How to Detect and Extract Table Data as JSON from PDF Documents in C#

Extracting tabular data is a typical use case for importing "digitally born" PDF documents such as invoices. Upon reading an invoice, the recognized data will be matched with the corresponding purchase order data in an ERP system.

Import PDF Documents

Like any other supported file type, TX Text Control can import digitally born Adobe PDF documents. Typically, you can use this approach to search PDF documents for strings in document pages and extract the contents of form fields.

Learn More

This article shows different ways to extract data from existing PDF documents.

Extract Text and Data from PDF Documents in C#

Unlike in the above article, the document is actually loaded into the TX Text Control to loop through recognized tables. Consider the following invoice PDF document.

Extracting Tables from PDFs

In this case, the TX Text Control will recognize the table that is highlighted in red.

Extracting Tables from PDFs

Looping through Tables

We use the PDFImportSettings ╰ TX Text Control .NET Server for ASP.NET
╰ TXTextControl Namespace
╰ TXTextControl Enumerations Enumerations
╰ PDFImportSettings Enumeration Enumeration
Specifies how the document structure is generated when a PDF document is imported. to specify that the TX Text Control should import the content and format it in such a way that tables are recognized as well. The following code loads the PDF document and loops through all of the tables it finds.

	// create a new load settings object
	TXTextControl.LoadSettings ls = new TXTextControl.LoadSettings() {
	// set the PDF import settings
	PDFImportSettings = PDFImportSettings.GenerateTextFrames
	};

	// load the PDF document
	textControl1.Load("pdf_template.pdf", TXTextControl.StreamType.AdobePDF, ls);

	// iterate through all text parts
	foreach (IFormattedText formattedText in textControl1.TextParts) {
	// iterate through all tables
	foreach (Table table in formattedText.Tables) {
	Debug.WriteLine(Table2Data(table, true));
	}
	}

view raw test.cs hosted with ❤ by GitHub

Generate JSON

The Table2Data method loops through all rows and creates a Dictionary for each row. If the containsTableHeader parameter is set to true, the column name is used from the first row. When finished, the list of dictionaries is returned as a JSON string.

	// convert a flat (non-nested) table to a JSON string
	private string Table2Data(Table table, bool containsTableHeader) {

	// create a list of dictionaries
	var tableList = new List<Dictionary<string, string>>();

	// get the start row
	int startRow = containsTableHeader ? 2 : 1;

	// iterate through all rows
	for (int row = startRow; row < table.Rows.Count; row++) {
	// create a dictionary for each row
	var rowData = new Dictionary<string, string>();

	// iterate through all columns
	for (int col = 1; col < table.Columns.Count; col++) {
	// get the column title from the first row
	var columnTitle = containsTableHeader ?
	table.Cells[1, col].Text.Trim('\t') : $"Column {col}";

	// add the cell text to the dictionary
	rowData[columnTitle] = table.Cells[row, col].Text.Trim('\t');
	}

	// add the dictionary to the list
	tableList.Add(rowData);
	}

	// return the JSON string
	return System.Text.Json.JsonSerializer.Serialize(tableList, new JsonSerializerOptions { WriteIndented = true });
	}

view raw test.cs hosted with ❤ by GitHub

For the example document above, the following JSON is returned.

	[
	{
	"ID": "1",
	"Name": "Name 1",
	"Description": "Description 1",
	"Price": "$100",
	"Qty": "1"
	},
	{
	"ID": "2",
	"Name": "Name 2",
	"Description": "Description 2",
	"Price": "$200",
	"Qty": "1"
	}
	]

view raw test.json hosted with ❤ by GitHub

If the containsTableHeader parameter is set to false, the following JSON will be returned.

	[
	{
	"Column 1": "ID",
	"Column 2": "Name",
	"Column 3": "Description",
	"Column 4": "Price",
	"Column 5": "Qty"
	},
	{
	"Column 1": "1",
	"Column 2": "Name 1",
	"Column 3": "Description 1",
	"Column 4": "$100",
	"Column 5": "1"
	},
	{
	"Column 1": "2",
	"Column 2": "Name 2",
	"Column 3": "Description 2",
	"Column 4": "$200",
	"Column 5": "1"
	}
	]

view raw test.json hosted with ❤ by GitHub

ASP.NET

Integrate document processing into your applications to create documents such as PDFs and MS Word documents, including client-side document editing, viewing, and electronic signatures.

Text Control Products

WEB, SERVER AND CLOUD

Getting started with:

DESKTOP

HOSTED CLOUD

LOW CODE PLATFORM

Core Technologies

Text Control Documentation

Text Control Blog

Text Control Support

About Text Control

How to Detect and Extract Table Data as JSON from PDF Documents in C#

Summary

Import PDF Documents

Looping through Tables

Generate JSON

ASP.NET

Getting started with:

Related Posts

Text to Table and Table to Text in TX Text Control and C#

Document Editor: Useful JavaScript Functions for Tables

Document Editor: Formatting Table Cells Using JavaScript

Adjusting the Maximum Request Length for ASP.NET Core and ASP.NET Applications

Popular Products

Technologies

Get Products

Resources

Support

Ready To Talk?