How to Detect and Extract Table Data as JSON from PDF Documents in C#
A common use case is to extract tabular data from tables in a PDF document to create a structured and reusable data format. This article describes how to convert tables recognized by the PDF import functionality into a structured JSON format.

Extracting tabular data is a typical use case for importing "digitally born" PDF documents such as invoices. Upon reading an invoice, the recognized data will be matched with the corresponding purchase order data in an ERP system.
Import PDF Documents
Like any other supported file type, TX Text Control can import digitally born Adobe PDF documents. Typically, you can use this approach to search PDF documents for strings in document pages and extract the contents of form fields.
Learn More
This article shows different ways to extract data from existing PDF documents.
Unlike in the above article, the document is actually loaded into the TX Text Control to loop through recognized tables. Consider the following invoice PDF document.
In this case, the TX Text Control will recognize the table that is highlighted in red.
Looping through Tables
We use the PDFImport
// create a new load settings object
TXTextControl.LoadSettings ls = new TXTextControl.LoadSettings() {
// set the PDF import settings
PDFImportSettings = PDFImportSettings.GenerateTextFrames
};
// load the PDF document
textControl1.Load("pdf_template.pdf", TXTextControl.StreamType.AdobePDF, ls);
// iterate through all text parts
foreach (IFormattedText formattedText in textControl1.TextParts) {
// iterate through all tables
foreach (Table table in formattedText.Tables) {
Debug.WriteLine(Table2Data(table, true));
}
}
Generate JSON
The Table2Data method loops through all rows and creates a Dictionary for each row. If the containsTableHeader parameter is set to true, the column name is used from the first row. When finished, the list of dictionaries is returned as a JSON string.
// convert a flat (non-nested) table to a JSON string
private string Table2Data(Table table, bool containsTableHeader) {
// create a list of dictionaries
var tableList = new List<Dictionary<string, string>>();
// get the start row
int startRow = containsTableHeader ? 2 : 1;
// iterate through all rows
for (int row = startRow; row < table.Rows.Count; row++) {
// create a dictionary for each row
var rowData = new Dictionary<string, string>();
// iterate through all columns
for (int col = 1; col < table.Columns.Count; col++) {
// get the column title from the first row
var columnTitle = containsTableHeader ?
table.Cells[1, col].Text.Trim('\t') : $"Column {col}";
// add the cell text to the dictionary
rowData[columnTitle] = table.Cells[row, col].Text.Trim('\t');
}
// add the dictionary to the list
tableList.Add(rowData);
}
// return the JSON string
return System.Text.Json.JsonSerializer.Serialize(tableList, new JsonSerializerOptions { WriteIndented = true });
}
For the example document above, the following JSON is returned.
[
{
"ID": "1",
"Name": "Name 1",
"Description": "Description 1",
"Price": "$100",
"Qty": "1"
},
{
"ID": "2",
"Name": "Name 2",
"Description": "Description 2",
"Price": "$200",
"Qty": "1"
}
]
If the containsTableHeader parameter is set to false, the following JSON will be returned.
[
{
"Column 1": "ID",
"Column 2": "Name",
"Column 3": "Description",
"Column 4": "Price",
"Column 5": "Qty"
},
{
"Column 1": "1",
"Column 2": "Name 1",
"Column 3": "Description 1",
"Column 4": "$100",
"Column 5": "1"
},
{
"Column 1": "2",
"Column 2": "Name 2",
"Column 3": "Description 2",
"Column 4": "$200",
"Column 5": "1"
}
]
Related Posts
Generating Hierarchical Tables from JSON Data in .NET C#
Using TX Text Control, you can generate complex hierarchical tables directly from JSON data. This article explains the code and logic behind it.
Why Table Control in Templates is Important for Professional PDF Creation in C#
Controlling how tables behave at page breaks is an important factor in creating professional-looking documents. This article discusses the importance of table control in templates for PDF generation.
ASP.NETWindows FormsASP.NET Core
Splitting Tables at Bookmark Positions and Cloning Table Headers
This article shows how to split tables at bookmark positions and how to clone table headers in TX Text Control .NET for Windows Forms and TX Text Control .NET Server.
Loading and Processing Excel XLSX Spreadsheet Tables into TX Text Control…
TX Text Control provides a powerful API to load and process Excel spreadsheet tables in .NET applications. This article shows how to load an Excel file and process the tables using TX Text Control…
Creating Advanced Tables in PDF and DOCX Documents with C#
This article shows how to create advanced tables in PDF and DOCX documents using the TX Text Control .NET for ASP.NET Server component. This article shows how to create tables from scratch,…