Extracting tabular data is a typical use case for importing "digitally born" PDF documents such as invoices. Upon reading an invoice, the recognized data will be matched with the corresponding purchase order data in an ERP system.
Import PDF Documents
Like any other supported file type, TX Text Control can import digitally born Adobe PDF documents. Typically, you can use this approach to search PDF documents for strings in document pages and extract the contents of form fields.
Learn More
This article shows different ways to extract data from existing PDF documents.
Unlike in the above article, the document is actually loaded into the TX Text Control to loop through recognized tables. Consider the following invoice PDF document.
In this case, the TX Text Control will recognize the table that is highlighted in red.
Looping through Tables
We use the PDFImport ╰ TX Text Control .NET Server for ASP.NET
╰ TXTextControl Namespace
╰ TXTextControl Enumerations Enumerations
╰ PDFImportSettings Enumeration Enumeration
Specifies how the document structure is generated when a PDF document is imported. to specify that the TX Text Control should import the content and format it in such a way that tables are recognized as well. The following code loads the PDF document and loops through all of the tables it finds.
// create a new load settings object | |
TXTextControl.LoadSettings ls = new TXTextControl.LoadSettings() { | |
// set the PDF import settings | |
PDFImportSettings = PDFImportSettings.GenerateTextFrames | |
}; | |
// load the PDF document | |
textControl1.Load("pdf_template.pdf", TXTextControl.StreamType.AdobePDF, ls); | |
// iterate through all text parts | |
foreach (IFormattedText formattedText in textControl1.TextParts) { | |
// iterate through all tables | |
foreach (Table table in formattedText.Tables) { | |
Debug.WriteLine(Table2Data(table, true)); | |
} | |
} |
Generate JSON
The Table2Data method loops through all rows and creates a Dictionary for each row. If the containsTableHeader parameter is set to true, the column name is used from the first row. When finished, the list of dictionaries is returned as a JSON string.
// convert a flat (non-nested) table to a JSON string | |
private string Table2Data(Table table, bool containsTableHeader) { | |
// create a list of dictionaries | |
var tableList = new List<Dictionary<string, string>>(); | |
// get the start row | |
int startRow = containsTableHeader ? 2 : 1; | |
// iterate through all rows | |
for (int row = startRow; row < table.Rows.Count; row++) { | |
// create a dictionary for each row | |
var rowData = new Dictionary<string, string>(); | |
// iterate through all columns | |
for (int col = 1; col < table.Columns.Count; col++) { | |
// get the column title from the first row | |
var columnTitle = containsTableHeader ? | |
table.Cells[1, col].Text.Trim('\t') : $"Column {col}"; | |
// add the cell text to the dictionary | |
rowData[columnTitle] = table.Cells[row, col].Text.Trim('\t'); | |
} | |
// add the dictionary to the list | |
tableList.Add(rowData); | |
} | |
// return the JSON string | |
return System.Text.Json.JsonSerializer.Serialize(tableList, new JsonSerializerOptions { WriteIndented = true }); | |
} |
For the example document above, the following JSON is returned.
[ | |
{ | |
"ID": "1", | |
"Name": "Name 1", | |
"Description": "Description 1", | |
"Price": "$100", | |
"Qty": "1" | |
}, | |
{ | |
"ID": "2", | |
"Name": "Name 2", | |
"Description": "Description 2", | |
"Price": "$200", | |
"Qty": "1" | |
} | |
] |
If the containsTableHeader parameter is set to false, the following JSON will be returned.
[ | |
{ | |
"Column 1": "ID", | |
"Column 2": "Name", | |
"Column 3": "Description", | |
"Column 4": "Price", | |
"Column 5": "Qty" | |
}, | |
{ | |
"Column 1": "1", | |
"Column 2": "Name 1", | |
"Column 3": "Description 1", | |
"Column 4": "$100", | |
"Column 5": "1" | |
}, | |
{ | |
"Column 1": "2", | |
"Column 2": "Name 2", | |
"Column 3": "Description 2", | |
"Column 4": "$200", | |
"Column 5": "1" | |
} | |
] |