Products Technologies Demo Docs Blog Support Company

How to Detect and Extract Table Data as JSON from PDF Documents in C#

A common use case is to extract tabular data from tables in a PDF document to create a structured and reusable data format. This article describes how to convert tables recognized by the PDF import functionality into a structured JSON format.

How to Detect and Extract Table Data as JSON from PDF Documents in C#

Extracting tabular data is a typical use case for importing "digitally born" PDF documents such as invoices. Upon reading an invoice, the recognized data will be matched with the corresponding purchase order data in an ERP system.

Import PDF Documents

Like any other supported file type, TX Text Control can import digitally born Adobe PDF documents. Typically, you can use this approach to search PDF documents for strings in document pages and extract the contents of form fields.

Learn More

This article shows different ways to extract data from existing PDF documents.

Extract Text and Data from PDF Documents in C#

Unlike in the above article, the document is actually loaded into the TX Text Control to loop through recognized tables. Consider the following invoice PDF document.

Extracting Tables from PDFs

In this case, the TX Text Control will recognize the table that is highlighted in red.

Extracting Tables from PDFs

Looping through Tables

We use the PDFImportSettings to specify that the TX Text Control should import the content and format it in such a way that tables are recognized as well. The following code loads the PDF document and loops through all of the tables it finds.

// create a new load settings object
TXTextControl.LoadSettings ls = new TXTextControl.LoadSettings() {
  // set the PDF import settings
  PDFImportSettings = PDFImportSettings.GenerateTextFrames
};

// load the PDF document
textControl1.Load("pdf_template.pdf", TXTextControl.StreamType.AdobePDF, ls);

// iterate through all text parts
foreach (IFormattedText formattedText in textControl1.TextParts) {
  // iterate through all tables
  foreach (Table table in formattedText.Tables) {
    Debug.WriteLine(Table2Data(table, true));
  }
}

Generate JSON

The Table2Data method loops through all rows and creates a Dictionary for each row. If the containsTableHeader parameter is set to true, the column name is used from the first row. When finished, the list of dictionaries is returned as a JSON string.

// convert a flat (non-nested) table to a JSON string
private string Table2Data(Table table, bool containsTableHeader) {

  // create a list of dictionaries
  var tableList = new List<Dictionary<string, string>>();

  // get the start row
  int startRow = containsTableHeader ? 2 : 1;

  // iterate through all rows
  for (int row = startRow; row < table.Rows.Count; row++) {
    // create a dictionary for each row
    var rowData = new Dictionary<string, string>();

    // iterate through all columns
    for (int col = 1; col < table.Columns.Count; col++) {
      // get the column title from the first row
      var columnTitle = containsTableHeader ? 
        table.Cells[1, col].Text.Trim('\t') : $"Column {col}";

      // add the cell text to the dictionary
      rowData[columnTitle] = table.Cells[row, col].Text.Trim('\t');
    }

    // add the dictionary to the list
    tableList.Add(rowData);
  }

  // return the JSON string
  return System.Text.Json.JsonSerializer.Serialize(tableList, new JsonSerializerOptions { WriteIndented = true });
}

For the example document above, the following JSON is returned.

[
  {
    "ID": "1",
    "Name": "Name 1",
    "Description": "Description 1",
    "Price": "$100",
    "Qty": "1"
  },
  {
    "ID": "2",
    "Name": "Name 2",
    "Description": "Description 2",
    "Price": "$200",
    "Qty": "1"
  }
]

If the containsTableHeader parameter is set to false, the following JSON will be returned.

[
  {
    "Column 1": "ID",
    "Column 2": "Name",
    "Column 3": "Description",
    "Column 4": "Price",
    "Column 5": "Qty"
  },
  {
    "Column 1": "1",
    "Column 2": "Name 1",
    "Column 3": "Description 1",
    "Column 4": "$100",
    "Column 5": "1"
  },
  {
    "Column 1": "2",
    "Column 2": "Name 2",
    "Column 3": "Description 2",
    "Column 4": "$200",
    "Column 5": "1"
  }
]

Stay in the loop!

Subscribe to the newsletter to receive the latest updates.

ASP.NET

Integrate document processing into your applications to create documents such as PDFs and MS Word documents, including client-side document editing, viewing, and electronic signatures.

ASP.NET Core
Angular
Blazor
JavaScript
React
  • Angular
  • Blazor
  • React
  • JavaScript
  • ASP.NET MVC, ASP.NET Core, and WebForms

Learn more Trial token Download trial

Related Posts

ASP.NETWindows FormsWPF

Generating Hierarchical Tables from JSON Data in .NET C#

Using TX Text Control, you can generate complex hierarchical tables directly from JSON data. This article explains the code and logic behind it.


ASP.NETASP.NET CoreCSV

Export Document Tables to CSV in .NET C#

This article shows how to use ServerTextControl to load documents, iterate through table rows and cells, and export document tables as CSV files. The sample handles multiple tables, header rows,…


ASP.NETASP.NET CoreForm Fields

Automatically Mapping TX Text Control Form Fields to JSON Data in .NET C#

In this article, we will explore how to automatically map TX Text Control form fields to JSON data in a .NET C# application. This process can help streamline data handling and improve the…


ASP.NETASP.NET CoreCSV

Convert CSV to PDF in .NET C#

Learn how to convert CSV data to a table in C# using the ServerTextControl library with this step-by-step tutorial. Easily generate PDF documents from CSV files in your .NET applications.


ASP.NETASP.NET CorePDF

Why Table Control in Templates is Important for Professional PDF Creation in C#

Controlling how tables behave at page breaks is an important factor in creating professional-looking documents. This article discusses the importance of table control in templates for PDF generation.

Share on this blog post on: