Products Technologies Demo Docs Blog Support Company

Extract Data from PDF Documents with C#

Learn how to extract text from PDF documents using the TX Text Control PDF import feature in C#. This article shows how to extract text, attachments, form field values and metadata from PDF documents.

Extract Data from PDF Documents with C#

TX Text Control can be used not only to create PDF documents from templates or programmatically from scratch, but also to extract data from existing PDF documents. When TX Text Control is used to process PDF documents, it provides a complete workflow including creation, form data extraction, and content searching.

How to extract data for the following typical scenarios is covered in this article:

  • Extracting text from PDF documents
  • Extract text at a specific location in PDF documents
  • Extracting form field data from PDF documents
  • Extract meta data from PDF documents
  • Extract attachments from PDF documents

Preparing the Application

A .NET 8 console application is created for the purposes of this demo.

Prerequisites

The following tutorial requires a trial version of TX Text Control .NET Server.

  1. In Visual Studio, create a new Console App using .NET 8.

  2. In the Solution Explorer, select your created project and choose Manage NuGet Packages... from the Project main menu.

    Select Text Control Offline Packages from the Package source drop-down.

    Install the latest versions of the following package:

    • TXTextControl.TextControl.ASP.SDK

    Create PDF

Extracting Text from PDF Documents

The following example uses the simple invoice document shown in the screenshot below:

invoice.pdf

With TX Text Control, you can extract all of the plain text from a PDF document by loading the document and accessing the plain text. The following code shows how to extract the plain text from a PDF document:

using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl())
{
        tx.Create();

        TXTextControl.LoadSettings loadSettings = new TXTextControl.LoadSettings()
        {
                PDFImportSettings = TXTextControl.PDFImportSettings.GenerateLines
        };

        tx.Load("invoice.pdf", TXTextControl.StreamType.AdobePDF, loadSettings);

        var plainText = tx.Text;        

        Console.WriteLine(plainText);
}

The extracted text from the invoice document looks like this:

INVOICE
Text Control
Bill To:
Typer Company
Tim Typer
7872 Typing Ave.
Charlotte, NC 28210
Invoice #: 267664
Item Description Price
Product 1 Description 1 800.00
Product 2 Description 2 400.00
Product 3 Description 3 1200.00
Total:2400.00
Paid in full with CC ending with ****3425.
Thanks for your business!

Find Specific String

TX Text Control provides a powerful text search functionality that can be used to find specific strings in a PDF document. The following code shows how to find the string "Total:" in a PDF document:

using TXTextControl.DocumentServer.PDF.Contents;

Lines pdfLines = new Lines("invoice.pdf");

string stringToFind = "Total:";

List<ContentLine> contentLines =
        pdfLines.Find(stringToFind);

foreach (ContentLine line in contentLines)
{
        Console.WriteLine("Found string \"{0}\" on page {1} (X: {2}, Y: {3}): {4}",
                stringToFind,
                line.Page.ToString(),
                line.X.ToString(),
                line.Y.ToString(),
                line.Text);
}
Found string "Total:" on page 1 (X: 462.75, Y: 376.2): Total:2400.00

Find Text by Location

Now that we have the location of the string "Total:", we want to extract the value at that location. This is a very smart way to get the total value of a PDF invoice in a programmatic way.

Search in PDF Documents

The following code shows how to extract the value at the location of the string "Total:" in a PDF document:

using System.Drawing;
using System.Text.RegularExpressions;
using TXTextControl.DocumentServer.PDF.Contents;

Lines pdfLines = new Lines("invoice.pdf");

List<ContentLine> contentLines =
   pdfLines.Find(new RectangleF(462, 376, 400, 400), true);

foreach (ContentLine line in contentLines)
{
        float fTotal = ExtractFloatFromString(line.Text);

        Console.WriteLine("Found value on page {0} (X: {1}, Y: {2}): {3}",
                line.Page.ToString(),
                line.X.ToString(),
                line.Y.ToString(),
                fTotal.ToString());
}

static float ExtractFloatFromString(string input)
{
        // Regular expression to match a floating-point number
        string pattern = @"[-+]?[0-9]*\.?[0-9]+";

        // Match the pattern in the input string
        Match match = Regex.Match(input, pattern);

        if (match.Success)
        {
                // Convert the matched value to float
                return float.Parse(match.Value);
        }

        // If no match found, return 0.0 or you can throw an exception or handle it as needed
        return 0.0f;
}

The extracted value from the invoice document looks like this:

Found value on page 1 (X: 462.75, Y: 376.2): 2400

Extracting Form Field Data

PDF documents can contain form fields that can be filled out by the user. These form fields can be extracted using TX Text Control. The following screenshot shows a PDF document with form fields:

form.pdf

The following code shows how to extract the form field data from a PDF document:

FormField[] acroForms = Forms.GetAcroFormFields("lease_agreement.pdf");

foreach (FormField field in acroForms) {
    switch (field) {
        case FormTextField textField:
            Console.WriteLine("Field \"{0}\" extracted: {1}",
                textField.FieldName,
                textField.Value);
            break;

        case FormCheckBox checkBoxField:
            Console.WriteLine("Field \"{0}\" extracted: {1}",
                checkBoxField.FieldName,
                checkBoxField.IsChecked.ToString());
            break;

        case FormComboBox comboBoxField:
            Console.WriteLine("Field \"{0}\" extracted. Selected value: {1}",
                comboBoxField.FieldName,
                comboBoxField.Value);

            foreach (var item in comboBoxField.Options) {
                Console.WriteLine(" -> Option: {0}", item);
            }

            break;
    }
}

The extracted form field data from the invoice document looks like this:

Field "tenant" extracted: Tim Tenant
Field "date" extracted: 6/16/2022
Field "agree" extracted: True
Field "movein" extracted. Selected value: June
 -> Option: January
 -> Option: June

Extracting Meta Data

PDF documents can contain meta data such as author, title, and keywords. This meta data can be extracted using TX Text Control. The following code shows how to extract the meta data from a PDF document:

using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl())
{
        tx.Create();

        TXTextControl.LoadSettings loadSettings = new TXTextControl.LoadSettings()
        {
                PDFImportSettings = TXTextControl.PDFImportSettings.GenerateLines
        };

        tx.Load("invoice.pdf", TXTextControl.StreamType.AdobePDF, loadSettings);

        // Write document meta data to the console
        Console.WriteLine($"Author: {loadSettings.Author}");
        Console.WriteLine($"Subject: {loadSettings.DocumentSubject}");
        Console.WriteLine($"Title: {loadSettings.DocumentTitle}");
        Console.WriteLine($"Creation Date: {loadSettings.CreationDate}");
        Console.WriteLine($"Application: {loadSettings.CreatorApplication}");

        foreach (string keyword in loadSettings.DocumentKeywords)
        {
                Console.WriteLine($"Keyword: {keyword}");
        }
}

The extracted meta data from the invoice document looks like this:

Author: Tim Typer
Subject: Sample invoice
Title: PDF Invoice
Creation Date: 6/10/2024 3:40:46 PM
Application: TX Text Control
Keyword: ERP tag
Keyword:  Tag 2

Extracting Attachments

PDF documents can contain attachments such as images, documents, or other files. PDF/A-3 documents enable the transition from electronic paper to an electronic container that contains both human- and machine-readable versions of a document. A PDF/A-3 document can contain an unlimited number of embedded documents for different processes.

To extract attachments, the PDF must be loaded into TX Text Control using LoadSettings. The attachments are stored in the EmbeddedFiles array of attachments.

using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl()) {
   tx.Create();

   TXTextControl.LoadSettings loadSettings = new TXTextControl.LoadSettings();

   tx.Load("mypdf.pdf", TXTextControl.StreamType.AdobePDF, loadSettings);

   foreach (TXTextControl.EmbeddedFile embeddedFile in loadSettings.EmbeddedFiles) {
      System.IO.File.WriteAllText(
         embeddedFile.FileName,
         Encoding.ASCII.GetString((byte[])embeddedFile.Data));
   }
}

This code loads a PDF file, extracts any embedded files in the file, and saves those embedded files as separate files on the hard drive.

Conclusion

TX Text Control provides a complete workflow for processing PDF documents including creation, form data extraction, and content searching. This article showed how to extract text, form field data, meta data, and attachments from PDF documents using TX Text Control.

Stay in the loop!

Subscribe to the newsletter to receive the latest updates.

Related Posts

ASP.NETJavaScriptPDF

Inject JavaScript to PDF Documents in C#

Learn how to inject JavaScript into PDF documents using TX Text Control .NET Server. This article shows how to add JavaScript to a PDF document to execute code when the document is opened.


ASP.NETJavaScriptASP.NET Core

Add JavaScript to PDFs with TX Text Control in C# .NET: Time-Based Alerts…

In this article, we explore how to enrich PDF documents with JavaScript using TX Text Control in C# .NET. Read on to learn how to create time-based alerts that trigger actions based on specific…


ASP.NETJavaScriptWindows Forms

Generating Interactive PDF Forms by Injecting JavaScript

Using TX Text Control, it is possible to export documents with form fields to fillable PDFs. This article shows how to inject JavaScript to add interaction to form fields.


ASP.NETASP.NET CoreExtraction

Mining PDFs with Regex in C#: Practical Patterns, Tips, and Ideas

Mining PDFs with Regex in C# can be a powerful technique for extracting information from documents. This article explores practical patterns, tips, and ideas for effectively using regular…


ASP.NETConversionDOCX

PDF Conversion in .NET: Convert DOCX, HTML and more with C#

PDF conversion in .NET is a standard requirement for generating invoices, templates, and accessible reports. This article provides an overview of PDF conversion capabilities using TX Text Control,…