Extract Data from PDF Documents with C#

Bjoern Meyer

June 10, 2024

Learn how to extract text from PDF documents using the TX Text Control PDF import feature in C#. This article shows how to extract text, attachments, form field values and metadata from PDF documents.

TX Text Control can be used not only to create PDF documents from templates or programmatically from scratch, but also to extract data from existing PDF documents. When TX Text Control is used to process PDF documents, it provides a complete workflow including creation, form data extraction, and content searching.

How to extract data for the following typical scenarios is covered in this article:

Extracting text from PDF documents
Extract text at a specific location in PDF documents
Extracting form field data from PDF documents
Extract meta data from PDF documents
Extract attachments from PDF documents

Preparing the Application

A .NET 8 console application is created for the purposes of this demo.

Prerequisites

The following tutorial requires a trial version of TX Text Control .NET Server.

Download Trial Version

In Visual Studio, create a new Console App using .NET 8.
In the Solution Explorer, select your created project and choose Manage NuGet Packages... from the Project main menu.

Select Text Control Offline Packages from the Package source drop-down.

Install the latest versions of the following package:
- TXTextControl.TextControl.ASP.SDK

Extracting Text from PDF Documents

The following example uses the simple invoice document shown in the screenshot below:

invoice.pdf

With TX Text Control, you can extract all of the plain text from a PDF document by loading the document and accessing the plain text. The following code shows how to extract the plain text from a PDF document:

using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl())
{
        tx.Create();

        TXTextControl.LoadSettings loadSettings = new TXTextControl.LoadSettings()
        {
                PDFImportSettings = TXTextControl.PDFImportSettings.GenerateLines
        };

        tx.Load("invoice.pdf", TXTextControl.StreamType.AdobePDF, loadSettings);

        var plainText = tx.Text;        

        Console.WriteLine(plainText);
}

The extracted text from the invoice document looks like this:

INVOICE
Text Control
Bill To:
Typer Company
Tim Typer
7872 Typing Ave.
Charlotte, NC 28210
Invoice #: 267664
Item Description Price
Product 1 Description 1 800.00
Product 2 Description 2 400.00
Product 3 Description 3 1200.00
Total:2400.00
Paid in full with CC ending with ****3425.
Thanks for your business!

Find Specific String

TX Text Control provides a powerful text search functionality that can be used to find specific strings in a PDF document. The following code shows how to find the string "Total:" in a PDF document:

using TXTextControl.DocumentServer.PDF.Contents;

Lines pdfLines = new Lines("invoice.pdf");

string stringToFind = "Total:";

List<ContentLine> contentLines =
        pdfLines.Find(stringToFind);

foreach (ContentLine line in contentLines)
{
        Console.WriteLine("Found string \"{0}\" on page {1} (X: {2}, Y: {3}): {4}",
                stringToFind,
                line.Page.ToString(),
                line.X.ToString(),
                line.Y.ToString(),
                line.Text);
}

Found string "Total:" on page 1 (X: 462.75, Y: 376.2): Total:2400.00

Find Text by Location

Now that we have the location of the string "Total:", we want to extract the value at that location. This is a very smart way to get the total value of a PDF invoice in a programmatic way.

The following code shows how to extract the value at the location of the string "Total:" in a PDF document:

using System.Drawing;
using System.Text.RegularExpressions;
using TXTextControl.DocumentServer.PDF.Contents;

Lines pdfLines = new Lines("invoice.pdf");

List<ContentLine> contentLines =
   pdfLines.Find(new RectangleF(462, 376, 400, 400), true);

foreach (ContentLine line in contentLines)
{
        float fTotal = ExtractFloatFromString(line.Text);

        Console.WriteLine("Found value on page {0} (X: {1}, Y: {2}): {3}",
                line.Page.ToString(),
                line.X.ToString(),
                line.Y.ToString(),
                fTotal.ToString());
}

static float ExtractFloatFromString(string input)
{
        // Regular expression to match a floating-point number
        string pattern = @"[-+]?[0-9]*\.?[0-9]+";

        // Match the pattern in the input string
        Match match = Regex.Match(input, pattern);

        if (match.Success)
        {
                // Convert the matched value to float
                return float.Parse(match.Value);
        }

        // If no match found, return 0.0 or you can throw an exception or handle it as needed
        return 0.0f;
}

The extracted value from the invoice document looks like this:

Found value on page 1 (X: 462.75, Y: 376.2): 2400

Extracting Form Field Data

PDF documents can contain form fields that can be filled out by the user. These form fields can be extracted using TX Text Control. The following screenshot shows a PDF document with form fields:

form.pdf

The following code shows how to extract the form field data from a PDF document:

FormField[] acroForms = Forms.GetAcroFormFields("lease_agreement.pdf");

foreach (FormField field in acroForms) {
    switch (field) {
        case FormTextField textField:
            Console.WriteLine("Field \"{0}\" extracted: {1}",
                textField.FieldName,
                textField.Value);
            break;

        case FormCheckBox checkBoxField:
            Console.WriteLine("Field \"{0}\" extracted: {1}",
                checkBoxField.FieldName,
                checkBoxField.IsChecked.ToString());
            break;

        case FormComboBox comboBoxField:
            Console.WriteLine("Field \"{0}\" extracted. Selected value: {1}",
                comboBoxField.FieldName,
                comboBoxField.Value);

            foreach (var item in comboBoxField.Options) {
                Console.WriteLine(" -> Option: {0}", item);
            }

            break;
    }
}

The extracted form field data from the invoice document looks like this:

Field "tenant" extracted: Tim Tenant
Field "date" extracted: 6/16/2022
Field "agree" extracted: True
Field "movein" extracted. Selected value: June
 -> Option: January
 -> Option: June

Extracting Meta Data

PDF documents can contain meta data such as author, title, and keywords. This meta data can be extracted using TX Text Control. The following code shows how to extract the meta data from a PDF document:

using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl())
{
        tx.Create();

        TXTextControl.LoadSettings loadSettings = new TXTextControl.LoadSettings()
        {
                PDFImportSettings = TXTextControl.PDFImportSettings.GenerateLines
        };

        tx.Load("invoice.pdf", TXTextControl.StreamType.AdobePDF, loadSettings);

        // Write document meta data to the console
        Console.WriteLine($"Author: {loadSettings.Author}");
        Console.WriteLine($"Subject: {loadSettings.DocumentSubject}");
        Console.WriteLine($"Title: {loadSettings.DocumentTitle}");
        Console.WriteLine($"Creation Date: {loadSettings.CreationDate}");
        Console.WriteLine($"Application: {loadSettings.CreatorApplication}");

        foreach (string keyword in loadSettings.DocumentKeywords)
        {
                Console.WriteLine($"Keyword: {keyword}");
        }
}

The extracted meta data from the invoice document looks like this:

Author: Tim Typer
Subject: Sample invoice
Title: PDF Invoice
Creation Date: 6/10/2024 3:40:46 PM
Application: TX Text Control
Keyword: ERP tag
Keyword:  Tag 2

Extracting Attachments

PDF documents can contain attachments such as images, documents, or other files. PDF/A-3 documents enable the transition from electronic paper to an electronic container that contains both human- and machine-readable versions of a document. A PDF/A-3 document can contain an unlimited number of embedded documents for different processes.

To extract attachments, the PDF must be loaded into TX Text Control using LoadSettings. The attachments are stored in the EmbeddedFiles array of attachments.

using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl()) {
   tx.Create();

   TXTextControl.LoadSettings loadSettings = new TXTextControl.LoadSettings();

   tx.Load("mypdf.pdf", TXTextControl.StreamType.AdobePDF, loadSettings);

   foreach (TXTextControl.EmbeddedFile embeddedFile in loadSettings.EmbeddedFiles) {
      System.IO.File.WriteAllText(
         embeddedFile.FileName,
         Encoding.ASCII.GetString((byte[])embeddedFile.Data));
   }
}

This code loads a PDF file, extracts any embedded files in the file, and saves those embedded files as separate files on the hard drive.

Conclusion

TX Text Control provides a complete workflow for processing PDF documents including creation, form data extraction, and content searching. This article showed how to extract text, form field data, meta data, and attachments from PDF documents using TX Text Control.

ASP.NET

Integrate document processing into your applications to create documents such as PDFs and MS Word documents, including client-side document editing, viewing, and electronic signatures.