Extract Text and Data from PDF Documents in C#

Bjoern Meyer

June 16, 2022

TX Text Control can be used to create and edit Adobe PDF documents programmatically. But it is also possible to import PDF documents to read, extract and manipulate them. This article shows different ways to extract data from existing PDF documents.

Extract Text and Data from PDF Documents in C#

TX Text Control is able to import "digitally born" Adobe PDF documents like any other supported file type. Using this approach, PDF documents can be searched for strings in document pages and form field content can be extracted.

Retrieving Text Lines

The namespaceTXTextControl.DocumentServer.PDF contains the class Contents.Lines that can be used to import text coordinates from a PDF document.

Consider the following sample PDF document that is used in this article to show how text and data can be easily extracted.

PDF in Text Control

The following code shows how to load a PDF document in order to loop through all recognized text lines:

Lines pdfLines = new Lines("lease_agreement.pdf");

foreach (ContentLine contentLine in pdfLines.ContentLines) {
    Console.WriteLine("Content line on page {0}: {1}", 
        contentLine.Page.ToString(), 
        contentLine.Text);
}

The output of the above code shows the text lines and the associated page number:

Content line on page 1: Residential Lease Agreement
Content line on page 1: In consideration of the Landlord leasing certain premises to the Tenant
Content line on page 1: and other valuable consideration, the receipt and sufficiency of which
Content line on page 1: considerations is herby acknowledged, the Parties agree as follows.
Content line on page 1: 1. Lease Term.
Content line on page 1: The term of this Agreement shall be be a period of «year» years, beginning on the
Content line on page 1: date «begin» and ending on the date «end».
Content line on page 1: 2. Property.
Content line on page 1: The leased premises shall be comprised of that certain personal residence (including
Content line on page 1: the house and the land) located at «location» ("Premises"). Landlord leases the
Content line on page 1: Premises to Tenant and Tenant leases the Premises from Landlord on the terms and
Content line on page 1: conditions set forth herein.
Content line on page 1: 3. Monthly Rent.
Content line on page 1: The rent to be paid by Tenant to Landlord throughout the term of this Agreement is
Content line on page 1: «rent» per month and shall be due on the 1st day of each month.
Content line on page 1: Tenant Landlord
Content line on page 1: Name Name
Content line on page 1: Date Date
Content line on page 1: I agree
Content line on page 1: Move in month:
Content line on page 1: Tim Tenant
Content line on page 1: 6/16/2022
Content line on page 1: June

Finding Text

The next code snipped returns all text lines with the word Agreement in it:

Lines pdfLines = new Lines("lease_agreement.pdf");

string stringToFind = "Agreement";

List<ContentLine> contentLines = 
    pdfLines.Find(stringToFind);

foreach (ContentLine line in contentLines) {
    Console.WriteLine("Found string \"{0}\" on page {1} (X: {2}, Y: {3}): {4}",
        stringToFind, 
        line.Page.ToString(), 
        line.X.ToString(), 
        line.Y.ToString(), 
        line.Text);
}

The output lists the full text line, the page number and the actual location in the document:

Found string "Agreement" on page 1 (X: 72, Y: 83.25): Residential Lease Agreement
Found string "Agreement" on page 1 (X: 94.5, Y: 228.15): The term of this Agreement shall be be a period of «year» years, beginning on the
Found string "Agreement" on page 1 (X: 94.5, Y: 369): The rent to be paid by Tenant to Landlord throughout the term of this Agreement is

Radial Text Search

Other implementations of the Find method allows to search for a regular expression or to search for lines in a specific range such as a rectangle or a radius. The following code returns all lines within a radius of 10 points around a specific location and includes lines that are partially overlapping the given radius:

Lines pdfLines = new Lines("lease_agreement.pdf");

string stringToFind = "Agreement";

List<ContentLine> contentLines =
    pdfLines.Find(new RectangleF(72,83, 10, 10), true);

foreach (ContentLine line in contentLines) {
    Console.WriteLine("Found text on page {0} (X: {1}, Y: {2}): {3}",
        line.Page.ToString(),
        line.X.ToString(),
        line.Y.ToString(),
        line.Text);
}

The output shows one found entry at that specific location (compare to the first results entry of the previous code snippet):

Found text on page 1 (X: 72, Y: 83.25): Residential Lease Agreement

Extracting Form Field Values

Interactive forms in the Adobe PDF format are also known as AcroForm - a de-facto standard for PDF forms processing. The forms can be created and exported using TX Text Control, so that end-users can fill-out these form fields in Acrobat Reader or other applications.

TX Text Control allows the extraction of form field data to collect results from completed documents.

The following code shows how to get all AcroForm fields from the above sample PDF document using the GetAcroFormFields method that accepts file names and byte arrays.

FormField[] acroForms = Forms.GetAcroFormFields("lease_agreement.pdf");

foreach (FormField field in acroForms) {
    switch (field) {
        case FormTextField textField:
            Console.WriteLine("Field \"{0}\" extracted: {1}",
                textField.FieldName,
                textField.Value);
            break;

        case FormCheckBox checkBoxField:
            Console.WriteLine("Field \"{0}\" extracted: {1}",
                checkBoxField.FieldName,
                checkBoxField.IsChecked.ToString());
            break;

        case FormComboBox comboBoxField:
            Console.WriteLine("Field \"{0}\" extracted. Selected value: {1}",
                comboBoxField.FieldName,
                comboBoxField.Value);

            foreach (var item in comboBoxField.Options) {
                Console.WriteLine(" -> Option: {0}", item);
            }

            break;
    }
}

The output lists all completed form fields including the name and possible drop-down options:

Field "tenant" extracted: Tim Tenant
Field "date" extracted: 6/16/2022
Field "agree" extracted: True
Field "movein" extracted. Selected value: June
 -> Option: January
 -> Option: June

Additionally, all form fields are inherited from the base class FormField and return the alternative field name, type and the location within the PDF document.

Also See

This post references the following in the documentation:

TXTextControl.DocumentServer.PDF.AcroForms.FormField Class
TXTextControl.DocumentServer.PDF.Contents.Lines Class
TXTextControl.DocumentServer.PDF.Contents.Lines.Find Method
TXTextControl.DocumentServer.PDF.Forms.GetAcroFormFields Method
TXTextControl.DocumentServer.PDF Namespace

ASP.NET

Integrate document processing into your applications to create documents such as PDFs and MS Word documents, including client-side document editing, viewing, and electronic signatures.