Products Technologies Demo Docs Blog Support Company

Extract Text and Data from PDF Documents in C#

TX Text Control can be used to create and edit Adobe PDF documents programmatically. But it is also possible to import PDF documents to read, extract and manipulate them. This article shows different ways to extract data from existing PDF documents.

Extract Text and Data from PDF Documents in C#

TX Text Control is able to import "digitally born" Adobe PDF documents like any other supported file type. Using this approach, PDF documents can be searched for strings in document pages and form field content can be extracted.

Retrieving Text Lines

The namespaceTXTextControl.DocumentServer.PDF contains the class Contents.Lines that can be used to import text coordinates from a PDF document.

Consider the following sample PDF document that is used in this article to show how text and data can be easily extracted.

PDF in Text Control

The following code shows how to load a PDF document in order to loop through all recognized text lines:

Lines pdfLines = new Lines("lease_agreement.pdf");

foreach (ContentLine contentLine in pdfLines.ContentLines) {
    Console.WriteLine("Content line on page {0}: {1}", 
        contentLine.Page.ToString(), 
        contentLine.Text);
}

The output of the above code shows the text lines and the associated page number:

Content line on page 1: Residential Lease Agreement
Content line on page 1: In consideration of the Landlord leasing certain premises to the Tenant
Content line on page 1: and other valuable consideration, the receipt and sufficiency of which
Content line on page 1: considerations is herby acknowledged, the Parties agree as follows.
Content line on page 1: 1. Lease Term.
Content line on page 1: The term of this Agreement shall be be a period of «year» years, beginning on the
Content line on page 1: date «begin» and ending on the date «end».
Content line on page 1: 2. Property.
Content line on page 1: The leased premises shall be comprised of that certain personal residence (including
Content line on page 1: the house and the land) located at «location» ("Premises"). Landlord leases the
Content line on page 1: Premises to Tenant and Tenant leases the Premises from Landlord on the terms and
Content line on page 1: conditions set forth herein.
Content line on page 1: 3. Monthly Rent.
Content line on page 1: The rent to be paid by Tenant to Landlord throughout the term of this Agreement is
Content line on page 1: «rent» per month and shall be due on the 1st day of each month.
Content line on page 1: Tenant Landlord
Content line on page 1: Name Name
Content line on page 1: Date Date
Content line on page 1: I agree
Content line on page 1: Move in month:
Content line on page 1: Tim Tenant
Content line on page 1: 6/16/2022
Content line on page 1: June

Finding Text

The next code snipped returns all text lines with the word Agreement in it:

Lines pdfLines = new Lines("lease_agreement.pdf");

string stringToFind = "Agreement";

List<ContentLine> contentLines = 
    pdfLines.Find(stringToFind);

foreach (ContentLine line in contentLines) {
    Console.WriteLine("Found string \"{0}\" on page {1} (X: {2}, Y: {3}): {4}",
        stringToFind, 
        line.Page.ToString(), 
        line.X.ToString(), 
        line.Y.ToString(), 
        line.Text);
}

The output lists the full text line, the page number and the actual location in the document:

Found string "Agreement" on page 1 (X: 72, Y: 83.25): Residential Lease Agreement
Found string "Agreement" on page 1 (X: 94.5, Y: 228.15): The term of this Agreement shall be be a period of «year» years, beginning on the
Found string "Agreement" on page 1 (X: 94.5, Y: 369): The rent to be paid by Tenant to Landlord throughout the term of this Agreement is

Radial Text Search

Other implementations of the Find method allows to search for a regular expression or to search for lines in a specific range such as a rectangle or a radius. The following code returns all lines within a radius of 10 points around a specific location and includes lines that are partially overlapping the given radius:

Lines pdfLines = new Lines("lease_agreement.pdf");

string stringToFind = "Agreement";

List<ContentLine> contentLines =
    pdfLines.Find(new RectangleF(72,83, 10, 10), true);

foreach (ContentLine line in contentLines) {
    Console.WriteLine("Found text on page {0} (X: {1}, Y: {2}): {3}",
        line.Page.ToString(),
        line.X.ToString(),
        line.Y.ToString(),
        line.Text);
}

The output shows one found entry at that specific location (compare to the first results entry of the previous code snippet):

Found text on page 1 (X: 72, Y: 83.25): Residential Lease Agreement

Extracting Form Field Values

Interactive forms in the Adobe PDF format are also known as AcroForm - a de-facto standard for PDF forms processing. The forms can be created and exported using TX Text Control, so that end-users can fill-out these form fields in Acrobat Reader or other applications.

TX Text Control allows the extraction of form field data to collect results from completed documents.

The following code shows how to get all AcroForm fields from the above sample PDF document using the GetAcroFormFields method that accepts file names and byte arrays.

FormField[] acroForms = Forms.GetAcroFormFields("lease_agreement.pdf");

foreach (FormField field in acroForms) {
    switch (field) {
        case FormTextField textField:
            Console.WriteLine("Field \"{0}\" extracted: {1}",
                textField.FieldName,
                textField.Value);
            break;

        case FormCheckBox checkBoxField:
            Console.WriteLine("Field \"{0}\" extracted: {1}",
                checkBoxField.FieldName,
                checkBoxField.IsChecked.ToString());
            break;

        case FormComboBox comboBoxField:
            Console.WriteLine("Field \"{0}\" extracted. Selected value: {1}",
                comboBoxField.FieldName,
                comboBoxField.Value);

            foreach (var item in comboBoxField.Options) {
                Console.WriteLine(" -> Option: {0}", item);
            }

            break;
    }
}

The output lists all completed form fields including the name and possible drop-down options:

Field "tenant" extracted: Tim Tenant
Field "date" extracted: 6/16/2022
Field "agree" extracted: True
Field "movein" extracted. Selected value: June
 -> Option: January
 -> Option: June

Additionally, all form fields are inherited from the base class FormField and return the alternative field name, type and the location within the PDF document.

Stay in the loop!

Subscribe to the newsletter to receive the latest updates.

Also See

This post references the following in the documentation:

  • TXTextControl.DocumentServer.PDF.AcroForms.FormField Class
  • TXTextControl.DocumentServer.PDF.Contents.Lines Class
  • TXTextControl.DocumentServer.PDF.Contents.Lines.Find Method
  • TXTextControl.DocumentServer.PDF.Forms.GetAcroFormFields Method
  • TXTextControl.DocumentServer.PDF Namespace

ASP.NET

Integrate document processing into your applications to create documents such as PDFs and MS Word documents, including client-side document editing, viewing, and electronic signatures.

ASP.NET Core
Angular
Blazor
JavaScript
React
  • Angular
  • Blazor
  • React
  • JavaScript
  • ASP.NET MVC, ASP.NET Core, and WebForms

Learn more Trial token Download trial

Related Posts

ASP.NETWindows FormsWPF

Creating PDF Files using TX Text Control .NET in C#

TX Text Control allows developers to create PDF files programmatically using C#. This article shows various ways to create Adobe PDF documents.


ASP.NETJavaScriptWindows Forms

Generating Interactive PDF Forms by Injecting JavaScript

Using TX Text Control, it is possible to export documents with form fields to fillable PDFs. This article shows how to inject JavaScript to add interaction to form fields.


ASP.NETWindows FormsWPF

Form Field Handling in PDF Documents

Since TX Text Control supports form fields, it is possible to either export form fields in the PDF document or to flatten the form fields to export text only without the field functionality. This…


ASP.NETWindows FormsWPF

Creating ZUGFeRD Compliant PDF Invoices in C#

ZUGFeRD / Factur-X documents can be created and extracted using TX Text Control X19. This article shows how to create a valid ZUGFeRD compliant invoice PDF document from scratch.


ASP.NETWindows FormsWPF

X19 Sneak Peek: Processing AcroForm Fields in Adobe PDF Documents

TX Text Control X19 will introduce a new namespace that contains classes to process PDF documents. A new feature allows the extraction of AcroForm fields from existing Adobe PDF documents.