Extract Text and Data from PDF Documents in C#

TX Text Control is able to import "digitally born" Adobe PDF documents like any other supported file type. Using this approach, PDF documents can be searched for strings in document pages and form field content can be extracted.

Retrieving Text Lines

The namespace TXTextControl.DocumentServer.PDF ╰ TX Text Control .NET Server for ASP.NET
╰ DocumentServer.PDF Namespace
TXTextControl.DocumentServer.PDF Namespace contains the class Contents.Lines ╰ TX Text Control .NET Server for ASP.NET
╰ DocumentServer.PDF Namespace
╰ Contents.Lines Class
The Lines class implements functionality to find DocumentServer.PDF.Contents.ContentLine objects in a PDF document. that can be used to import text coordinates from a PDF document.

Consider the following sample PDF document that is used in this article to show how text and data can be easily extracted.

PDF in Text Control

The following code shows how to load a PDF document in order to loop through all recognized text lines:

	Lines pdfLines = new Lines("lease_agreement.pdf");

	foreach (ContentLine contentLine in pdfLines.ContentLines) {
	Console.WriteLine("Content line on page {0}: {1}",
	contentLine.Page.ToString(),
	contentLine.Text);
	}

view raw test.cs hosted with ❤ by GitHub

The output of the above code shows the text lines and the associated page number:

Content line on page 1: Residential Lease Agreement
Content line on page 1: In consideration of the Landlord leasing certain premises to the Tenant
Content line on page 1: and other valuable consideration, the receipt and sufficiency of which
Content line on page 1: considerations is herby acknowledged, the Parties agree as follows.
Content line on page 1: 1. Lease Term.
Content line on page 1: The term of this Agreement shall be be a period of «year» years, beginning on the
Content line on page 1: date «begin» and ending on the date «end».
Content line on page 1: 2. Property.
Content line on page 1: The leased premises shall be comprised of that certain personal residence (including
Content line on page 1: the house and the land) located at «location» (?Premises?). Landlord leases the
Content line on page 1: Premises to Tenant and Tenant leases the Premises from Landlord on the terms and
Content line on page 1: conditions set forth herein.
Content line on page 1: 3. Monthly Rent.
Content line on page 1: The rent to be paid by Tenant to Landlord throughout the term of this Agreement is
Content line on page 1: «rent» per month and shall be due on the 1st day of each month.
Content line on page 1: Tenant Landlord
Content line on page 1: Name Name
Content line on page 1: Date Date
Content line on page 1: I agree
Content line on page 1: Move in month:
Content line on page 1: Tim Tenant
Content line on page 1: 6/16/2022
Content line on page 1: June

Finding Text

The next code snipped returns all text lines with the word Agreement in it:

	Lines pdfLines = new Lines("lease_agreement.pdf");

	string stringToFind = "Agreement";

	List<ContentLine> contentLines =
	pdfLines.Find(stringToFind);

	foreach (ContentLine line in contentLines) {
	Console.WriteLine("Found string \"{0}\" on page {1} (X: {2}, Y: {3}): {4}",
	stringToFind,
	line.Page.ToString(),
	line.X.ToString(),
	line.Y.ToString(),
	line.Text);
	}

view raw test.cs hosted with ❤ by GitHub

The output lists the full text line, the page number and the actual location in the document:

Found string "Agreement" on page 1 (X: 72, Y: 83.25): Residential Lease Agreement
Found string "Agreement" on page 1 (X: 94.5, Y: 228.15): The term of this Agreement shall be be a period of «year» years, beginning on the
Found string "Agreement" on page 1 (X: 94.5, Y: 369): The rent to be paid by Tenant to Landlord throughout the term of this Agreement is

Radial Text Search

Other implementations of the Find ╰ TX Text Control .NET Server for ASP.NET
╰ DocumentServer.PDF Namespace
╰ Contents.Lines Class
╰ Find Method
Performs a search in the DocumentServer.PDF.Contents.Lines.ContentLines list using a string value, a regular expression or by performing a search in a geometric location. method allows to search for a regular expression or to search for lines in a specific range such as a rectangle or a radius. The following code returns all lines within a radius of 10 points around a specific location and includes lines that are partially overlapping the given radius:

	Lines pdfLines = new Lines("lease_agreement.pdf");

	string stringToFind = "Agreement";

	List<ContentLine> contentLines =
	pdfLines.Find(new RectangleF(72,83, 10, 10), true);

	foreach (ContentLine line in contentLines) {
	Console.WriteLine("Found text on page {0} (X: {1}, Y: {2}): {3}",
	line.Page.ToString(),
	line.X.ToString(),
	line.Y.ToString(),
	line.Text);
	}

view raw test.cs hosted with ❤ by GitHub

The output shows one found entry at that specific location (compare to the first results entry of the previous code snippet):

Found text on page 1 (X: 72, Y: 83.25): Residential Lease Agreement

Extracting Form Field Values

Interactive forms in the Adobe PDF format are also known as AcroForm - a de-facto standard for PDF forms processing. The forms can be created and exported using TX Text Control, so that end-users can fill-out these form fields in Acrobat Reader or other applications.

TX Text Control allows the extraction of form field data to collect results from completed documents.

The following code shows how to get all AcroForm fields from the above sample PDF document using the GetAcroFormFields ╰ TX Text Control .NET Server for ASP.NET
╰ DocumentServer.PDF Namespace
╰ Forms Class
╰ GetAcroFormFields Method
Imports AcroFormFields from an Adobe PDF document. method that accepts file names and byte arrays.

	FormField[] acroForms = Forms.GetAcroFormFields("lease_agreement.pdf");

	foreach (FormField field in acroForms) {
	switch (field) {
	case FormTextField textField:
	Console.WriteLine("Field \"{0}\" extracted: {1}",
	textField.FieldName,
	textField.Value);
	break;

	case FormCheckBox checkBoxField:
	Console.WriteLine("Field \"{0}\" extracted: {1}",
	checkBoxField.FieldName,
	checkBoxField.IsChecked.ToString());
	break;

	case FormComboBox comboBoxField:
	Console.WriteLine("Field \"{0}\" extracted. Selected value: {1}",
	comboBoxField.FieldName,
	comboBoxField.Value);

	foreach (var item in comboBoxField.Options) {
	Console.WriteLine(" -> Option: {0}", item);
	}

	break;
	}
	}

view raw test.cs hosted with ❤ by GitHub

The output lists all completed form fields including the name and possible drop-down options:

Field "tenant" extracted: Tim Tenant
Field "date" extracted: 6/16/2022
Field "agree" extracted: True
Field "movein" extracted. Selected value: June
 -> Option: January
 -> Option: June

Additionally, all form fields are inherited from the base class FormField ╰ TX Text Control .NET Server for ASP.NET
╰ DocumentServer.PDF Namespace
╰ AcroForms.FormField Class
The FormField class implements the base class for the Adobe PDF AcroForms form fields. and return the alternative field name, type and the location within the PDF document.

ASP.NET

Integrate document processing into your applications to create documents such as PDFs and MS Word documents, including client-side document editing, viewing, and electronic signatures.

Text Control Products

WEB, SERVER AND CLOUD

Getting started with:

DESKTOP

HOSTED CLOUD

LOW CODE PLATFORM

Core Technologies

Text Control Documentation

Text Control Blog

Text Control Support

About Text Control

Extract Text and Data from PDF Documents in C#

Summary

Retrieving Text Lines

Finding Text

Radial Text Search

Extracting Form Field Values

Also See

ASP.NET

Getting started with:

Related Posts

Enhancing PDF Searchability in Large Repositories by Adding and Reading Keywords Using C# .NET

How to Verify PDF Encryption Programmatically in C# .NET

TX Text Control 33.0 SP2 is Now Available: What's New in the Latest Version

PDF Security for C# Developers: Encryption and Permissions in .NET

Popular Products

Technologies

Get Products

Resources

Support

Ready To Talk?