TX Text Control can be used not only to create PDF documents from templates or programmatically from scratch, but also to extract data from existing PDF documents. When TX Text Control is used to process PDF documents, it provides a complete workflow including creation, form data extraction, and content searching.

How to extract data for the following typical scenarios is covered in this article:

  • Extracting text from PDF documents
  • Extract text at a specific location in PDF documents
  • Extracting form field data from PDF documents
  • Extract meta data from PDF documents
  • Extract attachments from PDF documents

Preparing the Application

A .NET 8 console application is created for the purposes of this demo.

Prerequisites

The following tutorial requires a trial version of TX Text Control .NET Server for ASP.NET.

  1. In Visual Studio, create a new Console App using .NET 8.

  2. In the Solution Explorer, select your created project and choose Manage NuGet Packages... from the Project main menu.

    Select Text Control Offline Packages from the Package source drop-down.

    Install the latest versions of the following package:

    • TXTextControl.TextControl.ASP.SDK

    Create PDF

Extracting Text from PDF Documents

The following example uses the simple invoice document shown in the screenshot below:

invoice.pdf

With TX Text Control, you can extract all of the plain text from a PDF document by loading the document and accessing the plain text. The following code shows how to extract the plain text from a PDF document:

using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl())
{
tx.Create();
TXTextControl.LoadSettings loadSettings = new TXTextControl.LoadSettings()
{
PDFImportSettings = TXTextControl.PDFImportSettings.GenerateLines
};
tx.Load("invoice.pdf", TXTextControl.StreamType.AdobePDF, loadSettings);
var plainText = tx.Text;
Console.WriteLine(plainText);
}
view raw test.cs hosted with ❤ by GitHub

The extracted text from the invoice document looks like this:

INVOICE
Text Control
Bill To:
Typer Company
Tim Typer
7872 Typing Ave.
Charlotte, NC 28210
Invoice #: 267664
Item Description Price
Product 1 Description 1 800.00
Product 2 Description 2 400.00
Product 3 Description 3 1200.00
Total:2400.00
Paid in full with CC ending with ****3425.
Thanks for your business!

Find Specific String

TX Text Control provides a powerful text search functionality that can be used to find specific strings in a PDF document. The following code shows how to find the string "Total:" in a PDF document:

using TXTextControl.DocumentServer.PDF.Contents;
Lines pdfLines = new Lines("invoice.pdf");
string stringToFind = "Total:";
List<ContentLine> contentLines =
pdfLines.Find(stringToFind);
foreach (ContentLine line in contentLines)
{
Console.WriteLine("Found string \"{0}\" on page {1} (X: {2}, Y: {3}): {4}",
stringToFind,
line.Page.ToString(),
line.X.ToString(),
line.Y.ToString(),
line.Text);
}
view raw test.cs hosted with ❤ by GitHub
Found string "Total:" on page 1 (X: 462.75, Y: 376.2): Total:2400.00

Find Text by Location

Now that we have the location of the string "Total:", we want to extract the value at that location. This is a very smart way to get the total value of a PDF invoice in a programmatic way.

Search in PDF Documents

The following code shows how to extract the value at the location of the string "Total:" in a PDF document:

using System.Drawing;
using System.Text.RegularExpressions;
using TXTextControl.DocumentServer.PDF.Contents;
Lines pdfLines = new Lines("invoice.pdf");
List<ContentLine> contentLines =
pdfLines.Find(new RectangleF(462, 376, 400, 400), true);
foreach (ContentLine line in contentLines)
{
float fTotal = ExtractFloatFromString(line.Text);
Console.WriteLine("Found value on page {0} (X: {1}, Y: {2}): {3}",
line.Page.ToString(),
line.X.ToString(),
line.Y.ToString(),
fTotal.ToString());
}
static float ExtractFloatFromString(string input)
{
// Regular expression to match a floating-point number
string pattern = @"[-+]?[0-9]*\.?[0-9]+";
// Match the pattern in the input string
Match match = Regex.Match(input, pattern);
if (match.Success)
{
// Convert the matched value to float
return float.Parse(match.Value);
}
// If no match found, return 0.0 or you can throw an exception or handle it as needed
return 0.0f;
}
view raw test.cs hosted with ❤ by GitHub

The extracted value from the invoice document looks like this:

Found value on page 1 (X: 462.75, Y: 376.2): 2400

Extracting Form Field Data

PDF documents can contain form fields that can be filled out by the user. These form fields can be extracted using TX Text Control. The following screenshot shows a PDF document with form fields:

form.pdf

The following code shows how to extract the form field data from a PDF document:

FormField[] acroForms = Forms.GetAcroFormFields("lease_agreement.pdf");
foreach (FormField field in acroForms) {
switch (field) {
case FormTextField textField:
Console.WriteLine("Field \"{0}\" extracted: {1}",
textField.FieldName,
textField.Value);
break;
case FormCheckBox checkBoxField:
Console.WriteLine("Field \"{0}\" extracted: {1}",
checkBoxField.FieldName,
checkBoxField.IsChecked.ToString());
break;
case FormComboBox comboBoxField:
Console.WriteLine("Field \"{0}\" extracted. Selected value: {1}",
comboBoxField.FieldName,
comboBoxField.Value);
foreach (var item in comboBoxField.Options) {
Console.WriteLine(" -> Option: {0}", item);
}
break;
}
}
view raw test.cs hosted with ❤ by GitHub

The extracted form field data from the invoice document looks like this:

Field "tenant" extracted: Tim Tenant
Field "date" extracted: 6/16/2022
Field "agree" extracted: True
Field "movein" extracted. Selected value: June
 -> Option: January
 -> Option: June

Extracting Meta Data

PDF documents can contain meta data such as author, title, and keywords. This meta data can be extracted using TX Text Control. The following code shows how to extract the meta data from a PDF document:

using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl())
{
tx.Create();
TXTextControl.LoadSettings loadSettings = new TXTextControl.LoadSettings()
{
PDFImportSettings = TXTextControl.PDFImportSettings.GenerateLines
};
tx.Load("invoice.pdf", TXTextControl.StreamType.AdobePDF, loadSettings);
// Write document meta data to the console
Console.WriteLine($"Author: {loadSettings.Author}");
Console.WriteLine($"Subject: {loadSettings.DocumentSubject}");
Console.WriteLine($"Title: {loadSettings.DocumentTitle}");
Console.WriteLine($"Creation Date: {loadSettings.CreationDate}");
Console.WriteLine($"Application: {loadSettings.CreatorApplication}");
foreach (string keyword in loadSettings.DocumentKeywords)
{
Console.WriteLine($"Keyword: {keyword}");
}
}
view raw test.cs hosted with ❤ by GitHub

The extracted meta data from the invoice document looks like this:

Author: Tim Typer
Subject: Sample invoice
Title: PDF Invoice
Creation Date: 6/10/2024 3:40:46 PM
Application: TX Text Control
Keyword: ERP tag
Keyword:  Tag 2

Extracting Attachments

PDF documents can contain attachments such as images, documents, or other files. PDF/A-3 documents enable the transition from electronic paper to an electronic container that contains both human- and machine-readable versions of a document. A PDF/A-3 document can contain an unlimited number of embedded documents for different processes.

To extract attachments, the PDF must be loaded into TX Text Control using LoadSettings TX Text Control .NET Server for ASP.NET
TXTextControl Namespace
LoadSettings Class
The LoadSettings class provides properties for advanced settings and information during load operations.
. The attachments are stored in the EmbeddedFiles TX Text Control .NET Server for ASP.NET
TXTextControl Namespace
LoadSaveSettingsBase Class
EmbeddedFiles Property
Specifies an array of EmbeddedFile objects which will be embedded in the saved document.
array of attachments.

using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl()) {
tx.Create();
TXTextControl.LoadSettings loadSettings = new TXTextControl.LoadSettings();
tx.Load("mypdf.pdf", TXTextControl.StreamType.AdobePDF, loadSettings);
foreach (TXTextControl.EmbeddedFile embeddedFile in loadSettings.EmbeddedFiles) {
System.IO.File.WriteAllText(
embeddedFile.FileName,
Encoding.ASCII.GetString((byte[])embeddedFile.Data));
}
}
view raw test.cs hosted with ❤ by GitHub

This code loads a PDF file, extracts any embedded files in the file, and saves those embedded files as separate files on the hard drive.

Conclusion

TX Text Control provides a complete workflow for processing PDF documents including creation, form data extraction, and content searching. This article showed how to extract text, form field data, meta data, and attachments from PDF documents using TX Text Control.