# Template-Based Text Extraction from PDF Documents in .NET C#

> Template-based text extraction using TX Text Control provides an efficient way to retrieve structured data from PDF documents. This article shows how to extract text from PDF documents using templates.

- **Author:** Bjoern Meyer
- **Published:** 2025-02-12
- **Modified:** 2025-11-16
- **Description:** Template-based text extraction using TX Text Control provides an efficient way to retrieve structured data from PDF documents. This article shows how to extract text from PDF documents using templates.
- **6 min read** (1031 words)
- **Tags:**
  - ASP.NET
  - ASP.NET Core
  - PDF
- **Web URL:** https://www.textcontrol.com/blog/2025/02/12/template-based-text-extraction-from-pdf-documents-in-net-c-sharp/
- **LLMs URL:** https://www.textcontrol.com/blog/2025/02/12/template-based-text-extraction-from-pdf-documents-in-net-c-sharp/llms.txt
- **LLMs-Full URL:** https://www.textcontrol.com/blog/2025/02/12/template-based-text-extraction-from-pdf-documents-in-net-c-sharp/llms-full.txt

---

When working with PDF documents, extracting specific information such as a company name, social security number, or invoice number can be a challenge. In a perfect world, this data would exist in well-structured form fields that could be easily accessed programmatically (including using TX Text Control functionality). However, many PDF files are flattened, which means that the form fields are removed, leaving only the raw text.

TX Text Control provides a powerful solution for text extraction based on predefined template areas, allowing you to extract structured information even from flattened PDFs.

Template-based text extraction involves defining a rectangle (bounding box) within which specific text is expected to appear in a PDF document. Once this area is defined, TX Text Control can extract lines of text within the defined rectangle, ensuring accurate data retrieval.

1. **Defining a template**: Identify a known text string in a sample document and define a bounding box around it.
2. **Applying the template**: Use the same selection box on other similar documents to extract relevant text.
3. **Extracting text**: The TX Text Control allows you to search for text within the defined rectangle and retrieve meaningful data from it.

### Implementing Template-Based Text Extraction

As a first step, we want to identify the text location of a known document with known data. Using a sample PDF, find a known piece of text (such as a company name or invoice number) that appears consistently in a particular location.

Let us take a look at a very typical US tax form, the W9.

![PDF form](https://s1-www.textcontrol.com/assets/dist/blog/2025/02/12/a/assets/form.webp "PDF form")

This document still has form fields enabled, and if we had access to this document, it would be very easy to extract the data using TX Text Control by simply iterating through the form fields.

> **Learn More**
> 
> Learn how to extract text from PDF documents using the TX Text Control PDF import feature in C#. This article shows how to extract text, attachments, form field values and metadata from PDF documents.
> 
> [Extract Data from PDF Documents with C# ](https://www.textcontrol.com/blog/2024/06/10/extract-data-from-pdf-documents-with-csharp/llms-full.txt)

But in this scenario, we don't have access to the source document, but to a flattened version where all the form fields have been removed and only the text is visible.

![Flattened PDF form](https://s1-www.textcontrol.com/assets/dist/blog/2025/02/12/a/assets/flattened.webp "Flattened PDF form")

Now we want to define the rectangle to search for the company name.

![Rectangle](https://s1-www.textcontrol.com/assets/dist/blog/2025/02/12/a/assets/location.webp "Rectangle")

### Creating the Application

To demonstrate how easy this is with the TX Text Control library, we will use a .NET console application.

Make sure that you downloaded the latest version of Visual Studio 2022 that comes with the [.NET 8 SDK](https://dotnet.microsoft.com/download/dotnet/8.0).

> ### Prerequisites
> 
> The following tutorial requires a trial version of TX Text Control .NET Server.
> 
> - [Download Trial Version](https://www.textcontrol.com/product/tx-text-control-dotnet-server/download/)

1. In Visual Studio 2022, create a new project by choosing *Create a new project*.
2. Select *Console App* as the project template and confirm with *Next*.
3. Choose a name for your project and confirm with *Next*.
4. In the next dialog, choose *.NET 8 (Long-term support)* as the *Framework* and confirm with *Create*.

### Adding the NuGet Package

5. In the *Solution Explorer*, select your created project and choose *Manage NuGet Packages...* from the *Project* main menu.
    
    Select *Text Control Offline Packages* from the *Package source* drop-down.
    
    *Install* the latest versions of the following package:
    
    
    - *TXTextControl.TextControl.ASP.SDK*
    
    ![ASP.NET Core Web Application](https://s1-www.textcontrol.com/assets/dist/blog/2025/02/12/a/assets/visualstudio1.webp "ASP.NET Core Web Application")

### Training Data

The following code uses TX Text Control to find the known value of our training data "Text Control, LLC" and returns the location that will later be used for all other documents.

```
using TXTextControl.DocumentServer.PDF.Contents;

try
{
    string pdfFilePath = "FormW9.pdf";

    // Check if the file exists before processing
    if (!File.Exists(pdfFilePath))
    {
        Console.WriteLine($"Error: File '{pdfFilePath}' not found.");
        return;
    }

    // Load PDF lines
    var pdfLines = new Lines(pdfFilePath);

    // Find the target text
    var trainLines = pdfLines.Find("Text Control, LLC");

    // Check if any lines were found before accessing the index
    if (trainLines.Count > 0)
    {
        Console.WriteLine(trainLines[0].Rectangle.ToString());
    }
    else
    {
        Console.WriteLine("Text not found in the PDF.");
    }
}
catch (Exception ex)
{
    Console.WriteLine($"An error occurred: {ex.Message}");
}
```

The console contains the location of the found text.

```
{X=1192,Y=2566,Width=1510,Height=180}
```

### Extracting Text

The next snippet loads the second document and searches for text in the given rectangle, which we retrieved from our training data.

```
using System.Drawing;
using TXTextControl.DocumentServer.PDF.Contents;

try
{
    string pdfFilePath = "FormW9_2.pdf";

    // Check if the file exists before processing
    if (!File.Exists(pdfFilePath))
    {
        Console.WriteLine($"Error: File '{pdfFilePath}' not found.");
        return;
    }

    // Load PDF lines
    var pdfLines = new Lines(pdfFilePath);

    // Define the search area
    var searchRectangle = new Rectangle(1192, 2566, 1510, 180);

    // Find text within the defined rectangle (include partial matches)
    var contentLines = pdfLines.Find(searchRectangle, true);

    // Filter only page 1 content lines
    var page1ContentLines = contentLines.Where(cl => cl.Page == 1).ToList();

    // Check if any content was found
    if (page1ContentLines.Count > 0)
    {
        Console.WriteLine(page1ContentLines[0].Text);
    }
    else
    {
        Console.WriteLine("No content found in the specified rectangle on page 1.");
    }
}
catch (Exception ex)
{
    Console.WriteLine($"An error occurred: {ex.Message}");
}
```

The console contains the extracted text from the second document.

```
Document Processing Enterprises Ltd.
```

Because we used *true* for the second parameter of the Find method, the search will return the entire line, even if the company name is longer in this case.

Even if the company name goes to the end of the line, it will find the correct values.

![Long company name](https://s1-www.textcontrol.com/assets/dist/blog/2025/02/12/a/assets/long.webp "Long company name")

```
This is a very long company name - This is a very long company name - This is a very long company name
```

### Conclusion

Template-based text extraction is a powerful feature for extracting structured information from PDF documents. By defining a rectangle around known text, TX Text Control can extract text from similar documents, even if the text is not in form fields.

---

## About Bjoern Meyer

As CEO, Bjoern is the visionary behind our strategic direction and business development, bridging the gap between our customers and engineering teams. His deep passion for coding and web technologies drives the creation of innovative products. If you're at a tech conference, be sure to stop by our booth - you'll most likely meet Bjoern in person. With an advanced graduate degree (Dipl. Inf.) in Computer Science, specializing in AI, from the University of Bremen, Bjoern brings significant expertise to his role. In his spare time, Bjoern enjoys running, paragliding, mountain biking, and playing the piano.

- [LinkedIn](https://www.linkedin.com/in/bjoernmeyer/)
- [X](https://x.com/txbjoern)
- [GitHub](https://github.com/bjoerntx)

---

## Related Posts

- [Why Structured E-Invoices Still Need Tamper Protection using C# and .NET](https://www.textcontrol.com/blog/2026/03/24/why-structured-e-invoices-still-need-tamper-protection-using-csharp-and-dotnet/llms.txt)
- [Create Fillable PDFs from HTML Forms in C# ASP.NET Core Using a WYSIWYG Template](https://www.textcontrol.com/blog/2026/03/17/create-fillable-pdfs-from-html-forms-in-csharp-aspnet-core-using-a-wysiwyg-template/llms.txt)
- [Why HTML to PDF Conversion is Often the Wrong Choice for Business Documents in C# .NET](https://www.textcontrol.com/blog/2026/03/13/why-html-to-pdf-conversion-is-often-the-wrong-choice-for-business-documents-in-csharp-dot-net/llms.txt)
- [A Complete Guide to Converting Markdown to PDF in .NET C#](https://www.textcontrol.com/blog/2026/01/07/a-complete-guide-to-converting-markdown-to-pdf-in-dotnet-csharp/llms.txt)
- [Why PDF Creation Belongs at the End of the Business Process](https://www.textcontrol.com/blog/2026/01/02/why-pdf-creation-belongs-at-the-end-of-the-business-process/llms.txt)
- [Designing the Perfect PDF Form with TX Text Control in .NET C#](https://www.textcontrol.com/blog/2025/12/16/designing-the-perfect-pdf-form-with-tx-text-control-in-dotnet-csharp/llms.txt)
- [Why Defining MIME Types for PDF/A Attachments Is Essential](https://www.textcontrol.com/blog/2025/12/10/why-defining-mime-types-for-pdfa-attachments-is-essential/llms.txt)
- [Validate Digital Signatures and the Integrity of PDF Documents in C# .NET](https://www.textcontrol.com/blog/2025/11/14/validate-digital-signatures-and-the-integrity-of-pdf-documents-in-csharp-dotnet/llms.txt)
- [Validate PDF/UA Documents and Verify Electronic Signatures in C# .NET](https://www.textcontrol.com/blog/2025/11/13/validate-pdf-ua-documents-and-verify-electronic-signatures-in-csharp-dotnet/llms.txt)
- [How To Choose the Right C# PDF Generation Library: Developer Checklist](https://www.textcontrol.com/blog/2025/11/12/how-to-choose-the-right-csharp-pdf-generation-library-developer-checklist/llms.txt)
- [Why Digitally Signing your PDFs is the Only Reliable Way to Prevent Tampering](https://www.textcontrol.com/blog/2025/10/30/why-digitally-signing-your-pdfs-is-the-only-reliable-way-to-prevent-tampering/llms.txt)
- [Automating PDF/UA Accessibility with AI: Describing DOCX Documents Using TX Text Control and LLMs](https://www.textcontrol.com/blog/2025/10/16/automating-pdf-ua-accessibility-with-ai-describing-docx-documents-using-tx-text-control-and-llms/llms.txt)
- [Converting Office Open XML (DOCX) to PDF in Java](https://www.textcontrol.com/blog/2025/10/14/converting-office-open-xml-docx-to-pdf-in-java/llms.txt)
- [Extending DS Server with Custom Digital Signature APIs](https://www.textcontrol.com/blog/2025/10/09/extending-ds-server-with-custom-digital-signature-apis/llms.txt)
- [Why PDF/UA and PDF/A-3a Matter: Accessibility, Archiving, and Legal Compliance](https://www.textcontrol.com/blog/2025/10/07/why-pdf-ua-and-pdf-a-3a-matter-accessibility-archiving-and-legal-compliance/llms.txt)
- [Convert Markdown to PDF in a Console Application on Linux and Windows](https://www.textcontrol.com/blog/2025/09/23/convert-markdown-to-pdf-in-a-console-application-on-linux-and-windows/llms.txt)
- [Mining PDFs with Regex in C#: Practical Patterns, Tips, and Ideas](https://www.textcontrol.com/blog/2025/08/12/mining-pdfs-with-regex-in-csharp-practical-patterns-tips-and-ideas/llms.txt)
- [Streamline Data Collection with Embedded Forms in C# .NET](https://www.textcontrol.com/blog/2025/08/02/streamline-data-collection-with-embedded-forms-in-csharp-dotnet/llms.txt)
- [Adding QR Codes to PDF Documents in C# .NET](https://www.textcontrol.com/blog/2025/07/15/adding-qr-codes-to-pdf-documents-in-csharp-dotnet/llms.txt)
- [Adding SVG Graphics to PDF Documents in C# .NET](https://www.textcontrol.com/blog/2025/07/08/adding-svg-graphics-to-pdf-documents-in-csharp-dotnet/llms.txt)
- [Enhancing PDF Searchability in Large Repositories by Adding and Reading Keywords Using C# .NET](https://www.textcontrol.com/blog/2025/06/24/enhancing-pdf-searchability-in-large-repositories-by-adding-and-reading-keywords-using-csharp-dotnet/llms.txt)
- [How to Verify PDF Encryption Programmatically in C# .NET](https://www.textcontrol.com/blog/2025/06/20/how-to-verify-pdf-encryption-programmatically-in-csharp-dotnet/llms.txt)
- [PDF Security for C# Developers: Encryption and Permissions in .NET](https://www.textcontrol.com/blog/2025/06/16/pdf-security-for-csharp-developers-encryption-and-permissions-in-dotnet/llms.txt)
- [Add JavaScript to PDFs with TX Text Control in C# .NET: Time-Based Alerts Made Easy](https://www.textcontrol.com/blog/2025/06/13/add-javascript-to-pdfs-with-tx-text-control-in-c-dot-net-time-based-alerts-made-easy/llms.txt)
- [Convert MS Word DOCX to PDF including Text Reflow using .NET C# on Linux](https://www.textcontrol.com/blog/2025/06/10/convert-ms-word-docx-to-pdf-including-text-reflow-using-dotnet-csharp-on-linux/llms.txt)
