Products Technologies Demo Docs Blog Support Company

Converting PDF Documents to Plain Text TXT in C#

The extraction of plain text from PDF documents for further processing, analysis or full text search is a common task. This article explains how to extract plain text from PDF documents programmatically with .NET C# and TX Text Control.

Converting PDF Documents to Plain Text TXT in C#

Extracting plain text from PDF documents for further processing, analyzing or searching is a common task. Typically, a PDF document contains a collection of characters at specific locations, and a filter is required to import the text to be extracted.

With TX Text Control, all typical word processing formats such as DOC, DOCX, RTF and PDF can be loaded for plain text extraction. The following code shows how to create a simple console application that loads a PDF document and then extracts the plain text from it.

Preparing the Application

A .NET 6 console application is created for the purposes of this demo.

Prerequisites

The following tutorial requires a trial version of TX Text Control .NET Server.

  1. In Visual Studio, create a new Console App using .NET 6.

  2. In the Solution Explorer, select your created project and choose Manage NuGet Packages... from the Project main menu.

    Select Text Control Offline Packages from the Package source drop-down.

    Install the latest versions of the following package:

    • TXTextControl.TextControl.ASP.SDK

    Create PDF

Adding a PDF

  1. Create a folder named App_Data in the root of your project. Copy your PDF documents into this folder. In this example, the name of the PDF document that will be loaded is sample.pdf.

Adding the Code

  1. Open the Program.cs file and add the following code:

    using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl())
    {
        tx.Create();
    
        TXTextControl.LoadSettings ls = new TXTextControl.LoadSettings()
        {
            PDFImportSettings = TXTextControl.PDFImportSettings.GenerateParagraphs
        };
    
        // load PDF document
        tx.Load("App_Data/sample.pdf", TXTextControl.StreamType.AdobePDF, ls);
        
        // retrieve plain text
        var text = tx.Text;
    
        Console.WriteLine(text);
    }

Alternatively, if the document is stored in a database or by some other method, you can load the document from a byte array.

byte[] document = File.ReadAllBytes("App_Data/sample.pdf");
tx.Load(document, TXTextControl.BinaryStreamType.AdobePDF, ls);

TX Text Control recognizes matching words to create paragraphs and returns plain text written to the console.

Stay in the loop!

Subscribe to the newsletter to receive the latest updates.

ASP.NET

Integrate document processing into your applications to create documents such as PDFs and MS Word documents, including client-side document editing, viewing, and electronic signatures.

ASP.NET Core
Angular
Blazor
JavaScript
React
  • Angular
  • Blazor
  • React
  • JavaScript
  • ASP.NET MVC, ASP.NET Core, and WebForms

Learn more Trial token Download trial

Related Posts

ASP.NETConversionDOCX

PDF Conversion in .NET: Convert DOCX, HTML and more with C#

PDF conversion in .NET is a standard requirement for generating invoices, templates, and accessible reports. This article provides an overview of PDF conversion capabilities using TX Text Control,…


ASP.NETASP.NET CoreE-Invoicing

Why Structured E-Invoices Still Need Tamper Protection using C# and .NET

ZUGFeRD, Factur-X, German e-invoicing rules, and how to seal PDF invoices with TX Text Control to prevent tampering. Learn how to create compliant e-invoices with C# and .NET.


ASP.NETASP.NET CoreForms

Create Fillable PDFs from HTML Forms in C# ASP.NET Core Using a WYSIWYG Template

Learn how to generate PDFs from HTML forms in ASP.NET Core using a pixel-perfect WYSIWYG template. Extract form fields from a document, render a dynamic HTML form, and merge the data server-side…


ASP.NETASP.NET CoreHTML

Why HTML to PDF Conversion is Often the Wrong Choice for Business Documents…

In this article, we explore the challenges of HTML to PDF conversion for business documents in C# .NET and present alternative solutions that offer better performance and reliability. Discover why…


ASP.NETASP.NET CoreMarkdown

A Complete Guide to Converting Markdown to PDF in .NET C#

Learn how to convert Markdown to PDF in .NET C# using Text Control's ServerTextControl component. This guide covers setup, conversion process, and customization options for generating high-quality…

Share on this blog post on: