Converting PDF Documents to Plain Text TXT in C#

Extracting plain text from PDF documents for further processing, analyzing or searching is a common task. Typically, a PDF document contains a collection of characters at specific locations, and a filter is required to import the text to be extracted.

With TX Text Control, all typical word processing formats such as DOC, DOCX, RTF and PDF can be loaded for plain text extraction. The following code shows how to create a simple console application that loads a PDF document and then extracts the plain text from it.

Preparing the Application

A .NET 6 console application is created for the purposes of this demo.

Prerequisites

The following tutorial requires a trial version of TX Text Control .NET Server for ASP.NET.

Download Trial Version

In Visual Studio, create a new Console App using .NET 6.
In the Solution Explorer, select your created project and choose Manage NuGet Packages... from the Project main menu.

Select Text Control Offline Packages from the Package source drop-down.

Install the latest versions of the following package:
- TXTextControl.TextControl.ASP.SDK

Adding a PDF

Create a folder named App_Data in the root of your project. Copy your PDF documents into this folder. In this example, the name of the PDF document that will be loaded is sample.pdf.

Adding the Code

Open the Program.cs file and add the following code:

	using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl())
	{
	tx.Create();

	TXTextControl.LoadSettings ls = new TXTextControl.LoadSettings()
	{
	PDFImportSettings = TXTextControl.PDFImportSettings.GenerateParagraphs
	};

	// load PDF document
	tx.Load("App_Data/sample.pdf", TXTextControl.StreamType.AdobePDF, ls);

	// retrieve plain text
	var text = tx.Text;

	Console.WriteLine(text);
	}

view raw test.cs hosted with ❤ by GitHub

Alternatively, if the document is stored in a database or by some other method, you can load the document from a byte array.

	byte[] document = File.ReadAllBytes("App_Data/sample.pdf");
	tx.Load(document, TXTextControl.BinaryStreamType.AdobePDF, ls);

view raw test.cs hosted with ❤ by GitHub

TX Text Control recognizes matching words to create paragraphs and returns plain text written to the console.

ASP.NET

The first true WYSIWYG, HTML5-based Web editor and reporting template designer. Give your users an MS Word compatible editor to create powerful reporting templates anywhere - in any browser on any device. Our ASP.NET components combine the power of a reporting tool and an easy-to-use WYSIWYG word processor - fully programmable and embeddable in your ASP.NET application.

Text Control Products

WEB, SERVER AND CLOUD

Getting started with:

DESKTOP

HOSTED CLOUD

LOW CODE PLATFORM

Core Technologies

Text Control Documentation

Text Control Blog

Text Control Support

About Text Control

Converting PDF Documents to Plain Text TXT in C#

Summary

Preparing the Application

Prerequisites

Adding a PDF

Adding the Code

ASP.NET

Related Posts

Generating MS Word DOCX and PDF Documents with ASP.NET Core C#

Customizing Electronic Signature Fonts for Typed Signatures in Angular and ASP.NET Core

Why use PDF Templates or Editors when you can use True WYSIWYG Editing?

Chat PDF - A Generative AI Application for PDF Documents using TX Text Control and OpenAI…

Popular Products

Technologies

Get Products

Resources

Support

Ready To Talk?