Extracting plain text from PDF documents for further processing, analyzing or searching is a common task. Typically, a PDF document contains a collection of characters at specific locations, and a filter is required to import the text to be extracted.
With TX Text Control, all typical word processing formats such as DOC, DOCX, RTF and PDF can be loaded for plain text extraction. The following code shows how to create a simple console application that loads a PDF document and then extracts the plain text from it.
Preparing the Application
A .NET 6 console application is created for the purposes of this demo.
Prerequisites
The following tutorial requires a trial version of TX Text Control .NET Server for ASP.NET.
-
In Visual Studio, create a new Console App using .NET 6.
-
In the Solution Explorer, select your created project and choose Manage NuGet Packages... from the Project main menu.
Select Text Control Offline Packages from the Package source drop-down.
Install the latest versions of the following package:
- TXTextControl.TextControl.ASP.SDK
Adding a PDF
-
Create a folder named App_Data in the root of your project. Copy your PDF documents into this folder. In this example, the name of the PDF document that will be loaded is sample.pdf.
Adding the Code
-
Open the Program.cs file and add the following code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersusing (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl()) { tx.Create(); TXTextControl.LoadSettings ls = new TXTextControl.LoadSettings() { PDFImportSettings = TXTextControl.PDFImportSettings.GenerateParagraphs }; // load PDF document tx.Load("App_Data/sample.pdf", TXTextControl.StreamType.AdobePDF, ls); // retrieve plain text var text = tx.Text; Console.WriteLine(text); }
Alternatively, if the document is stored in a database or by some other method, you can load the document from a byte array.
byte[] document = File.ReadAllBytes("App_Data/sample.pdf"); | |
tx.Load(document, TXTextControl.BinaryStreamType.AdobePDF, ls); |
TX Text Control recognizes matching words to create paragraphs and returns plain text written to the console.