Searching Strings in PDF Documents

Bjoern Meyer

February 5, 2014

TX Text Control can import born-digital PDF documents for viewing, editing, and format conversion. By combining ServerTextControl with LINQ and regular expressions, developers can programmatically search for string patterns within PDFs and retrieve all matching index positions.

TX Text Control is not only able to load and modify MS Word documents such as DOC, RTF and DOCX files. TX Text Control is also able to import "born digital" PDF documents, so that you can view, edit or convert these files.

An overview of the features and the possibilities can be read here:

PDF Reflow - Load, view, edit and convert Adobe PDF files

The combination of ServerTextControl, LINQ and regular expressions provides a powerful tool to search strings in PDF documents. The method FindInPDF listed below accepts a PDF document as a file path and a value to seek. ServerTextControl opens the PDF in order to provide the plain text to a regular expression.

The resulting MatchCollection of the Matches method is used by LINQ to return the index of each individual result that are stored in an IEnumerable<int> object.

private IEnumerable<int> FindInPDF(string path, string value)
{
    string sSourceString = "";

    // create a temporary ServerTextControl that imports the PDF file
    using (TXTextControl.ServerTextControl tx =
        new TXTextControl.ServerTextControl())
    {
        TXTextControl.LoadSettings ls= new TXTextControl.LoadSettings();
        ls.PDFImportSettings = TXTextControl.PDFImportSettings.GenerateLines;

        tx.Create();
        tx.Load(path, TXTextControl.StreamType.AdobePDF, ls);

        // prepare the string to match the numbers of
        // control chartacters
        sSourceString = tx.Text.Replace("
","
"); 
    }

    // use RegEx and LINQ to match the strings and to return the ID
    return
        Regex.Matches(sSourceString, value).Cast<Match>().Select(m => m.Index);
}

The following code returns all index positions of the string "text" in a PDF document:

IEnumerable<int> index = FindInPDF("test.pdf", "text");

PDF Tutorial

Chat with this article in your favorite LLM:

TX Text Control .NET Server

DS Server

Server & Web

Windows Forms

WPF

ActiveX

Core Technologies

Blog

Core Technologies

Blog

Authors

Latest Posts

Support Resources

Documentation

Support

Getting Started

Other Resources

About Us

Industry events

Company

Contact

Newsletter

Legal

Sign in

Create account

Text Control Account

Searching Strings in PDF Documents

Bjoern Meyer

Related Posts

Create Password Protected and Signed Adobe PDF and PDF/A Documents

Why HTML to PDF Conversion is Often the Wrong Choice for Business Documents…

A Complete Guide to Converting Markdown to PDF in .NET C#

Why PDF Creation Belongs at the End of the Business Process

Designing the Perfect PDF Form with TX Text Control in .NET C#

Chat with this article in your favorite LLM:

Share on this blog post on:

Server & Web

Windows Forms

WPF

ActiveX

Core Technologies

​

Latest Posts

​

Support

Getting Started

​

Other Resources

Company

Contact

Newsletter

Legal

Text Control Account

​

​

Searching Strings in PDF Documents

Stay in the loop!

Subscribe to the newsletter to receive the latest updates.

Related Posts

Chat with this article in your favorite LLM:

Share on this blog post on: