Products Technologies Demo Docs Blog Support Company

Searching Strings in PDF Documents

TX Text Control is not only able to load and modify MS Word documents such as DOC, RTF and DOCX files. TX Text Control is also able to import "born digital" PDF documents, so that you can view, edit or convert these files. An overview of the features and the possibilities can be read here: PDF Reflow - Load, view, edit and convert Adobe PDF files The combination of ServerTextControl, LINQ and regular expressions provides a powerful tool to search strings in PDF documents. The method…

Searching Strings in PDF Documents
Searching strings in PDF documents

TX Text Control is not only able to load and modify MS Word documents such as DOC, RTF and DOCX files. TX Text Control is also able to import "born digital" PDF documents, so that you can view, edit or convert these files.

An overview of the features and the possibilities can be read here:

PDF Reflow - Load, view, edit and convert Adobe PDF files

The combination of ServerTextControl, LINQ and regular expressions provides a powerful tool to search strings in PDF documents. The method FindInPDF listed below accepts a PDF document as a file path and a value to seek. ServerTextControl opens the PDF in order to provide the plain text to a regular expression.

The resulting MatchCollection of the Matches method is used by LINQ to return the index of each individual result that are stored in an IEnumerable<int> object.

private IEnumerable<int> FindInPDF(string path, string value)
{
    string sSourceString = "";

    // create a temporary ServerTextControl that imports the PDF file
    using (TXTextControl.ServerTextControl tx =
        new TXTextControl.ServerTextControl())
    {
        TXTextControl.LoadSettings ls= new TXTextControl.LoadSettings();
        ls.PDFImportSettings = TXTextControl.PDFImportSettings.GenerateLines;

        tx.Create();
        tx.Load(path, TXTextControl.StreamType.AdobePDF, ls);

        // prepare the string to match the numbers of
        // control chartacters
        sSourceString = tx.Text.Replace("
","
"); 
    }

    // use RegEx and LINQ to match the strings and to return the ID
    return
        Regex.Matches(sSourceString, value).Cast<Match>().Select(m => m.Index);
}

The following code returns all index positions of the string "text" in a PDF document:

IEnumerable<int> index = FindInPDF("test.pdf", "text");

Stay in the loop!

Subscribe to the newsletter to receive the latest updates.

Related Posts

PDFTutorial

Create Password Protected and Signed Adobe PDF and PDF/A Documents

The document format Adobe PDF is probably the most popular format when it comes to invoices, quotes and other business documents. It has several advantages: It is a "read only" document that…


ASP.NETASP.NET CoreMarkdown

A Complete Guide to Converting Markdown to PDF in .NET C#

Learn how to convert Markdown to PDF in .NET C# using Text Control's ServerTextControl component. This guide covers setup, conversion process, and customization options for generating high-quality…


ASP.NETASP.NET CoreDocument Creation

Why PDF Creation Belongs at the End of the Business Process

This article discusses why placing PDF creation at the end of the business process is important for ensuring accuracy and efficiency. The most scalable systems delay PDF generation until the…


ASP.NETASP.NET CoreForms

Designing the Perfect PDF Form with TX Text Control in .NET C#

Learn how to create and design interactive PDF forms using TX Text Control in .NET C#. This guide covers essential features and best practices for effective form design.


ASP.NETASP.NET CoreMIME

Why Defining MIME Types for PDF/A Attachments Is Essential

The PDF/A standard was created to ensure the long-term reliable archiving of digital documents. An important aspect of the standard involves properly handling embedded files and attachments within…

Summarize this blog post with:

Share on this blog post on: