Searching Strings in PDF Documents
TX Text Control is not only able to load and modify MS Word documents such as DOC, RTF and DOCX files. TX Text Control is also able to import "born digital" PDF documents, so that you can view, edit or convert these files. An overview of the features and the possibilities can be read here: PDF Reflow - Load, view, edit and convert Adobe PDF files The combination of ServerTextControl, LINQ and regular expressions provides a powerful tool to search strings in PDF documents. The method…


TX Text Control is not only able to load and modify MS Word documents such as DOC, RTF and DOCX files. TX Text Control is also able to import "born digital" PDF documents, so that you can view, edit or convert these files.
An overview of the features and the possibilities can be read here:
PDF Reflow - Load, view, edit and convert Adobe PDF files
The combination of ServerTextControl, LINQ and regular expressions provides a powerful tool to search strings in PDF documents. The method FindInPDF listed below accepts a PDF document as a file path and a value to seek. ServerTextControl opens the PDF in order to provide the plain text to a regular expression.
The resulting MatchCollection of the Matches method is used by LINQ to return the index of each individual result that are stored in an IEnumerable<int> object.
private IEnumerable<int> FindInPDF(string path, string value)
{
string sSourceString = "";
// create a temporary ServerTextControl that imports the PDF file
using (TXTextControl.ServerTextControl tx =
new TXTextControl.ServerTextControl())
{
TXTextControl.LoadSettings ls= new TXTextControl.LoadSettings();
ls.PDFImportSettings = TXTextControl.PDFImportSettings.GenerateLines;
tx.Create();
tx.Load(path, TXTextControl.StreamType.AdobePDF, ls);
// prepare the string to match the numbers of
// control chartacters
sSourceString = tx.Text.Replace("
","
");
}
// use RegEx and LINQ to match the strings and to return the ID
return
Regex.Matches(sSourceString, value).Cast<Match>().Select(m => m.Index);
}
The following code returns all index positions of the string "text" in a PDF document:
IEnumerable<int> index = FindInPDF("test.pdf", "text");
Related Posts
Create Password Protected and Signed Adobe PDF and PDF/A Documents
The document format Adobe PDF is probably the most popular format when it comes to invoices, quotes and other business documents. It has several advantages: It is a "read only" document that…
Mining PDFs with Regex in C#: Practical Patterns, Tips, and Ideas
Mining PDFs with Regex in C# can be a powerful technique for extracting information from documents. This article explores practical patterns, tips, and ideas for effectively using regular…
PDF Conversion in .NET: Convert DOCX, HTML and more with C#
PDF conversion in .NET is a standard requirement for generating invoices, templates, and accessible reports. This article provides an overview of PDF conversion capabilities using TX Text Control,…
Streamline Data Collection with Embedded Forms in C# .NET
Discover how to enhance your C# .NET applications by embedding forms for data collection. This article explores the benefits of using Text Control's ASP.NET and ASP.NET Core components to create…
Adding QR Codes to PDF Documents in C# .NET
This article explains how to add QR codes to PDF documents with the Text Control .NET Server component in C#. It provides the necessary steps and code snippets for effectively implementing this…