Searching Strings in PDF Documents
TX Text Control can import born-digital PDF documents for viewing, editing, and format conversion. By combining ServerTextControl with LINQ and regular expressions, developers can programmatically search for string patterns within PDFs and retrieve all matching index positions.

TX Text Control is not only able to load and modify MS Word documents such as DOC, RTF and DOCX files. TX Text Control is also able to import "born digital" PDF documents, so that you can view, edit or convert these files.
An overview of the features and the possibilities can be read here:
PDF Reflow - Load, view, edit and convert Adobe PDF files
The combination of ServerTextControl, LINQ and regular expressions provides a powerful tool to search strings in PDF documents. The method FindInPDF listed below accepts a PDF document as a file path and a value to seek. ServerTextControl opens the PDF in order to provide the plain text to a regular expression.
The resulting MatchCollection of the Matches method is used by LINQ to return the index of each individual result that are stored in an IEnumerable<int> object.
private IEnumerable<int> FindInPDF(string path, string value)
{
string sSourceString = "";
// create a temporary ServerTextControl that imports the PDF file
using (TXTextControl.ServerTextControl tx =
new TXTextControl.ServerTextControl())
{
TXTextControl.LoadSettings ls= new TXTextControl.LoadSettings();
ls.PDFImportSettings = TXTextControl.PDFImportSettings.GenerateLines;
tx.Create();
tx.Load(path, TXTextControl.StreamType.AdobePDF, ls);
// prepare the string to match the numbers of
// control chartacters
sSourceString = tx.Text.Replace("
","
");
}
// use RegEx and LINQ to match the strings and to return the ID
return
Regex.Matches(sSourceString, value).Cast<Match>().Select(m => m.Index);
}
The following code returns all index positions of the string "text" in a PDF document:
IEnumerable<int> index = FindInPDF("test.pdf", "text");Related Posts
Create Password Protected and Signed Adobe PDF and PDF/A Documents
TX Text Control includes a PDF engine that creates documents with automatic paging, table breaks, headers, and footers. It converts RTF, DOC, DOCX, and HTML to PDF, imports existing PDFs, and…
Using QR Codes in PDF Documents in C# .NET
QR codes are a powerful tool for embedding machine-readable information in documents. In this article, we will explore how to generate and insert them into PDF documents using C# .NET with TX Text…
Programmatically Fill, Flatten, and Export DOCX Form Templates to PDF in C# .NET
Learn how to fill, flatten and export DOCX form templates to PDF programmatically in C# .NET. This article provides a step-by-step guide on how to use the TX Text Control library to achieve this,…
ASP.NETASP.NET CoreE-Invoicing
Why Structured E-Invoices Still Need Tamper Protection using C# and .NET
ZUGFeRD, Factur-X, German e-invoicing rules, and how to seal PDF invoices with TX Text Control to prevent tampering. Learn how to create compliant e-invoices with C# and .NET.
Create Fillable PDFs from HTML Forms in C# ASP.NET Core Using a WYSIWYG Template
Learn how to generate PDFs from HTML forms in ASP.NET Core using a pixel-perfect WYSIWYG template. Extract form fields from a document, render a dynamic HTML form, and merge the data server-side…
