Products Technologies Demo Docs Blog Support Company

Mining PDFs with Regex in C#: Practical Patterns, Tips, and Ideas

Mining PDFs with Regex in C# can be a powerful technique for extracting information from documents. This article explores practical patterns, tips, and ideas for effectively using regular expressions in C# to mine data from PDF files.

Mining PDFs with Regex in C#: Practical Patterns, Tips, and Ideas

Regex is a powerful tool for finding things in PDFs. TX Text Control displays the text content of a PDF as lines with coordinates, enabling expressive searches and precise location of each result on the page. This makes it possible to perform audits, check clauses, scan for compliance, and build your own "find in document" features.

The Core Idea: Find Text and Get Positions

The Lines class provides the ability to locate ContentLine objects within a PDF document. This class represents a line of text in the PDF, including its position on the page. The lines are stored in a collection that can be iterated over to find specific text.

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using TXTextControl.DocumentServer.PDF.Contents;

class Program
{
    static void Main()
    {
        string matchWord = "shall";
        string pattern = $@"\b{Regex.Escape(matchWord)}\b";

        // Parse all text lines from the PDF
        Lines pdfLines = new Lines("NDA_Agreement.pdf");

        // Run a case insensitive regex search
        List<ContentLine> contentLines =
            pdfLines.Find(pattern, RegexOptions.IgnoreCase);

        foreach (ContentLine line in contentLines)
        {
            Console.WriteLine(
                "Found \"{0}\" on page {1} (X: {2}, Y: {3}): {4}",
                matchWord, line.Page, line.X, line.Y, line.Text);
        }
    }
}

The code above lists all text lines containing the word "shall." The following is one possible result of the above code:

Found "shall" on page 1 (X: 1560, Y: 11600): This Agreement shall not apply to information that was known to the Receiving Party before disclosure,
Found "shall" on page 1 (X: 1560, Y: 13392): Upon request or termination, the Receiving Party shall promptly return or destroy all materials
Found "shall" on page 2 (X: 1560, Y: 2304): This Agreement shall remain in effect for [two] years from the Effective Date. The obligation to protect
Found "shall" on page 2 (X: 1560, Y: 2544): Confidential Information shall survive termination for an additional [two] years.
Found "shall" on page 2 (X: 1560, Y: 8512): This Agreement shall be governed by the laws of the State of [Your State/Country]. Disputes shall be
Found "shall" on page 3 (X: 1560, Y: 5408): This Agreement may be executed in counterparts, including electronic versions, each of which shall be

What you get back from Find are ContentLine results with:

  • Text: Text content
  • Page: Page number
  • X,Y: Coordinates of the text on the page

This is sufficient for building lists, linking to viewer locations, and post-processing the match.

Legal Regex Patterns for PDF Text

These regex patterns can be used to identify specific legal text structures within PDF documents:

Pattern Description
\b(shall|must|is required to)\b Variations of a requirement term
\b(Section|Clause)\s+\d+(\.\d+)*\b Clause numbers and headings
\$\s?\d{1,3}(,\d{3})*(\.\d{2})? Monetary amounts
\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?| May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?| Nov(?:ember)?|Dec(?:ember)?)\s+\d{1,2},\s+\d{4}\b Dates in common legal formats
\b(?:Compan(?:y|ies)|Disclos(?:er|ing|ure|es)?|Recipien(?:t|ts)|Provid(?:er|ing|es)|Custom(?:er|ers))\b\s*(?:\(the ""[^""]+""\)|Party:\s*\[[^\]]+\]) Party definitions

Getting more Context than a Single Line

Sometimes, a match appears in the middle of a sentence that wraps to the next line. You can create your own "context window" around each match.

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using TXTextControl.DocumentServer.PDF.Contents;

class Program
{
    static void Main()
    {
        Lines pdfLines = new Lines("NDA_Agreement.pdf");
        var all = pdfLines.ContentLines; 

        var hits = pdfLines.Find(@"\bshall\b", RegexOptions.IgnoreCase);

        // Build a quick index by page and Y
        var byPage = all
            .GroupBy(l => l.Page)
            .ToDictionary(g => g.Key, g => g.OrderBy(l => l.Y).ThenBy(l => l.X).ToList());

        foreach (var hit in hits)
        {
            var linesOnPage = byPage[hit.Page];
            var idx = linesOnPage.FindIndex(l => l.X == hit.X && l.Y == hit.Y);

            // Take previous and next line for context if available
            var context = new List<ContentLine>();
            if (idx > 0) context.Add(linesOnPage[idx - 1]);
            context.Add(linesOnPage[idx]);
            if (idx < linesOnPage.Count - 1) context.Add(linesOnPage[idx + 1]);

            Console.WriteLine($"--- Page {hit.Page} around Y={hit.Y}");
            foreach (var l in context)
                Console.WriteLine(l.Text);
        }
    }
}

By examining the lines before and after a match, you can gain a better understanding of its context and significance within the document.

--- Page 1 around Y=11600
4. Exclusions from Confidential Information
This Agreement shall not apply to information that was known to the Receiving Party before disclosure,
becomes public through no fault, is disclosed legally by a third party, or is independently developed
--- Page 1 around Y=13392
5. Return or Destruction
Upon request or termination, the Receiving Party shall promptly return or destroy all materials
containing Confidential Information and certify such destruction in writing if requested.
--- Page 2 around Y=2304
6. Term
This Agreement shall remain in effect for [two] years from the Effective Date. The obligation to protect
Confidential Information shall survive termination for an additional [two] years.
--- Page 2 around Y=2544
This Agreement shall remain in effect for [two] years from the Effective Date. The obligation to protect
Confidential Information shall survive termination for an additional [two] years.
7. No License
--- Page 2 around Y=8512
10. Governing Law and Jurisdiction
This Agreement shall be governed by the laws of the State of [Your State/Country]. Disputes shall be
resolved in the courts located in [City, State].
--- Page 3 around Y=5408
13. Counterparts
This Agreement may be executed in counterparts, including electronic versions, each of which shall be
deemed an original.

Grouping Results per Page

To group results by page, you can create a data structure that organizes the matches based on their page numbers. This allows you to present the findings in a more structured way, making it easier to navigate through the document.

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using TXTextControl.DocumentServer.PDF.Contents;

class Program
{
    static void Main()
    {
        Lines pdfLines = new Lines("NDA_Agreement.pdf");

        var pattern = @"\b(shall|must)\b";
        var hits = pdfLines.Find(pattern, RegexOptions.IgnoreCase);

        var perPage = hits
            .GroupBy(h => h.Page)
            .Select(g => new { Page = g.Key, Count = g.Count(), Samples = g.Take(3).Select(x => x.Text).ToList() })
            .OrderBy(x => x.Page);

        foreach (var p in perPage)
        {
            Console.WriteLine($"Page {p.Page}: {p.Count} hits");
            foreach (var s in p.Samples)
                Console.WriteLine($"   {s}");
        }

    }
}

The following shows the results grouped by page:

Page 1: 2 hits
   This Agreement shall not apply to information that was known to the Receiving Party before disclosure,
   Upon request or termination, the Receiving Party shall promptly return or destroy all materials
Page 2: 3 hits
   This Agreement shall remain in effect for [two] years from the Effective Date. The obligation to protect
   Confidential Information shall survive termination for an additional [two] years.
   This Agreement shall be governed by the laws of the State of [Your State/Country]. Disputes shall be
Page 3: 2 hits
   This Agreement may only be amended by a signed written instrument. Waivers must be explicit and in
   This Agreement may be executed in counterparts, including electronic versions, each of which shall be

Conclusion

Leveraging the capabilities of the PDF content extraction API allows you to effectively locate and analyze specific text patterns within legal documents. Using regular expressions allows for precise matching, and grouping results by page enhances the organization and accessibility of the findings.

Stay in the loop!

Subscribe to the newsletter to receive the latest updates.

ASP.NET

Integrate document processing into your applications to create documents such as PDFs and MS Word documents, including client-side document editing, viewing, and electronic signatures.

ASP.NET Core
Angular
Blazor
JavaScript
React
  • Angular
  • Blazor
  • React
  • JavaScript
  • ASP.NET MVC, ASP.NET Core, and WebForms

Learn more Trial token Download trial

Related Posts

ASP.NETASP.NET CoreE-Invoicing

Why Structured E-Invoices Still Need Tamper Protection using C# and .NET

ZUGFeRD, Factur-X, German e-invoicing rules, and how to seal PDF invoices with TX Text Control to prevent tampering. Learn how to create compliant e-invoices with C# and .NET.


ASP.NETASP.NET CoreForms

Create Fillable PDFs from HTML Forms in C# ASP.NET Core Using a WYSIWYG Template

Learn how to generate PDFs from HTML forms in ASP.NET Core using a pixel-perfect WYSIWYG template. Extract form fields from a document, render a dynamic HTML form, and merge the data server-side…


ASP.NETASP.NET CoreHTML

Why HTML to PDF Conversion is Often the Wrong Choice for Business Documents…

In this article, we explore the challenges of HTML to PDF conversion for business documents in C# .NET and present alternative solutions that offer better performance and reliability. Discover why…


ASP.NETASP.NET CoreMarkdown

A Complete Guide to Converting Markdown to PDF in .NET C#

Learn how to convert Markdown to PDF in .NET C# using Text Control's ServerTextControl component. This guide covers setup, conversion process, and customization options for generating high-quality…


ASP.NETASP.NET CoreDocument Creation

Why PDF Creation Belongs at the End of the Business Process

This article discusses why placing PDF creation at the end of the business process is important for ensuring accuracy and efficiency. The most scalable systems delay PDF generation until the…

Share on this blog post on: