Document Comparison using Image-Based Pixel Matching in .NET C#

Document comparison is a critical feature in many industries, allowing teams to efficiently identify differences between versions of documents. While text-based comparison methods are common, there are scenarios where an image-based, pixel-by-pixel approach offers unique advantages. This article provides examples and applications that demonstrate the practicality and speed of this method, and explores when and why it is useful.

Using the TX Text Control API, it would be possible to go through all the paragraphs, characters to check the position and formatting. Although this is technically possible and TX Text Control is already fast, it would be too slow for longer documents.

Image-Based Document Comparison

Image-based document comparison renders the pages of a document as images and compares them pixel by pixel. Rather than programmatically analyzing textual content, formatting, or positioning, this approach directly identifies visual differences. Traditional text-based comparison methods parse document structure, extract text, analyze formatting, and detect positional differences. This process can be computationally intensive, especially for complex documents with intricate layouts or heavy formatting. Image-based comparison skips these steps and compares rendered images directly, which can significantly reduce processing time.

Text-based methods can miss certain visual differences, such as slight font changes, alignment shifts, or color variations. Pixel-by-pixel comparison accurately captures these differences, making it ideal for visually critical applications.

Comparing Documents

For demonstration purposes, we will use our demo document that comes with the installation of TX Text Control. It is a six-page document that contains most of the features of TX Text Control.

Comparing Documents

In a first pass, we will take two exact copies of the document and compare them using the following code.

	using static DocumentComparer;

	string document1 = "demo1.tx";
	string document2 = "demo2.tx";

	// Get the comparison results
	List<PageComparisonResult> comparisonResults = DocumentComparer.CompareDocuments(document1, document2);

	// Generate and display the results
	foreach (var result in comparisonResults)
	{
	if (result.PageIndex == -1)
	{
	// Special case for differing page counts
	Console.WriteLine(result.Message);
	}
	else
	{
	string message = result.AreEqual
	? $"The document images of page {result.PageIndex + 1} are equal."
	: $"The document images of page {result.PageIndex + 1} are different.";
	Console.WriteLine(message);
	}
	}

view raw test.cs hosted with ❤ by GitHub

When running this code, the result will be the following which means that the documents are identical:

The document images of page 1 are equal.
The document images of page 2 are equal.
The document images of page 3 are equal.
The document images of page 4 are equal.
The document images of page 5 are equal.
The document images of page 6 are equal.

Now let's change the font of the first paragraph on page 1 and reduce the size of the image on page 4.

Comparing Documents

When running the same code again, the result will be the following:

The document images of page 1 are different.
The document images of page 2 are equal.
The document images of page 3 are equal.
The document images of page 4 are different.
The document images of page 5 are equal.
The document images of page 6 are equal.

Implementation

The DocumentComparer class is a static utility for comparing two documents page by page. It provides insight into whether the documents are visually identical or contain differences. The CompareDocuments method provides an entry point for comparing two documents. It uses a ServerTextControl ╰ TX Text Control .NET Server for ASP.NET
╰ TXTextControl Namespace
╰ ServerTextControl Class
The ServerTextControl class implements a component that provide high-level text processing features for server-based applications. instance to load both documents and converts each document into a list of bitmap objects.

	public static List<PageComparisonResult> CompareDocuments(string documentPath1, string documentPath2)
	{
	var comparisonResults = new List<PageComparisonResult>();

	using (var serverTextControl = new ServerTextControl())
	{
	serverTextControl.Create();

	// Load and render the first document
	serverTextControl.Load(documentPath1, StreamType.InternalUnicodeFormat);
	var bitmapsDocument1 = GetDocumentImages(serverTextControl);

	// Load and render the second document
	serverTextControl.Load(documentPath2, StreamType.InternalUnicodeFormat);
	var bitmapsDocument2 = GetDocumentImages(serverTextControl);

	// Compare pages
	if (bitmapsDocument1.Count != bitmapsDocument2.Count)
	{
	comparisonResults.Add(new PageComparisonResult
	{
	PageIndex = -1,
	AreEqual = false,
	Message = "The documents have different page counts."
	});
	return comparisonResults; // Return early if page counts differ
	}

	for (int i = 0; i < bitmapsDocument1.Count; i++)
	{
	using (var bitmap1 = bitmapsDocument1[i])
	using (var bitmap2 = bitmapsDocument2[i])
	{
	comparisonResults.Add(new PageComparisonResult
	{
	PageIndex = i,
	AreEqual = !DocumentComparer.IsDifferent(bitmap1, bitmap2),
	Message = null
	});
	}
	}
	}

	return comparisonResults;
	}

view raw test.cs hosted with ❤ by GitHub

Each bitmap represents one rendered page. The method first checks if the documents have the same number of pages. If the page counts are different, it immediately returns a result highlighting this discrepancy. For documents with matching page counts, the method uses the IsDifferent function to compare the rendered bitmap objects for each page, identifying any visual differences.

The GetDocumentImages method extracts high-resolution images of all pages from a document loaded into the ServerTextControl. Each page is rendered at 300 DPI to maintain high fidelity and ensure accurate pixel-based comparisons.

	private static List<Bitmap> GetDocumentImages(ServerTextControl serverTextControl)
	{
	var bitmaps = new List<Bitmap>();
	var pages = serverTextControl.GetPages();

	for (int i = 1; i <= pages.Count; i++)
	{
	// Get image for each page
	bitmaps.Add(pages[i].GetImage(300, Page.PageContent.All));
	}

	return bitmaps;
	}

view raw test.cs hosted with ❤ by GitHub

The IsDifferent method determines whether two bitmap objects are different by comparing their pixel data byte by byte. If the dimensions of the images differ, they are immediately marked as different. The method locks the pixel data for efficient access, compares the raw pixel data byte by byte for mismatches, and then unlocks the data when the comparison is complete. This approach ensures accuracy in detecting even subtle visual discrepancies.

	public static bool IsDifferent(Bitmap bitmap1, Bitmap bitmap2)
	{
	if (bitmap1 == null \|\| bitmap2 == null)
	{
	throw new ArgumentNullException("Bitmaps cannot be null.");
	}

	if (bitmap1.Width != bitmap2.Width \|\| bitmap1.Height != bitmap2.Height)
	{
	// Consider images different if dimensions are not the same.
	return true;
	}

	// Lock the bits for both images for efficient pixel access.
	var rect = new Rectangle(0, 0, bitmap1.Width, bitmap1.Height);
	BitmapData data1 = bitmap1.LockBits(rect, ImageLockMode.ReadOnly, PixelFormat.Format32bppArgb);
	BitmapData data2 = bitmap2.LockBits(rect, ImageLockMode.ReadOnly, PixelFormat.Format32bppArgb);

	try
	{
	// Compare pixel data byte by byte.
	int bytes = data1.Stride * data1.Height;
	byte[] buffer1 = new byte[bytes];
	byte[] buffer2 = new byte[bytes];

	System.Runtime.InteropServices.Marshal.Copy(data1.Scan0, buffer1, 0, bytes);
	System.Runtime.InteropServices.Marshal.Copy(data2.Scan0, buffer2, 0, bytes);

	for (int i = 0; i < bytes; i++)
	{
	if (buffer1[i] != buffer2[i])
	{
	return true;
	}
	}
	}
	finally
	{
	// Unlock the bits.
	bitmap1.UnlockBits(data1);
	bitmap2.UnlockBits(data2);
	}

	return false;
	}

view raw test.cs hosted with ❤ by GitHub

Conclusion

IImage-based document comparison provides a unique, very fast and efficient approach to identifying visual differences between documents. By rendering documents as images and comparing them pixel by pixel, this method provides a fast and accurate way to detect changes. This approach is particularly useful for visually critical applications where text-based methods may miss subtle differences. The DocumentComparer utility demonstrates how to implement image-based document comparison using TX Text Control, providing a practical and efficient solution for comparing documents.

Download the sample from GitHub and test it with your own documents.

Text Control Products

WEB, SERVER AND CLOUD

Getting started with:

DESKTOP

HOSTED CLOUD

LOW CODE PLATFORM

Core Technologies

Text Control Documentation

Text Control Blog

Text Control Support

About Text Control

Document Comparison using Image-Based Pixel Matching in .NET C#

Summary

Image-Based Document Comparison

Comparing Documents

Implementation

Conclusion

Download and Fork This Sample on GitHub

Requirements for This Sample

ASP.NET

Getting started with:

Related Posts

Enhancing PDF Searchability in Large Repositories by Adding and Reading Keywords Using C# .NET

Best Practices: Reliable Auto-Save in TX Text Control Using the WebSocketHandler and Background…

How to Verify PDF Encryption Programmatically in C# .NET

TX Text Control 33.0 SP2 is Now Available: What's New in the Latest Version

Popular Products

Technologies

Get Products

Resources

Support

Ready To Talk?