Document comparison is a critical feature in many industries, allowing teams to efficiently identify differences between versions of documents. While text-based comparison methods are common, there are scenarios where an image-based, pixel-by-pixel approach offers unique advantages. This article provides examples and applications that demonstrate the practicality and speed of this method, and explores when and why it is useful.
Using the TX Text Control API, it would be possible to go through all the paragraphs, characters to check the position and formatting. Although this is technically possible and TX Text Control is already fast, it would be too slow for longer documents.
Image-Based Document Comparison
Image-based document comparison renders the pages of a document as images and compares them pixel by pixel. Rather than programmatically analyzing textual content, formatting, or positioning, this approach directly identifies visual differences. Traditional text-based comparison methods parse document structure, extract text, analyze formatting, and detect positional differences. This process can be computationally intensive, especially for complex documents with intricate layouts or heavy formatting. Image-based comparison skips these steps and compares rendered images directly, which can significantly reduce processing time.
Text-based methods can miss certain visual differences, such as slight font changes, alignment shifts, or color variations. Pixel-by-pixel comparison accurately captures these differences, making it ideal for visually critical applications.
Comparing Documents
For demonstration purposes, we will use our demo document that comes with the installation of TX Text Control. It is a six-page document that contains most of the features of TX Text Control.
In a first pass, we will take two exact copies of the document and compare them using the following code.
using static DocumentComparer; | |
string document1 = "demo1.tx"; | |
string document2 = "demo2.tx"; | |
// Get the comparison results | |
List<PageComparisonResult> comparisonResults = DocumentComparer.CompareDocuments(document1, document2); | |
// Generate and display the results | |
foreach (var result in comparisonResults) | |
{ | |
if (result.PageIndex == -1) | |
{ | |
// Special case for differing page counts | |
Console.WriteLine(result.Message); | |
} | |
else | |
{ | |
string message = result.AreEqual | |
? $"The document images of page {result.PageIndex + 1} are equal." | |
: $"The document images of page {result.PageIndex + 1} are different."; | |
Console.WriteLine(message); | |
} | |
} |
When running this code, the result will be the following which means that the documents are identical:
The document images of page 1 are equal.
The document images of page 2 are equal.
The document images of page 3 are equal.
The document images of page 4 are equal.
The document images of page 5 are equal.
The document images of page 6 are equal.
Now let's change the font of the first paragraph on page 1 and reduce the size of the image on page 4.
When running the same code again, the result will be the following:
The document images of page 1 are different.
The document images of page 2 are equal.
The document images of page 3 are equal.
The document images of page 4 are different.
The document images of page 5 are equal.
The document images of page 6 are equal.
Implementation
The DocumentComparer class is a static utility for comparing two documents page by page. It provides insight into whether the documents are visually identical or contain differences. The CompareDocuments method provides an entry point for comparing two documents. It uses a Server ╰ TX Text Control .NET Server for ASP.NET
╰ TXTextControl Namespace
╰ ServerTextControl Class
The ServerTextControl class implements a component that provide high-level text processing features for server-based applications. instance to load both documents and converts each document into a list of bitmap objects.
public static List<PageComparisonResult> CompareDocuments(string documentPath1, string documentPath2) | |
{ | |
var comparisonResults = new List<PageComparisonResult>(); | |
using (var serverTextControl = new ServerTextControl()) | |
{ | |
serverTextControl.Create(); | |
// Load and render the first document | |
serverTextControl.Load(documentPath1, StreamType.InternalUnicodeFormat); | |
var bitmapsDocument1 = GetDocumentImages(serverTextControl); | |
// Load and render the second document | |
serverTextControl.Load(documentPath2, StreamType.InternalUnicodeFormat); | |
var bitmapsDocument2 = GetDocumentImages(serverTextControl); | |
// Compare pages | |
if (bitmapsDocument1.Count != bitmapsDocument2.Count) | |
{ | |
comparisonResults.Add(new PageComparisonResult | |
{ | |
PageIndex = -1, | |
AreEqual = false, | |
Message = "The documents have different page counts." | |
}); | |
return comparisonResults; // Return early if page counts differ | |
} | |
for (int i = 0; i < bitmapsDocument1.Count; i++) | |
{ | |
using (var bitmap1 = bitmapsDocument1[i]) | |
using (var bitmap2 = bitmapsDocument2[i]) | |
{ | |
comparisonResults.Add(new PageComparisonResult | |
{ | |
PageIndex = i, | |
AreEqual = !DocumentComparer.IsDifferent(bitmap1, bitmap2), | |
Message = null | |
}); | |
} | |
} | |
} | |
return comparisonResults; | |
} |
Each bitmap represents one rendered page. The method first checks if the documents have the same number of pages. If the page counts are different, it immediately returns a result highlighting this discrepancy. For documents with matching page counts, the method uses the IsDifferent function to compare the rendered bitmap objects for each page, identifying any visual differences.
The GetDocumentImages method extracts high-resolution images of all pages from a document loaded into the ServerTextControl. Each page is rendered at 300 DPI to maintain high fidelity and ensure accurate pixel-based comparisons.
private static List<Bitmap> GetDocumentImages(ServerTextControl serverTextControl) | |
{ | |
var bitmaps = new List<Bitmap>(); | |
var pages = serverTextControl.GetPages(); | |
for (int i = 1; i <= pages.Count; i++) | |
{ | |
// Get image for each page | |
bitmaps.Add(pages[i].GetImage(300, Page.PageContent.All)); | |
} | |
return bitmaps; | |
} |
The IsDifferent method determines whether two bitmap objects are different by comparing their pixel data byte by byte. If the dimensions of the images differ, they are immediately marked as different. The method locks the pixel data for efficient access, compares the raw pixel data byte by byte for mismatches, and then unlocks the data when the comparison is complete. This approach ensures accuracy in detecting even subtle visual discrepancies.
public static bool IsDifferent(Bitmap bitmap1, Bitmap bitmap2) | |
{ | |
if (bitmap1 == null || bitmap2 == null) | |
{ | |
throw new ArgumentNullException("Bitmaps cannot be null."); | |
} | |
if (bitmap1.Width != bitmap2.Width || bitmap1.Height != bitmap2.Height) | |
{ | |
// Consider images different if dimensions are not the same. | |
return true; | |
} | |
// Lock the bits for both images for efficient pixel access. | |
var rect = new Rectangle(0, 0, bitmap1.Width, bitmap1.Height); | |
BitmapData data1 = bitmap1.LockBits(rect, ImageLockMode.ReadOnly, PixelFormat.Format32bppArgb); | |
BitmapData data2 = bitmap2.LockBits(rect, ImageLockMode.ReadOnly, PixelFormat.Format32bppArgb); | |
try | |
{ | |
// Compare pixel data byte by byte. | |
int bytes = data1.Stride * data1.Height; | |
byte[] buffer1 = new byte[bytes]; | |
byte[] buffer2 = new byte[bytes]; | |
System.Runtime.InteropServices.Marshal.Copy(data1.Scan0, buffer1, 0, bytes); | |
System.Runtime.InteropServices.Marshal.Copy(data2.Scan0, buffer2, 0, bytes); | |
for (int i = 0; i < bytes; i++) | |
{ | |
if (buffer1[i] != buffer2[i]) | |
{ | |
return true; | |
} | |
} | |
} | |
finally | |
{ | |
// Unlock the bits. | |
bitmap1.UnlockBits(data1); | |
bitmap2.UnlockBits(data2); | |
} | |
return false; | |
} |
Conclusion
IImage-based document comparison provides a unique, very fast and efficient approach to identifying visual differences between documents. By rendering documents as images and comparing them pixel by pixel, this method provides a fast and accurate way to detect changes. This approach is particularly useful for visually critical applications where text-based methods may miss subtle differences. The DocumentComparer utility demonstrates how to implement image-based document comparison using TX Text Control, providing a practical and efficient solution for comparing documents.
Download the sample from GitHub and test it with your own documents.