There are many strategies for comparing documents in document processing applications. One of the most common is to compare the text of the documents word by word. This is a simple and effective method of document comparison, but it does have some limitations.

Word-by-Word Comparison

Essentially, this comparison algorithm compares all paragraphs in their given order. On the basis of the paragraph, all the sentences will be extracted in accordance with the delimiters. Finally, the words in these sentences from an original document are compared to a given revised document.

The results are marked as track changes in the original document. The track changes are highlighted in the original document, and the user can see the changes that have been made to the document.

Document Comparison with TX Text Control in C#

Implementation

The sample implements the DocumentComparison class, which accepts two TXTextControl.TextControl TX Text Control .NET for Windows Forms
TXTextControl Namespace
TextControl Class
The TextControl class implements a Windows Forms control with high-level text editing features.
instances in its constructor. You can easily rewrite this class to use non-UI TXTextControl.ServerTextControl TX Text Control .NET for Windows Forms
TXTextControl Namespace
ServerTextControl Class
The ServerTextControl class implements a component that provide high-level text processing features for server-based applications.
instances.

DocumentComparison dc = new DocumentComparison(textControl1, textControl2);
view raw test.cs hosted with ❤ by GitHub

The constructor compares the two documents. It loops through all paragraphs in the original document and compares the text with the revised document. If a difference is found, the text is marked as a track change.

Extracting Sentences

The ExtractSentences method takes a string from the current paragraph and returns a list of sentences by splitting it at typical delimiters.

public static List<string> ExtractSentences(string input)
{
List<string> sentences = new List<string>();
// Use regular expression to split the input string into sentences but keep white spaces
string pattern = @"([.!?])";
// split the input string into sentences with the delimiters
string[] splitSentences = Regex.Split(input, pattern);
// Trim each sentence and remove empty strings
foreach (string sentence in splitSentences)
{
sentences.Add(sentence);
}
return sentences;
}
view raw test.cs hosted with ❤ by GitHub

Comparing Sentences

The CompareSentences method creates individual words and compares the positions of the words within each of the given sentences. It returns a list of tuples, each containing three elements: the word from sentence1, the character index where the word starts, and the corresponding word from sentence2. Finally, it returns the list of differences between the two sentences.

private static List<(string word, int charIndex, string replacedWord)> CompareSentences(string sentence1, string sentence2)
{
string[] words1 = sentence1.Split(' ');
string[] words2 = sentence2.Split(' ');
List<(string word, int charIndex, string replacedWord)> differences =
new List<(string word, int charIndex, string replacedWord)>();
// Track the character index
int charIndex = 0;
// Get the maximum length of the two sentences
int maxLength = Math.Max(words1.Length, words2.Length);
// Compare each word in the sentences
for (int i = 0; i < maxLength; i++)
{
// Check if the current word exists in both sentences
if (i < words1.Length && i < words2.Length)
{
// If the words are different, add the word, character index, and replaced word to the list
if (words1[i] != words2[i])
{
differences.Add((words1[i], charIndex, words2[i]));
}
}
// If one of the sentences is shorter, add the extra word to the list
else if (i < words1.Length)
{
differences.Add((words1[i], charIndex, ""));
}
else
{
differences.Add((words2[i], charIndex, ""));
}
// Update the character index for the next word
if (i < words1.Length)
charIndex += words1[i].Length + 1; // Add 1 for the space
}
return differences;
}
view raw test.cs hosted with ❤ by GitHub

Comparing Documents

The constructor of the DocumentComparison class uses the above methods to find the differences between given TextControl instances. The differences are marked as track changes in the original document.

public DocumentComparison(TXTextControl.TextControl originalDocument, TextControl revisedDocument)
{
// Initialize document references
m_originalDocument = originalDocument;
m_revisedDocument = revisedDocument;
// Enable track changes in the original document
originalDocument.IsTrackChangesEnabled = true;
// Compare paragraphs between the original and revised documents
for (int p = 1; p <= m_originalDocument.Paragraphs.Count; p++)
{
var offsetSentences = 0;
// Retrieve the original and revised paragraphs
Paragraph originalParagraph = m_originalDocument.Paragraphs[p];
if (p > m_revisedDocument.Paragraphs.Count)
break; // Break if the revised document has fewer paragraphs than the original document
Paragraph revisedParagraph = m_revisedDocument.Paragraphs[p];
// Get the start position of the original paragraph
var startParagraph = originalParagraph.Start;
var uncheckedOffset = 0;
// Check if the text of the original and revised paragraphs differ
if (originalParagraph.Text != revisedParagraph.Text)
{
// Extract sentences from the original and revised paragraphs
var originalSentences = ExtractSentences(originalParagraph.Text);
var revisedSentences = ExtractSentences(revisedParagraph.Text);
// Compare sentences and replace words in the original document
for (int i = 0; i < originalSentences.Count; i++)
{
// Trim sentences and calculate offset
var originalTrimOffset = originalSentences[i].Length - originalSentences[i].Trim().Length;
var originalSentence = originalSentences[i].Trim();
var revisedSentence = revisedSentences[i].Trim();
// Track changes offset initialization
int trackedChangeOffset = 0;
var differences = CompareSentences(originalSentence, revisedSentence);
// Check if there are any differences
if (differences.Count == 0)
uncheckedOffset = originalSentences[i].Length - 1;
// Apply differences to the original document
foreach (var difference in differences)
{
m_originalDocument.Selection.Start = trackedChangeOffset + startParagraph + offsetSentences +
difference.charIndex + originalTrimOffset + uncheckedOffset - 1;
m_originalDocument.Selection.Length = difference.word.Length;
m_originalDocument.Selection.Text = difference.replacedWord;
trackedChangeOffset += difference.replacedWord.Length;
}
// Update offset for next sentence
offsetSentences += originalSentences[i].Length + trackedChangeOffset;
}
}
}
}
view raw test.cs hosted with ❤ by GitHub

The complex part of this process is keeping track of various index offsets and trimming paragraphs to ignore spaces within sentences.

Conclusion

Comparing documents word by word is a common method of document comparison. This sample shows how to implement a simple word-by-word comparison algorithm using TX Text Control. The sample compares two documents and marks the differences as track changes in the original document.

Download the complete sample from our GitHub repository and test it on your own.