There are many strategies for comparing documents in document processing applications. One of the most common is to compare the text of the documents word by word. This is a simple and effective method of document comparison, but it does have some limitations.
Word-by-Word Comparison
Essentially, this comparison algorithm compares all paragraphs in their given order. On the basis of the paragraph, all the sentences will be extracted in accordance with the delimiters. Finally, the words in these sentences from an original document are compared to a given revised document.
The results are marked as track changes in the original document. The track changes are highlighted in the original document, and the user can see the changes that have been made to the document.
Implementation
The sample implements the DocumentComparison class, which accepts two TXText ╰ TX Text Control .NET for Windows Forms
╰ TXTextControl Namespace
╰ TextControl Class
The TextControl class implements a Windows Forms control with high-level text editing features. instances in its constructor. You can easily rewrite this class to use non-UI TXText ╰ TX Text Control .NET for Windows Forms
╰ TXTextControl Namespace
╰ ServerTextControl Class
The ServerTextControl class implements a component that provide high-level text processing features for server-based applications. instances.
DocumentComparison dc = new DocumentComparison(textControl1, textControl2); |
The constructor compares the two documents. It loops through all paragraphs in the original document and compares the text with the revised document. If a difference is found, the text is marked as a track change.
Extracting Sentences
The ExtractSentences method takes a string from the current paragraph and returns a list of sentences by splitting it at typical delimiters.
public static List<string> ExtractSentences(string input) | |
{ | |
List<string> sentences = new List<string>(); | |
// Use regular expression to split the input string into sentences but keep white spaces | |
string pattern = @"([.!?])"; | |
// split the input string into sentences with the delimiters | |
string[] splitSentences = Regex.Split(input, pattern); | |
// Trim each sentence and remove empty strings | |
foreach (string sentence in splitSentences) | |
{ | |
sentences.Add(sentence); | |
} | |
return sentences; | |
} |
Comparing Sentences
The CompareSentences method creates individual words and compares the positions of the words within each of the given sentences. It returns a list of tuples, each containing three elements: the word from sentence1, the character index where the word starts, and the corresponding word from sentence2. Finally, it returns the list of differences between the two sentences.
private static List<(string word, int charIndex, string replacedWord)> CompareSentences(string sentence1, string sentence2) | |
{ | |
string[] words1 = sentence1.Split(' '); | |
string[] words2 = sentence2.Split(' '); | |
List<(string word, int charIndex, string replacedWord)> differences = | |
new List<(string word, int charIndex, string replacedWord)>(); | |
// Track the character index | |
int charIndex = 0; | |
// Get the maximum length of the two sentences | |
int maxLength = Math.Max(words1.Length, words2.Length); | |
// Compare each word in the sentences | |
for (int i = 0; i < maxLength; i++) | |
{ | |
// Check if the current word exists in both sentences | |
if (i < words1.Length && i < words2.Length) | |
{ | |
// If the words are different, add the word, character index, and replaced word to the list | |
if (words1[i] != words2[i]) | |
{ | |
differences.Add((words1[i], charIndex, words2[i])); | |
} | |
} | |
// If one of the sentences is shorter, add the extra word to the list | |
else if (i < words1.Length) | |
{ | |
differences.Add((words1[i], charIndex, "")); | |
} | |
else | |
{ | |
differences.Add((words2[i], charIndex, "")); | |
} | |
// Update the character index for the next word | |
if (i < words1.Length) | |
charIndex += words1[i].Length + 1; // Add 1 for the space | |
} | |
return differences; | |
} |
Comparing Documents
The constructor of the DocumentComparison class uses the above methods to find the differences between given TextControl instances. The differences are marked as track changes in the original document.
public DocumentComparison(TXTextControl.TextControl originalDocument, TextControl revisedDocument) | |
{ | |
// Initialize document references | |
m_originalDocument = originalDocument; | |
m_revisedDocument = revisedDocument; | |
// Enable track changes in the original document | |
originalDocument.IsTrackChangesEnabled = true; | |
// Compare paragraphs between the original and revised documents | |
for (int p = 1; p <= m_originalDocument.Paragraphs.Count; p++) | |
{ | |
var offsetSentences = 0; | |
// Retrieve the original and revised paragraphs | |
Paragraph originalParagraph = m_originalDocument.Paragraphs[p]; | |
if (p > m_revisedDocument.Paragraphs.Count) | |
break; // Break if the revised document has fewer paragraphs than the original document | |
Paragraph revisedParagraph = m_revisedDocument.Paragraphs[p]; | |
// Get the start position of the original paragraph | |
var startParagraph = originalParagraph.Start; | |
var uncheckedOffset = 0; | |
// Check if the text of the original and revised paragraphs differ | |
if (originalParagraph.Text != revisedParagraph.Text) | |
{ | |
// Extract sentences from the original and revised paragraphs | |
var originalSentences = ExtractSentences(originalParagraph.Text); | |
var revisedSentences = ExtractSentences(revisedParagraph.Text); | |
// Compare sentences and replace words in the original document | |
for (int i = 0; i < originalSentences.Count; i++) | |
{ | |
// Trim sentences and calculate offset | |
var originalTrimOffset = originalSentences[i].Length - originalSentences[i].Trim().Length; | |
var originalSentence = originalSentences[i].Trim(); | |
var revisedSentence = revisedSentences[i].Trim(); | |
// Track changes offset initialization | |
int trackedChangeOffset = 0; | |
var differences = CompareSentences(originalSentence, revisedSentence); | |
// Check if there are any differences | |
if (differences.Count == 0) | |
uncheckedOffset = originalSentences[i].Length - 1; | |
// Apply differences to the original document | |
foreach (var difference in differences) | |
{ | |
m_originalDocument.Selection.Start = trackedChangeOffset + startParagraph + offsetSentences + | |
difference.charIndex + originalTrimOffset + uncheckedOffset - 1; | |
m_originalDocument.Selection.Length = difference.word.Length; | |
m_originalDocument.Selection.Text = difference.replacedWord; | |
trackedChangeOffset += difference.replacedWord.Length; | |
} | |
// Update offset for next sentence | |
offsetSentences += originalSentences[i].Length + trackedChangeOffset; | |
} | |
} | |
} | |
} |
The complex part of this process is keeping track of various index offsets and trimming paragraphs to ignore spaces within sentences.
Conclusion
Comparing documents word by word is a common method of document comparison. This sample shows how to implement a simple word-by-word comparison algorithm using TX Text Control. The sample compares two documents and marks the differences as track changes in the original document.
Download the complete sample from our GitHub repository and test it on your own.