Products Technologies Demo Docs Blog Support Company

Word-based Document Comparison and Track Changes Using TX Text Control and C#

This article shows how to compare two documents by their text content and how to track changes in a document using TX Text Control .NET for Windows Forms.

Word-based Document Comparison and Track Changes Using TX Text Control and C#

There are many strategies for comparing documents in document processing applications. One of the most common is to compare the text of the documents word by word. This is a simple and effective method of document comparison, but it does have some limitations.

Word-by-Word Comparison

Essentially, this comparison algorithm compares all paragraphs in their given order. On the basis of the paragraph, all the sentences will be extracted in accordance with the delimiters. Finally, the words in these sentences from an original document are compared to a given revised document.

The results are marked as track changes in the original document. The track changes are highlighted in the original document, and the user can see the changes that have been made to the document.

Document Comparison with TX Text Control in C#

Implementation

The sample implements the DocumentComparison class, which accepts two TXTextControl.TextControl instances in its constructor. You can easily rewrite this class to use non-UI TXTextControl.ServerTextControl instances.

DocumentComparison dc = new DocumentComparison(textControl1, textControl2);

The constructor compares the two documents. It loops through all paragraphs in the original document and compares the text with the revised document. If a difference is found, the text is marked as a track change.

Extracting Sentences

The ExtractSentences method takes a string from the current paragraph and returns a list of sentences by splitting it at typical delimiters.

public static List<string> ExtractSentences(string input)
{
        List<string> sentences = new List<string>();

        // Use regular expression to split the input string into sentences but keep white spaces
        string pattern = @"([.!?])";

        // split the input string into sentences with the delimiters
        string[] splitSentences = Regex.Split(input, pattern);
        
        // Trim each sentence and remove empty strings
        foreach (string sentence in splitSentences)
        {
                sentences.Add(sentence);
        }

        return sentences;
}

Comparing Sentences

The CompareSentences method creates individual words and compares the positions of the words within each of the given sentences. It returns a list of tuples, each containing three elements: the word from sentence1, the character index where the word starts, and the corresponding word from sentence2. Finally, it returns the list of differences between the two sentences.

private static List<(string word, int charIndex, string replacedWord)> CompareSentences(string sentence1, string sentence2)
{
        string[] words1 = sentence1.Split(' ');
        string[] words2 = sentence2.Split(' ');

        List<(string word, int charIndex, string replacedWord)> differences =
                new List<(string word, int charIndex, string replacedWord)>();

        // Track the character index
        int charIndex = 0;

        // Get the maximum length of the two sentences
        int maxLength = Math.Max(words1.Length, words2.Length);

        // Compare each word in the sentences
        for (int i = 0; i < maxLength; i++)
        {
                // Check if the current word exists in both sentences
                if (i < words1.Length && i < words2.Length)
                {
                        // If the words are different, add the word, character index, and replaced word to the list
                        if (words1[i] != words2[i])
                        {
                                differences.Add((words1[i], charIndex, words2[i]));
                        }
                }
                // If one of the sentences is shorter, add the extra word to the list
                else if (i < words1.Length)
                {
                        differences.Add((words1[i], charIndex, ""));
                }
                else
                {
                        differences.Add((words2[i], charIndex, ""));
                }

                // Update the character index for the next word
                if (i < words1.Length)
                        charIndex += words1[i].Length + 1; // Add 1 for the space
        }

        return differences;
}

Comparing Documents

The constructor of the DocumentComparison class uses the above methods to find the differences between given TextControl instances. The differences are marked as track changes in the original document.

public DocumentComparison(TXTextControl.TextControl originalDocument, TextControl revisedDocument)
{
  // Initialize document references
  m_originalDocument = originalDocument;
  m_revisedDocument = revisedDocument;

  // Enable track changes in the original document
  originalDocument.IsTrackChangesEnabled = true;

  // Compare paragraphs between the original and revised documents
  for (int p = 1; p <= m_originalDocument.Paragraphs.Count; p++)
  {
    var offsetSentences = 0;

    // Retrieve the original and revised paragraphs
    Paragraph originalParagraph = m_originalDocument.Paragraphs[p];

    if (p > m_revisedDocument.Paragraphs.Count)
      break; // Break if the revised document has fewer paragraphs than the original document

    Paragraph revisedParagraph = m_revisedDocument.Paragraphs[p];

    // Get the start position of the original paragraph
    var startParagraph = originalParagraph.Start;
    var uncheckedOffset = 0;

    // Check if the text of the original and revised paragraphs differ
    if (originalParagraph.Text != revisedParagraph.Text)
    {
      // Extract sentences from the original and revised paragraphs
      var originalSentences = ExtractSentences(originalParagraph.Text);
      var revisedSentences = ExtractSentences(revisedParagraph.Text);

      // Compare sentences and replace words in the original document
      for (int i = 0; i < originalSentences.Count; i++)
      {
        // Trim sentences and calculate offset
        var originalTrimOffset = originalSentences[i].Length - originalSentences[i].Trim().Length;
        var originalSentence = originalSentences[i].Trim();
        var revisedSentence = revisedSentences[i].Trim();

        // Track changes offset initialization
        int trackedChangeOffset = 0;

        var differences = CompareSentences(originalSentence, revisedSentence);

        // Check if there are any differences
        if (differences.Count == 0)
          uncheckedOffset = originalSentences[i].Length - 1;

        // Apply differences to the original document
        foreach (var difference in differences)
        {
          m_originalDocument.Selection.Start = trackedChangeOffset + startParagraph + offsetSentences +
                               difference.charIndex + originalTrimOffset + uncheckedOffset - 1;

          m_originalDocument.Selection.Length = difference.word.Length;
          m_originalDocument.Selection.Text = difference.replacedWord;

          trackedChangeOffset += difference.replacedWord.Length;
        }

        // Update offset for next sentence
        offsetSentences += originalSentences[i].Length + trackedChangeOffset;
      }
    }
    
  }
}

The complex part of this process is keeping track of various index offsets and trimming paragraphs to ignore spaces within sentences.

Conclusion

Comparing documents word by word is a common method of document comparison. This sample shows how to implement a simple word-by-word comparison algorithm using TX Text Control. The sample compares two documents and marks the differences as track changes in the original document.

Download the complete sample from our GitHub repository and test it on your own.

Stay in the loop!

Subscribe to the newsletter to receive the latest updates.

GitHub

Download and Fork This Sample on GitHub

We proudly host our sample code on github.com/TextControl.

Please fork and contribute.

Download ZIP

Open on GitHub

Open in Visual Studio

Requirements for this sample

  • Visual Studio 2022
  • TX Text Control .NET for Windows Forms

ASP.NET

Integrate document processing into your applications to create documents such as PDFs and MS Word documents, including client-side document editing, viewing, and electronic signatures.

ASP.NET Core
Angular
Blazor
JavaScript
React
  • Angular
  • Blazor
  • React
  • JavaScript
  • ASP.NET MVC, ASP.NET Core, and WebForms

Learn more Trial token Download trial

Related Posts

ASP.NETWindows FormsWPF

User Management Features in TX Text Control

TX Text Control includes a list of user names which is used for document protection and to track changes for multiple authors.


ASP.NETWindows FormsWPF

TX Text Control 33.0 SP3 is Now Available: What's New in the Latest Version

TX Text Control 33.0 Service Pack 3 is now available, offering important updates and bug fixes for all platforms. If you use TX Text Control in your document processing applications, this service…


ASP.NETWindows FormsWPF

TX Text Control 33.0 SP2 is Now Available: What's New in the Latest Version

TX Text Control 33.0 Service Pack 2 is now available, offering important updates and bug fixes for all platforms. If you use TX Text Control in your document processing applications, this service…


ASP.NETWindows FormsWPF

Document Lifecycle Optimization: Leveraging TX Text Control's Internal Format

Maintaining the integrity and functionality of documents throughout their lifecycle is paramount. TX Text Control provides a robust ecosystem that focuses on preserving documents in their internal…


ActiveXASP.NETWindows Forms

Expert Implementation Services for Legacy System Modernization

We are happy to officially announce our partnership with Quality Bytes, a specialized integration company with extensive experience in modernizing legacy systems with TX Text Control technologies.