Products Technologies Demo Docs Blog Support Company
TX Text Control 34.0 SP1 has been released - Learn more

Extract Plain Text from Office Open XML DOCX and DOC Documents in ASP.NET Core C#

This article shows how to extract plain text from Office Open XML DOCX and DOC documents in ASP.NET Core C#. It shows how to convert the binary DOCX and DOC files to plain text using the ServerTextControl class and how to extract specific areas of the document.

Extract Plain Text from Office Open XML DOCX and DOC Documents in ASP.NET Core C#

Extracting plain text from Office Open XML DOCX and DOC files is required in many different applications. Whether you are indexing text for a search engine, an AI-powered text analytics tool, or a text-to-speech system, you need to extract text from DOCX and DOC files. In this article, we will show you how to extract plain text from DOCX and DOC files using C#.

TX Text Control provides a very powerful API to extract text from DOCX and DOC files. You can convert the entire document or just a specific range of pages or text between two specific text positions. The following code shows how to extract plain text from a DOCX file using TX Text Control:

Preparing the Application

A .NET 6 console application is created for the purposes of this demo.

Prerequisites

The following tutorial requires a trial version of TX Text Control .NET Server.

  1. In Visual Studio, create a new Console App using .NET 8.

  2. In the Solution Explorer, select your created project and choose Manage NuGet Packages... from the Project main menu.

    Select Text Control Offline Packages from the Package source drop-down.

    Install the latest versions of the following package:

    • TXTextControl.TextControl.ASP.SDK

    Create PDF

Extracting Text from DOCX Files

After installing the required NuGet package, you can use the following code to extract plain text from a DOCX file:

try
{
  using TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl();
  
  tx.Create();
  
  tx.Load("document.docx", TXTextControl.StreamType.WordprocessingML);
  tx.Save(out string plainText, TXTextControl.StringStreamType.PlainText);

  Console.WriteLine(plainText);
}
catch (Exception ex)
{
  Console.WriteLine($"An error occurred: {ex.Message}");
}

The code snippet above loads a DOCX file and extracts the complete plain text from the document. The extracted text is then written to the console.

Extracting Text Between Headings

TX Text Control provides a powerful API to extract text between two specific text positions. Consider a scenario where you want to get all the text sections between chapter titles that are defined by stylesheets.

Consider the following document:

Extracting text with TX Text Control

We want to extract the complete text between the headings with the stylesheet names Heading1. The following code shows how to extract the text between these two headings:

List<string> ExtractTextBlocks(string paragraphStyleName, ServerTextControl serverTextControl, bool includeRemainingText)
{
    List<string> textBlocks = new List<string>();
    bool capturing = false;
    StringBuilder currentBlock = new StringBuilder();

    for (int i = 1; i < serverTextControl.Paragraphs.Count - 1; i++)
    {
        Paragraph paragraph = serverTextControl.Paragraphs[i];

        if (paragraph.FormattingStyle == paragraphStyleName)
        {
            if (capturing)
            {
                textBlocks.Add(currentBlock.ToString().Trim());
                currentBlock.Clear();
            }
            else
            {
                capturing = true;
            }
        }
        else if (capturing)
        {
            currentBlock.AppendLine(paragraph.Text);
        }
    }

    // Add remaining text if still capturing at the end
    if (includeRemainingText && (capturing || currentBlock.Length > 0))
    {
        textBlocks.Add(currentBlock.ToString().Trim());
    }

    return textBlocks;
}

The code snippet below loads a DOCX file and extracts the text between two specified headings. It then prints the extracted text to the console.

using TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl();
tx.Create();
tx.Load("document.docx", TXTextControl.StreamType.WordprocessingML);

var test = ExtractTextBlocks("Heading 1", tx, true);

foreach (var item in test)
{
        Console.WriteLine("New block: \r\n\r\n" + item + "\r\n");
}

The result is a list of three items containing the text between all headings named Heading 1.

New block:

This is the text of heading 1.

This is more text of heading 1.

This is the text of heading 1.

Sub-Heading 1

Normal text.

Normal text 2.

Sub-Heading 2

New block:

This is the text of heading 2.

This is more text of heading 2.

New block:

This is the text of heading 3.

If we now want to extract only the text between the Heading 2 styles, without adding the rest of the text that doesn't contain a closing style name, we can use the following code:

using TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl();
tx.Create();
tx.Load("document.docx", TXTextControl.StreamType.WordprocessingML);

var test = ExtractTextBlocks("Heading 2", tx, false);

foreach (var item in test)
{
        Console.WriteLine("New block: \r\n\r\n" + item + "\r\n");
}

The following screenshot shows the extracted text between the Heading 2 styles:

Extracting text with TX Text Control

The result of the above code snippet is a block of text between the Heading 2 styles.

New block:

Normal text.

Normal text 2.

Conclusion

TX Text Control provides a powerful API to extract text from DOCX and DOC files. You can extract the complete text or just a specific range of text between two specific text positions. This article showed how to extract plain text from DOCX and DOC files using TX Text Control in C#.

Stay in the loop!

Subscribe to the newsletter to receive the latest updates.

ASP.NET

Integrate document processing into your applications to create documents such as PDFs and MS Word documents, including client-side document editing, viewing, and electronic signatures.

ASP.NET Core
Angular
Blazor
JavaScript
React
  • Angular
  • Blazor
  • React
  • JavaScript
  • ASP.NET MVC, ASP.NET Core, and WebForms

Learn more Trial token Download trial

Related Posts

ASP.NETASP.NET CoreMIME

Why Defining MIME Types for PDF/A Attachments Is Essential

The PDF/A standard was created to ensure the long-term reliable archiving of digital documents. An important aspect of the standard involves properly handling embedded files and attachments within…


ASP.NETASP.NET CoreConference

We are Returning to CodeMash 2026 as a Sponsor and Exhibitor

We are excited to announce that we will be returning to CodeMash 2026 as a sponsor and exhibitor. Join us to learn about the latest in .NET development and how our products can help you build…


ASP.NETASP.NET Core

AI-Ready Documents in .NET C#: How Structured Content Unlocks Better…

Most organizations use AI on documents that were never designed for machines. PDFs without tags, inconsistent templates, undescribed images, and disorganized reading orders are still common. This…


ASP.NETASP.NET CoreDocument Automation

Why Document Processing Libraries Require a Document Editor

A document processing library alone cannot guarantee reliable and predictable results. Users need a true WYSIWYG document editor to design and adjust templates to appear exactly as they will after…


ASP.NETWindows FormsWPF

TX Text Control 34.0 SP1 is Now Available: What's New in the Latest Version

TX Text Control 34.0 Service Pack 1 is now available, offering important updates and bug fixes for all platforms. If you use TX Text Control in your document processing applications, this service…

Summarize this blog post with:

Share on this blog post on: