Extracting plain text from Office Open XML DOCX and DOC files is required in many different applications. Whether you are indexing text for a search engine, an AI-powered text analytics tool, or a text-to-speech system, you need to extract text from DOCX and DOC files. In this article, we will show you how to extract plain text from DOCX and DOC files using C#.
TX Text Control provides a very powerful API to extract text from DOCX and DOC files. You can convert the entire document or just a specific range of pages or text between two specific text positions. The following code shows how to extract plain text from a DOCX file using TX Text Control:
Preparing the Application
A .NET 6 console application is created for the purposes of this demo.
Prerequisites
The following tutorial requires a trial version of TX Text Control .NET Server for ASP.NET.
-
In Visual Studio, create a new Console App using .NET 8.
-
In the Solution Explorer, select your created project and choose Manage NuGet Packages... from the Project main menu.
Select Text Control Offline Packages from the Package source drop-down.
Install the latest versions of the following package:
- TXTextControl.TextControl.ASP.SDK
Extracting Text from DOCX Files
After installing the required NuGet package, you can use the following code to extract plain text from a DOCX file:
try | |
{ | |
using TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl(); | |
tx.Create(); | |
tx.Load("document.docx", TXTextControl.StreamType.WordprocessingML); | |
tx.Save(out string plainText, TXTextControl.StringStreamType.PlainText); | |
Console.WriteLine(plainText); | |
} | |
catch (Exception ex) | |
{ | |
Console.WriteLine($"An error occurred: {ex.Message}"); | |
} |
The code snippet above loads a DOCX file and extracts the complete plain text from the document. The extracted text is then written to the console.
Extracting Text Between Headings
TX Text Control provides a powerful API to extract text between two specific text positions. Consider a scenario where you want to get all the text sections between chapter titles that are defined by stylesheets.
Consider the following document:
We want to extract the complete text between the headings with the stylesheet names Heading1. The following code shows how to extract the text between these two headings:
List<string> ExtractTextBlocks(string paragraphStyleName, ServerTextControl serverTextControl, bool includeRemainingText) | |
{ | |
List<string> textBlocks = new List<string>(); | |
bool capturing = false; | |
StringBuilder currentBlock = new StringBuilder(); | |
for (int i = 1; i < serverTextControl.Paragraphs.Count - 1; i++) | |
{ | |
Paragraph paragraph = serverTextControl.Paragraphs[i]; | |
if (paragraph.FormattingStyle == paragraphStyleName) | |
{ | |
if (capturing) | |
{ | |
textBlocks.Add(currentBlock.ToString().Trim()); | |
currentBlock.Clear(); | |
} | |
else | |
{ | |
capturing = true; | |
} | |
} | |
else if (capturing) | |
{ | |
currentBlock.AppendLine(paragraph.Text); | |
} | |
} | |
// Add remaining text if still capturing at the end | |
if (includeRemainingText && (capturing || currentBlock.Length > 0)) | |
{ | |
textBlocks.Add(currentBlock.ToString().Trim()); | |
} | |
return textBlocks; | |
} |
The code snippet below loads a DOCX file and extracts the text between two specified headings. It then prints the extracted text to the console.
using TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl(); | |
tx.Create(); | |
tx.Load("document.docx", TXTextControl.StreamType.WordprocessingML); | |
var test = ExtractTextBlocks("Heading 1", tx, true); | |
foreach (var item in test) | |
{ | |
Console.WriteLine("New block: \r\n\r\n" + item + "\r\n"); | |
} |
The result is a list of three items containing the text between all headings named Heading 1.
New block:
This is the text of heading 1.
This is more text of heading 1.
This is the text of heading 1.
Sub-Heading 1
Normal text.
Normal text 2.
Sub-Heading 2
New block:
This is the text of heading 2.
This is more text of heading 2.
New block:
This is the text of heading 3.
If we now want to extract only the text between the Heading 2 styles, without adding the rest of the text that doesn't contain a closing style name, we can use the following code:
using TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl(); | |
tx.Create(); | |
tx.Load("document.docx", TXTextControl.StreamType.WordprocessingML); | |
var test = ExtractTextBlocks("Heading 2", tx, false); | |
foreach (var item in test) | |
{ | |
Console.WriteLine("New block: \r\n\r\n" + item + "\r\n"); | |
} |
The following screenshot shows the extracted text between the Heading 2 styles:
The result of the above code snippet is a block of text between the Heading 2 styles.
New block:
Normal text.
Normal text 2.
Conclusion
TX Text Control provides a powerful API to extract text from DOCX and DOC files. You can extract the complete text or just a specific range of text between two specific text positions. This article showed how to extract plain text from DOCX and DOC files using TX Text Control in C#.