Products Technologies Demo Docs Blog Support Company

Converting MS Word (*.docx) to Markdown (*.md) in .NET C#

Learn how to convert MS Word documents (*.docx) to Markdown files (*.md) in .NET C# using the TX Text Control .NET Server for ASP.NET. This tutorial provides a step-by-step guide to implement the conversion process, enabling developers to easily transform Word documents into Markdown format for various applications.

Converting MS Word (*.docx) to Markdown (*.md) in .NET C#

If your content is in Microsoft Word but your applications increasingly rely on AI, converting DOCX to Markdown is one of the most impactful moves you can make. DOCX is great for authoring and rich formatting. However, AI systems thrive on clean, explicit structure in plain text. Markdown provides both human-readable text and machine-friendly semantics.

Converting a DOCX file to Markdown provides AI with compact plain text that has an explicit structure, including headings, lists, tables, links, images, and code. This makes it easier for models to parse the text consistently. It also enables deterministic chunking by section for retrieval-augmented generation (RAG). RAG retrieves the most relevant document chunks from your corpus and injects them into the prompt. This process reduces tokens and latency, provides clean diffs and reviews in Git, simplifies safe preprocessing such as PII scrubbing and handling tracked changes and whitespace, and supplies models with predictable semantics that enhance prompting, fine-tuning, and the quality of the final output.

In this tutorial, we will explain how to convert a document in the Office Open XML (DOCX) format to Markup using TX Text Control and the Markup.Core package, which is also provided by us.

Creating the Application

Make sure that you downloaded the latest version of Visual Studio 2022 that comes with the .NET 8 SDK.

Prerequisites

You need to download and install the trial version of TX Text Control .NET Server.

  1. In Visual Studio 2022, create a new project by choosing Create a new project.

  2. Select Console App as the project template and confirm with Next.

  3. Enter a project name and choose a location to save the project. Confirm with Next.

  4. Choose .NET 8.0 (Long Term Support) as the Framework and confirm with Create.

    Creating the .NET 8 project

Adding the NuGet Packages

  1. In the Solution Explorer, select your created project and choose Manage NuGet Packages... from the Project main menu. Select Text Control Offline Packages as the Package source.

    Install the following package:

    • TXTextControl.TextControl.Core.SDK
  2. Select nuget.org as the Package source.

    Install the following package:

    • TXTextControl.Markdown.Core

    ASP.NET Core Web Application

Converting the Document

  1. Download the sample document and place it in the project directory. Set the Copy to Output Directory property of the file to Copy if newer.

  2. Find the Program.cs file in the Solution Explorer and replace the code with the following code snippet:

    using TXTextControl;
    using TXTextControl.Markdown;
     
    using var tx = new TXTextControl.ServerTextControl();
    tx.Create();
    
    tx.Load("test_word_document.docx", StreamType.WordprocessingML);
    string md = tx.SaveMarkdown();
    
    Console.WriteLine(md);

This code loads the DOCX file into a TXTextControl.ServerTextControl instance and then uses the TXTextControl.Markdown.Core to convert the document to Markdown format. First, let's take a look at the document in MS Word. The document uses default styles, such as "Title" and "Heading 1," as well as typical features, such as inline formatting, tables, and numbered lists.

MS Word sample document

The conversion result looks like this:

# Sample Document Title

# Introduction

This is a sample document generated for testing purposes. It demonstrates the usage of default Microsoft Word styles such as Title, Heading, Heading 2, Strong, and Quote.

Subtitle will be converted to H3

## Background

Word documents rely on styles to structure and format content. Default styles make it easy to maintain a consistent look and feel.

This paragraph contains some **strongly emphasized text** using the Strong style.

This is an example of a block quote style. Quotes help highlight important excerpts or external references.

## Key features of this test document:

- Bullet item 1
- Bullet **item 2**
    - Bullet *item 3*

## Steps to create a test document:

1. Open **MS Word**
2. Add test *structures*
    1. Open in TX Text Control
    2. ~~Test document~~

## Sample Table

| Column 1 | Column 2 | Column 3 |
| ----- | ----: | ----- |
| Row 1, Col 1 | **Row 1, Col 2** | Row 1, Col 3 |
| ~~Row 2, Col 1~~ | Row 2, Col 2 | *Row 2, Col 3* |

Customizing the Conversion

The converter automatically recognizes common author style names and converts them into Markdown headings. For example, "Title" becomes H1 and "Heading 2" becomes H2. No configuration is necessary; this mapping is active when you call SaveMarkdown without options. Take another look at the MS Word screenshot. You will find a paragraph style called "Subtitle." We want to convert those to H3 elements in Markdown.

Start with the built-in default map and add your own names on top. In our example, you add "Subtitle" and "Sub-Title" and instruct the converter to treat them as H3. Nothing else changes: "Title," "Heading 1," "Heading 2," etc., still map as before.

The default mapping is as follows:

var headingMap = new HeadingStyleMap(new Dictionary<int, IEnumerable<string>>
{
    { 1, new[] { "Title", "Heading 1", "H1" } },
    { 2, new[] { "Heading 2", "H2" } },
    { 3, new[] { "Heading 3", "H3" } },
    { 4, new[] { "Heading 4", "H4" } },
    { 5, new[] { "Heading 5", "H5" } },
    { 6, new[] { "Heading 6", "H6" } },
});

Extending the Default Mapping

To convert the "Subtitle" and "Sub-Title" styles to H3 elements, you need to extend the default mapping as follows:

using var tx = new TXTextControl.ServerTextControl();
tx.Create();

tx.Load("test_word_document.docx", StreamType.WordprocessingML);

var options = new MarkdownOptions
{
    HeadingMap = HeadingStyleMap.Default.Extend(new Dictionary<int, IEnumerable<string>>
    {
        { 3, new[] { "Subtitle", "Sub-Title" } }      // treat these as H3
    })
};

string md = tx.SaveMarkdown(options);

File.WriteAllText("out.md", md);

Run the application. The output will look like this:

# Sample Document Title

# Introduction

This is a sample document generated for testing purposes. It demonstrates the usage of default Microsoft Word styles such as Title, Heading, Heading 2, Strong, and Quote.

### Subtitle will be converted to H3

## Background

Word documents rely on styles to structure and format content. Default styles make it easy to maintain a consistent look and feel.

This paragraph contains some **strongly emphasized text** using the Strong style.

This is an example of a block quote style. Quotes help highlight important excerpts or external references.

## Key features of this test document:

- Bullet item 1
- Bullet **item 2**
    - Bullet *item 3*

## Steps to create a test document:

1. Open **MS Word**
2. Add test *structures*
    1. Open in TX Text Control
    2. ~~Test document~~

## Sample Table

| Column 1 | Column 2 | Column 3 |
| ----- | ----: | ----- |
| Row 1, Col 1 | **Row 1, Col 2** | Row 1, Col 3 |
| ~~Row 2, Col 1~~ | Row 2, Col 2 | *Row 2, Col 3* |

Let's take a look at the preview of the generated Markdown file in a Markdown viewer such as VS Code:

Markdown preview in VS Code

If you have other style names, such as the German "Untertitel" or the Spanish "Subtítulo," you can add them to the mapping as well.

var options = new MarkdownOptions
{
    HeadingMap = HeadingStyleMap.Default.Extend(new Dictionary<int, IEnumerable<string>>
    {
        { 1, new[] { "Titel", "Überschrift 1", "Ueberschrift 1" } },
        { 2, new[] { "Überschrift 2", "Ueberschrift 2", "Untertitel" } },
        { 3, new[] { "Überschrift 3", "Ueberschrift 3" } },
        { 4, new[] { "Überschrift 4", "Ueberschrift 4" } },
        { 5, new[] { "Überschrift 5", "Ueberschrift 5" } },
        { 6, new[] { "Überschrift 6", "Ueberschrift 6" } },
    })
};

Conclusion

Converting a DOCX file to Markdown is an effective way to prepare your documents for use with AI applications. TX Text Control simplifies this process while enabling you to customize the output according to your preferences. Leveraging Markdown's simplicity and structure enhances the performance of AI models and streamlines content workflows.

Stay in the loop!

Subscribe to the newsletter to receive the latest updates.

ASP.NET

Integrate document processing into your applications to create documents such as PDFs and MS Word documents, including client-side document editing, viewing, and electronic signatures.

ASP.NET Core
Angular
Blazor
JavaScript
React
  • Angular
  • Blazor
  • React
  • JavaScript
  • ASP.NET MVC, ASP.NET Core, and WebForms

Learn more Trial token Download trial

Related Posts

ASP.NETASP.NET CoreDOCX

DOCX Meets Markdown: Preparing Enterprise Documents for AI

Discover why Markdown is a game-changer for document creation in the age of AI. Explore how this lightweight markup language can enhance collaboration, version control, and integration with AI…


ASP.NETASP.NET CoreMarkdown

Bringing MailMerge Power to Markdown: Fluid Placeholders in TX Text Control…

The latest beta version of the TX Text Control Markdown NuGet package introduces support for fluid placeholders, also known as Mustache or Handlebars syntax. This powerful feature enables…


ASP.NETASP.NET CoreMarkdown

Convert Markdown to PDF in a Console Application on Linux and Windows

Learn how to convert Markdown files to PDF in a console application on Linux and Windows using TX Text Control .NET Server for ASP.NET. This tutorial provides step-by-step instructions and code…


ASP.NETASP.NET CoreMarkdown

Introducing TXTextControl.Markdown.Core: Import and Export Markdown in TX…

We are happy to announce the release of TXTextControl.Markdown.Core, a powerful new component that enables seamless import and export of Markdown files in TX Text Control. This addition enhances…


ASP.NETASP.NET CoreDOCX

Why HTML is not a Substitute for Page-Oriented Formats like DOCX

In this blog post, we will discuss the limitations of HTML as a document format and explain why page-oriented formats, such as DOCX, remain essential for certain use cases. We will explore the…