Extracting Images from MS Word Documents in C#
This article shows how to extract all images from a MS Word document using TX Text Control .NET Server. The sample code extracts all images and saves them in their appropriate format.

TX Text Control is primarily used for the creation of documents in typical industry-standard formats with images, tables, and other features. However, you can also use TX Text Control to extract data from documents that you already have.
Learn More
Learn how to extract text from PDF documents using the TX Text Control PDF import feature in C#. This article shows how to extract text, attachments, form field values and metadata from PDF documents.
In this article we will explain how to extract all the images from a document, regardless of whether the document is saved as a DOC, DOCX, RTF or PDF file.
Preparing the Application
A .NET 8 console application is created for the purposes of this demo.
Prerequisites
The following tutorial requires a trial version of TX Text Control .NET Server.
-
In Visual Studio, create a new Console App using .NET 8.
-
In the Solution Explorer, select your created project and choose Manage NuGet Packages... from the Project main menu.
Select Text Control Offline Packages from the Package source drop-down.
Install the latest versions of the following package:
- TXTextControl.TextControl.ASP.SDK
Extracting Images from Documents
TX Text Control can be used to load documents in various formats such as DOC, DOCX, RTF, and PDF. To extract images, the non-UI Server
First, the document is loaded into the ServerTextControl class and exported to HTML. The HTML string is then parsed to extract the base64-encoded images. The following code shows how to extract images from a document:
static List<ImageData> GetImageBase64Strings(string html)
{
var imageDatas = new List<ImageData>();
var regex = new Regex("<img[^>]+?src=[\"']data:image/(?<format>[^;]+);base64,(?<data>[^\"']+)[\"'][^>]*>", RegexOptions.IgnoreCase);
foreach (Match match in regex.Matches(html))
{
var format = match.Groups["format"].Value;
var base64Content = match.Groups["data"].Value;
var extension = GetFileExtension(format);
imageDatas.Add(new ImageData
{
Base64Content = base64Content,
Format = format,
Extension = extension
});
}
return imageDatas;
}
static string GetFileExtension(string format)
{
return format.ToLower() switch
{
"jpeg" => ".jpg",
"jpg" => ".jpg",
"png" => ".png",
"gif" => ".gif",
"bmp" => ".bmp",
"webp" => ".webp",
"svg+xml" => ".svg",
"tiff" => ".tif",
"x-icon" => ".ico",
_ => $".{format}"
};
}
public class ImageData
{
public string Base64Content { get; set; }
public string Format { get; set; }
public string Extension { get; set; }
}
GetImageBase64Strings parses a string and returns a list of ImageData objects that contain the base64 data, the image format, and the typical extension to save the image.
The following code shows how to use this method:
using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl())
{
tx.Create();
tx.Load("invoice.docx", TXTextControl.StreamType.WordprocessingML);
TXTextControl.SaveSettings saveSettings = new TXTextControl.SaveSettings()
{
ImageSaveMode = TXTextControl.ImageSaveMode.SaveAsData
};
string doc = "";
tx.Save(out doc, TXTextControl.StringStreamType.HTMLFormat, saveSettings);
var images = GetImageBase64Strings(doc);
foreach (var image in images)
{
byte[] imageBytes = Convert.FromBase64String(image.Base64Content);
// write byte array to file
using (FileStream fs = new FileStream("image" + image.Extension, FileMode.Create))
{
fs.Write(imageBytes, 0, imageBytes.Length);
}
}
}
TX Text Control is used to load the document and export it as an HTML string. The extracted images are saved to the file system with the typical extension of the image format.
Conclusion
This article showed how to extract images from documents using TX Text Control. The ServerTextControl class can be used to load documents in various formats and export them to HTML. The HTML string can be parsed to extract the base64-encoded images.
Related Posts
PDF Conversion in .NET: Convert DOCX, HTML and more with C#
PDF conversion in .NET is a standard requirement for generating invoices, templates, and accessible reports. This article provides an overview of PDF conversion capabilities using TX Text Control,…
Selecting and Formatting TableCells in TX Text Control
This article shows how to select and format table cells in TX Text Control .NET for Windows Forms and TX Text Control .NET Server. The sample code shows how to select a table cell range and how to…
Various Ways of Inserting Images into TX Text Control
TX Text Control provides various ways of inserting images into the document. This article shows different approaches to add images to a document from memory, from a file or from a URL.
DOCX to HTML: Convert Documents to HTML and Prepare for Shadow DOM Rendering
Learn how to convert DOCX documents to HTML and prepare the HTML for rendering in a Shadow DOM. The sample project uses the TX Text Control .NET Server component to convert DOCX to HTML, to create…
Convert HTML to PDF in ASP.NET Core C#
HTML snippets are often used as the basis in many applications for the creation of documents such as PDF files. Creating a PDF document from HTML content is demonstrated in this article.