Extracting Images from MS Word Documents in C#

Bjoern Meyer

June 11, 2024

This article shows how to extract all images from a MS Word document using TX Text Control .NET Server. The sample code extracts all images and saves them in their appropriate format.

Extracting Images from MS Word Documents in C#

TX Text Control is primarily used for the creation of documents in typical industry-standard formats with images, tables, and other features. However, you can also use TX Text Control to extract data from documents that you already have.

Learn More

Learn how to extract text from PDF documents using the TX Text Control PDF import feature in C#. This article shows how to extract text, attachments, form field values and metadata from PDF documents.

Extract Data from PDF Documents with C#

In this article we will explain how to extract all the images from a document, regardless of whether the document is saved as a DOC, DOCX, RTF or PDF file.

Preparing the Application

A .NET 8 console application is created for the purposes of this demo.

Prerequisites

The following tutorial requires a trial version of TX Text Control .NET Server.

Download Trial Version

In Visual Studio, create a new Console App using .NET 8.
In the Solution Explorer, select your created project and choose Manage NuGet Packages... from the Project main menu.

Select Text Control Offline Packages from the Package source drop-down.

Install the latest versions of the following package:
- TXTextControl.TextControl.ASP.SDK

Extracting Images from Documents

TX Text Control can be used to load documents in various formats such as DOC, DOCX, RTF, and PDF. To extract images, the non-UI ServerTextControl class can be used to export the document to HTML, where images are embedded as base64-encoded strings.

First, the document is loaded into the ServerTextControl class and exported to HTML. The HTML string is then parsed to extract the base64-encoded images. The following code shows how to extract images from a document:

static List<ImageData> GetImageBase64Strings(string html)
{
        var imageDatas = new List<ImageData>();
        var regex = new Regex("<img[^>]+?src=[\"']data:image/(?<format>[^;]+);base64,(?<data>[^\"']+)[\"'][^>]*>", RegexOptions.IgnoreCase);

        foreach (Match match in regex.Matches(html))
        {
                var format = match.Groups["format"].Value;
                var base64Content = match.Groups["data"].Value;
                var extension = GetFileExtension(format);

                imageDatas.Add(new ImageData
                {
                        Base64Content = base64Content,
                        Format = format,
                        Extension = extension
                });
        }

        return imageDatas;
}

static string GetFileExtension(string format)
{
        return format.ToLower() switch
        {
                "jpeg" => ".jpg",
                "jpg" => ".jpg",
                "png" => ".png",
                "gif" => ".gif",
                "bmp" => ".bmp",
                "webp" => ".webp",
                "svg+xml" => ".svg",
                "tiff" => ".tif",
                "x-icon" => ".ico",
                _ => $".{format}"
        };
}

public class ImageData
{
        public string Base64Content { get; set; }
        public string Format { get; set; }
        public string Extension { get; set; }
}

GetImageBase64Strings parses a string and returns a list of ImageData objects that contain the base64 data, the image format, and the typical extension to save the image.

The following code shows how to use this method:

using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl())
        {
                tx.Create();
        
                tx.Load("invoice.docx", TXTextControl.StreamType.WordprocessingML);
        
                TXTextControl.SaveSettings saveSettings = new TXTextControl.SaveSettings()
                {
                        ImageSaveMode = TXTextControl.ImageSaveMode.SaveAsData
                };
        
                string doc = "";
        
                tx.Save(out doc, TXTextControl.StringStreamType.HTMLFormat, saveSettings);
        
                var images = GetImageBase64Strings(doc);
        
                foreach (var image in images)
                {
                        byte[] imageBytes = Convert.FromBase64String(image.Base64Content);
        
                        // write byte array to file
                        using (FileStream fs = new FileStream("image" + image.Extension, FileMode.Create))
                        {
                                fs.Write(imageBytes, 0, imageBytes.Length);
                        }
                }
        }

TX Text Control is used to load the document and export it as an HTML string. The extracted images are saved to the file system with the typical extension of the image format.

Conclusion

This article showed how to extract images from documents using TX Text Control. The ServerTextControl class can be used to load documents in various formats and export them to HTML. The HTML string can be parsed to extract the base64-encoded images.

ASP.NET

Integrate document processing into your applications to create documents such as PDFs and MS Word documents, including client-side document editing, viewing, and electronic signatures.