Extracting Images from MS Word Documents in C#

TX Text Control is primarily used for the creation of documents in typical industry-standard formats with images, tables, and other features. However, you can also use TX Text Control to extract data from documents that you already have.

Learn More

Learn how to extract text from PDF documents using the TX Text Control PDF import feature in C#. This article shows how to extract text, attachments, form field values and metadata from PDF documents.

Extract Data from PDF Documents with C#

In this article we will explain how to extract all the images from a document, regardless of whether the document is saved as a DOC, DOCX, RTF or PDF file.

Preparing the Application

A .NET 8 console application is created for the purposes of this demo.

Prerequisites

The following tutorial requires a trial version of TX Text Control .NET Server for ASP.NET.

Download Trial Version

In Visual Studio, create a new Console App using .NET 8.
In the Solution Explorer, select your created project and choose Manage NuGet Packages... from the Project main menu.

Select Text Control Offline Packages from the Package source drop-down.

Install the latest versions of the following package:
- TXTextControl.TextControl.ASP.SDK

Extracting Images from Documents

TX Text Control can be used to load documents in various formats such as DOC, DOCX, RTF, and PDF. To extract images, the non-UI ServerTextControl ╰ TX Text Control .NET Server for ASP.NET
╰ TXTextControl Namespace
╰ ServerTextControl Class
The ServerTextControl class implements a component that provide high-level text processing features for server-based applications. class can be used to export the document to HTML, where images are embedded as base64-encoded strings.

First, the document is loaded into the ServerTextControl class and exported to HTML. The HTML string is then parsed to extract the base64-encoded images. The following code shows how to extract images from a document:

	static List<ImageData> GetImageBase64Strings(string html)
	{
	var imageDatas = new List<ImageData>();
	var regex = new Regex("<img[^>]+?src=[\"']data:image/(?<format>[^;]+);base64,(?<data>[^\"']+)[\"'][^>]*>", RegexOptions.IgnoreCase);

	foreach (Match match in regex.Matches(html))
	{
	var format = match.Groups["format"].Value;
	var base64Content = match.Groups["data"].Value;
	var extension = GetFileExtension(format);

	imageDatas.Add(new ImageData
	{
	Base64Content = base64Content,
	Format = format,
	Extension = extension
	});
	}

	return imageDatas;
	}

	static string GetFileExtension(string format)
	{
	return format.ToLower() switch
	{
	"jpeg" => ".jpg",
	"jpg" => ".jpg",
	"png" => ".png",
	"gif" => ".gif",
	"bmp" => ".bmp",
	"webp" => ".webp",
	"svg+xml" => ".svg",
	"tiff" => ".tif",
	"x-icon" => ".ico",
	_ => $".{format}"
	};
	}

	public class ImageData
	{
	public string Base64Content { get; set; }
	public string Format { get; set; }
	public string Extension { get; set; }
	}

view raw test.cs hosted with ❤ by GitHub

GetImageBase64Strings parses a string and returns a list of ImageData objects that contain the base64 data, the image format, and the typical extension to save the image.

The following code shows how to use this method:

	using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl())
	{
	tx.Create();

	tx.Load("invoice.docx", TXTextControl.StreamType.WordprocessingML);

	TXTextControl.SaveSettings saveSettings = new TXTextControl.SaveSettings()
	{
	ImageSaveMode = TXTextControl.ImageSaveMode.SaveAsData
	};

	string doc = "";

	tx.Save(out doc, TXTextControl.StringStreamType.HTMLFormat, saveSettings);

	var images = GetImageBase64Strings(doc);

	foreach (var image in images)
	{
	byte[] imageBytes = Convert.FromBase64String(image.Base64Content);

	// write byte array to file
	using (FileStream fs = new FileStream("image" + image.Extension, FileMode.Create))
	{
	fs.Write(imageBytes, 0, imageBytes.Length);
	}
	}
	}

view raw test.cs hosted with ❤ by GitHub

TX Text Control is used to load the document and export it as an HTML string. The extracted images are saved to the file system with the typical extension of the image format.

Conclusion

This article showed how to extract images from documents using TX Text Control. The ServerTextControl class can be used to load documents in various formats and export them to HTML. The HTML string can be parsed to extract the base64-encoded images.

ASP.NET

Integrate document processing into your applications to create documents such as PDFs and MS Word documents, including client-side document editing, viewing, and electronic signatures.

Text Control Products

WEB, SERVER AND CLOUD

Getting started with:

DESKTOP

HOSTED CLOUD

LOW CODE PLATFORM

Core Technologies

Text Control Documentation

Text Control Blog

Text Control Support

About Text Control

Extracting Images from MS Word Documents in C#

Summary

Preparing the Application

Prerequisites

Extracting Images from Documents

Conclusion

ASP.NET

Getting started with:

Related Posts

Selecting and Formatting TableCells in TX Text Control

Various Ways of Inserting Images into TX Text Control

DOCX to HTML: Convert Documents to HTML and Prepare for Shadow DOM Rendering

Convert HTML to PDF in ASP.NET Core C#

Popular Products

Technologies

Get Products

Resources

Support

Ready To Talk?