TX Text Control is primarily used for the creation of documents in typical industry-standard formats with images, tables, and other features. However, you can also use TX Text Control to extract data from documents that you already have.

Learn More

Learn how to extract text from PDF documents using the TX Text Control PDF import feature in C#. This article shows how to extract text, attachments, form field values and metadata from PDF documents.

Extract Data from PDF Documents with C#

In this article we will explain how to extract all the images from a document, regardless of whether the document is saved as a DOC, DOCX, RTF or PDF file.

Preparing the Application

A .NET 8 console application is created for the purposes of this demo.

Prerequisites

The following tutorial requires a trial version of TX Text Control .NET Server for ASP.NET.

  1. In Visual Studio, create a new Console App using .NET 8.

  2. In the Solution Explorer, select your created project and choose Manage NuGet Packages... from the Project main menu.

    Select Text Control Offline Packages from the Package source drop-down.

    Install the latest versions of the following package:

    • TXTextControl.TextControl.ASP.SDK

    Create PDF

Extracting Images from Documents

TX Text Control can be used to load documents in various formats such as DOC, DOCX, RTF, and PDF. To extract images, the non-UI ServerTextControl TX Text Control .NET Server for ASP.NET
TXTextControl Namespace
ServerTextControl Class
The ServerTextControl class implements a component that provide high-level text processing features for server-based applications.
class can be used to export the document to HTML, where images are embedded as base64-encoded strings.

First, the document is loaded into the ServerTextControl class and exported to HTML. The HTML string is then parsed to extract the base64-encoded images. The following code shows how to extract images from a document:

static List<ImageData> GetImageBase64Strings(string html)
{
var imageDatas = new List<ImageData>();
var regex = new Regex("<img[^>]+?src=[\"']data:image/(?<format>[^;]+);base64,(?<data>[^\"']+)[\"'][^>]*>", RegexOptions.IgnoreCase);
foreach (Match match in regex.Matches(html))
{
var format = match.Groups["format"].Value;
var base64Content = match.Groups["data"].Value;
var extension = GetFileExtension(format);
imageDatas.Add(new ImageData
{
Base64Content = base64Content,
Format = format,
Extension = extension
});
}
return imageDatas;
}
static string GetFileExtension(string format)
{
return format.ToLower() switch
{
"jpeg" => ".jpg",
"jpg" => ".jpg",
"png" => ".png",
"gif" => ".gif",
"bmp" => ".bmp",
"webp" => ".webp",
"svg+xml" => ".svg",
"tiff" => ".tif",
"x-icon" => ".ico",
_ => $".{format}"
};
}
public class ImageData
{
public string Base64Content { get; set; }
public string Format { get; set; }
public string Extension { get; set; }
}
view raw test.cs hosted with ❤ by GitHub

GetImageBase64Strings parses a string and returns a list of ImageData objects that contain the base64 data, the image format, and the typical extension to save the image.

The following code shows how to use this method:

using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl())
{
tx.Create();
tx.Load("invoice.docx", TXTextControl.StreamType.WordprocessingML);
TXTextControl.SaveSettings saveSettings = new TXTextControl.SaveSettings()
{
ImageSaveMode = TXTextControl.ImageSaveMode.SaveAsData
};
string doc = "";
tx.Save(out doc, TXTextControl.StringStreamType.HTMLFormat, saveSettings);
var images = GetImageBase64Strings(doc);
foreach (var image in images)
{
byte[] imageBytes = Convert.FromBase64String(image.Base64Content);
// write byte array to file
using (FileStream fs = new FileStream("image" + image.Extension, FileMode.Create))
{
fs.Write(imageBytes, 0, imageBytes.Length);
}
}
}
view raw test.cs hosted with ❤ by GitHub

TX Text Control is used to load the document and export it as an HTML string. The extracted images are saved to the file system with the typical extension of the image format.

Conclusion

This article showed how to extract images from documents using TX Text Control. The ServerTextControl class can be used to load documents in various formats and export them to HTML. The HTML string can be parsed to extract the base64-encoded images.