Importing Text from a PDF File

Technical Articles > PDF Import and Export

PDF, like its close relative, PostScript, is a page description format. Its original purpose was to display documents on different platforms, preserving the layout and formatting to the greatest possible detail. Loading a PDF file back into a word processor, except for making very small changes, was never planned.

A PDF file therefore contains detailed information about the appearance of text characters, but not necessarily about their meaning. In other words, it specifies exactly what a character should look like and where on a page it is to be positioned, but not which Ansi or Unicode character it actually is. Because of that, it is not always possible to extract text from a PDF file.

Besides this, there is no information about text order or text flow or whether a piece of text is a header or a table cell. Although recent enhancements to the PDF specification allow for including this type of information, it is rarely used. Fortunately, the majority of PDF files contain one or another form of character mapping, which enables a PDF reader to convert the contained text to a Unicode string.

TX Text Control extracts and converts all of the text it can find, adds missing spaces and paragraph breaks, and resorts the various text blocks so that they appear in their logical order.

The resulting text will be formatted according to the setting of the LoadSettings.PDFImportSettings property.

Features of TX Text Control's PDF import: