TX Text Control .NET for Windows Forms Documentation

Importing Text from a PDF File

PDF, like its close relative, PostScript, is a page description format. Its original purpose was to display documents on different platforms, preserving the layout and formatting to the greatest possible detail. Loading a PDF file back into a word processor, except for making very small changes, was never planned.

A PDF file therefore contains detailed information about the appearance of text characters, but not necessarily about their meaning. In other words, it specifies exactly what a character should look like and where on a page it is to be positioned, but not which Ansi or Unicode character it actually is. Because of that, it is not always possible to extract text from a PDF file.

Besides this, there is no information about text order or text flow or whether a piece of text is a header or a table cell. Although recent enhancements to the PDF specification allow for including this type of information, it is rarely used. Fortunately, the majority of PDF files contain one or another form of character mapping, which enables a PDF reader to convert the contained text to a Unicode string.

TX Text Control extracts and converts all of the text it can find, adds missing spaces and paragraph breaks, and resorts the various text blocks so that they appear in their logical order.

The resulting text will consist of single lines of text, with a line break at the end of each line. As an optional feature, TX Text Control can combine these single lines of text to larger paragraphs, which makes editing the text more convenient. As a third option, text can be displayed in text frames. In this mode, the original layout of the document is preserved much better than in the other two, and additionally, images are included. The mode is selected using the LoadSettings.PDFImportSettings property.

Features of TX Text Control's PDF import:

  • Text can be imported from PDF and PDF/A files, and saved in any of the formats supported by TX Text Control
  • Logical text order and missing spaces are restored
  • Text formatting, including font names, sizes and styles
  • Unicode support
  • Adobe Acrobat(R) is not required

Limitations

  • Text and images will be imported, Postscript® graphics elements are discarded.
  • Encrypted (i.e. password protected) files can not be loaded at this time
  • Text parts, which are only included as glyphs (i.e. with no information about their character codes) are discarded.
  • Paragraph formatting and document layout features like tables or headers are not supported at this time. Text contained in tables will be displayed as continuous text.
 
 
 

Products

Support

Downloads

Corporate

Buy Now