Adobe Portable Document Format (PDF) Import

Existing PDF documents, for which the original source files no longer exist, can be imported into TX Text Control .NET Server, upon which they can be edited and saved to any supported file format.

Calculating the layout of a page from an imported PDF, however, is a tricky process: PDF files contain detailed information about the appearance of a page, but not necessarily about the meaning of the characters and images contained within a page.

Furthermore, PDF files do not usually contain any information about the order of text, text flow, nor whether a piece of text is part of a header or table cell. Although recent additions to the PDF specification allow for some of this information to be stored (tagged PDF), this is rarely used.

TX Text Control .NET Server extracts and converts all of the text it can find, adds missing spaces and paragraph breaks, and re-sorts the various text blocks and images, so that they appear in their logical order.

The following three parameters are available, which specify the exact behavior of the import filter:

GenerateLines: The imported document is built from singular lines of text, terminated by a line break. This option is most suitable, if only the textual content of the PDF file is of interest.

GenerateParagraphs: The singular lines of text are grouped together to form paragraphs. This option eases post-import editing and is most suited to text heavy documents, such as legal contracts.

GenerateTextFrames: The imported blocks of text and images are organized into text frames and placed at the same position as in the original PDF file. This option produces documents, which are most similar to the original.