During the development of TX Text Control .NET for Windows Forms 15.1, I was able to see most of the new features and of course I already tested most of them intensively. But there was one large item on the feature list I was very excited about: PDF import. Due to my travel calendar, I was not able to look at the improvements our development team came up with in version 15.1. I couldn't wait to get my hands on the new PDF import results, so I spent parts of the weekend to reviewing these changes. Let me make it clear from the start: The results are amazing.
As you may know, in version 15.0, we released the first PDF import filter. The idea was to provide the functionality to import PDF documents and do further processing on such documents. The first version was able to extract the text from PDF files. Even paragraphs were recognized and the text frame mode was able to generate good looking documents.
If you don't know the PDF format in detail, you might wonder why it is so tricky to load this format. To understand this, imagine a PDF document like a real executable file in contrast to the source code.
An RTF document is like the source code. You can open it easily in a text editor and if you understand RTF tags, you can make changes to the document.
Whereas the PDF document can be compared to the true Win32 EXE which can't be easily decompiled. To understand what the EXE does or how it is implemented, you need to reverse engineer the executable. The same is valid for PDF documents: The Adobe PDF format is a low-level output format and was designed to be printed - not to be imported again. In most cases it contains only geometrical information about the single characters. You have to calculate which characters belong to a sentence, a paragraph or a text frame.
So, what has been improved?
Aside from new image formats that are now supported, the most impressive part is the grouping of created text frames. Let's have a look at a typical PDF page:
It consists of a heading, an introduction text and a table with 3 columns. The following screenshots show the results from version 15.0 in comparison to version 15.1:
TX Text Control 15.0
TX Text Control 15.1
As you can see in the left screenshot, all text areas are inserted as single text frames to realize the distances between the paragraphs. The problem with this is that you can't extract the text from all of these frames using copy and paste. With version 15.1, all text frames are grouped into one large frame and the distances are realized using paragraph distances, specific line spacing and indents.
The tabular data in the PDF is now realized using tab positions and not single text frames which makes the text readable and reusable. The following illustration shows the same document in TX Text Control 15.1 with visible control characters where the line spacing and tab positions are highlighted:
I would like to encourage you to test the much improved PDF import feature of TX Text Control .NET for Windows Forms 15.1. I look forward to your comments and I would be happy to discuss the possibilities with you.