The Beginning

The Portable Document Format, or PDF is the most commonly used document format for business applications. Back in 1990, Adobe's co-founder Dr. John Warnock published a white paper essentially describing the need for the PDF format:

What industries badly need is a universal way to communicate documents across a wide variety of machine configurations, operating systems and communication networks. These documents should be viewable on any display and should be printable on any modern printers. If this problem can be solved, then the fundamental way people work will change.

Dr. John Warnock, Co-founder Adobe

In a nutshell, the advantage of PDF documents is the multi-platform compatibility. A document such as a tax form or invoice can be send to any recipient who is able to read, complete or print it. Because of available editing restrictions (and later electronic signatures), a PDF has been handled very much like printed paper and received a similar status in business processes.

Electronic Data Interchange

Electronic Data Interchange, better known as EDI, existed since the early 1970s to communicate data between applications. The idea was to save money by replacing paper based documents and therefore, manual paper processes such as sorting, archiving and printing. The reality of this approach was that more paper documents have been produced and sent as paper documents were maintained in parallel with EDI data.

Legal restrictions and user experience require most data (for example invoices) to be human-readable. In theory, the PDF document is the perfect format to replace printed paper. It is easy to send, easy to read on all machines, can be searched and is good for archiving processes. But the machine-readable data is missing.

The software industry tried to solve this issue by recognizing content in PDF documents (very similar to OCR processes) to give documents a context (Is this document an invoice?) and to match content with expected fields (invoice number, addresses, products, ...).

From Electronic Paper to Container

In the most recent iteration of PDF/A specifications, PDF/A-3 added a significant change to all predecessors. PDF/A-3 (ISO 19005-3:2012) permits the embedding of files of any format (including XML, MS Word and proprietary binary formats). This change allows the progression from electronic paper to an electronic container that holds the human and machine-readable versions of documents.

Now, the human-readable version can be ignored by applications reading the data of the document. Applications can extract the machine-readable portion of the PDF document in order to process it. A PDF/A-3 document can contain an unlimited number of embedded documents for different processes. According to the specification, software applications can extract embedded files without explicit knowledge of the PDF document itself.

Create and Extract Embedded Files

Technically, that is not an easy process. With TX Text Control X19 (29.0), we will provide PDF features to create documents with embedded files and also to extract embedded files from these electronic containers.The following code creates a PDF document with an embedded XML document:

var sData = File.ReadAllText("data.xml");
TXTextControl.EmbeddedFile file = new TXTextControl.EmbeddedFile("data.xml", sData, "") {
MIMEType = "text/xml",
RelationShip = "Alternative"
};
TXTextControl.SaveSettings saveSettings = new TXTextControl.SaveSettings() {
EmbeddedFiles = new TXTextControl.EmbeddedFile[] { file }
};
textControl1.Save("myinvoice.pdf",
TXTextControl.StreamType.AdobePDFA,
saveSettings);
view raw createpdf.cs hosted with ❤ by GitHub

When opened in Adobe Acrobat Reader, you can find the embedded files in the Attachments sidebar tab:

PDF/A-3 with TX Text Control

TX Text Control cannot only be used to create those documents, but to import and extract the embedded files as well. The following code shows how to extract the XML data from the PDF document:

TXTextControl.LoadSettings ls = new TXTextControl.LoadSettings();
ls.PDFImportSettings =
TXTextControl.PDFImportSettings.LoadEmbeddedFiles |
TXTextControl.PDFImportSettings.LoadMetadata;
textControl1.Load("myinvoice.pdf", TXTextControl.StreamType.AdobePDF, ls);
var dataXml = Encoding.Unicode.GetString((byte[])ls.EmbeddedFiles[0].Data);
view raw extractfiles.cs hosted with ❤ by GitHub

TX Text Control Covers Complete PDF Workflow

The EmbeddedFiles property contains an array of all files embedded in the PDF document. TX Text Control can be used to cover the complete PDF document workflow from creating the document to processing incoming documents in business applications. Combined with the powerful template-based document creation engine, TX Text Control provides developers the complete solution to handle PDF documents in business processes.