In today's digital age, organizations are inundated with an overwhelming amount of documents in a variety of formats, including Adobe PDF, Office Open XML, DOCX, or older formats such as DOC or RTF. In older legacy applications, the information contained in these documents was likely not stored in databases or other easily accessible forms. The process of extracting this information is very time consuming and labor intensive. With Intelligent Document Processing (IDP), developers can now integrate automation and enhance document-related workflows to improve accuracy, efficiency, and decision-making in business applications.

What is Intelligent Document Processing?

Intelligent Document Processing is a process that uses artificial intelligence (AI) and natural language processing (NLP) to automate the extraction of data from documents. The AI models used understand the content, context, and structure of documents, enabling sophisticated tasks such as document classification, data extraction, and even querying documents for specific information. IDP can be used to automate the processing of a wide variety of documents, including invoices, purchase orders, contracts, and more.

At Text Control, our focus is on researching the best available models and AI providers to integrate AI-based document processing into .NET applications. We have created several examples for various typical IDP applications to demonstrate how the power of TX Text Control technology can be combined to extract text from PDF documents or access content from MS Word documents with AI model-based queries on that content.

Document Classification

One of IDP's most important capabilities is document classification. Organizations deal with a wide variety of documents-contracts, invoices, receipts, forms, legal documents, and more. When creating these documents with TX Text Control in modern applications, data is stored in databases or in machine-readable form and then attached to a created PDF document in an ISO standard format such as PDF/A-3b. This data can be used to classify the document based on its content, structure, or metadata. For example, an invoice can be classified as an invoice based on the presence of specific keywords, patterns, or other criteria.

Learn More

Metadata in TX Text Control

Document metadata in PDFs and other formats is important for several reasons, including organization, searchability, authenticity, and compliance. This article shows how to import and export metadata in PDF documents using the TX Text Control .NET Server for ASP.NET.

The Importance of Metadata in PDF Documents: Import and Export Metadata in ASP.NET Core C#

However, documents created with older technologies other than TX Text Control are missing this important metadata and must be extracted in a separate process. An automated process can help determine whether a document is an invoice, quote, or contract and route it to the appropriate workflow. We built a prototype that uses TX Text Control to import the text of a PDF document and uses OpenAI for analysis.

For example, the following PDF is loaded and parsed using TX Text Control:

Invoice analysis

After entering the document name, the document is imported and sent to OpenAI for analysis. The results will be written to the console.

Enter the path to the document to classify:
Documents\invoice.pdf
invoice:0.8, receipt:0.2, contract:0, quotation:0, agreement:0, other:0
Highest probability: invoice

The application has determined that the input document is an invoice, which is perfectly correct.

Data Extraction

Another important aspect of the IDP is the extraction of data. This data can be used to populate a database, trigger a workflow, or perform any number of other actions. For example, an invoice might contain information such as the invoice number, date, total amount, and line items. IDP is now used to extract specific details from an invoice to reconcile values with the original purchase order.

Many business documents do not follow a fixed format, making it difficult for traditional systems to extract information. With TX Text Control, there are two ways to find specific values in a PDF document:

  • Using the built-in radial text search functionality
  • Using AI models to query the content

Using a combination of both methods, we can extract specific values from a document and double-check the results with AI models.

Learn More

Extract data from PDF

TX Text Control can be used to create and edit Adobe PDF documents programmatically. But it is also possible to import PDF documents to read, extract and manipulate them. This article shows different ways to extract data from existing PDF documents.

Extract Text and Data from PDF Documents in C#

Querying Documents

The ability to answer questions based on the content of documents is one of the most powerful features of modern IDP systems. Imagine you have a large repository of contracts and you need to find out which contracts contain a certain clause or what the cancellation terms are. Manually, even with advanced searches, finding the right answer would take forever.

Using NLP and AI, users could ask natural questions about the content of a document or list of documents. Typical questions on an invoice would be:

  • What is the total amount?
  • When is the invoice due?
  • What are the payment terms?

For businesses, this means faster decision making and improved productivity. Instead of spending hours searching for information, employees can focus on higher-value tasks, knowing they have accurate data at their fingertips.

We developed a prototype that comes with full source code called Chat PDF, which uses TX Text Control to extract text from PDF documents and OpenAI to analyze the content. The example also shows how to prepare the content by breaking it into smaller chunks with a specific overlap to get accurate answers.

Learn More

Questions on a PDF

This article shows how to create a generative AI application for PDF documents using TX Text Control and OpenAI functions in C#. The application uses the OpenAI GPT-3 engine to answer questions on the content of a PDF document.

Chat PDF - A Generative AI Application for PDF Documents using TX Text Control and OpenAI Functions in C#

The application is a simple .NET Console App that uses the TX Text Control .NET Server for ASP.NET to import the PDF document and to display the answer generated by OpenAI.

string question = "Is contracting with other partners an option?";
//string question = "How will disputes be dealt with?";
//string question = "Can the agreement be changed or modified?";
string pdfPath = "Sample PDFs/SampleContract-Shuttle.pdf";
// load the PDF file
byte[] pdfDocument = File.ReadAllBytes(pdfPath);
// split the PDF document into chunks
var chunks = DocumentProcessing.Chunk(pdfDocument, 2500, 50);
Console.WriteLine($"{chunks.Count.ToString()} chunks generated from: {pdfPath}");
// get the keywords
List<string> generatedKeywords = GPTHelper.GetKeywords(question, 20);
// find the matches
var matches = DocumentProcessing.FindMatches(chunks, generatedKeywords).ToList().First();
// print the matches
Console.WriteLine($"The question: \"{question}\" was found in chunk {matches.Key}.");
// print the answer
Console.WriteLine("\r\n********\r\n" + GPTHelper.GetAnswer(chunks[matches.Key], question));
view raw test.cs hosted with ❤ by GitHub

A sample output is shown in the following console:

14 chunks generated from: Sample PDFs/SampleContract-Shuttle.pdf
The question: "Is contracting with other partners an option?" was found in chunk 11.

********
No, contracting with other partners is not an option unless prior approval is obtained from the COMMISSION'S Contract Manager. The document specifies that subcontracting work under this Agreement is not allowed without prior written authorization, except for those identified in the approved Fee Schedule. Subcontracts over $25,000 must include the necessary provisions from the main Agreement and must be approved in writing by the COMMISSION'S Contract Manager.

The application has found the answer to the question in the document and displays the relevant text. This is a very powerful feature that can be integrated into any business application to provide answers to questions based on the content of documents.

Conclusion

Intelligent Document Processing is a powerful tool that can help organizations automate document-related workflows, improve accuracy, and make better decisions. By combining the power of TX Text Control with AI models, developers can create sophisticated applications that can classify documents, extract data, and answer questions based on the content of documents. This can help organizations save time, reduce errors, and improve productivity.

At Text Control, we are committed to providing developers with the tools they need to create powerful applications that can take advantage of the latest technologies. Our research into Intelligent Document Processing is just one example of how we are working to help developers create innovative solutions that can transform the way organizations work.

For more information on Intelligent Document Processing and how you can integrate it into your applications, please contact us. We would be happy to help you get started with this exciting technology.