Chat PDF - A Generative AI Application for PDF Documents using TX Text Control and OpenAI Functions in C#
This article shows how to create a generative AI application for PDF documents using TX Text Control and OpenAI functions in C#. The application uses the OpenAI GPT-3 engine to answer questions on the content of a PDF document.

One of the strongest applications of generative AI is in understanding the context of text and being able to answer questions about it. The integration of AI into day-to-day processes will be the norm in the applications of the future. It is highly probable that AI will be integrated into every business application developed using TX Text Control. Artificial intelligence applications in digital document processing include document analysis, contract summary, and AI-powered document template generation.
Our role is to provide the interfaces, the typical use cases, and the UI for the integration of AI into the workflow of document processing. That is why we will always provide you with ideas on how to integrate AI into your applications built with TX Text Control.
One of the most interesting applications of AI in document processing is the ability to ask questions about the content of a document. This article shows how to create a generative AI application for PDF documents using TX Text Control and OpenAI functions in C#. The application uses the OpenAI GPT-3 engine to answer questions on the content of a PDF document.
TX Text Control provides a powerful API to create, modify, and process PDF documents. The OpenAI GPT-3 engine is a powerful tool to understand the context of text and to answer questions about it. The integration of both technologies allows developers to create powerful applications that can understand the content of a PDF document and answer questions about it.
Import the PDF
The first step is to import the PDF document into the application. TX Text Control .NET Server provides a powerful API to import PDF documents. Due to the nature of the OpenAI calls, the length of the content is limited based on the model that is used. This is why we need to create smaller chunks of the PDF document.
The following method opens a PDF document using the Server
// split a PDF document into chunks
public static List<string> Chunk(byte[] pdfDocument, int chunkSize, int overlap = 1)
{
// create a new ServerTextControl instance
using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl())
{
tx.Create();
var loadSettings = new TXTextControl.LoadSettings
{
PDFImportSettings = TXTextControl.PDFImportSettings.GenerateParagraphs
};
// load the PDF document
tx.Load(pdfDocument, TXTextControl.BinaryStreamType.AdobePDF, loadSettings);
// remove line breaks
string pdfText = tx.Text.Replace("\r\n", " ");
// call the extracted chunk creation method
return CreateChunks(pdfText, chunkSize, overlap);
}
}
// split a text into chunks
private static List<string> CreateChunks(string text, int chunkSize, int overlap)
{
List<string> chunks = new List<string>();
// split the text into chunks
while (text.Length > chunkSize)
{
chunks.Add(text.Substring(0, chunkSize));
text = text.Substring(chunkSize - overlap);
}
// add the last chunk
chunks.Add(text);
return chunks;
}
The method accepts a specific chunk size and an overlap size. The chunk size should not be too small, because we are trying to select the chunk on the basis of generated keywords that are in accordance with the question that is asked in the PDF document. A good value for the size of a chunk is about 2500 characters with an overlap of 50 characters.
Why an overlap?
Keep in mind that important content may be at the beginning or end of a chunk, and the content may not be long enough to find the right chunk based on keywords. For better visibility, the content should be repeated in the next chunk. Basically, the chunk overlap is the number of characters that adjacent chunks have in common.
Consider the following sample text, which is divided into chunks of 50 characters with an overlap of 10 characters. The small chunks should not be used for real world applications because they don't contain enough information to find keyword matches and generate answers from the content. The size of the chunk is only used to provide a visual illustration of the concept.
The area marked in purple will be repeated at the beginning of the next chunk (marked in yellow).
Find Matches
After the PDF document is imported and divided into chunks, the next step is to find matches based on the question that is asked. Therefore, OpenAI is used to generate keywords and synonyms from the question used.
This OpenAI call uses function calling to return a list of keywords from the question. This method also uses the TXTextControl.OpenAI namespace, which has been extended with functions and return values for them.
public static List<string> GetKeywords(string text, int numKeywords = 10)
{
// create a list to store the keywords
List<string> keywords = new List<string>();
string prompt = $"Create {numKeywords} keywords and synonyms from the following question that can be used to find information in a larger text. Create only 1 word per keyword. Return the keywords in lowercase only. Here is the question: {text}";
// create a request object
Request apiRequest = new Request
{
Messages = new[]
{
new RequestMessage
{
Role = "system",
Content = $"Always provide {numKeywords} keywords that include relevant synonyms of words in the original question."
},
new RequestMessage
{
Role = "user",
Content = prompt
}
},
Functions = new[]
{
new Function
{
Name = "get_keywords",
Description = "Use this function to give the user a list of keywords.",
Parameters = new Parameters
{
Type = "object",
Properties = new Properties
{
List = new ListProperty
{
Type = "array",
Items = new Items
{
Type = "string",
Description = "A keyword"
},
Description = "A list of keywords"
}
}
},
Required = new List<string> { "list" }
}
},
FunctionCall = new FunctionCall
{
Name = "get_keywords",
Arguments = "{'list'}"
}
};
// get the response
if (GetResponse(apiRequest) is Response response)
{
// return the keywords
return System.Text.Json.JsonSerializer.Deserialize<ListReturnObject>(response.Choices[0].Message.FunctionCall.Arguments).List;
}
return null;
}
The method uses the OpenAI GPT-3 engine to generate keywords and synonyms from the question, which are then used to find the occurrences on the generated chunks. The idea is to select the best match and send only that chunk to OpenAI to generate the answer based on that content.
The FindMatches method returns a dictionary containing the chunk id sorted by relevance. The relevance is calculated on the basis of the number of how many times the keywords are present in the text of the chunk.
// find matches in a list of chunks
public static Dictionary<int, double> FindMatches(List<string> chunks, List<string> keywords, int padding = 500)
{
// create a dictionary to store the document frequency of each keyword
Dictionary<string, int> df = new Dictionary<string, int>();
// create a dictionary to store the results
Dictionary<int, double> results = new Dictionary<int, double>();
// create a list to store the trimmed chunks
List<string> trimmedChunks = new List<string>();
// loop through the chunks
for (int i = 0; i < chunks.Count; i++)
{
// remove the padding from the first and last chunk
string chunk = i != 0 ? chunks[i].Substring(padding) : chunks[i];
chunk = i != chunks.Count - 1 ? chunk.Substring(0, chunk.Length - padding) : chunk;
trimmedChunks.Add(chunk.ToLower());
}
// loop through the trimmed chunks
foreach (string chunk in trimmedChunks)
{
// loop through the keywords
foreach (string keyword in keywords)
{
// count the occurrences of the keyword in the chunk
int occurrences = chunk.CountSubstring(keyword);
// add the keyword to the document frequency dictionary
if (!df.ContainsKey(keyword))
{
df[keyword] = 0;
}
// increment the document frequency
df[keyword] += occurrences;
}
}
// loop through the trimmed chunks
for (int chunkId = 0; chunkId < trimmedChunks.Count; chunkId++)
{
// initialize the points
double points = 0;
// loop through the keywords
foreach (string keyword in keywords)
{
// count the occurrences of the keyword in the chunk
int occurrences = trimmedChunks[chunkId].CountSubstring(keyword);
// calculate the points
if (df[keyword] > 0)
{
// add the points
points += occurrences / (double)df[keyword];
}
}
// add the points to the results
results[chunkId] = points;
}
// return the results sorted by points
return results.OrderByDescending(x => x.Value).ToDictionary(x => x.Key, x => x.Value);
}
Generate the Answer
After the best match is found, the next step is to generate the answer based on the content of the chunk. The chunk will be sent to OpenAI along with the question and the prompt that follows:
$"```{chunk}```Your source is the information above. What is the answer to the following question? ```{question}```
The complete method GetAnswer is shown in the code below.
public static string GetAnswer(string chunk, string question)
{
// create a prompt
string prompt = $"```{chunk}```Your source is the information above. What is the answer to the following question? ```{question}```";
// create a request object
Request apiRequest = new Request
{
Messages = new[]
{
new RequestMessage
{
Role = "system",
Content = "You should help to find an answer to a question in a document."
},
new RequestMessage
{
Role = "user",
Content = prompt
}
}
};
// get the response
if (GetResponse(apiRequest) is Response response)
{
// return the answer
return response.Choices[0].Message.Content;
}
// return null if the response is null
return null;
}
Running the Application
The application is a simple .NET Console App that uses the TX Text Control .NET Server to import the PDF document and to display the answer generated by OpenAI.
string question = "Is contracting with other partners an option?";
//string question = "How will disputes be dealt with?";
//string question = "Can the agreement be changed or modified?";
string pdfPath = "Sample PDFs/SampleContract-Shuttle.pdf";
// load the PDF file
byte[] pdfDocument = File.ReadAllBytes(pdfPath);
// split the PDF document into chunks
var chunks = DocumentProcessing.Chunk(pdfDocument, 2500, 50);
Console.WriteLine($"{chunks.Count.ToString()} chunks generated from: {pdfPath}");
// get the keywords
List<string> generatedKeywords = GPTHelper.GetKeywords(question, 20);
// find the matches
var matches = DocumentProcessing.FindMatches(chunks, generatedKeywords).ToList().First();
// print the matches
Console.WriteLine($"The question: \"{question}\" was found in chunk {matches.Key}.");
// print the answer
Console.WriteLine("\r\n********\r\n" + GPTHelper.GetAnswer(chunks[matches.Key], question));
A sample output is shown in the following console:
14 chunks generated from: Sample PDFs/SampleContract-Shuttle.pdf<br />
The question: "Is contracting with other partners an option?" was found in chunk 11.<br />
<br />
********<br />
No, contracting with other partners is not an option unless prior approval is obtained from the COMMISSION\'S Contract Manager. The document specifies that subcontracting work under this Agreement is not allowed without prior written authorization, except for those identified in the approved Fee Schedule. Subcontracts over $25,000 must include the necessary provisions from the main Agreement and must be approved in writing by the COMMISSION\'S Contract Manager.
The answer is generated based on the content of the chunk that contains the best match for the question. The answer is a direct quote from the PDF document.
Conclusion
This article shows how to create a generative AI application for PDF documents using TX Text Control and OpenAI functions in .NET C#. The application uses the OpenAI GPT-3 engine to answer questions on the content of a PDF document.
Test it yourself by downloading the sample application from our GitHub repository.
Download and Fork This Sample on GitHub
We proudly host our sample code on github.com/TextControl.
Please fork and contribute.
Requirements for this sample
- TX Text Control .NET Server Core
- OpenAI API Key
Related Posts
Sign Documents with a Self-Signed Digital ID From Adobe Acrobat Reader in…
This article shows how to create a self-signed digital ID using Adobe Acrobat Reader and how to use it to sign documents in .NET C#. The article also shows how to create a PDF document with a…
Programmatically Convert MS Word DOCX Documents to PDF in .NET C#
This article shows how to convert MS Word DOCX documents to PDF in .NET C# using the ServerTextControl component. The example shows how to load a DOCX file from a file or from a variable and how…
PDF Document Classification with OpenAI and TX Text Control in C#
This article shows how to classify PDF documents with OpenAI and TX Text Control in C#. The classification is based on the GPT-3 model and the document is imported using TX Text Control .NET Server.
Document Viewer: Save the Values of Form Fields in Documents
The TX Text Control Document Viewer is used to allow users to fill in form fields in documents. This article explains how to save a document with the values of the filled in form fields.
Store Documents as PDF/A using C# - A Future-Proof Archiving Format
PDF/A is an international ISO standard for the preservation of electronic and digital documents. This article will explain the reasons and motivations and will show you how to create PDF/A…