One of the strongest applications of generative AI is in understanding the context of text and being able to answer questions about it. The integration of AI into day-to-day processes will be the norm in the applications of the future. It is highly probable that AI will be integrated into every business application developed using TX Text Control. Artificial intelligence applications in digital document processing include document analysis, contract summary, and AI-powered document template generation.

Our role is to provide the interfaces, the typical use cases, and the UI for the integration of AI into the workflow of document processing. That is why we will always provide you with ideas on how to integrate AI into your applications built with TX Text Control.

One of the most interesting applications of AI in document processing is the ability to ask questions about the content of a document. This article shows how to create a generative AI application for PDF documents using TX Text Control and OpenAI functions in C#. The application uses the OpenAI GPT-3 engine to answer questions on the content of a PDF document.

TX Text Control provides a powerful API to create, modify, and process PDF documents. The OpenAI GPT-3 engine is a powerful tool to understand the context of text and to answer questions about it. The integration of both technologies allows developers to create powerful applications that can understand the content of a PDF document and answer questions about it.

Import the PDF

The first step is to import the PDF document into the application. TX Text Control .NET Server for ASP.NET provides a powerful API to import PDF documents. Due to the nature of the OpenAI calls, the length of the content is limited based on the model that is used. This is why we need to create smaller chunks of the PDF document.

The following method opens a PDF document using the ServerTextControl TX Text Control .NET Server for ASP.NET
TXTextControl Namespace
ServerTextControl Class
The ServerTextControl class implements a component that provide high-level text processing features for server-based applications.
class and creates smaller chunks of plain text.

// split a PDF document into chunks
public static List<string> Chunk(byte[] pdfDocument, int chunkSize, int overlap = 1)
{
// create a new ServerTextControl instance
using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl())
{
tx.Create();
var loadSettings = new TXTextControl.LoadSettings
{
PDFImportSettings = TXTextControl.PDFImportSettings.GenerateParagraphs
};
// load the PDF document
tx.Load(pdfDocument, TXTextControl.BinaryStreamType.AdobePDF, loadSettings);
// remove line breaks
string pdfText = tx.Text.Replace("\r\n", " ");
// call the extracted chunk creation method
return CreateChunks(pdfText, chunkSize, overlap);
}
}
view raw test.cs hosted with ❤ by GitHub
// split a text into chunks
private static List<string> CreateChunks(string text, int chunkSize, int overlap)
{
List<string> chunks = new List<string>();
// split the text into chunks
while (text.Length > chunkSize)
{
chunks.Add(text.Substring(0, chunkSize));
text = text.Substring(chunkSize - overlap);
}
// add the last chunk
chunks.Add(text);
return chunks;
}
view raw test.cs hosted with ❤ by GitHub

The method accepts a specific chunk size and an overlap size. The chunk size should not be too small, because we are trying to select the chunk on the basis of generated keywords that are in accordance with the question that is asked in the PDF document. A good value for the size of a chunk is about 2500 characters with an overlap of 50 characters.

Why an overlap?

Keep in mind that important content may be at the beginning or end of a chunk, and the content may not be long enough to find the right chunk based on keywords. For better visibility, the content should be repeated in the next chunk. Basically, the chunk overlap is the number of characters that adjacent chunks have in common.

Consider the following sample text, which is divided into chunks of 50 characters with an overlap of 10 characters. The small chunks should not be used for real world applications because they don't contain enough information to find keyword matches and generate answers from the content. The size of the chunk is only used to provide a visual illustration of the concept.

Text chunks with overlap

The area marked in purple will be repeated at the beginning of the next chunk (marked in yellow).

Find Matches

After the PDF document is imported and divided into chunks, the next step is to find matches based on the question that is asked. Therefore, OpenAI is used to generate keywords and synonyms from the question used.

This OpenAI call uses function calling to return a list of keywords from the question. This method also uses the TXTextControl.OpenAI namespace, which has been extended with functions and return values for them.

public static List<string> GetKeywords(string text, int numKeywords = 10)
{
// create a list to store the keywords
List<string> keywords = new List<string>();
string prompt = $"Create {numKeywords} keywords and synonyms from the following question that can be used to find information in a larger text. Create only 1 word per keyword. Return the keywords in lowercase only. Here is the question: {text}";
// create a request object
Request apiRequest = new Request
{
Messages = new[]
{
new RequestMessage
{
Role = "system",
Content = $"Always provide {numKeywords} keywords that include relevant synonyms of words in the original question."
},
new RequestMessage
{
Role = "user",
Content = prompt
}
},
Functions = new[]
{
new Function
{
Name = "get_keywords",
Description = "Use this function to give the user a list of keywords.",
Parameters = new Parameters
{
Type = "object",
Properties = new Properties
{
List = new ListProperty
{
Type = "array",
Items = new Items
{
Type = "string",
Description = "A keyword"
},
Description = "A list of keywords"
}
}
},
Required = new List<string> { "list" }
}
},
FunctionCall = new FunctionCall
{
Name = "get_keywords",
Arguments = "{'list'}"
}
};
// get the response
if (GetResponse(apiRequest) is Response response)
{
// return the keywords
return System.Text.Json.JsonSerializer.Deserialize<ListReturnObject>(response.Choices[0].Message.FunctionCall.Arguments).List;
}
return null;
}
view raw test.cs hosted with ❤ by GitHub

The method uses the OpenAI GPT-3 engine to generate keywords and synonyms from the question, which are then used to find the occurrences on the generated chunks. The idea is to select the best match and send only that chunk to OpenAI to generate the answer based on that content.

The FindMatches method returns a dictionary containing the chunk id sorted by relevance. The relevance is calculated on the basis of the number of how many times the keywords are present in the text of the chunk.

// find matches in a list of chunks
public static Dictionary<int, double> FindMatches(List<string> chunks, List<string> keywords, int padding = 500)
{
// create a dictionary to store the document frequency of each keyword
Dictionary<string, int> df = new Dictionary<string, int>();
// create a dictionary to store the results
Dictionary<int, double> results = new Dictionary<int, double>();
// create a list to store the trimmed chunks
List<string> trimmedChunks = new List<string>();
// loop through the chunks
for (int i = 0; i < chunks.Count; i++)
{
// remove the padding from the first and last chunk
string chunk = i != 0 ? chunks[i].Substring(padding) : chunks[i];
chunk = i != chunks.Count - 1 ? chunk.Substring(0, chunk.Length - padding) : chunk;
trimmedChunks.Add(chunk.ToLower());
}
// loop through the trimmed chunks
foreach (string chunk in trimmedChunks)
{
// loop through the keywords
foreach (string keyword in keywords)
{
// count the occurrences of the keyword in the chunk
int occurrences = chunk.CountSubstring(keyword);
// add the keyword to the document frequency dictionary
if (!df.ContainsKey(keyword))
{
df[keyword] = 0;
}
// increment the document frequency
df[keyword] += occurrences;
}
}
// loop through the trimmed chunks
for (int chunkId = 0; chunkId < trimmedChunks.Count; chunkId++)
{
// initialize the points
double points = 0;
// loop through the keywords
foreach (string keyword in keywords)
{
// count the occurrences of the keyword in the chunk
int occurrences = trimmedChunks[chunkId].CountSubstring(keyword);
// calculate the points
if (df[keyword] > 0)
{
// add the points
points += occurrences / (double)df[keyword];
}
}
// add the points to the results
results[chunkId] = points;
}
// return the results sorted by points
return results.OrderByDescending(x => x.Value).ToDictionary(x => x.Key, x => x.Value);
}
view raw test.cs hosted with ❤ by GitHub

Generate the Answer

After the best match is found, the next step is to generate the answer based on the content of the chunk. The chunk will be sent to OpenAI along with the question and the prompt that follows:

$"```{chunk}```Your source is the information above. What is the answer to the following question? ```{question}```
view raw test.cs hosted with ❤ by GitHub

The complete method GetAnswer is shown in the code below.

public static string GetAnswer(string chunk, string question)
{
// create a prompt
string prompt = $"```{chunk}```Your source is the information above. What is the answer to the following question? ```{question}```";
// create a request object
Request apiRequest = new Request
{
Messages = new[]
{
new RequestMessage
{
Role = "system",
Content = "You should help to find an answer to a question in a document."
},
new RequestMessage
{
Role = "user",
Content = prompt
}
}
};
// get the response
if (GetResponse(apiRequest) is Response response)
{
// return the answer
return response.Choices[0].Message.Content;
}
// return null if the response is null
return null;
}
view raw test.cs hosted with ❤ by GitHub

Running the Application

The application is a simple .NET Console App that uses the TX Text Control .NET Server for ASP.NET to import the PDF document and to display the answer generated by OpenAI.

string question = "Is contracting with other partners an option?";
//string question = "How will disputes be dealt with?";
//string question = "Can the agreement be changed or modified?";
string pdfPath = "Sample PDFs/SampleContract-Shuttle.pdf";
// load the PDF file
byte[] pdfDocument = File.ReadAllBytes(pdfPath);
// split the PDF document into chunks
var chunks = DocumentProcessing.Chunk(pdfDocument, 2500, 50);
Console.WriteLine($"{chunks.Count.ToString()} chunks generated from: {pdfPath}");
// get the keywords
List<string> generatedKeywords = GPTHelper.GetKeywords(question, 20);
// find the matches
var matches = DocumentProcessing.FindMatches(chunks, generatedKeywords).ToList().First();
// print the matches
Console.WriteLine($"The question: \"{question}\" was found in chunk {matches.Key}.");
// print the answer
Console.WriteLine("\r\n********\r\n" + GPTHelper.GetAnswer(chunks[matches.Key], question));
view raw test.cs hosted with ❤ by GitHub

A sample output is shown in the following console:

14 chunks generated from: Sample PDFs/SampleContract-Shuttle.pdf
The question: "Is contracting with other partners an option?" was found in chunk 11.

********
No, contracting with other partners is not an option unless prior approval is obtained from the COMMISSION'S Contract Manager. The document specifies that subcontracting work under this Agreement is not allowed without prior written authorization, except for those identified in the approved Fee Schedule. Subcontracts over $25,000 must include the necessary provisions from the main Agreement and must be approved in writing by the COMMISSION'S Contract Manager.

The answer is generated based on the content of the chunk that contains the best match for the question. The answer is a direct quote from the PDF document.

Conclusion

This article shows how to create a generative AI application for PDF documents using TX Text Control and OpenAI functions in .NET C#. The application uses the OpenAI GPT-3 engine to answer questions on the content of a PDF document.

Test it yourself by downloading the sample application from our GitHub repository.