Chat PDF - A Generative AI Application for PDF Documents using TX Text Control and OpenAI Functions in C#

One of the strongest applications of generative AI is in understanding the context of text and being able to answer questions about it. The integration of AI into day-to-day processes will be the norm in the applications of the future. It is highly probable that AI will be integrated into every business application developed using TX Text Control. Artificial intelligence applications in digital document processing include document analysis, contract summary, and AI-powered document template generation.

Our role is to provide the interfaces, the typical use cases, and the UI for the integration of AI into the workflow of document processing. That is why we will always provide you with ideas on how to integrate AI into your applications built with TX Text Control.

One of the most interesting applications of AI in document processing is the ability to ask questions about the content of a document. This article shows how to create a generative AI application for PDF documents using TX Text Control and OpenAI functions in C#. The application uses the OpenAI GPT-3 engine to answer questions on the content of a PDF document.

TX Text Control provides a powerful API to create, modify, and process PDF documents. The OpenAI GPT-3 engine is a powerful tool to understand the context of text and to answer questions about it. The integration of both technologies allows developers to create powerful applications that can understand the content of a PDF document and answer questions about it.

Import the PDF

The first step is to import the PDF document into the application. TX Text Control .NET Server for ASP.NET provides a powerful API to import PDF documents. Due to the nature of the OpenAI calls, the length of the content is limited based on the model that is used. This is why we need to create smaller chunks of the PDF document.

The following method opens a PDF document using the ServerTextControl ╰ TX Text Control .NET Server for ASP.NET
╰ TXTextControl Namespace
╰ ServerTextControl Class
The ServerTextControl class implements a component that provide high-level text processing features for server-based applications. class and creates smaller chunks of plain text.

	// split a PDF document into chunks
	public static List<string> Chunk(byte[] pdfDocument, int chunkSize, int overlap = 1)
	{
	// create a new ServerTextControl instance
	using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl())
	{
	tx.Create();

	var loadSettings = new TXTextControl.LoadSettings
	{
	PDFImportSettings = TXTextControl.PDFImportSettings.GenerateParagraphs
	};

	// load the PDF document
	tx.Load(pdfDocument, TXTextControl.BinaryStreamType.AdobePDF, loadSettings);

	// remove line breaks
	string pdfText = tx.Text.Replace("\r\n", " ");

	// call the extracted chunk creation method
	return CreateChunks(pdfText, chunkSize, overlap);
	}
	}

view raw test.cs hosted with ❤ by GitHub

	// split a text into chunks
	private static List<string> CreateChunks(string text, int chunkSize, int overlap)
	{
	List<string> chunks = new List<string>();

	// split the text into chunks
	while (text.Length > chunkSize)
	{
	chunks.Add(text.Substring(0, chunkSize));
	text = text.Substring(chunkSize - overlap);
	}

	// add the last chunk
	chunks.Add(text);

	return chunks;
	}

view raw test.cs hosted with ❤ by GitHub

The method accepts a specific chunk size and an overlap size. The chunk size should not be too small, because we are trying to select the chunk on the basis of generated keywords that are in accordance with the question that is asked in the PDF document. A good value for the size of a chunk is about 2500 characters with an overlap of 50 characters.

Why an overlap?

Keep in mind that important content may be at the beginning or end of a chunk, and the content may not be long enough to find the right chunk based on keywords. For better visibility, the content should be repeated in the next chunk. Basically, the chunk overlap is the number of characters that adjacent chunks have in common.

Consider the following sample text, which is divided into chunks of 50 characters with an overlap of 10 characters. The small chunks should not be used for real world applications because they don't contain enough information to find keyword matches and generate answers from the content. The size of the chunk is only used to provide a visual illustration of the concept.

Text chunks with overlap

The area marked in purple will be repeated at the beginning of the next chunk (marked in yellow).

Find Matches

After the PDF document is imported and divided into chunks, the next step is to find matches based on the question that is asked. Therefore, OpenAI is used to generate keywords and synonyms from the question used.

This OpenAI call uses function calling to return a list of keywords from the question. This method also uses the TXTextControl.OpenAI namespace, which has been extended with functions and return values for them.

	public static List<string> GetKeywords(string text, int numKeywords = 10)
	{
	// create a list to store the keywords
	List<string> keywords = new List<string>();

	string prompt = $"Create {numKeywords} keywords and synonyms from the following question that can be used to find information in a larger text. Create only 1 word per keyword. Return the keywords in lowercase only. Here is the question: {text}";

	// create a request object
	Request apiRequest = new Request
	{
	Messages = new[]
	{
	new RequestMessage
	{
	Role = "system",
	Content = $"Always provide {numKeywords} keywords that include relevant synonyms of words in the original question."
	},
	new RequestMessage
	{
	Role = "user",
	Content = prompt
	}
	},
	Functions = new[]
	{
	new Function
	{
	Name = "get_keywords",
	Description = "Use this function to give the user a list of keywords.",
	Parameters = new Parameters
	{
	Type = "object",
	Properties = new Properties
	{
	List = new ListProperty
	{
	Type = "array",
	Items = new Items
	{
	Type = "string",
	Description = "A keyword"
	},
	Description = "A list of keywords"

	}
	}

	},
	Required = new List<string> { "list" }
	}
	},
	FunctionCall = new FunctionCall
	{
	Name = "get_keywords",
	Arguments = "{'list'}"
	}
	};

	// get the response
	if (GetResponse(apiRequest) is Response response)
	{
	// return the keywords
	return System.Text.Json.JsonSerializer.Deserialize<ListReturnObject>(response.Choices[0].Message.FunctionCall.Arguments).List;
	}

	return null;
	}

view raw test.cs hosted with ❤ by GitHub

The method uses the OpenAI GPT-3 engine to generate keywords and synonyms from the question, which are then used to find the occurrences on the generated chunks. The idea is to select the best match and send only that chunk to OpenAI to generate the answer based on that content.

The FindMatches method returns a dictionary containing the chunk id sorted by relevance. The relevance is calculated on the basis of the number of how many times the keywords are present in the text of the chunk.

	// find matches in a list of chunks
	public static Dictionary<int, double> FindMatches(List<string> chunks, List<string> keywords, int padding = 500)
	{
	// create a dictionary to store the document frequency of each keyword
	Dictionary<string, int> df = new Dictionary<string, int>();

	// create a dictionary to store the results
	Dictionary<int, double> results = new Dictionary<int, double>();

	// create a list to store the trimmed chunks
	List<string> trimmedChunks = new List<string>();

	// loop through the chunks
	for (int i = 0; i < chunks.Count; i++)
	{
	// remove the padding from the first and last chunk
	string chunk = i != 0 ? chunks[i].Substring(padding) : chunks[i];
	chunk = i != chunks.Count - 1 ? chunk.Substring(0, chunk.Length - padding) : chunk;
	trimmedChunks.Add(chunk.ToLower());
	}

	// loop through the trimmed chunks
	foreach (string chunk in trimmedChunks)
	{
	// loop through the keywords
	foreach (string keyword in keywords)
	{
	// count the occurrences of the keyword in the chunk
	int occurrences = chunk.CountSubstring(keyword);

	// add the keyword to the document frequency dictionary
	if (!df.ContainsKey(keyword))
	{
	df[keyword] = 0;
	}

	// increment the document frequency
	df[keyword] += occurrences;
	}
	}

	// loop through the trimmed chunks
	for (int chunkId = 0; chunkId < trimmedChunks.Count; chunkId++)
	{
	// initialize the points
	double points = 0;

	// loop through the keywords
	foreach (string keyword in keywords)
	{
	// count the occurrences of the keyword in the chunk
	int occurrences = trimmedChunks[chunkId].CountSubstring(keyword);

	// calculate the points
	if (df[keyword] > 0)
	{
	// add the points
	points += occurrences / (double)df[keyword];
	}
	}
	// add the points to the results
	results[chunkId] = points;
	}

	// return the results sorted by points
	return results.OrderByDescending(x => x.Value).ToDictionary(x => x.Key, x => x.Value);
	}

view raw test.cs hosted with ❤ by GitHub

Generate the Answer

After the best match is found, the next step is to generate the answer based on the content of the chunk. The chunk will be sent to OpenAI along with the question and the prompt that follows:

$"```{chunk}```Your source is the information above. What is the answer to the following question? ```{question}```

view raw test.cs hosted with ❤ by GitHub

The complete method GetAnswer is shown in the code below.

	public static string GetAnswer(string chunk, string question)
	{
	// create a prompt
	string prompt = $"```{chunk}```Your source is the information above. What is the answer to the following question? ```{question}```";

	// create a request object
	Request apiRequest = new Request
	{
	Messages = new[]
	{
	new RequestMessage
	{
	Role = "system",
	Content = "You should help to find an answer to a question in a document."
	},
	new RequestMessage
	{
	Role = "user",
	Content = prompt
	}
	}
	};

	// get the response
	if (GetResponse(apiRequest) is Response response)
	{
	// return the answer
	return response.Choices[0].Message.Content;
	}

	// return null if the response is null
	return null;
	}

view raw test.cs hosted with ❤ by GitHub

Running the Application

The application is a simple .NET Console App that uses the TX Text Control .NET Server for ASP.NET to import the PDF document and to display the answer generated by OpenAI.

	string question = "Is contracting with other partners an option?";
	//string question = "How will disputes be dealt with?";
	//string question = "Can the agreement be changed or modified?";

	string pdfPath = "Sample PDFs/SampleContract-Shuttle.pdf";

	// load the PDF file
	byte[] pdfDocument = File.ReadAllBytes(pdfPath);

	// split the PDF document into chunks
	var chunks = DocumentProcessing.Chunk(pdfDocument, 2500, 50);

	Console.WriteLine($"{chunks.Count.ToString()} chunks generated from: {pdfPath}");

	// get the keywords
	List<string> generatedKeywords = GPTHelper.GetKeywords(question, 20);

	// find the matches
	var matches = DocumentProcessing.FindMatches(chunks, generatedKeywords).ToList().First();

	// print the matches
	Console.WriteLine($"The question: \"{question}\" was found in chunk {matches.Key}.");

	// print the answer
	Console.WriteLine("\r\n********\r\n" + GPTHelper.GetAnswer(chunks[matches.Key], question));

view raw test.cs hosted with ❤ by GitHub

A sample output is shown in the following console:

14 chunks generated from: Sample PDFs/SampleContract-Shuttle.pdf The question: "Is contracting with other partners an option?" was found in chunk 11. ******** No, contracting with other partners is not an option unless prior approval is obtained from the COMMISSION'S Contract Manager. The document specifies that subcontracting work under this Agreement is not allowed without prior written authorization, except for those identified in the approved Fee Schedule. Subcontracts over $25,000 must include the necessary provisions from the main Agreement and must be approved in writing by the COMMISSION'S Contract Manager.

The answer is generated based on the content of the chunk that contains the best match for the question. The answer is a direct quote from the PDF document.

Conclusion

This article shows how to create a generative AI application for PDF documents using TX Text Control and OpenAI functions in .NET C#. The application uses the OpenAI GPT-3 engine to answer questions on the content of a PDF document.

Test it yourself by downloading the sample application from our GitHub repository.

Text Control Products

WEB, SERVER AND CLOUD

Getting started with:

DESKTOP

HOSTED CLOUD

LOW CODE PLATFORM

Core Technologies

Text Control Documentation

Text Control Blog

Text Control Support

About Text Control

Chat PDF - A Generative AI Application for PDF Documents using TX Text Control and OpenAI Functions in C#

Summary

Import the PDF

Find Matches

Generate the Answer

Running the Application

Conclusion

Download and Fork This Sample on GitHub

Requirements for This Sample

ASP.NET

Getting started with:

Related Posts

Sign Documents with a Self-Signed Digital ID From Adobe Acrobat Reader in .NET C#

Programmatically Convert MS Word DOCX Documents to PDF in .NET C#

PDF Document Classification with OpenAI and TX Text Control in C#

Document Viewer: Save the Values of Form Fields in Documents

Popular Products

Technologies

Get Products

Resources

Support

Ready To Talk?