ASP.NET Generative AI OpenAI PDF ServerTextControl

Chat PDF - A Generative AI Application for PDF Documents using TX Text Control and OpenAI Functions in C#

Bjoern Meyer

February 23, 2024

This article shows how to create a generative AI application for PDF documents using TX Text Control and OpenAI functions in C#. The application uses the OpenAI GPT-3 engine to answer questions on the content of a PDF document.

Chat PDF - A Generative AI Application for PDF Documents using TX Text Control and OpenAI Functions in C#

One of the strongest applications of generative AI is in understanding the context of text and being able to answer questions about it. The integration of AI into day-to-day processes will be the norm in the applications of the future. It is highly probable that AI will be integrated into every business application developed using TX Text Control. Artificial intelligence applications in digital document processing include document analysis, contract summary, and AI-powered document template generation.

Our role is to provide the interfaces, the typical use cases, and the UI for the integration of AI into the workflow of document processing. That is why we will always provide you with ideas on how to integrate AI into your applications built with TX Text Control.

One of the most interesting applications of AI in document processing is the ability to ask questions about the content of a document. This article shows how to create a generative AI application for PDF documents using TX Text Control and OpenAI functions in C#. The application uses the OpenAI GPT-3 engine to answer questions on the content of a PDF document.

TX Text Control provides a powerful API to create, modify, and process PDF documents. The OpenAI GPT-3 engine is a powerful tool to understand the context of text and to answer questions about it. The integration of both technologies allows developers to create powerful applications that can understand the content of a PDF document and answer questions about it.

Import the PDF

The first step is to import the PDF document into the application. TX Text Control .NET Server provides a powerful API to import PDF documents. Due to the nature of the OpenAI calls, the length of the content is limited based on the model that is used. This is why we need to create smaller chunks of the PDF document.

The following method opens a PDF document using the ServerTextControl class and creates smaller chunks of plain text.

// split a PDF document into chunks
  public static List<string> Chunk(byte[] pdfDocument, int chunkSize, int overlap = 1)
  {
    // create a new ServerTextControl instance
    using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl())
    {
      tx.Create();
  
      var loadSettings = new TXTextControl.LoadSettings
      {
        PDFImportSettings = TXTextControl.PDFImportSettings.GenerateParagraphs
      };
  
      // load the PDF document
      tx.Load(pdfDocument, TXTextControl.BinaryStreamType.AdobePDF, loadSettings);
  
      // remove line breaks
      string pdfText = tx.Text.Replace("\r\n", " ");
  
      // call the extracted chunk creation method
      return CreateChunks(pdfText, chunkSize, overlap);
    }
  }

// split a text into chunks
  private static List<string> CreateChunks(string text, int chunkSize, int overlap)
  {
    List<string> chunks = new List<string>();
  
    // split the text into chunks
    while (text.Length > chunkSize)
    {
      chunks.Add(text.Substring(0, chunkSize));
      text = text.Substring(chunkSize - overlap);
    }
  
    // add the last chunk
    chunks.Add(text);
  
    return chunks;
  }

The method accepts a specific chunk size and an overlap size. The chunk size should not be too small, because we are trying to select the chunk on the basis of generated keywords that are in accordance with the question that is asked in the PDF document. A good value for the size of a chunk is about 2500 characters with an overlap of 50 characters.

Why an overlap?

Keep in mind that important content may be at the beginning or end of a chunk, and the content may not be long enough to find the right chunk based on keywords. For better visibility, the content should be repeated in the next chunk. Basically, the chunk overlap is the number of characters that adjacent chunks have in common.

Consider the following sample text, which is divided into chunks of 50 characters with an overlap of 10 characters. The small chunks should not be used for real world applications because they don't contain enough information to find keyword matches and generate answers from the content. The size of the chunk is only used to provide a visual illustration of the concept.

Text chunks with overlap

The area marked in purple will be repeated at the beginning of the next chunk (marked in yellow).

Find Matches

After the PDF document is imported and divided into chunks, the next step is to find matches based on the question that is asked. Therefore, OpenAI is used to generate keywords and synonyms from the question used.

This OpenAI call uses function calling to return a list of keywords from the question. This method also uses the TXTextControl.OpenAI namespace, which has been extended with functions and return values for them.

public static List<string> GetKeywords(string text, int numKeywords = 10)
{
  // create a list to store the keywords
  List<string> keywords = new List<string>();

  string prompt = $"Create {numKeywords} keywords and synonyms from the following question that can be used to find information in a larger text. Create only 1 word per keyword. Return the keywords in lowercase only. Here is the question: {text}";

  // create a request object
  Request apiRequest = new Request
  {
    Messages = new[]
    {
      new RequestMessage
      {
        Role = "system",
        Content = $"Always provide {numKeywords} keywords that include relevant synonyms of words in the original question."
      },
      new RequestMessage
      {
        Role = "user",
        Content = prompt
      }
    },
    Functions = new[]
    {
      new Function
      {
        Name = "get_keywords",
        Description = "Use this function to give the user a list of keywords.",
        Parameters = new Parameters
        {
          Type = "object",
          Properties = new Properties
          {
            List = new ListProperty
            {
              Type = "array",
              Items = new Items
              {
                Type = "string",
                Description = "A keyword"
              },
              Description = "A list of keywords"

            }
          }
          
        },
        Required = new List<string> { "list" }
      }
    },
    FunctionCall = new FunctionCall
    {
      Name = "get_keywords",
      Arguments = "{'list'}"
    }
  };

  // get the response
  if (GetResponse(apiRequest) is Response response)
  {
    // return the keywords
    return System.Text.Json.JsonSerializer.Deserialize<ListReturnObject>(response.Choices[0].Message.FunctionCall.Arguments).List;
  }

  return null;
}

The method uses the OpenAI GPT-3 engine to generate keywords and synonyms from the question, which are then used to find the occurrences on the generated chunks. The idea is to select the best match and send only that chunk to OpenAI to generate the answer based on that content.

The FindMatches method returns a dictionary containing the chunk id sorted by relevance. The relevance is calculated on the basis of the number of how many times the keywords are present in the text of the chunk.

// find matches in a list of chunks
public static Dictionary<int, double> FindMatches(List<string> chunks, List<string> keywords, int padding = 500)
{
  // create a dictionary to store the document frequency of each keyword
  Dictionary<string, int> df = new Dictionary<string, int>();

  // create a dictionary to store the results
  Dictionary<int, double> results = new Dictionary<int, double>();

  // create a list to store the trimmed chunks
  List<string> trimmedChunks = new List<string>();

  // loop through the chunks
  for (int i = 0; i < chunks.Count; i++)
  {
    // remove the padding from the first and last chunk
    string chunk = i != 0 ? chunks[i].Substring(padding) : chunks[i];
    chunk = i != chunks.Count - 1 ? chunk.Substring(0, chunk.Length - padding) : chunk;
    trimmedChunks.Add(chunk.ToLower());
  }

  // loop through the trimmed chunks
  foreach (string chunk in trimmedChunks)
  {
    // loop through the keywords
    foreach (string keyword in keywords)
    {
      // count the occurrences of the keyword in the chunk
      int occurrences = chunk.CountSubstring(keyword);
      
      // add the keyword to the document frequency dictionary
      if (!df.ContainsKey(keyword))
      {
        df[keyword] = 0;
      }
      
      // increment the document frequency
      df[keyword] += occurrences;
    }
  }

  // loop through the trimmed chunks
  for (int chunkId = 0; chunkId < trimmedChunks.Count; chunkId++)
  {
    // initialize the points
    double points = 0;

    // loop through the keywords
    foreach (string keyword in keywords)
    {
      // count the occurrences of the keyword in the chunk
      int occurrences = trimmedChunks[chunkId].CountSubstring(keyword);
      
      // calculate the points
      if (df[keyword] > 0)
      {
        // add the points
        points += occurrences / (double)df[keyword];
      }
    }
    // add the points to the results
    results[chunkId] = points;
  }

  // return the results sorted by points
  return results.OrderByDescending(x => x.Value).ToDictionary(x => x.Key, x => x.Value);
}

Generate the Answer

After the best match is found, the next step is to generate the answer based on the content of the chunk. The chunk will be sent to OpenAI along with the question and the prompt that follows:

$"```{chunk}```Your source is the information above. What is the answer to the following question? ```{question}```

The complete method GetAnswer is shown in the code below.

public static string GetAnswer(string chunk, string question)
{
  // create a prompt
  string prompt = $"```{chunk}```Your source is the information above. What is the answer to the following question? ```{question}```";

  // create a request object
  Request apiRequest = new Request
  {
    Messages = new[]
    {
      new RequestMessage
      {
         Role = "system",
         Content = "You should help to find an answer to a question in a document."
      },
      new RequestMessage
      {
         Role = "user",
         Content = prompt
      }
    }
  };

  // get the response
  if (GetResponse(apiRequest) is Response response)
  {
    // return the answer
    return response.Choices[0].Message.Content;
  }

  // return null if the response is null
  return null;
}

Running the Application

The application is a simple .NET Console App that uses the TX Text Control .NET Server to import the PDF document and to display the answer generated by OpenAI.

string question = "Is contracting with other partners an option?";
//string question = "How will disputes be dealt with?";
//string question = "Can the agreement be changed or modified?";

string pdfPath = "Sample PDFs/SampleContract-Shuttle.pdf";

// load the PDF file
byte[] pdfDocument = File.ReadAllBytes(pdfPath);

// split the PDF document into chunks
var chunks = DocumentProcessing.Chunk(pdfDocument, 2500, 50);

Console.WriteLine($"{chunks.Count.ToString()} chunks generated from: {pdfPath}");

// get the keywords
List<string> generatedKeywords = GPTHelper.GetKeywords(question, 20);

// find the matches
var matches = DocumentProcessing.FindMatches(chunks, generatedKeywords).ToList().First();

// print the matches
Console.WriteLine($"The question: \"{question}\" was found in chunk {matches.Key}.");

// print the answer
Console.WriteLine("\r\n********\r\n" + GPTHelper.GetAnswer(chunks[matches.Key], question));

A sample output is shown in the following console:

14 chunks generated from: Sample PDFs/SampleContract-Shuttle.pdf<br />
The question: "Is contracting with other partners an option?" was found in chunk 11.<br />
<br />
********<br />
No, contracting with other partners is not an option unless prior approval is obtained from the COMMISSION\'S Contract Manager. The document specifies that subcontracting work under this Agreement is not allowed without prior written authorization, except for those identified in the approved Fee Schedule. Subcontracts over $25,000 must include the necessary provisions from the main Agreement and must be approved in writing by the COMMISSION\'S Contract Manager.

The answer is generated based on the content of the chunk that contains the best match for the question. The answer is a direct quote from the PDF document.

Conclusion

This article shows how to create a generative AI application for PDF documents using TX Text Control and OpenAI functions in .NET C#. The application uses the OpenAI GPT-3 engine to answer questions on the content of a PDF document.

Test it yourself by downloading the sample application from our GitHub repository.

Download and Fork This Sample on GitHub

We proudly host our sample code on github.com/TextControl.

Please fork and contribute.

Download ZIP

Open on GitHub

Open in Visual Studio

Requirements for this sample

TX Text Control .NET Server Core
OpenAI API Key

ASP.NET

Integrate document processing into your applications to create documents such as PDFs and MS Word documents, including client-side document editing, viewing, and electronic signatures.