[]
        
(Showing Draft Content)

AI Assistant

The integration of artificial intelligence (AI) into PDF document management is revolutionizing how people and organizations interact with digital content. Traditionally, extracting key information from lengthy PDFs, like contracts, research papers, or reports, was a manual and time-consuming process. AI-powered tools now automate this task by summarizing documents, extracting relevant data, and even building document outline trees.

AI capabilities in DsPdf are offered via the GrapeCity.Documents.Pdf.AI namespace, provided in a separate package named Document Solutions for PDF AI Assistant (DsPdfAI). This package uses OpenAI services and includes two classes that implement its AI functionality:

Currently, the following three AI-powered features are supported:

  1. Generating a document summary and abstract

  2. Building a document outline tree

  3. Extracting tables from a document

Licensing Considerations

The DsPdfAI library does not require a separate license. However, it works with GcPdfDocument instances provided by your application. To ensure full functionality without limitations, the DsPdf library must be properly licensed.

If DsPdf is not licensed:

  • Only the first 5 pages of a PDF document will be processed.

  • A license reminder message (nag message) will appear on each page of the output.

To learn how to apply a license in DsPdf, please see the DsPdf Licensing topic (License Information | Document Solutions for PDF | Document Solutions)

Setting up the environment for using DsPdfAI:

  • Open Microsoft Visual Studio and Create a New Project

image

  • Install DS.Documents.Pdf and DS.Documents.Pdf.AI packages

image

  • Include the package name, add the following lines of code and complete it with your OpenAI or AzureOpenAI tokens and endpoint

using GrapeCity.Documents.Pdf.AI;

//Using OpenAI
var openAiToken = @"...";
var a = new OpenAIDocumentAssistant(openAiToken);

Note: The same can be used with AzureOpenAIDocumentAssistant class as well.

Generating a document summary and abstract

DsPdfAI provides the GetAbstract and GetSummary methods for generating a document's summary and abstract. Two string properties GetAbstractMessage and GetSummaryMessage are provided for customizing the requests to the AI engine. Also, an OutputRange parameter is used to specify the range of pages to be included in the request to the AI. This parameter is null by default, implying that all document pages are to be summarized.

Refer to the code snippet below for the usage of GetAbstract and GetSummary methods:

var doc = new GcPdfDocument();
using var fs = File.OpenRead("myDocument.pdf");
doc.Load(fs);

// Set the abstract message:
a.GetAbstractMessage = "Please analyze the PDF and return a brief abstract of the document.";
// Get the abstract for a PDF:
string @abstract = await a.GetAbstract(doc/*, pageRange*/);
//pageRange is Null by default

// Set the summary message:
a.GetSummaryMessage = "Please analyze the PDF and return a summary of the document.";
// Set the page range to be summarized, in this case pages 1 to 5
OutputRange o1 = new OutputRange("1-5");
// Get the summary:
string summary = await a.GetSummary(doc, pageRange: o1);

image

Limitations:

  • AI-generated results may not always be accurate. Since methods like GetTable and BuildOutlines in DsPdfAI rely on AI-driven processes, their output can vary in accuracy and reliability.

Building a document outline tree

DsPdfAI provides the BuildOutlines method for building outlines for a pdf document. The AI-generated outline includes only the outline text, without any coordinate information. The string property BuildOutlinesMessage is provided for customizing the request to the AI engine.

A parameter OutlineNodeCollection is provided for specifying whether the resulting document should have additional outlines other than the ones built by the AI engine. An OutputRange parameter is also provided, to specify the range of pages to be included in the request to the AI.

Refer to the code snippet below for the usage of BuildOutlines method:

//Build outlines:

await a.BuildOutlines(doc);
doc.Save("myDocumentWithOutlines.pdf");

image

Limitations:

  • This matching process may occasionally fail if the text returned by the AI engine does not exactly match the text in the PDF. Variations in formatting, whitespace, or paraphrasing can cause mismatches.

  • The AI engine returns the outline tree it built as JSON. In rare cases that JSON may be malformed, in which case the BuildOutlines method will throw an “invalid JSON response” exception.

Extracting tables from a document

DsPdfAI provides the GetTable method for extracting a table located in the pdf document. The method returns a Table object, which is defined by the Table helper class in the GrapeCity.Documents.Pdf.AI namespace. This table is constructed based on the AI's response.

A string tableRequest parameter is provided to describe the specific table to fetch, so that the AI engine can find it in the document. The user sends a natural language prompt through this parameter, such as:

"Extract the table from the chapter titled '3.1 Record'."

The property GetTableMessageFmt is provided for customizing the general request to the AI engine. An OutputRange parameter is also used here, to specify the range of pages to be included in the request to the AI.

Refer to the code snippet below for the usage of GetTable method:

// Get a GrapeCity.Documents.AI.Table object from the PDF

//GetTableMessageFmt is the general request to the AI, the following is its default value
a.GetTableMessageFmt = "Please analyze the PDF. {0}. Return the table only without additional information.";


// (the second argument is the tableRequest):
var t = await a.GetTable(doc, "Extract the table from the chapter named \"3.1 Record\".");

image

Note: It is recommended to structure AI prompts by referencing the chapter where the table is located. The PDF is passed to the AI engine as a single stream of text without any page breaks, so specifying page numbers will never work (except by chance).