Data indexing at lightning speed

How a multi-billion dollar analytics leader accelerated document indexing from several hours to just 7 minutes, reducing cost and expanding revenue opportunities in the process.

Client

The academia and government division of major analytics player

Transformation

Automation of indexing PDF documents with open-source LLMs

Use case

Data indexing

Generative AI is commonly understood as a way to ‘generate’ things—content, images, reports, etc. The good news is: GenAI can do a lot more than that. Large language models (LLMs) can automate any number of complex and highly intelligent processes.

One such process is data indexing that a global analytics major executed with great success.

About client

Client is a provider of analytics and business intelligence services. Simply put, they curate data from a wide range of sources, index that to information, crunch the numbers, and offer insights and dashboards to customers.

Tune AI worked with the VP of content operations of the academia and government vertical of the company.

Business situation

Context

The client regularly curates data from company filings, patents, trademarks, domain registrations, market reports, surveys, etc. Based on the information available in these documents, they create reports for industry use.

Need

Data indexing is the foundation of the client’s analytics and insights offerings. They needed the PDF documents to be indexed based on attributes such as:

  • Title of the document

  • Authors

  • Key findings

  • Citations

  • URLs

Existing solution

Presently, the client employs over 400 individuals to do the data indexing manually. On average, they pay each employee $18 an hour, and each document takes several hours to index.

Challenges

Data type: The PDFs were unstructured, including several hand-written annotations. 

Data variety: There was a wide variety of documents, which made using templates ineffective. In fact, even within the same type of documents, there was no common standard.

Expense: Data indexing became a large department within the organization, incurring high costs.

Inability to scale: When done manually, data indexing took hours for each document, restricting the organization’s scalability.

Based on deep conversations with the client’s content operations teams, we agreed that Generative AI was the way to go. However, the goal was to match the gold standard for accuracy that manual indexing offers. 

Solution approach

Tune AI’s primary approach to the client’s business problems was: Robotic Process Automation (RPA) with LLMs. 

From that strategic standpoint, we designed the solution and made several key decisions.

Piloting with sample documents: On a paid POC of 30 documents of ~31 pages each (with 664,470 tokens per page), we ran multiple closed-source and open-source models to gauge performance, speed and cost. 

We observed that closed-source models performed well out of the box. However, when fine-tuned and customized for client needs, open-source models were cheaper and performed at the the same or comparable standards. 

Using open source models: Given the security-intensive nature of the client’s business, they can not afford to use closed-source solutions, which run the risk of proprietary data being used to train publicly available models.

Human in the loop: Admittedly, the LLMs struggled with processing handwriting and some unstructured parts of the data. We enabled a human in the loop to ensure accuracy and train the models in the process.

Outcomes

Accuracy

>99% accuracy, almost on par with manual indexing

Time savings

Several hours to 7 minutes per document

Cost savings

$18 to ¢60 per hour

Scalability

Potential to process millions more documents with the same capacity

Future plans

After successfully proving the concept, we’re all set to implement the solution at scale. To improve performance and accuracy, Tune AI’s experts will:

  • Try more models with image + text modality 

  • Compare the performance, price, and accuracy of these modalities

  • Analyze trade-offs and choose the right model for the use case

  • Fine-tune the models using Tune Studio to optimize performance 

  • Deploy agents to automate indexing and improve automatically based on human feedback