Data indexing at lightning speed
How a multi-billion dollar analytics leader accelerated document indexing from several hours to just 7 minutes, reducing cost and expanding revenue opportunities in the process.
Client
The academia and government division of major analytics player
Transformation
Automation of indexing PDF documents with open-source LLMs
Use case
Data indexing
Generative AI is commonly understood as a way to ‘generate’ things—content, images, reports, etc. The good news is: GenAI can do a lot more than that. Large language models (LLMs) can automate any number of complex and highly intelligent processes.
One such process is data indexing that a global analytics major executed with great success.
About client
Client is a provider of analytics and business intelligence services. Simply put, they curate data from a wide range of sources, index that to information, crunch the numbers, and offer insights and dashboards to customers.
Tune AI worked with the VP of content operations of the academia and government vertical of the company.
Business situation
Context
The client regularly curates data from company filings, patents, trademarks, domain registrations, market reports, surveys, etc. Based on the information available in these documents, they create reports for industry use.
Need
Data indexing is the foundation of the client’s analytics and insights offerings. They needed the PDF documents to be indexed based on attributes such as:
Title of the document
Authors
Key findings
Citations
URLs
Existing solution
Presently, the client employs over 400 individuals to do the data indexing manually. On average, they pay each employee $18 an hour, and each document takes several hours to index.
Challenges
Data type: The PDFs were unstructured, including several hand-written annotations.
Data variety: There was a wide variety of documents, which made using templates ineffective. In fact, even within the same type of documents, there was no common standard.
Expense: Data indexing became a large department within the organization, incurring high costs.
Inability to scale: When done manually, data indexing took hours for each document, restricting the organization’s scalability.
Based on deep conversations with the client’s content operations teams, we agreed that Generative AI was the way to go. However, the goal was to match the gold standard for accuracy that manual indexing offers.
Solution approach
Tune AI’s primary approach to the client’s business problems was: Robotic Process Automation (RPA) with LLMs.
From that strategic standpoint, we designed the solution and made several key decisions.
Piloting with sample documents: On a paid POC of 30 documents of ~31 pages each (with 664,470 tokens per page), we ran multiple closed-source and open-source models to gauge performance, speed and cost.
We observed that closed-source models performed well out of the box. However, when fine-tuned and customized for client needs, open-source models were cheaper and performed at the the same or comparable standards.
Using open source models: Given the security-intensive nature of the client’s business, they can not afford to use closed-source solutions, which run the risk of proprietary data being used to train publicly available models.
Human in the loop: Admittedly, the LLMs struggled with processing handwriting and some unstructured parts of the data. We enabled a human in the loop to ensure accuracy and train the models in the process.
Outcomes
Accuracy
>99% accuracy, almost on par with manual indexing
Time savings
Several hours to 7 minutes per document
Cost savings
$18 to ¢60 per hour
Scalability
Potential to process millions more documents with the same capacity
Future plans
After successfully proving the concept, we’re all set to implement the solution at scale. To improve performance and accuracy, Tune AI’s experts will:
Try more models with image + text modality
Compare the performance, price, and accuracy of these modalities
Analyze trade-offs and choose the right model for the use case
Fine-tune the models using Tune Studio to optimize performance
Deploy agents to automate indexing and improve automatically based on human feedback