Gungun Jain's Blog

Suppose you want to store large documents in a database. If the document size is too large, it’s very difficult to do indexing. It’s very expensive. So how to make it efficient both economically and technically?

So, here I will be sharing an approach which can tackle this problem really well. It’s AI. We can actually leverage the power of existing LLM models and vector databases to achieve this. In this article, I will be sharing how this can be done efficiently.

A smart indexing system is what we are making here leveraging the LLMs, RAG, and vector databases.

Traditional indexing relies on keyword-based methods, such as inverted indexes, which match exact words or phrases from a query to documents. This works fine for simple searches but struggles with things like:

Context: It can’t tell the difference between different meanings of a word (e.g., "apple" as a fruit vs. "Apple" the company).
Related Concepts: It misses documents that use synonyms or related terms (e.g., "car" vs. "automobile").
Intent: It often fails to capture the meaning behind complex or natural language queries.

In contrast, AI-powered indexing uses embeddings from LLMs to capture the semantic meaning of text. This means it understands the context of words and phrases, so it can distinguish between different meanings. It retrieves documents that match the query’s intent, even if the exact keywords aren’t present. For example, a search for "healthy snacks" could return documents about apples, even if "snacks" isn’t mentioned. Results are more relevant and accurate because they’re based on meaning, not just word frequency.

I will share more benefits of using the AI-powered indexing system at last. For now, let’s discuss how this approach works and how it is implemented.

How It Works: A Step-by-Step Guide

To implement this AI-powered indexing system, follow these steps:

Split Documents into Passages

First, the documents are divided into passages of some limited words. They are divided into manageable chunks of fixed sizes like 200-300 words. Smaller passages allow for precise retrieval; large documents are computationally expensive. You can split based on natural breaks (like paragraphs) or use a fixed word count, ensuring each passage is meaningful but not overly long.
Generate Embeddings with an LLM

Convert each passage into a dense vector embedding—a numerical representation of its semantic content. This can be done by a pre-trained LLM model optimized for generating embeddings. Pass each chunk of the passages through the model to generate the embeddings.

Smart indexing with LLMs
Store Embeddings in a Vector Database

Now store the embeddings in a vector database designed for fast similarity searches, like FAISS (Facebook AI Similarity Search). Vector databases are optimized for nearest-neighbor searches, making retrieval of relevant passages quick and resource-efficient, even with millions of entries. Create an index (e.g., FAISS FlatL2 index) and add the embeddings along with pointers to their corresponding passages or documents.
Process the Search Queries

When a query comes in, convert it into an embedding and search the vector database for the most similar passages. Use the same LLM to generate an embedding for the query, then perform a similarity search to retrieve the top-k (e.g., 5-10) most relevant passages.
Generate Answers with RAG

Now, feed the retrieved passages into the LLM to generate a concise, contextually accurate answer. RAG ensures that the answer is informed by the most relevant parts of your documents, reducing the computational load by focusing only on retrieved content.

Implementation Example

Here is how you can implement the code:

import faiss
import numpy as np
from transformers import AutoTokenizer, AutoModel
import torch

# Load LLM for embeddings
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Function to generate embeddings
def generate_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# Function to split documents into passages
def split_into_passages(document, chunk_size=200):
    words = document.split()
    return [' '.join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

# Index your documents
documents = ["Your large document text here...", "Another large document..."]
all_passages = []
for doc in documents:
    passages = split_into_passages(doc)
    all_passages.extend(passages)

# Generate embeddings
embeddings = np.array([generate_embedding(passage) for passage in all_passages])

# Create FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)  # L2 distance for similarity
index.add(embeddings)

# Search function
def search(query, top_k=5):
    query_embedding = generate_embedding(query)
    distances, indices = index.search(np.array([query_embedding]), top_k)
    return [all_passages[idx] for idx in indices[0]]

# Example usage
query = "What is the main topic of the documents?"
relevant_passages = search(query)
print("Relevant Passages:", relevant_passages)

# Generate answer (simplified placeholder)
answer = " ".join(relevant_passages)  # In practice, use a generative LLM
print("Answer:", answer)

Advantages Over Traditional Indexing

Here’s how this approach is better than a traditional indexing approach:

Better Accuracy Through Semantic Understanding

Traditional indexing relies on keyword-based methods, such as inverted indexes, which match exact words or phrases from a query to documents. While this works for simple searches, it struggles with:
- Context: It can’t tell the difference between different meanings of a word (e.g., "apple" as a fruit vs. "Apple" the company).
- Related Concepts: It misses documents that use synonyms or related terms (e.g., "car" vs. "automobile").
- Intent: It often fails to capture the meaning behind complex or natural language queries.
In contrast, AI-powered indexing uses embeddings from LLMs to capture the semantic meaning of text. This means:
- It understands the context of words and phrases, so it can distinguish between different meanings.
- It retrieves documents that match the query’s intent, even if the exact keywords aren’t present. For example, a search for "healthy snacks" could return documents about apples, even if "snacks" isn’t mentioned.
- Results are more relevant and accurate because they’re based on meaning, not just word frequency.
Greater Efficiency and Speed

Traditional indexing can be slow and resource-heavy, especially for large datasets, because:
- It may need to scan large portions of text or maintain extensive indexes.
- Search times can grow as the dataset size increases.
AI-powered indexing, however, is highly efficient:
- One-Time Effort: Creating embeddings for documents requires initial computation, but this is done only once. After that, the embeddings are stored and reused.
- Fast Searches: Tools like FAISS perform rapid similarity searches in high-dimensional vector spaces, allowing searches across millions of documents in milliseconds.
- Each query only requires generating an embedding for the query itself and comparing it to the stored embeddings, making it much faster than scanning entire documents.
Cost-Effectiveness for Large-Scale Use

With traditional indexing:
- Storage: Large inverted indexes can take up significant space.
- Processing: Each query might demand substantial compute resources, especially for massive datasets.
AI-powered indexing offers cost savings:
- After the initial embedding creation (a one-time cost), search queries are lightweight and require fewer resources.
- This makes it more cost-effective for ongoing operations, especially as the document collection grows.
Flexibility with Diverse Queries

Traditional methods are optimized for keyword searches or structured queries but struggle with:
- Natural Language: Long, conversational queries are hard to process effectively.
- Complex Tasks: They typically retrieve documents rather than answering questions directly.
The AI approach excels in flexibility:
- It handles everything from short keywords to full sentences or questions, adapting to different user needs.
- When paired with RAG, it can not only find relevant documents but also generate concise answers, making it suitable for both basic searches and advanced question-answering scenarios.
Scalability for Growing Datasets

Scaling traditional indexing to handle very large datasets (e.g., billions of documents) requires:
- Complex strategies like sharding or partitioning.
- Significant hardware to maintain performance.
AI-powered indexing is built for scale:
- Vector databases like FAISS can efficiently manage millions or billions of vectors with minimal performance loss.
- The process of generating and indexing embeddings can be parallelized across machines, making it practical for massive collections.

Conclusion

This technology outperforms traditional indexing by offering better accuracy through semantic understanding, faster and more efficient searches, lower long-term costs, greater flexibility for different query types, and easy scalability for large datasets. These benefits make it an ideal choice for managing and searching extensive document collections, delivering more relevant results with less effort compared to traditional methods.

Smart Indexing: Efficient Document Indexing Made Easy with AI and Vector Search

How It Works: A Step-by-Step Guide

Implementation Example

Advantages Over Traditional Indexing

Conclusion