How to build an AI-powered private document search app with RAG, ChromaDB, and memory

As AI advances, more tools are being introduced into the ecosystem, enabling engineers and developers to experiment and build custom AI apps. But it’s not that simple.

In fact, each advantage of AI is accompanied by a disadvantage. For example, with vector databases such as Chroma, the major challenge was efficient data processing. And many of the latest AI applications rely on vector embeddings as the core of LLMs.

Vector databases are designed to store and query unstructured data—inputs that lack a fixed schema, such as text, images, and audio. This is a clear departure from traditional SQL databases, which usually query for rows where values match specific criteria, as with the “SELECT” statement.

“Vector databases are designed to store and query unstructured data—inputs that lack a fixed schema, such as text, images, and audio.”

This tutorial is meant to help you connect an LLM with LangChain to your own data sources (in this case, a PDF document) while using ChromaDB as a vector database to serve as memory. This is where RAG enters the picture, introducing the ability to store and retrieve data during the conversation, as well as adding chat history that gives you the power to build complex apps with memory.

Here’s what the pipeline of the entire product will look like:

Project workflow with steps

Now, you have a basic introduction of the type of app we’re building. This section will cover the implementation steps required to: load the data into LangChain documents; split it into chunks; rank the vectors by similarity to the question’s embedding; and, finally, ask where you insert the question to get the most relevant chunks into a message to a GPT model while returning the GPT’s answer.

Let’s dive in.

Step 1: Install dependencies

In your Notebook, run these Python packages:

pip install pypdf docx2txt openai langchain chromadb langchain-community langchain-openai “langchain-chroma>=0.1.2”

pypdf: responsible for splitting and transforming PDF files
docx2txt: extracts text from your docx file
openai: gives access to the model
langchain: acts as a wrapper to the LLM
chromadb: open-source vector database for storing and querying embeddings
langchain-community: loads data into the standard LangChain document format
langchain-openai: packages connecting OpenAI and LangChain
“langchain-chroma>=0.1.2”: to access the Chroma vector store

These tools all work together to create the LLM-powered Q&A application.

Step 2: Load your secret

import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

Consistent with every Python project, you should load your environment variables in a .env file that won’t commit to source control when you decide to share it on GitHub.

Step 3: Loading the documents

Here, we would use LangChain documents to load the PDF file using the function load_document. This will convert the file into an array of documents with the library, pypdf, where each document contains the page content and metadata associated with a page number using the loader.load() function.

def load_document(file):
import os
name, extension = os.path.splitext(file)

if extension == ‘.pdf’:
from langchain_community.document_loaders import PyPDFLoader
print(f’Loading {file}’)
loader = PyPDFLoader(file)
elif extension == ‘.docx’:
from langchain_community.document_loaders import Docx2txtLoader
print(f’Loading {file}’)
loader = Docx2txtLoader(file)
elif extension == ‘.txt’:
from langchain_community.document_loaders import TextLoader
loader = TextLoader(file)
else:
print(‘Document format is not supported!’)
return None

data = loader.load()
return data

Step 4: Chunking data

def chunk_data(data, chunk_size=256):

from langchain.text_splitter import RecursiveCharacterTextSplitter

overlap = int(chunk_size * 0.15)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)

chunks = text_splitter.split_documents(data)

return chunks

The chunk_data function will handle splitting the documents into text chunks of a specified size using LangChain’s RecursiveCharacterTextSplitter. With this approach, you can access the specified text of the page content with an index.

Step 5: Query and response

def ask_and_get_answer(vector_store, q, k=3):
from langchain_openai import ChatOpenAI
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model=’gpt-3.5-turbo’, temperature=0.0)

retriever = vector_store.as_retriever(search_type=’similarity’, search_kwargs={‘k’: k})

# Define a prompt template for better control
prompt = ChatPromptTemplate.from_template(“””
Answer the following question based only on the provided context:
<context>
{context}
</context>
Question: {input}”””)

document_chain = create_stuff_documents_chain(llm, prompt)
retrieval_chain = create_retrieval_chain(retriever, document_chain)

response = retrieval_chain.invoke({“input”: q})
return response

Since we need the answers in natural language, LLMs come in handy. This particular function uses the GPT-3.5-turbo model to generate an answer and queries the vector store for the document.

Step 6: Using Chroma as a vector database

def create_embeddings_chroma(chunks, persist_directory=’./chroma_db’):
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model=’text-embedding-3-small’, dimensions=1536)

vector_store = Chroma.from_documents(chunks, embeddings, persist_directory=persist_directory)

return vector_store

This code instantiates an embedding model from OpenAI. In the process, it creates a vector store using the provided text chunks and embedding model, as well as configuring it to save the data to a specified directory chroma_db.

def load_embeddings_chroma(persist_directory=’./chroma_db’):
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model=’text-embedding-3-small’, dimensions=1536)

vector_store = Chroma(persist_directory=persist_directory, embedding_function=embeddings)

return vector_store

Here, we define a function that passes the persist directory as an argument, which loads the existing embeddings from disk to a vector store object. Next, instantiate the same embedding model used during creation. The returned loaded vector store will load a Chroma vector store from the specified directory using the provided embedding function.

Step 7: Running the code

After writing lots of code, it is time to test and run the code.

First, load the PDF document into LangChain as files represent the directory for the file:

data = load_document(‘files/rag_powered_by_google_search.pdf’) # use any file you have

chunks = chunk_data(data, chunk_size=256)

vector_store = create_embeddings_chroma(chunks)

Next, you should observe that running this code leads to this message, “Loading files/rag_powered_by_google_search.pdf”—indicating a successful load.

db = load_embeddings_chroma()

q = ‘How many pairs of questions and answers had the StackOverflow dataset?’

answer = ask_and_get_answer(vector_store, q)
print(answer)

Here, we retrieve the answer from the vector store as an object.

If you ask a follow-up question, it will be clear that you won’t receive an accurate answer from the vector store. This is due to the absence of chat history (memory).

q = ‘Multiply that number by 2.’
answer = ask_and_get_answer(vector_store, q)

print(answer[‘answer’])

The result of this response should look something like: “Since no specific number is provided in the context, it is not possible to multiply it by 2.”

Step 8: Adding chat history memory to your RAG application

from langchain_openai import ChatOpenAI
from langchain.chains import (
create_history_aware_retriever,
create_retrieval_chain,
)
from langchain.chains.combine_documents import (
create_stuff_documents_chain,
)
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

llm = ChatOpenAI(
model_name=’gpt-3.5-turbo’,
temperature=0.0
)

retriever = vector_store.as_retriever(
search_type=’similarity’,
search_kwargs={‘k’: 5}
)

contextualize_q_system_prompt = (
“Given a chat history and the latest user question ”
“which might reference context in the chat history, ”
“formulate a standalone question which can be understood ”
“without the chat history. Do NOT answer the question, just ”
“reformulate it if needed and otherwise return it as is.”
)

contextualize_q_prompt = ChatPromptTemplate.from_messages([
(“system”, contextualize_q_system_prompt),
MessagesPlaceholder(“chat_history”),
(“human”, “{input}”)
])

history_aware_retriever = create_history_aware_retriever(
llm, retriever, contextualize_q_prompt
)

qa_system_prompt = (
“You are an assistant for question-answering tasks. Use ”
“the following pieces of retrieved context to answer the ”
“question. If you don’t know the answer, just say that you ”
“don’t know. Use three sentences maximum and keep the answer ”
“concise.”
“\n\n”
“{context}”
)

qa_prompt = ChatPromptTemplate.from_messages([
(“system”, qa_system_prompt),
MessagesPlaceholder(“chat_history”),
(“human”, “{input}”)
])

question_answer_chain = create_stuff_documents_chain(
llm, qa_prompt
)

rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

For a RAG application, a basic requirement is support for follow-up questions—including those that reference past chat history.

The code here depicts a way to build conversational AI chains that act as an implementation to store the conversation memory and track the conversation using the ConversationBufferMemory class during interaction.

Step 9: Create a function to ask questions

from langchain_core.messages import HumanMessage, AIMessage

chat_history = []

query = “How many pairs of questions and answers had the StackOverflow dataset?”

def ask_question(query, chain):
response = rag_chain.invoke({
“input”: query,
“chat_history”: chat_history
})
return response

result = ask_question(query, rag_chain)
print(result[‘answer’])

# Update memory manually
chat_history.append(HumanMessage(content=query))
chat_history.append(AIMessage(content=result[“answer”]))

The function accepts the parameters as a “query” and “rag_chain” variable in step 8 to display the result.

Now, for a follow-up question, run this code in another cell block:

query = ‘Multiply the answer by 4.’

result = ask_question(query, rag_chain)

print(result[‘answer’])

The response should give you a figure of “32 million,” and you can continue passing different prompts to get an answer from the result.

Step 10: Interactive question loop in your RAG app

For a dynamic and interactive workflow, run this code:

while True:
query = input(‘Your question: ‘)
if query.lower() in [‘exit’, ‘quit’, ‘bye’]:
print(‘Bye bye!’)
break
result = ask_question(query, rag_chain)
print(result[‘answer’])
print(‘-‘ * 100)

Welcome to AI’s RAG era

Retrieval augmented generation is more than a flashy phrase for AI engineers. Its deeper usefulness stems from being a technique that leverages and combines an LLM with a method for searching for information.

“Retrieval augmented generation is more than a flashy phrase… Its deeper usefulness stems from being a technique that leverages and combines an LLM with a method for searching for information.”

By following the steps above, devs and engineers can ensure that adopting this system will help people retrieve vital information from documents. This will also render answers more factual because it reads and learns from the data without compromising its integrity, avoiding the common issue of biased thoughts and reasoning.

Let’s embrace RAG as we continue to build with AI. Check out this repository for the complete workflow.

YOUTUBE.COM/THENEWSTACK

Tech moves fast, don’t miss an episode. Subscribe to our YouTube
channel to stream all our podcasts, interviews, demos, and more.

Group
Created with Sketch.

Teri Eyenike is a software engineer and a member of the Andela talent network, a private global marketplace for digital talent. With more than five years of experience focused on creating usable web interfaces and other applications using modern technologies,…

Latest post

The Best Movies to Stream This Month (April 2026)

Apple Maps Introduces Strategic Updates in iOS 26.5 to Enhance Navigation and Monetization

Für Cyberattacken gewappnet – Krisenkommunikation nach Plan

How to build an AI-powered private document search app with RAG, ChromaDB, and memory

Apple App Store Update: New Layout And Name For iOS App Updates Menu

Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

SmartBear’s Swagger update targets the API drift problem AI coding tools created

Intel And ASRock HUDIMM Could Make DDR5 Memory Cheaper For Budget PCs

The Best Movies to Stream This Month (April 2026)

Apple Maps Introduces Strategic Updates in iOS 26.5 to Enhance Navigation and Monetization

Für Cyberattacken gewappnet – Krisenkommunikation nach Plan

Stripe's Payment APIs: the first 10 years (2020)

The Best Movies to Stream This Month (April 2026)

Apple Maps Introduces Strategic Updates in iOS 26.5 to Enhance Navigation and Monetization

Für Cyberattacken gewappnet – Krisenkommunikation nach Plan

Stripe's Payment APIs: the first 10 years (2020)

Latest post

How to build an AI-powered private document search app with RAG, ChromaDB, and memory

Project workflow with steps

Step 1: Install dependencies

Step 2: Load your secret

Step 3: Loading the documents

Step 4: Chunking data

Step 5: Query and response

Step 6: Using Chroma as a vector database

Step 7: Running the code

Step 8: Adding chat history memory to your RAG application

Step 9: Create a function to ask questions

Step 10: Interactive question loop in your RAG app

Welcome to AI’s RAG era

Related Posts