This repository implements a pipeline for extracting, processing, and querying information from PDF documents using LangChain, FAISS, and Llama-based models. The pipeline supports querying document content with similarity-based retrieval and a retrieval-augmented generation (RAG) approach.
- Load and Process PDFs: Automatically extracts content from PDFs and splits it into manageable chunks.
- Embeddings Generation: Utilizes Ollama-based embeddings for efficient and accurate document representation.
- Vector Store Integration: Uses FAISS for similarity search and Maximum Marginal Relevance (MMR)-based retrieval.
- Customizable RAG Pipeline: Combines document retrieval with Llama-based models for accurate question-answering.
- Dynamic Prompting: Adopts a flexible and concise chat prompt for generating context-aware answers.
Ensure you have Python installed on your system. Install the required Python libraries:
pip install -r requirements.txt
- Clone the repository:
https://github.com/Sawanmahna/Ask-Questions-from-PDF-using-LLM.git cd Ask-Questions-from-PDF-using-LLM
- Create a
.env
file in the project root and set environment variables:LANGCHAIN_API_KEY="your_api_key" LANGCHAIN_PROJECT = "pdfchatnow" LANGCHAIN_ENDPOINT = "https://api.smith.langchain.com" LANGCHAIN_TRACING_V2=true
- Suppress warnings (optional):
import warnings warnings.filterwarnings("ignore")
Place your PDF files in the Data/
directory. The script will automatically load and process them.
Run the script to extract and split the PDF content into manageable chunks for querying.
You can query the processed PDFs by asking specific questions. For example:
question = "What is the invoice number?"
output = rag_chain.invoke(question)
print("Answer:", output)
- "What is the price of Web Design?"
- "What is in the PDF?"
- "What is the invoice date?"
- LangChain: For building the framework for language model-based pipelines.
- FAISS: For efficient similarity search and retrieval.
- Ollama: For embedding and question-answering models.
- PyMuPDF: For efficient PDF processing.