Skip to content

This is my project during my Internship in Infosys Springboard Internship.

Notifications You must be signed in to change notification settings

Sawanmahna/Ask-Questions-from-PDF-using-LLM

Repository files navigation

Ask Questions From PDF with LangChain and FAISS

This repository implements a pipeline for extracting, processing, and querying information from PDF documents using LangChain, FAISS, and Llama-based models. The pipeline supports querying document content with similarity-based retrieval and a retrieval-augmented generation (RAG) approach.

Features

  • Load and Process PDFs: Automatically extracts content from PDFs and splits it into manageable chunks.
  • Embeddings Generation: Utilizes Ollama-based embeddings for efficient and accurate document representation.
  • Vector Store Integration: Uses FAISS for similarity search and Maximum Marginal Relevance (MMR)-based retrieval.
  • Customizable RAG Pipeline: Combines document retrieval with Llama-based models for accurate question-answering.
  • Dynamic Prompting: Adopts a flexible and concise chat prompt for generating context-aware answers.

Setup Instructions

Prerequisites

Ensure you have Python installed on your system. Install the required Python libraries:

pip install -r requirements.txt

Environment Configuration

  1. Clone the repository:
    https://github.com/Sawanmahna/Ask-Questions-from-PDF-using-LLM.git
    cd Ask-Questions-from-PDF-using-LLM
  2. Create a .env file in the project root and set environment variables:
    LANGCHAIN_API_KEY="your_api_key"
    LANGCHAIN_PROJECT = "pdfchatnow"
    LANGCHAIN_ENDPOINT = "https://api.smith.langchain.com"
    LANGCHAIN_TRACING_V2=true
  3. Suppress warnings (optional):
    import warnings
    warnings.filterwarnings("ignore")

Usage

1. Load PDFs

Place your PDF files in the Data/ directory. The script will automatically load and process them.

2. Process Documents

Run the script to extract and split the PDF content into manageable chunks for querying.

3. Ask Questions

You can query the processed PDFs by asking specific questions. For example:

question = "What is the invoice number?"
output = rag_chain.invoke(question)
print("Answer:", output)

4. Sample Questions

  • "What is the price of Web Design?"
  • "What is in the PDF?"
  • "What is the invoice date?"

Acknowledgments

  • LangChain: For building the framework for language model-based pipelines.
  • FAISS: For efficient similarity search and retrieval.
  • Ollama: For embedding and question-answering models.
  • PyMuPDF: For efficient PDF processing.

About

This is my project during my Internship in Infosys Springboard Internship.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published