diff --git a/posts/llama_index_rag.ipynb b/posts/llama_index_rag.ipynb index 2ae0502..4de2b52 100644 --- a/posts/llama_index_rag.ipynb +++ b/posts/llama_index_rag.ipynb @@ -6,33 +6,34 @@ "source": [ "# Running Large Language Models (LLMs) locally for Retrieval-Augmented-Generation (RAG) Systems with full privacy\n", "\n", - "**tl;dr:** You can run small LLMs locally on your consumer PC and with ollama that's very easy to set up. It is fun to chat with an LLM locally, but it gets really interesting when you build RAG systems or agents with your local LLM. I show you an example of a RAG-System built with llama-index.\n", + "**tl;dr:** You can run small LLMs locally on your consumer PC and with **ollama** that's very easy to set up. It is fun to chat with an LLM locally, but it gets really interesting when you build RAG-systems or agents with your local LLM, there is great synergy. I show you an example of a RAG-System built with **ollama** and **llama-index**.\n", "\n", "## Running small LLMs locally with quantization\n", "\n", - "Large language models are large, mindboggingly large. Even if we had the source code and the weights of ChatGPTs GPT-4o model, with its (probably, the exact size is not known) 1,800b parameters - that is b for billion - it would be about 3 TB in size if every paramter is stored as a 16 bit float. Difficult to fit into your RAM!\n", + "Large language models are large, mindboggingly large. Even if we had the source code and the weights of ChatGPTs GPT-4o model, with its [probably 1760b parameters](https://the-decoder.com/gpt-4-has-a-trillion-parameters/) - that is b for billion - it would be about 3 TB in size if every paramter is stored as a 16 bit float. Difficult to fit into your RAM!\n", "\n", "``\n", - "We could use proper SI notation, '1800G' or '1.8T' instead of '1800b' 😞, since 'billion' means different things in different languages, but here we are.\n", + "We could use proper SI notation, '1800G' or '1.8T' instead of '1800b', since 'billion' means different things in different languages, but here we are 😞.\n", "``\n", "\n", "But nevermind, we don't have the code and weights anyway. So what about open source models? While the flagships are still too large, there is a vibrant community on the HuggingFace platform that makes and improves models that have only **8b** to **30b** parameters, and those models are not useless. Meta has recently released a language model llama-3.2 with only **3b** parameters. While you cannot expect the same detailed knowledge about the world and attention span as the flagship models, these models still produce coherent text and you can have decent short conversations with them. I would recommend to use at least an **8b** model, because the smaller models likely won't follow your prompt very well.\n", "\n", "An 8b model is 200 times smaller than GPT-4o, but still has a size of about 15 GB. It fits into your CPU RAM, but you want it to fit onto your GPU. If it does not fit completely onto the GPU, a part of the calculation has to be done with the CPU, and that will slow down the generation dramatically. Memory transfer speed is the bottleneck.\n", "\n", - "Fortunately, one can quantize the parameters quite strongly without loosing much. It turns out one can go down to 4 or 5 bits per parameter without loosing much (about one percent in benchmarks compared to the original model). This finally brings these models down to a size that fits onto consumer GPUs. You need some extra memory for the code and context window as well.\n", + "Fortunately, one can quantize the parameters quite strongly without loosing much. It turns out one can go down to 4 or 5 bits per parameter without loosing much - about one percent in benchmarks compared to the original model [ref1](https://neuralmagic.com/blog/we-ran-over-half-a-million-evaluations-on-quantized-llms-heres-what-we-found/), [ref2](https://arxiv.org/pdf/2402.16775), [ref3](https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=9178256&fileOId=9178257). This finally brings these models down to a size that fits onto consumer GPUs. You need some extra memory for the code and context window as well.\n", "\n", "**If you are interested in this sort of thing and plan to buy a GPU soon, take one with at least 16 GB of RAM. GPU speed does not really matter.**\n", "\n", - "There are a couple of libraries which allow you to run these quantized models, but the best one by far is **Ollama** in my experience. Ollama is really easy to install and use. It successfully hides a lot of the complexity from you, and gives you easy start into the world of runnig local LLMs.\n", + "There are a couple of libraries which allow you to run these quantized models, but the best one is **Ollama** in my experience. Ollama is really easy to install and use. It successfully hides a lot of the complexity from you, and gives you easy start into the world of runnig local LLMs.\n", "\n", - "I had a lot of fun trying out different models. There are leaderboards ([Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) and [Chatbot Arena](https://lmarena.ai/)) which help to select good candidates. I noticed large differences in perceived quality among models with the same size. Generally, I recommend finetuned versions of the llama-3.1:8b and gemma2:9b models by the community. If you want to skip over that, then try out mannix/gemma2-9b-simpo, and if you have at least 16GB of GPU RAM, gemma2:27b.\n", + "I had a lot of fun trying out different models. There are leaderboards ([Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) and [Chatbot Arena](https://lmarena.ai/)) which help to select good candidates. I noticed large differences in perceived quality among models with the same size. Generally, I recommend finetuned versions of the llama-3.1:8b and gemma2:9b models by the community. If you want to skip over that, then try out `mannix/gemma2-9b-simpo`.\n", "\n", "## Great, I have a local LLM running, now what?\n", "\n", "Having an LLM running locally is nice and all, but for programming and asking questions about the world, the free tiers of ChatGPT and Claude are better. The real interesting use case for local LLMs is to chat with your documents using retrieval augmented generation (RAG).\n", "\n", - "There is great synergy in running a RAG System with a local LLM.\n", + "There is great synergy in running a RAG-System with a local LLM.\n", + "\n", "- You can keep your local documents private. Nothing will ever be transferred to the cloud.\n", "- No additional costs. If you want to use the API of ChatGPT or Claude, you have to pay eventually. That's especially annoying while you are still developing, when you will run the LLMs over and over to test your application.\n", "- Local LLMs lack detailed world knowledge, but the RAG-System complements that lack of knowledge. Without RAG, local LLMs hallucinate a lot, but with RAG they will provide factual knowledge.\n", @@ -43,11 +44,11 @@ "\n", "For a RAG system, you need to convert your documents into plain text or Markdown, and an index to pull up relevant pieces from this corpus according to your query. There is currently gold-rush around developing converters for all kinds of documents into LLM-readable text, especially when it comes to PDFs. People try to make you to pay for this service. For PDFs, a free alternative that runs locally is **pymupdf4llm**. If your documents contain images, you can also run a multi-model LLM like llama-3.2-vision to make text descriptions for these images automatically.\n", "\n", - "Once you have your documents in plain text, you can split into mouth-sized pieces (mouth-sized for your LLM and its (small) context window) and use an embedding model to compute semantic vectors for each piece. These vectors magically encode semantic meaning of text, and can be used to find pieces that are relevant to a query using cosine similiarity - that's essentially a dot-product of the vectors. It is hard to believe that this works, but it actually does. Search via embeddings is superior to keyword search, but I can also say from experience that it is not a silver bullet. The best RAG Systems combine keywords with embeddings in some way. Using a good embedding model is key. If you use a model trained for english on German text, for example, it won't perform well, or if your documents contain lots of technical language that the embedding model was not trained on.\n", + "Once you have your documents in plain text, you can split them into mouth-sized pieces (mouth-sized for your LLM, so that multiple pieces fit into its small context window) and use an embedding model to compute semantic vectors for each piece. These vectors magically encode semantic meaning of text, and can be used to find pieces that are relevant to a query using cosine similiarity - that's essentially a dot-product of the vectors. It is hard to imagine that this works, but it actually does (more or less). Search via embeddings can be superior to keyword search, but in my experience it is not a silver bullet. The best RAG-Systems combine searches via keywords with embeddings in some way. Using a good embedding model is key. If you use a model trained solely for English text on German text, for example, it won't perform well, or if your documents contain lots of technical language that the embedding model was not trained on.\n", "\n", "Thankfully, Ollama also offers embedding models, so you can run these locally as well. I found that mxbai-embed-large works well for both english and German text.\n", "\n", - "Writing a RAG from scratch with Ollama is not too hard, but it usually pays off to use a well-designed library to do the grunt work, and then start to improve from there. I compared many libraries, and can confidently recommend **llama-index** as the best one by far. It is feature-rich and well designed: little boilerplate code for simple things, yet easy to extend. The workflow system especially is really well designed. Just their (good) documentation is annoyingly difficult to find, they try to push you to their paid cloud services (did I mention, there is a gold rush...).\n", + "Writing a RAG from scratch with Ollama is not too hard, but it usually pays off to use a well-designed library to do the grunt work, and then start to improve from there. I compared many libraries, and can confidently recommend **llama-index** as the best one by far. It is feature-rich and well designed: little boilerplate code for simple things, yet easy to extend. The workflow system especially is really well designed. Just their (good) documentation is annoyingly difficult to find, they try to push you to their paid cloud services (did I mention, there is a gold rush...). I review some other libraries in the **appendix** to this post.\n", "\n", "Below, I show you a RAG demo system, where I pull in Wikipedia pages about the seven antique world wonders, I then ask some questions about the Rhodes statue and the Hanging Gardens. As I am German, I wanted to see how well this works with German queries on German documents. That is not trivial, because both the LLM and the embedding model then have to understand German. I compare the result with and with RAG. Without RAG, the model will hallucinate details. With RAG, it follows the facts in the source documents closely. It is really impressive.\n", "\n", @@ -155,7 +156,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -350,7 +351,8 @@ "show_sources = False\n", "\n", "# Now we ask our questions. Set show_sources=True to see which text pieces were used.\n", - "# For reference, we compare the RAG answer (\"RAG\") with a plain LLM query (\"Ohne RAG\"). \n", + "# For reference, we compare the RAG answer (\"RAG\") with a plain LLM query (\"Ohne RAG\").\n", + "# If you don't speak German, no problem, I discuss the results further below in english.\n", "for q in question:\n", " q2 = q + \" Antworte detailliert auf Deutsch.\"\n", " \n", @@ -372,9 +374,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The answers without RAG are much nicer to read, but contain halluciations, while the RAG answers are dull, brief, but factually correct. The behavior of the LLM without RAG is a consequence of human preference optimization. The LLM generates answers by default that *look nice* to humans.\n", + "## Discussion\n", + "\n", + "Both the embedding model and the LLM handle German without issues. The answers without RAG are much nicer to read, but contain halluciations, while the RAG answers are dull, brief, but factually correct. The behavior of the LLM without RAG is a consequence of human preference optimization. The LLM generates answers by default that *look nice* to humans.\n", "\n", - "The RAG answer is very short, because the internal prompt of llama-index asks the LLM to only use information provided by the RAG system and not use its internal knowledge. It is therefore not a bug but a feature that the answer of the LLM is so short: it faithfully tries to only make statements that are covered by the text pieces. \n", + "The RAG answer is very short, because the internal prompt of llama-index asks the LLM to only use information provided by the RAG system and not use its internal knowledge. It is therefore not a bug but a feature: the LLM faithfully tries to only make statements that are covered by the text pieces. The LLM is not confused by irregularities in the text snippets that the reader did not filter out, like Wiki-Markup.\n", "\n", "1. Question is about the materials used to construct the Rhodes statue.\n", "\n", @@ -399,11 +403,11 @@ "source": [ "## Conclusions\n", "\n", - "RAG works very well even with small local LLMs. The caveats of small LLMs (lack of world knowledge) are compensated by RAG. The RAG answers are very faithful to the source in our example and contain no hallucinations. The use of local LLMs allows us to avoid additional costs and keeps our data private.\n", + "RAG works very well even with small local LLMs. The caveats of small LLMs (lack of world knowledge) are compensated by RAG. The RAG answers are faithful to the sources in our example and contain no hallucinations. The use of local LLMs allows us to avoid additional costs and keeps our documents private.\n", "\n", "The main challenge in setting up a RAG is the index. Finding all relevant pieces of information, without adding too many irrelevant pieces, is a hard problem. There are multiple ways to refine the basic RAG formula:\n", "\n", - "- Getting more relevant pieces by augmenting the source documents with metadata like tags or LLM-generated summaries for larger sections.\n", + "- Getting more relevant pieces by augmenting the source documents with metadata like tags or LLM-generated summaries for larger sections, and cross-references to other snippets.\n", "- Smarter text segmentation based on semantic similarity or logical document structure.\n", "- Postprocessing the retrieved documents, by letting a LLM rerank them according to their relevance for the query.\n", "- Asking the LLM to critique its answer, and then to refine it based on the critique.\n", @@ -412,6 +416,80 @@ "Have a look into the [llama-index documentation](https://docs.llamaindex.ai/en/stable/examples/) for more advanced RAG workflows." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Appendix: RAG-libraries that I explored\n", + "\n", + "There are numerous libraries available for RAG, but many have significant drawbacks from the point of view of my requirements:\n", + "\n", + "- I don't want to send documents to cloud services (data privacy).\n", + "- I don't want to pay fees for cloud services.\n", + "- I want to use a well-designed library that is easy to use and extend.\n", + "- I want many read-made components, so that I get good value for my time investment.\n", + "\n", + "### Candidates Reviewed\n", + "\n", + "The Github stars indicate popularity.\n", + "\n", + "- **LangChain**: 96k stars\n", + "- **LlamaIndex**: 37k stars\n", + "- **Autogen**: 35k stars\n", + "- **Haystack**: 18k stars\n", + "- **Txtai**: 9.6k stars\n", + "- **AutoChain**: 1.8k stars\n", + "\n", + "### Problems Identified\n", + "\n", + "1. **Push to use cloud services** \n", + " - All these libraries are open source, but most of them try to push you towards using paid cloud services to get essential functionalities; and in that case, data privacy cannot be guaranteed\n", + " - Examples of paid cloud-based services include:\n", + " - Document databases\n", + " - PDF converters\n", + " - Web search providers\n", + "\n", + "2. **Dependencies and Installation** \n", + " - Too many dependencies\n", + " - Difficult installation (e.g., requires Docker, incompatible libraries, etc.)\n", + "\n", + "3. **Design Flaws** \n", + " - Poor and/or bloated design\n", + " - Volatile APIs\n", + " - Bad cost/benefit ratio compared to custom-written software\n", + "\n", + "4. **Excluded Libraries** \n", + " - **Autogen**: No focus on RAG functionality \n", + " - **AutoChain**: Projects seems to dead, codebase has not been maintained for a year\n", + "\n", + "### Candidate shortlist\n", + "\n", + "Out of the intial contenders, Haystack and LlamaIndex survived my requirements. I installed and tried out examples with Haystack and LlamaIndex. Both are easy to install and have moderate dependencies. Both were designed for RAG, but also support agentic workflows. Both have good documentation.\n", + "\n", + "#### **Haystack**\n", + "\n", + "##### Pros\n", + "- API inspired by functional design principles, leading to clear information flow\n", + "- Claimed to be used by Netflix, Nvidia, Apple, Airbus, etc.\n", + "\n", + "##### Cons\n", + "- Excessive boilerplate code\n", + "- Inconvenient to extend\n", + "- Limited functionality\n", + "\n", + "#### **LlamaIndex**\n", + "\n", + "##### Pros\n", + "- Strong community support (e.g., [llamahub.ai](https://llamahub.ai)) offering components\n", + "- Minimal boilerplate code\n", + "- Elegant design with good defaults, see, for example, the Workflow class\n", + "- Many subpackages with specific functionality, so you only install what you really need\n", + "\n", + "### Cons\n", + "- Documentation is not easy to find when you land on their webpage, they try to push you to use their cloud services\n", + "- Information flow in the API is not always easy to follow, because configuration is done via a global Settings object and not passed call-by-call, that is the caveat of a design with minimal boilerplate code." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -420,7 +498,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "default", "language": "python", "name": "python3" },