Softlandia background

Softlandia

Blog

Building a RAG? Tired of chunking? Maybe Vision is All You Need!

At the core of most modern GenAI solutions is a method known as RAG (Retrieval-Augmented Generation), which stands for search-enhanced generation. Software engineers in the applied AI space often refer to it as “RAG”. With RAG, a language model can answer questions based on a company’s proprietary data.

The first letter, R for retrieval, refers to the search process. When a user asks a question to a GenAI bot, the search engine behind the scenes should accurately locate material related to the question to create a perfect, hallucination-free response. The terms A and G refer to feeding the retrieved data into the language model and generating the final answer.

This article focuses primarily on the retrieval process, as it is the most critical, labor-intensive, and challenging part of implementing an RAG architecture. We will first explore retrieval at a general level and then introduce the mechanics of a traditional chunk-based RAG retrieval. The latter part of the article highlights a new RAG approach that relies on image-based data for both retrieval and generation.

A Brief History of Information Retrieval

Google and other major search engine companies have been trying to solve the problem of information retrieval for decades — trying being the operative word. Information retrieval is still not as easy as it should be. One reason is that humans process information differently from machines. Converting natural language into a reasonable search query across diverse data masses is not simple. Google's power users are likely familiar with every possible trick to manipulate the search engine to their needs. Yet, it remains a laborious process, and the search results can be quite poor.

With the advancement of language models, information retrieval has suddenly been equipped with a natural language interface. Language models themselves, however, are terrible at providing fact-based information because their training data reflects a snapshot of the world at the time of training. Additionally, the knowledge is compressed within the model, and the well-known hallucination issue is inevitable. Language models, after all, are not search engines but reasoning machines.

The advantage of language models is that you can provide them with data samples, and instructions, and ask them to respond based on these inputs. This is the typical use case for ChatGPT and similar conversational AI interfaces. But people are lazy, and with the same effort, you could have done the task yourself. That's why we need RAG, so we can simply ask a question to an applied AI solution and get an answer based on precise information. At least, that’s the ideal in a world with perfect search.

How Does Retrieval Work in Traditional RAG?

There are as many RAG retrieval methods as RAG implementations. Search is always an optimization problem, and no generic solution fits all: the AI architecture must be tailored to each specific solution, for both search and other features.

Despite this, the typical baseline solution is the so-called chunking technique. In this method, the information stored in the database—often documents—is split into small pieces, approximately the size of a paragraph. Each chunk is then embedded into a numerical vector using embedding models related to language models. The resulting numerical vector is stored in a specialized vector database.

A naive search implementation over a vector database works as follows:

  1. The user asks a question.

  2. An embedding vector is generated from the question.

  3. A semantic search is performed on the vector database.

    1. In semantic search, the proximity between the question's vector and the vectors in the database is mathematically measured, taking into account the context and meaning of the text chunks.

  4. The vector search returns, for example, the top 10 text chunks that most closely match the search.

The retrieved text chunks are then inserted into the language model's context (prompt), and the model is asked to generate an answer to the original question. These last two steps after retrieval are the A and G phases of RAG.

The chunking technique and other preprocessing before indexing significantly impact the search quality. There are dozens of different methods for this pre-processing, and information can also be organized or filtered after the search (known as reranking). In addition to vector search, traditional keyword search or any other programmatic interface that retrieves structured information can be used. Techniques such as text-to-SQL or text-to-API fall into this category, where a new SQL or API query is generated based on the user's question. For unstructured data, chunking and vector search are the most commonly used techniques for retrieving information.

Chunking is not without its issues. Handling different file and data formats is laborious, and you must write separate chunking code for each format. There are existing software libraries for this, but they are not perfect. The size and overlap of the chunks must also be considered. Next, you encounter challenges with images, graphs, tables, and other data where understanding the visual information and its surrounding context is crucial. This includes elements like headings, font sizes, and other subtle visual cues that are completely lost in the chunking technique.

What if this chunking wasn't necessary at all, and the search worked as if a human were skimming through entire pages of documents?

Images Retain Visual Information

Implementing search methods based on images has become possible thanks to the development of advanced multimodal models. A prime example of AI solutions based on image data is Tesla’s autonomous driving solution, which relies solely on cameras. The idea behind this approach is that humans primarily perceive their environment visually.

This same concept can be applied to RAG implementations. Instead of chunking, entire pages are indexed directly as images, i.e., in the same format as humans view them. For example, each page of a PDF document is fed as an image into a specialized AI model (ColPali), which creates a vector representation based on both the visual content and the text. These vectors are then added to the vector database. We can call this new RAG architecture Vision Retrieval-Augmented Generation, or V-RAG for short.

The advantage of this approach is likely better retrieval accuracy than traditional methods, as the vector representation generated by the multimodal model takes into account both the textual and visual elements. The search results would be entire pages from the documents, which are then fed as images into a capable multimodal model like GPT-4. The model interpreting the source images can then directly reference information from a table or chart.

V-RAG works without needing to first extract complex structures like charts or tables into text, restructure that text into a new format, place it in a vector database, retrieve it, rerank it into a coherent prompt, and finally generate the answer. This offers a significant advantage when dealing with old manuals, table-heavy documents, and essentially any human-oriented document formats where much of the content is more than just plain text. Indexing is also much faster than in traditional layout detection and OCR processes.

Indexing speed stats from the ColPali paper

Text extraction from documents might still add value and assist the search alongside images. Nonetheless, chunking as a technique will soon become just one way to implement AI search systems.

Vision-RAG in Practice: Paligemma, ColPali, and Vector Databases

V-RAG implementation still requires access to specialized models and GPU computation, unlike traditional text-based RAG, which can largely be run on commonly available large models from the cloud. The best way to implement V-RAG is to use a model currently developed for this purpose, named ColPali.

ColPali is based on the multivector search introduced in the ColBERT model and Google's multimodal Paligemma language model. ColPali is a multimodal search model, meaning it understands not only text content but also the visual elements of documents. In practice, the developers of ColPali extended ColBERT's text-based search method to cover the visual space using Paligemma.

When creating embeddings, ColPali divides each image into a 32 x 32 grid, resulting in approximately 1024 patches per image, with each patch represented by a 128-dimensional vector. The total number of patches is 1030, as an instruction token "Describe the image." is also added to each image.

The user's text-based query is converted into the same embedding space so that the patches and parts of the query can be compared during the search. The search process itself is based on the so-called MaxSim method, which is well explained in this article. This search method is already implemented in many vector databases that support multivector retrieval.

Vision is All You Need – V-RAG Demo & Code

We created a V-RAG demo, and the code is available in Softlandia’s GitHub under the repository vision-is-all-you-need. You’ll find other applied AI demos under our account as well!

Running ColPali requires a GPU with a lot of memory. Therefore, it’s easiest to run it in the cloud on a platform that allows the use of GPU power. For this, we chose the excellent Modal platform, where using GPU capacity serverlessly is easy and cost-effective.

In this demo, we also used Qdrant’s in-memory version. If you run the demo, note that the indexed data will disappear once the underlying container ceases to exist. Qdrant has supported multivector search since version 1.10.0. The demo only supports PDF files, whose pages are converted into images using the handy pypdfium2 library. Additionally, we used the transformers library and the custom colpali-engine developed by the creators of ColPali to run the ColPali model. Other libraries, such as opencv-python-headless (that’s by the way my work), are also in use.

The demo provides an HTTP interface for indexing and asking questions. On top of this, we built a simple user interface using React. The user interface visualizes also the attention maps per token so you can easily check which parts of the image the model considers important.

A screenshot of the Vision is All You Need demo

Is Vision Really All You Need?

Despite the demo's title, search models like ColPali are not yet good enough, especially with multilingual data. The models are typically trained on a limited number of examples, which are almost always certain types of PDF files. Hence, the demo also only supports PDF files.

Another issue is the size of image data and the embeddings calculated from it. They take up a considerable amount of space, and searching through large data sets consumes significant computing power. This problem can be partially addressed by quantizing the embeddings into smaller forms, even down to binary. However, this leads to some loss of information, and search accuracy suffers slightly. In our demo, this has not been implemented, as optimization is not essential for the demo. Additionally, it’s worth noting that Qdrant does not support binary vectors.

For these reasons, it’s still a good idea to run a traditional keyword-based search alongside ColPali, which performs initial filtering. After that ColPali can be used to retrieve the final pages that will be fed into the language model’s context.

Multimodal search models will continue to evolve, just like traditional embedding models that produce text embeddings. I believe it is highly likely that OpenAI or a similar organization will soon release an embedding model like ColPali, pushing search accuracy to a new level. This would, however, overturn all current systems dependent on text embeddings built upon chunking and search methods.

Without Flexible AI Architecture, You'll Fall Behind

Language models, search methods, and other innovations are being released at an accelerating pace in the AI world. More important than the innovations themselves is the ability to adopt them quickly, which provides a significant competitive advantage over slower actors.

For this reason, the AI architecture of your software, including search, must be flexible and scalable to quickly adapt to the latest technological innovations. As development speeds up, it’s crucial that the core architecture of the system is not tied to a single solution but can support diverse search methods—whether it’s traditional text-based search, multimodal image search, or even the use of entirely new search models.

ColPali is just a taste of what’s to come. Future RAG solutions will combine multiple data sources and search techniques, and only an agile and customizable architecture will enable their seamless integration.

To solve this problem, we offer the following services:

  • Evaluation of your AI architecture’s current state

    • Sparring your technical leader and developers in the depths of AI — down to the code level

    • We examine for example search methods, scalability, architectural flexibility, security, and whether (Gen)AI is being used according to best practices

    • We make improvement suggestions and list the concrete next steps for development

  • Implementation of AI features or your AI platform as part of your team

    • Dedicated applied AI engineers ensure that your AI project does not fall behind other developments

  • Development of an AI product as an outsourced product development unit

    • We deliver a complete AI-based solution from start to finish

We help our clients gain a significant competitive advantage by accelerating the adoption of AI and ensuring its seamless integration. If you’re interested in learning more, reach out and let’s discuss how we can help your company stay at the forefront of AI development.