Softlandia background



Microsoft Copilot, Grok, ChatGPT and YOKOT.AI – a Look into RAGs

Generative AI solutions took a giant leap forward during the last week: Microsoft Copilot was released, released the beta version of the Grok model, and new OpenAI features will be revealed today during the OpenAI dev day. However, relevant information about how these solutions work is buried under constant noise and buzzwords that most people do not understand. Technical information is hard to come by, and making sense of it is inaccessible due to new terms and vocabulary understood only by people who work with generative AI solutions.

In this blog post, I will explain what these solutions have in common and how they work – or at least seem to work. I do not have access to the deep technical information of all these solutions. However, based on our work with our own generative AI product YOKOT.AI, and the bits and pieces found around the internet, I can make an educated guess about their inner workings.

Generative AI Models Utilize pre-trained Knowledge

MS Copilot, Grok, YOKOT.AI and ChatGPT utilize so-called foundation models under the hood. Usually, the foundation model is GPT-3.5 or GPT-4 since they are currently the most capable large language models (LLM). Grok uses the Grok-1 model, which has similar capabilities to GPT-3.5.

When users ask a question from these models, they reply based on the data they have been trained with. For each model, there’s a specific cutoff date after which the model does not know the world’s events or data. When users use these models and wish to utilize up-to-date or otherwise private information, the new information must be fed to the model somehow.

There is an expensive and slow way of doing this with fine-tuning, where a new model version is trained with some additional data. In the scale where Microsoft, X and even we operate here at Softlandia, this is not practical due to the slow turn-around times. However, fine-tuning is helpful in some particular use cases. That work is usually carried out as separate projects for customers that require a specialized approach to some of their business questions. So, if fine-tuning is not used, how do these models work with all the additional data?

Short Intro to RAG – Retrieval Augmented Generation

Our product YOKOT.AI uses a sophisticated RAG approach to augment the foundation model’s knowledge with data that the user has added to the system.

What does RAG mean? In practice, RAG is a software engineering approach to overcome the holes of knowledge in LLMs. RAG works by retrieving relevant external data from somewhere, augmenting the user query with the retrieved data and then giving the retrieved data + user query to LLM, effectively generating an answer with the LLM using this information as additional context for the LLM. Many methods can be used for model grounding, but for simplicity, we focus here on in-context learning.

The retrieval phase is usually implemented with a vector database. A vector database contains all the additional textual information in a numerical, vectorized format to be easily searched mathematically. To convert text into vectors - lists of numbers representing smaller pieces of the text, aka tokens - embedding models are used. OpenAIs Ada model is a good example of an embedding model. To retrieve the relevant content, the vector database is searched against the user’s vectorized query. The typical metric for comparing the vectors is cosine similarity.

From the vector database query, we get a list of best-matching texts. It’s up to the system to determine how it will use, combine, or further process these before augmenting the user prompt with that data. After that, the system makes the generative call to the LLM and returns an answer to the user.

I will use our YOKOT.AI private generative AI solution as an example here to summarize these steps:

  1. The user has crawled the Softlandia website with the YOKOT.AI crawler. YOKOT.AI has all Softlandia webpage contents indexed under the tag “Softlandia” in a vector database.

  2. The user sends a question to YOKOT.AI and tells YOKOT.AI to use the material under “Softlandia” as a potential source for information: “Tell me about Softlandia.”

  3. The user question is converted into a vector representation with an embedding model.

  4. A vector database is searched for the best matching texts.

  5. Best matches are inserted into the final LLM query.

  6. The system returns the answer to the user.

To get the best results, the process requires more complicated logic and usually additional reasoning via multiple different LLM calls or function-calling capabilities of LLMs. However, this summarization should give a basic understanding of how these AI solutions work.

Analyzing MS Copilot and Grok Architecture

Keeping the previous RAG approach in mind, we can mirror it to the architecture of MS Copilot and Grok. Microsoft has published some high-level architecture diagrams that look precisely like a RAG. Microsoft uses its own Graph API to retrieve the data that the user has access to. According to the diagrams, the retrieved organizational data is constantly indexed to a system that can be semantically searched – this means there’s most likely some vector storage somewhere. The LLM queries are then grounded with this additional data.

Illustration of the MS Copilot RAG

Regarding Grok, we know that X is using the Qdrant vector database for ingesting data in real-time to be consumed by Grok. This, again, hints towards a RAG approach: all new posts (tweets) are inserted into the vector database in real-time, and Grok will then query this database to gain real-time context in the chats. Qdrant is also used for X’s new “similar posts” feature, which is basically the retrieval part of the RAG. By the way, also YOKOT.AI is using the awesome Qdrant database!

I have oversimplified many things here, but the high-level architecture in all these three solutions is most likely rather similar. Based on the above solutions, the current go-to architecture for generative AI applications is an RAG-based approach. For simple use cases, a RAG similar to the one described in this post is enough. Advanced use cases require novel approaches in data indexing, retrieval, combination, and prompting.

It’s all about Software Engineering and less about AI

What can we learn from this? Developing state-of-the-art AI solutions on top of large language models is mostly about software engineering, not AI. Of course, you need in-depth skills and information about AI and ML to develop these solutions, but the focus is mainly on the software side. We call the people who develop these new solutions applied AI engineers – software engineers with an AI twist.

As explained in this blog post, the limitations of the foundational models are overcome by clever software engineering tricks that make the LLM utilize additional data without re-training the model. Continuous research and development is ongoing in this area. New emerging approaches are being published on how to make the RAGs even better – and we are evaluating them constantly in our YOKOT.AI product.

If you want to know more about generative AI, RAGs, and state-of-the-art applied AI solutions, please get in touch with us. We are continuously working in partnership with many companies at the forefront of utilizing LLMs and other modern AI solutions.