Softlandia background

Softlandia

Blog

Building Robust LLM Solutions - 3 Patterns to Avoid

Having now built dozens of Large Language Model (LLM) -based solutions for different industries and use cases, we’ve found some LLM anti-patterns that everyone should be aware of. While LLMs are powerful, they are also brittle, and capabilities change from model to model. It is simple to build a demo with an LLM, but much harder to make them work reliably in production. So don’t fall into the following traps!

Part of the challenge in productizing LLM-powered services is that a problem is rarely solved just by deploying an LLM. Infrastructure around the model is just as important as the model itself. For example, in chat-with-your-data applications that are commonly built using retrieval augmented generation (RAG) architectures, the most suitable RAG implementation changes from use case to use case.

So, here are some lessons we’d like to share from the field. I will introduce the following three LLM anti-patterns. At the same time, I will acknowledge that the patterns we have established so far keep evolving so fast that it is debatable if they deserve to be called patterns in the first place 🙂

LLM anti-patterns

Let's dive right in!

Doing everything with one LLM

We’ve now got a rather large selection of LLMs to choose from for our projects. Commercially available models are great for generic use cases, while the AI community has built models that are excellent for niche use cases such as story telling or on-device deployments. So when you pick a model for your service, make sure you can explain why you picked the specific model.

GPT-4 is, in our opinion, still the champion of LLMs. We’ll see how this changes with Gemini Ultra 🙂 For now, GPT-4 is slowish, has usage restrictions (tokens per minute limits) and can get pricey. It may be a great model, but it is not needed for everything. So think if you really need GPT-4, since if you don’t, you may be able to provide a cheaper service or a faster user experience.

This works the other way as well. Don't hold out on trying the better and more expensive models, as they will get cheaper and faster in the coming months! You'll have a much better grasp of what is possible with LLMs if you actually use the best available models. So limiting yourself to a single model just because of reasons is an anti-pattern.

Using naive RAG for what it cannot do

Retrieval Augmented Generation (RAG) is great. It is the primary method for making LLMs aware of data outside their training data. RAG grounds the LLM so that it generates text that is based on ground truth documents. 

In RAG, we retrieve documents from our own data sources, like document management systems, or scrape the web, for example. The documents are then given to the LLM, in the prompt, when generating text. In question-answer use cases, we retrieve facts and previously answered questions, in marketing material generation we retrieve style examples and product sheets, and so forth.

But we have to keep in mind that we cannot stuff infinite examples into the prompt. So you have to be smart about the data you retrieve.

Now for the anti-pattern. Once you understand how RAG works, you see that there are many ways to build the retrieval part. The simplest method goes like this: chunk data, embed data, retrieve chunks that are similar to a query, stuff them to the prompt. This method will not work for tasks like summarization or comparisons across documents, because you cannot guarantee that all relevant information will be retrieved. So don't use simple RAG for what it cannot do. Build your RAG to adapt to user queries.

Passing user questions directly to LLM

So you've built a cool demo. You show it off and people are in awe. You ship it, and start getting complaints that the solution does not work. That's because you exposed the LLM directly to the users.

It's a problem for a couple of reasons. Firstly, malicious users can rather easily get the model to reveal its system prompt, instructions, RAG data and even training data. It's just the nature of technology. 

Secondly, in production you need consistency, which you just cannot get if you let users directly prompt your service. There are too many ways to write the same question, too many ways to misunderstand. 

So exposing your logic-layer LLMs directly to users is an anti-pattern. What you should do instead is add an intermediate layer that filters, guards and extracts key information from user requests. Then pass the standardized data to your primary LLMs. Btw, this is how the DALL-E API works, so look into that if you need inspiration.

That’s the lesson for today! Avoid the above anti-patterns and you’re one step closer to delivering production grade LLM solutions. If you’d like us to handle all of this for you, be sure to check out our YOKOTAI offering and our custom solutions.