Mikko Lehtimäki
Founder, Chief Data Scientist
icon

Building NLP solutions: strategies and tools

The recent buzz around ChatGPT and other similar services has business thinking how these powerful tools could be best utilized in various industries to improve efficiency, productivity, and customer experience. And for a good reason. There are huge efficiency boosts and quality improvements on the table.

Large Language Models (LLMs), like ChatGPT, LLaMa and PaLM, have taken Natural Language Processing (NLP), a subfield in machine learning, to the next level. But NLP solutions are more than API calls to ChatGPT. A well-functioning service requires that the underlying LLMs are either fine-tuned on task-specific data or given relevant context in user queries. To achieve the best results, these approaches can be combined. Here we give a roadmap to creating an effective NLP service!

Here’s ChatGPT describing NLP while sounding like a pirate:

Ahoy, mateys! Natural language modeling be like a crew of language-savvy buccaneers who can help ye translate, summarize, and even generate text. They be a valuable treasure for those who value efficiency and innovation, and have the potential to revolutionize industries like healthcare, finance, and education. So if ye be lookin' to save time and effort, weigh anchor and set sail with natural language solutions, arr!

Let’s now outline key factors to developing great NLP solutions that add value. We’ll also list our favourite tools in the scene.

It starts with data. While ChatGPT is convincing and seemingly powerful, LLMs have some downsides and limitations that must be addressed in practical applications.

  • These models do not have access to the latest information. Their training data is based on historical events, and in some cases you may not even know how recent that history is. While we can build services that enable LLMs to get information from e.g. news sources, such a capability is not there by default.

  • These models don’t know anything about your private data. In fact, you should not assume that they even know about specific public data. They’re great at modelling the human language generation, but that does not make them reliable sources of information. LLMs can hallucinate facts that are not, in fact, facts. Just b*******! A great service will use LLMs together with the relevant data to provide value to the user.

  • These models don’t cite their sources. Even when they provide correct answers or take intended actions, it’s hard to know why. Did the model understand your instructions or was the temperature parameter set so that by chance the desired outcome was achieved? You may always ask for sources from the model itself, but more often than not these are also hallucinated. Hence, we need to build the capability to cite sources, so that we can verify LLM reasoning chains.

  • These models don’t have a memory. Our service needs to augment the LLM capabilities with a memory store, if needed.

  • The length of the input given to the model is limited. You cannot pass arbitrarily long documents for summarization, or ask the models to write a book for you. Clever strategies are needed to work around the maximum token count.

So how do we make sure our service gives reliable results from LLMs? We will prompt engineer our way to success and provide the correct context to the language model. When our service receives a request, we will add relevant context data to the request and then pass it to an LLM. For example, when requested to summarize a document, we must feed in the document to the LLM piece by piece while ensuring that no important information is lost. Or when asked to refer to previous information sent by the user, we must look up the relevant parts of the discussion and feed those to the LLM. When the user needs recent information, we must direct the LLM to the correct interfaces. Even a powerful tool created by hundreds of intelligent individuals needs some hand-holding 😃

Here’s a general strategy to developing a service around LLMs. For example a service that allows semantic search (content understanding rather than keyword matching) over your documents is built like this.

  1. Split your data (documents, website content, Github repositories) into segments that are small enough to be fed into the LLM. Segments for finetuning will look different than segments that are provided as context in user queries. As usual, the correct segment size depends on the application.

  2. Encode the segments into embeddings and store them in a database, along with the raw text segments. Embeddings are computed with language models that can be trained on different data sets, like question answering or translation. Choose an embedding model that matches your goals.

  3. Choose a similarity measure used to retrieve segments that have information related to user requests. Cosine similarity tends to work well.

  4. Design prompt templates that achieve the goal of your service. These will be passed to the LLM, with the context segments you stored earlier. You will probably need several templates that are passed to the LLM using suitable logic for your business case.

  5. When a user query arrives, it will be embedded, correct context will be fetched, and these will be inserted into the prompt templates and sent to the LLM. The LLM should return a response that answers the original user request!

Quite a few steps, and as you can see, there may be multiple language models used for different purposes in a single service. But there are tools that will help you be productive and successful! Most of them are open source too, and come with Python interfaces. Here are some of our favourites:

  • Trankit

    • Use it to extract segments from text. Supports multiple languages.

  • LangChain

    • Quite literally chains together prompts to LLMs in order to augment LLM capabilities. Use it to sequentially feed context to language models or create a service that can call APIs.

  • Llama-index

    • Build data structures from your context data so that the LLM always has the correct and most useful context available. Use it to interface with your embedding store.

  • Qdrant

    • A performant database for storing embedding vectors and related text and metadata. Implements vector similarity search on the database. You can additionally filter your search results based on the metadata or text content. Works well with Llama-index and LangChain.

  • Huggingface 🤗

    • An excellent source for pretrained models for embedding computation and query answering as well. You can download the models locally or use it as a service.

  • OpenAI

    • The best performing language models for now, in our opinion. We’re especially liking the Azure integration.

  • Metaflow

    • Orchestrate your embedding creation and fine-tuning as well as experiments with different parameters! Run locally or in the cloud.

  • Streamlit

    • Great for building interactive interfaces, just what language processing is about. It’s not a replacement for a production UI but really gives a productivity boost and woah-effects.

These tools don’t work in silos! Best results are achieved when they are combined. It’s not out of the question for all of these tools to be involved in a single project.

In summary, there are quite a few moving pieces around building an AI service around language models. But the tools are getting really good and it’s possible to create a functioning solution very efficiently.

As an example, go check out our Generative AI solution for Enterprise use YOKOT.AI or let us know, if you'd like to build your service with us!