Latest blog entries
Serverless LLMs (demo!) and other GenAI trends for 2024
The hype around large language models (LLMs) has not tuned down in 2024 and it will most likely continue to grow this year. This blog post will focus on one of the emerging trends that enables broader and cheaper LLM access for both companies and individuals.
Here's a direct link to the demo that is described later in this post. If you want to understand the technology and the trends behind it, keep reading.
The LLM ecosystem trends in 2024
Many new large language models — open-source and closed-source — were published in 2023. However, one fact remains: GPT-4 is still the best of the best for multi-language understanding and generation. To this date, no other model has outperformed it. Therefore, it is used as the main benchmark when it comes to evaluating new model performance and accuracy.
The Mistral Mixtral Moe model from France, Europe got very close to GPT-4 but needs to catch up when it comes to understanding and generating other than a few popular languages. However, it is a remarkable achievement because the model is almost as good as GPT-4 while being much smaller and more efficient. We can expect similar results from other new models as well, as the year progresses.
A clear trend in the LLM ecosystem is emerging: models are getting smaller while performing better than their larger counterparts. As this trend progresses, we will see even tiny models that generate high-quality outputs.
Why are small and efficient models important?
Small LLM models enable a few use cases that will in turn enable us to take LLMs into use more universally. Currently, GPT-4 level generative AI is still accessible to a relatively small number of people due to its pricing. Smaller models mean lower expenses and better accessibility.
Additionally, small models can be run on devices that do not have specialized graphics processing units (GPUs). While GPU will certainly make the models run faster, it’s not necessary anymore. This means that the device itself has the AI in it, and it’s not calling some remote application programming interface (API) such as OpenAI API in the cloud.
Internet of Things (IoT) devices, mobile devices, cars, and any peripherals you can think of will sooner or later have a built-in language model in them. Will this happen this year? Probably not, but the first attempts at running local LLMs in lower-tier devices will certainly be seen this year.
Making large models smaller via quantization
Many of the “small” models are still relatively large, both in size and memory requirements. You need lots of memory to run them, and even with a modern CPU, they might be extremely slow.
To make the models run better, there’s a process called quantization that enables you to convert the models into smaller ones. In the process, the model will lose some accuracy depending on the level of the quantization, but you end up with a much smaller model that is small in both size and memory requirements.
This means that the original model precision that is used to represent the numbers in the model is reduced. Most models have 32-bit floating point precision and the usual quantization levels are 8-bit (conservative, low accuracy loss) and 4-bit (aggressive, higher accuracy loss) integers. Floating point numbers take much more space and their compute requirements are much higher than integer numbers. This is why the model will not only get smaller but is also faster to run.
Most openly available LLMs have some quantized version available in Hugging Face, the open-source data science and machine learning platform. These quantized versions are usually published by LLM enthusiasts, who participate in open-source development and experiment with the models.
Serverless LLMs in Azure
An intermediate step towards having universal and cheap access available to LLMs will be taken in the cloud. Currently, for large models, proper scalable and fast generation can be done only with GPUs. We at Softlandia have studied how we could run some specialized tasks on smaller LLMs in the cloud on affordable machines that have only central processing units (CPUs) available. It turns out that this is already possible, and it can be done free of charge (!).
Azure’s serverless PaaS product called Azure Functions offers several different tiers: a consumption plan with a generous free quota and premium plans that are a bit pricey but have better scaling capabilities. While we succeeded, in the end, to run LLMs on the consumption plan, our initial attempts required the paid plan.
First, we tried to run a quantized Phi-2 model in Azure Functions. Good generation speed was achieved only in the most expensive premium tier of Azure Functions. This is way too expensive since you could use one or more GPU machines for the base price of that plan. On the other hand, this proved to us that LLMs can be run on serverless platforms that do not have GPUs.
Our next attempt was to run the quantized TinyLlama model in the Azure Functions consumption plan. This took place around the new year when the TinyLLama 3 trillion token training epoch was completed. This study yielded much better results: you can run a quantized 4-bit TinyLlama model in the Azure Functions consumption plan for free!
Later, around the time this blog post was published, the new Qwen 2 beta models were released, and we could easily run also the smallest quantized version of Qwen 2 in Azure Functions. A great start for the year!
Serverless Tiny LLMs Implementation
We have prepared a live demo of the two aforementioned models running in Azure that anyone can try out. To our knowledge, even Microsoft hasn’t tried running LLMs on their serverless platform - so this might be the first working proof of concept about serverless CPU-only Azure Functions -based LLM in the world! The models are not very fast, but still, they work surprisingly well. Please note that especially the TinyLlama model can sometimes cut off the generation.
To see the code, check out the Tiny Serverless LLMs repository and test out the demo. If the load gets too high or we detect any kind of abuse, the demo will be taken down. Please be patient, it might take some time to start generating a response in case there are no free instances available and the app needs to scale out.
The implementation is very simple: we use llama.cpp
directly and pipe the stdout into SignalR. There’s a quick-and-dirty vanilla JavaScript client served from the root of the function app. The client connects to the SignalR endpoint and triggers the LLM calls via an HTTPS API endpoint. This will push a message to the Azure Storage Queue, and a queue trigger will then do the llama.cpp
call. Llama.cpp
and the LLMs are deployed to the Function App via a post-build script during the continuous integration process.
Conclusion
Tiny LLMs will be one of the major trends in generative AI during 2024. We implemented a technology demonstration that shows that tiny (large) language models can be already run on cheap and scalable cloud infrastructure without GPUs. Later this year we will see even more capable small language models. In the not-so-distant future, small models can be found on all devices around us.
Softlandia is the leading GenAI software consultancy in Finland. Don’t hesitate to contact us in case you wish to learn more about applying LLMs to your business!