Softlandia background



Tools of the AI engineer

Large language models (LLMs) are shaping the way how AI applications are developed. They’ve found applications in many tasks, and the coolest thing right now is the autonomous agents that are being built! 

Behind the success of LLMs is their ability to “understand” and generate text. We communicate via text a lot, and perhaps most importantly, we write code and instructions to our computers as text. Thus, the capability to manipulate text gives LLMs a wide selection of tools to interact with us and with each other, boosting productivity and solving challenging problems.

In this blog post, we’ll give a practical tutorial to building your own LLM-based applications. Specifically, we’ll focus on building an AI that uses your private data to perform tasks. Our task of choice here is extracting structured information from text documents. The example can easily be adapted to chat-like interactions or more generative tasks as well.

We’ve previously shared some of our favorite tools in the NLP scene as well as written about the skill set for developing AI applications. It is the applied AI engineer who will be building AI-enabled solutions on top of foundation and multi-modal models and fine-tuning them to meet specific industry needs. In the following, we will break down the process of building an AI product, as well as describe tools to deliver great results.  

As an example, when we want LLMs to do something based on our private data, we’ll often provide that data as context to the LLM queries. But since LLMs can handle only a limited amount of text in a single call, we need clever strategies to make sure all the relevant context information gets to the LLM. We’ll demonstrate the necessary tools here.

We’ll build an application that scrapes, embeds, stores, retrieves, analyzes and validates data. Specifically, we 

  1. scrape some blog posts

  2. compute vector embeddings of the scraped content

  3. store that data in a vector database for semantic retrieval

  4. query the data with free-form text

  5. use the data with LLMs to perform a task

  6. make sure we get structured JSON as a response! 

We’ll be using Qdrant,, Llama-index (GPT-index), LangChain and OpenAI. You can find the code of the complete application in our Github repository 

If this topic interests you, we have a free event (on-site + online) coming up. It's an LLM special in our data science infrastructure series, sign up here: 🤩

Now, we’ll go hands-on.

building an AI-powered blog-analysis tool 🛠️ 

We start by loading environment variables from a .env file, which keeps sensitive API keys secure. It also stores some LLM settings, which makes it handy to change models in cloud deployments, for example:

chunk_len = 256
chunk_overlap = 32
doc_urls = [

# Setup OpenAI, we have these settings in a .env file as well
openai.api_key = os.environ["OPENAI_API_KEY"]
embedding_model = os.environ["EMBED_MODEL"] # text-embedding-ada-002 most likely
text_model = os.environ["TEXT_MODEL"# text-davinci-003 most likely

The code sets a list of URLs to the blog posts that are to be analyzed. chunk_len and chunk_overlap are needed to overcome one limitation of LLMS: the number of tokens that the model can consume in a single call is limited. Hence, we split the blog post texts into small chunks that can be easily fed to LLMs.

We then define a collection name for the Qdrant vector database, set Qdrant host and port, and define a client for this database:

collection_name = "softlandia_blog_posts"
qdrant_host = os.environ["QDRANT_HOST"]
qdrant_port = 6333  # Qdrant default
qdrant_api_key = os.environ["QDRANT_API_KEY"]
qdrant_client = QdrantClient(

With Qdrant, we store vectors and attached payloads in a collection. Qdrant collections are like tables in SQL databases. A vector database allows us to efficiently retrieve matching vectors using a metric of our definition (cosine, commonly), while filtering the search results by arbitrary metadata. Accurate retrieval is one of the basic mechanisms behind giving LLMs personalized data at query time. Above, we set an address and API key for using a Qdrant cloud cluster, although thanks to their open-source offering we can easily host our own clusters as well or even prototype with in-memory clusters. 

Now, we set our embedding function, which will use the embedding_model variable. We’ll use Ada, it’s accurate and affordable! With embeddings, we compute a numerical representation of text, so that we can evaluate text (or image) similarity programmatically. When a user asks a question, we will also embed that, and then retrieve semantically similar text chunks from our vector database to give context information to the LLM.

embed_model = LangchainEmbedding(
llm = OpenAI(model_name=text_model, max_tokens=2000)
llm_predictor = LLMPredictor(llm=llm)

Here, OpenAIEmbeddings is a LangChain wrapper around OpenAI embedding APIs and OpenAI is LangChain wrapper around OpenAI text completion APIs. LangChain is an open-source project for interacting with LLMs and building task-driven applications from LLM calls. We like the LangChain interface because it gives a common entry point to different models. If we ever change the embedding model, we’ll just use a different LangChain wrapper, and keep the rest of the code intact. Of course, with LangChain we can do much much more, like define text summarization tasks or autonomous agents that perform actions towards goals. All that becomes much easier if LLM calls have a common API. 

We further wrap the LangChain objects into a Llama-index LangchainEmbedding and LLMPredictor classes. Llama-index is an end-to-end solution for interacting with your data sets using LLMs. It is an awesome project that helps with text and data retrieval for LLM queries. It does this by allowing us to define different indices over our data. Each index has its own logic for providing context information into our LLM prompts. This is where Llama-index shines.

We’ll use Llama-index to retrieve best matching text data from our Qdrant database, given a user query, as well as to generate an answer to the query based on the retrieved data. Let’s further customize how Llama-index does this:

splitter = TokenTextSplitter(chunk_size=chunk_len, chunk_overlap=chunk_overlap)
node_parser = SimpleNodeParser(
     text_splitter=splitter, include_extra_info=True, include_prev_next_rel=False
prompt_helper = PromptHelper.from_llm_predictor(
service_context = ServiceContext.from_defaults(

NodeParser configures how we ingest text into our vector database - we parameterize a text splitter from LangChain which uses the chunk length and overlap parameters we set earlier (remember, LLMs can take in a limited number of text in a single call). The PromptHelper tells Llama-index to use the LLM we defined. ServiceContext is how the prompt and index creation customization will be passed in, and this is where we specify the embedding model as well.

We’ve now defined our database for storing embedded text data and specified models for computing embeddings and querying LLMs as well as put those together into Llama-index objects. It’s time to get data!

reader = download_loader("BeautifulSoupWebReader")
loader = reader(website_extractor={"": slreader})
documents = loader.load_data(
    urls=doc_urls, custom_hostname=""

We can use Llama-index loaders to bring in data from many sources, like web pages, Github repositories and Youtube. Here we use BeatifulSoup to read blog posts. We also customize the scrape results a bit with the website_extractor. Now, documents will be a list of Llama-index documents, which are ready to interact with vector stores and LLMs! Let’s store the data in Qdrant, by defining a Llama-index vector index:

index = GPTQdrantIndex.from_documents(

This is where we pass in our text data (documents), our database client as well as LLM and embedding model settings (service_context). Behind the scenes, Llama index will split the data into small chunks, call our embedding model to get vector representations, attach metadata to the vectors and store everything in Qdrant.

It’s time to ask an LLM some questions!
task = "List technologies that are mentioned in the blog posts, and their date of mention."
result = index.query(
     similarity_top_k=3  # Increase this to get more results

We define a task: look up open source technologies mentioned in our blog posts, and get the blog publication time as well. The similarity_top_k argument controls how many text chunks from blog posts are used as context. The higher the number, the more time and money your queries cost 😇When this query runs, Llama-index will do query embedding and vector lookup from Qdrant, where vector similarities are computed efficiently directly on the database server. Llama-index will then inject the retrieved text to LLM calls and chain the calls, so that we receive a single coherent LLM response. Very neat!

After the query runs, result.response will contain the text output from the LLM that has seen our personalized text content, and by inspecting result.source_nodes you can see exactly what data was used to synthesize the response. 

The thing with LLMs is that by default they output unstructured text, and the format of the response varies a lot from model to model. For example the ChatGPT model gpt-3.5-turbo is much more discussive than the davinci series. We need some method of ensuring that the model output is predictable, and preferably in a machine-readable format. The response we get is something like

Cloud Native Solutions (April 5, 2023)
Sensor Fusion & IoT (April 5, 2023)
Software Consulting (April 5, 2023)
Kubernetes (February 13, 2023)
Python APIs (February 13, 2023)
Bytewax (February 14, 2023)
Enter Guardrails.AI!

With Guardrails, we can define exactly what content and structure we want from the LLM output. Moreover, we can validate the output and perform corrective actions if necessary. This is absolutely necessary when we want to programmatically make use of LLM outputs. The validation happens with an XML-based RAIL specification as follows:

<rail version="0.1">

    <list name="technologies"  format="length: 1" on-fail-length="reask">
        <string name="item" description="name of the technology"/>
        <date name="date" date-format="%Y-%m-%d"/>







The above specification in the <output> tags tells the LLM to output a JSON list of objects with fields “item” and “date”, and the date should be in year-month-day format. There are a bunch of validators that can be applied, like checking the length of the list or ensuring that the output is valid Python code. So cool! With <prompt>, we use some pre-defined shorthands (@xml_prefix_prompt, @json_suffix_prompt_v2_wo_none) to instruct the LLM about the desired output format. {output_schema} is parsed from the <output> tags and given to the LLM in the prompt, so that it knows what output format to use. 

This is how we implement the validation in code, start by defining a Guard-object from the given RAIL specification (which we import from another file called

guard = gd.Guard.from_rail_string(blog_guard.TECHNOLOGIES_SPEC)
guard_task = "Format the technologies and their date from the text below. Only list one technology per item."
raw_llm_output, validated_output = guard(
     llm,  # We can pass any callable
     # Task and text keys are defined in our RAIL spec
     prompt_params={"task": guard_task, "text": result.response},

With guard_task, we add some further instructions to tell the LLM what we’d like to achieve. This task will be added to the {{task}} placeholder seen in the RAIL XML. The first argument to the guard() call is an LLM function, and we can neatly pass in a LangChain callable, or any other such as openai.Completions.create directly. Here we also tell Guardrails to ask the LLM to correct its output once, if it does not pass validation of list length at least one.

We insert the query response from our previous LLM call into the {{text}} placeholder. Basically, the earlier query to extract technologies and dates can return in a variety of formats, and this is how we ask the LLM to structure it properly. This is the simplest way of validating LLM outputs, but Guardrails also offers deeper integration with both Llama-index (through output parsers) and LangChain (Guardrails integration), be sure to check those out!

The validated_output should be a dictionary already, as Guardrails will handle the validation and conversion. We can pretty-print it and the result should be something like the following:

    "technologies": [
            "item": "NLP Solutions: Strategies and Tools",
            "date": "2023-04-05"
            "item": "Cloud Native Solutions",
            "date": "2023-04-05"
            "item": "Sensor Fusion & IoT",
            "date": "2023-04-05"
            "item": "Software Consulting",
            "date": "2023-04-05"
            "item": "Kubernetes",
            "date": "2023-02-13"
            "item": "Python APIs",
            "date": "2023-02-13"
            "item": "Bytewax",
            "date": "2023-02-14"

And the structure is golden :) 

This is how you go from your personal data into a complete application with structured output! The tools that evolved in the open-source AI space in the span of a few months are quite amazing. They have transformed the way developers create and deploy cutting-edge applications, democratizing access to powerful AI solutions. With the rapid growth of these tools, AI engineers can extract valuable insights, automate complex tasks, and create innovative products. This vibrant open-source ecosystem fosters collaboration, creativity, and continuous improvement, driving the AI revolution forward at an astonishing pace!

Softlandia team just launched their own Conversational AI solution for Enterprise use YOKOT.AI. It has been built utilising the best LLM technology available to date, bringing the AI revolution to daily business operations.

Don't hesitate to reach out if you'd like to learn more about building on top of LLMs!