Mikko Lehtimäki
Perustaja, Chief Data Scientist
icon

LLMs and privacy - what every enterprise needs to know

One of the big challenges around Large Language Models (LLMs) is privacy (other challenges are cost, speed and talent shortage). Since enterprise and research data are highly sensitive by nature, it is only natural that a rapidly developing new technology raises privacy questions. With LLMs, privacy needs to be considered from three primary perspectives:

  • LLMs as a technology

  • LLM providers

  • Surrounding infrastructure

The good news is that privacy concerns around LLMs can be addressed quite well! Let’s break down each of the above to understand them more thoroughly. 

Privacy of LLM technology

LLMs are large neural networks. While neural networks are sometimes referred to as black boxes, in that it is hard to deduce what is happening in the network, that does not mean that they are somehow extra secure. In fact, LLMs are terrible at keeping secrets. By now, it is established that segments of training data can be recovered from large language models [1]. This has implications for sharing trained models - in essence, you should think of language models as an extension of your training data. Whoever shouldn’t see the data, also shouldn’t have access to the model!

This is somewhat counter-intuitive as it implies that LLMs memorize data, which for machine learning practitioners has commonly meant that the training process has failed - models should learn to generalize, not memorize (i.e. overfit). But the rules have changed with LLMs that need just the right level of memorization to achieve good performance. 

Privacy of LLM providers

For now, the best LLMs are truly large and the trained model weights cannot be downloaded. This means that sometimes we rely on third party LLM providers to get stuff done. In practice, your prompts and completions must pass through the LLM provider, and depending on their terms and conditions this data will be used for different purposes.

Some providers may use your prompt data to train their language models, or sell it to advertisers. Prompts may be stored for several weeks even if they are not used for training, and the provider’s staff may have access to them. Selected providers, like Azure OpenAI service which we use, allow us to completely opt out from all data collections. This is the only acceptable option if your company data is sensitive.

Privacy of the surrounding infrastructure

A functional LLM solution will require more than a language model. The models need to be made aware of your internal data or more recent events than their training data originally contains. This means that you need to finetune the models or give them access to data sources like SQL or vector databases, Google search or other APIs. The tooling around LLMs is evolving rapidly, and practitioners need to work day and night to keep up with best practices. We’ve written about the required novel skillset in our blog earlier, as well as given a practical guide on state-of-the-art AI tools

A privacy-aware solution will make sure that no sensitive data is leaked to these connected services. Sensible solution providers will let you choose the region where they host your data and clearly state how the data is processed on their end.

A very important consideration is also one that we call the internal privacy of your enterprise. Not everyone should have access to all data, and the LLM solution should fit your existing access control methods and hierarchy.

What does this mean? 

Now, there are actions that can be taken if your data is sensitive. For example, Personally Identifiable Information (PII) can be removed before sending data to LLM providers. You’ll then locally do the pre-processing and any post-processing to properly handle LLM responses based on the anonymized data. Additionally, if you are fine-tuning LLMs, PII removal is an important step [2].

In the near future, the performance of locally hosted language models will improve. Progress is driven by methods like quantization, data quality improvement, knowledge distillation and novel architectures. Enterprises will be able to self-host capable LLMs in their existing compute infrastructure. This will force LLM providers to provide better security guarantees to stay relevant in the enterprise game. We recently wrote about local language models in our blog.

Overall great software engineering practices should be used. These include end-to-end system protection, authentication, data management and user interaction systems, in addition to the LLM integration. 

Enter YOKOTAI

Our enterprise-grade generative AI solution, YOKOTAI, takes security seriously. It is built with enterprise support in mind, and we’re making use of the best local and cloud LLMs. YOKOTAI provides a way for you to utilize your enterprise data, along with public data sources, with the power of LLMs. It’s not limited to just chatting either, so book a demo and see for yourself ;) 

[1] Carlini et al. "Extracting training data from large language models." 30th USENIX Security Symposium (USENIX Security 21). 2021.

[2] Behnia et al. "EW-Tune: A Framework for Privately Fine-Tuning Large Language Models with Differential Privacy." 2022 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 2022.