Softlandia background

Softlandia

Blogi

Top things to know about AI transcription to stay ahead of the competition

Transcription is the process of turning speech into text. What has for a long time been a tedious task for humans can now be automated with great quality. Transcripts enable humans to interact with machines with ease, and automate many tedious and error prone tasks. With transcription, automatic speech recognition becomes an integral part of your AI engine (you have one, right?).

Some real life use cases for transcription include

  • Saving Medical professionals’ time by automating writing of patient instructions and clinical notes

  • Automating call center analytics and customer service call handling

  • Facilitating automatic translation of content for education, training and marketing purposes

  • Enabling business intelligence data scraping from multimedia

Transcription has been a tough nut to crack in AI, but once again approaches that leverage massive amounts of data have proven superior to manual feature engineering. At the core of a transcribing solution is the transcription model - a machine learning method to create text from audio.

What goes in a transcribing solution

In a transcribing software solution, transcription models don’t live in a void - they need supporting infrastructure and interfaces. Users need a way to input audio or video, the input data needs to be preprocessed to enhance quality and ensure correct format for each transcription model, speakers need to be identified, vocabulary can be corrected, results can be polished with a large language model, the transcript itself can be viewed, annotated and corrected by users and exported in different formats, and so on. These features are very important for the overall user experience in a transcribing service.

Not all transcription models are the same, and picking the wrong one can make or break an application. In the following, we introduce the most important bits to know about AI transcription models:

  1. There are different services and models for different purposes: real-time, synchronous and batch. How often and how quickly results are needed will determine the choice of transcription service type.

  2. Transcription models can make mistakes, and different models make different mistakes. E.g. The famous Whisper model can hallucinate filler words or speech on top of music or silence, but in the big picture it is the most accurate model. Other models can miss more words in total but don’t make them up so easily.

  3. Transcription pairs well with LLMs for formatting, checking grammar, fixing entity names, translating and summarizing content.

  4. There are different capabilities, such as timestamping, diarization, speaker recognition and vocabulary correction, that can be baked into transcription.

  5. Not every language is supported by each transcription model.

  6. Hardware matters! If your use case requires fast transcription, a GPU will always beat a CPU.

  7. Transcription model runtime can be optimized in different ways, by reducing memory footprint or implementing carefully for specific platforms. Some methods come at the cost of accuracy.

  8. To boost transcription accuracy you’ll likely need to collect training data.

Which brings us into the technical bits..

You can get good results with transcription without understanding the methodology underneath, but once you hit a wall it becomes necessary to know the engineering and scientific concepts behind transcription, if you're looking to improve your results. That's because you would need to understand how transcribing machine learning models are trained, evaluated and operated.

The primary way to improve an existing transcription model is by collecting new training data about your specific application. So for example to improve a healthcare transcription system would collect actual conversations between doctors and patients (while taking care of privacy). This data is used to train a machine learning model that does the actual transcribing. But note that not all providers allow you to train the models, which is why we at Softlandia like open source models that we can train and deploy at our will!

Transcription is a piece in a bigger puzzle

The importance of automatic speech recognition is highlighted by advances in speech and video generation, as it is now much easier to create audio-visual solutions and content than before. Technologies such as LLMs, audio event recognition, speech generation and speaker recognition are all pieces of the puzzle when implementing transcription to your solution. Integrating these technologies allows for a more robust and accurate transcription system, tailored to specific needs and contexts, that is a pleasure to use.

There you go! By understanding these underlying concepts, we can better adapt and leverage the latest advancements to achieve superior results in transcription tasks.