The Speech to text module explained (STT)

Christina Dechent
Christina Dechent
  • Updated

In this article, we want to have a closer look at the Speech to text module (STT) and its different setup possibilities. 

What is the Speech to text module?

The Speech to text module stores voice or typed input given by the customer in the IVR or voicebot. There are many use cases:

  • leaving a call reason
  • agreeing to a call being recorded
  • typing in a customer ID
  • confirming a previous input, etc.

The answer of the customer is stored on a variable which you can use to route the call or to send it to external systems like your helpdesk. It's very similar to the Input Reader in its basic functionality but much more powerful.

Setting up a STT module

The STT module has many different settings which you can tweak to optimize the accuracy of recognition as well as the customer experience. We will go through each of these settings in the following.

We will look at the Audio tab first. If you know all this already, you can directly jump to the "Settings" part.

The "Audio" tab:

  1. The first row of the "Audio" enables you to define the audio you would like to play when the customer reaches this part of the call flow. You can either use an audio or you can configure a Text to Speech (TTS) input by typing what should be read to the customer. Therefore, just toggle the switch "TTS" and click to edit it.



    If you need more information about TTS, have a look at this article

  2. Next, lets look at the option to include a "No-input" audio:


    Enter an audio file or TTS here to have it read to the customer in case they are silent (give no input). In this case, the call will not move on to the next module in the call flow if the customer doesn't say anything. Instead, this audio will be played to the customer and they get a second chance to enter an input. You can also configure a second "No-input" audio.

  3. Lastly, you can also enter a "No-match" audio:
    This audio will be played to the customer in case they give a wrong input, e.g., they enter a number that is too long or an input that doesn't match the Regular Expression which is defined (we will look at this in the second part of this article).

For all the audios you have the option to configure the timing: You can determine how many milliseconds of pause there should be before and after the audio is played.

Great, now that we know about our audio options, let's look at the STT settings:

The "settings" tab:

In settings, you have five sections. Each section allows you to edit different settings to adjust and optimize your speech settings. We have these options:

  1. Speech recognition
  2. Result variable
  3. Timing
  4. Integer matching
  5. Customer behaviour

You can see them in the screenshot below. You also see either a minus ("-") or an arrow pointing to the left ("<").  If you click on the arrow, more settings will become visible. The minus hides these options.



Below we will explain each section and all the different settings that you can use. 


1. Speech recognition

If you are a beginner, only one setting is relevant:

Language: babelforce offers 70+ languages out of the box. Select the language you expect the input in. Please note that we cannot translate voice input out of the box. So, if you choose "English, United States" as the language for instance, our platform expects English input from the caller.

There are some advanced or beta options which you can play around with but we suggest you leave the default if you are not very familiar with speech modelling:

Template: babelforce optimized certain use cases in specific languages. However, only some templates have been optimized, so far. Please follow up on this article to see which Templates are already available for which language.

Speech module:  We suggest speech modules only to advanced users. Some modules might be optimized for a specific purpose in your language. By default we suggest "Command and search".

If you want to dive deeper into the topic of speech recognition by Google, check out this article by Google.


2. Result variable

Here you define the variable name. This variable stores the customer's input. You can later use it in Triggers, you can then include it in Switch nodes, or queue selections to route the call depending on the customer input. Have a look at this article if you want to read a use case on the topic. Use a meaningful variable name to make it easier to find in the expression list afterwards and remember its function.




3. Timing

Timeout between inputs: This number indicates, how many seconds of pause are allowed between the single customer input parts, before the customer input is considered complete and the call flow moves on to the next model. E.g., imagine the customer reads out a 5-digit customer ID. If the continuous input timeout is 2, the customer has two seconds after every digit they read out, before the input is considered complete. 


4. Integer matching

This section defines what type of input is expected.

DTMF input: Toggle this switch if you want to allow the customer to type in a number on their phone to give the input

Termination: Here you can configure, if the customer should press a certain key once they finished their input. You can also choose "None" to not use any key

Accept only numeric input = false: Do not touch this setting if you expect string input. 

Pattern (regular expression): Include a Regex code here which checks the customer input. You could for example use ([A-z]{3,25}) to only allow letters (no numbers or symbols) that are between 3 and 25 characters. This is only available if numeric input = false




Accept only numeric input = true: Toggle this switch if you are expecting a number input, like a customer ID consisting solely of number, or birth year. 

min/max length: These fields only appear if numeric input = true. You define the expected minimum and maximum length of the input. If a customer ID is expected to be between 8 and 10 digits long, min length = 8 and max = 10. Once 10 is reached, the input will be interrupted and continues to the next module. 




5. Customer behaviour

This section allows the adjustment of the caller experience. For instance, you can define whether or not an audio can be interrupted, you can say whether or not to ignore a silent customer and how often to play error messages in case of a wrong or missing input.


Barge in = false: Audios or TTS cannot be interrupted. Customer inputs can only be made after the input.

Max. retries if no-input & no-match: Define how often a customer can retry to enter the expected input. Min = 0, max = 2. 

Ignore no-input: if toggle is on false (grey), no-input will be treated as incorrect input and the customer will be played the no-input prompt. If activated, no-input will be ignored.



Barge in = true: Audios or TTS can be interrupted by the customer.

Barge in delay: Define, after how many seconds the customer can start interrupting the audio.



Was this article helpful?




Please sign in to leave a comment.