The Speech to text module explained (STT)

In this article, we want to have a closer look at the Speech to text module (STT) and its different setup possibilities. But first, what is the STT module and what do you need it for? 

The Speech to text module stores voice or DTMF input given by the customer during the callflow. You  need it if you want the customer to input information or answers to a question, e.g., giving their call reason, agreeing to a call being recorded, typing in or reading out their customer ID, etc. The answer of the customer will then be stored on a variable which you can use to route the call in certain directions or to send the answer to external systems like your helpdesk.

The STT module has many different settings which you can tweak to optimize the accuracy of recognition as well as the customer experience. We will go through each of these settings in the following.

We will look at the Audio tab first. If you know all this already, you can directly jump to the "Settings" part.

The "Audio" tab:

  1. The first row of the "Audio" enables you to define the audio you would like to play when the call enters the STT module. You can either use an audio file you uploaded to your babelforce account or you can configure a TTS module. Therefore, just toggle the switch "TTS" and click to edit it.

    mceclip7.png

    If you need more information about our TTS module, have a look at this article. The good thing about this is that you don't need to configure a separate TTS module or Audio Player preceding the STT module.

  2. Next, lets look at the option to include a "No-input" audio:

    mceclip8.png

    Enter an audio file or TTS here to have it read to the customer in case they do not enter any input. In this case, the call will not move on to the next module in the call flow if the customer doesn't say anything. Instead, this audio will be played to the customer and they get a second chance to enter an input. You can also configure a second "No-input" audio.

  3. Lastly, you can also enter a "No-match" audio:
    mceclip9.png
    This audio will be played to the customer in case they give a wrong input, e.g., they enter non-numeric input although you configured the module to expect numeric input (we will look at this in the second part of this article).

For all the audios you have the option to configure the timing: You can determine how many milliseconds of pause there should be before and after the audio is played.

Great, now that we know about our audio options, let's look at the STT settings:

 

The "settings" tab:

1. First, let's look at the upper part of the module. This is the part to configure the speech recognition settings. Per default, babelforce integrates the Google Cloud Speech-to-Text services. However, we are not restricted to this service. Services from other providers can be integrated as well.

mceclip0.png

    1. Language: babelforce offers 70+ languages out of the box. Please note that we cannot translate voice input out of the box. So, if you choose "English, United States" as the language for instance, our platform expects English input from the caller.

    2. Speech model: Here we can choose between three options:
      • Command and Search: Best for short or single-word utterances like voice commands or voice search. It works usually even better than the Phone Call model.
      • Phone call: Designed for audio that originated from a phone call (typically recorded at an 8khz sampling rate). However, it is not available for all languages. So if you find that your input is not recorded, switch back to Command and Search
      • Default: Best for audio that does not fit the other audio models, like long-form audio or dictation. Ideally the audio is high-fidelity, recorded at a 16khz or greater sampling rate.

    3. Recognition mode:
      • Single utterance: Best for brief (one-shot) speech input, like a name or a short caller intent.
      • Continuous input: Continuous speech input like a customer ID or call reason expressed in a whole sentence.

    4. Continuous input timeout: Only needed for the recognition mode "Continuous input". This number indicates, how many seconds of pause are allowed between the single customer input parts, before the customer input is considered complete and the call flow moves on to the next model. E.g., imagine the customer reads out a 5-digit customer ID. If the continuous input timeout is 2, the customer has two seconds after every digit they read out, before the input is considered complete. 

    5. Industry NAICS code: Allows users to provide extra context for an input but it does not need to be filled in.

If you want to dive deeper into the topic of speech recognition by Google, check out this article by Google.

2. Next, let's move on to the settings for "Matching":

mceclip1.png

    1. Pattern (regular expression): Include a Regex code here which checks the customer input. You could for example use ([A-z]{3,25}) to only allow letters (no numbers or symbols) that are between 3 and 25 characters.

    2. Numeric: Toggle this switch if you are expecting a number input, like a customer ID consisting solely of number) or birth year. Points 3.-5. are only relevant if "numeric" is turned on.

    3. Min. length: This is the minimum length of the expected number input. If the customer ID you are checking is between 8 and 10 digits long, this would have to be set to 8. Everything below 8 will be considered a wrong input by babelforce. In this case, the customer hears the audio we configured in the tab "Audio" for a wrong input (=no match).

    4. Max. length: This is the maximum length of the expected number input. If the customer ID you are checking is between 8 and 10 digits long, this would have to be set to 10. Everything higher than 10 will be dealt with as a wrong input. In this case, the customer hears the audio we configured in the tab "Audio" for a wrong input (=no match).
      • If the number is exactly 10 digits long, you set both, min. and max. length to 10.

    5. Termination: Here you can configure, if the customer should press a certain key once they finished their input. You can also choose "None" to not use any key.

3. Great, let's look at the "Timing" now:

mceclip2.png

The Read timeout works similar to the continuous input timeout: This number indicates how many seconds of pause are allowed between the single customer input parts before the customer input is considered complete and the call flow moves on to the next model. E.g., imagine the customer reads out a 5-digit customer ID. If the read timeout is 2, the customer has two seconds after every digit they read out, before the input is considered complete. The read timeout is only considered if DTMF is enabled, though (we will look at DTMF further down). Also, in that case it overwrites the continuous input timeout setting.

 

4. Now, let's look at the part "Behaviour":

mceclip3.png

 

As you read in the first part of this article, we can configure audios which are played, if babelforce does not receive any input from the customer or an input which is not correct.

    1. Max. retries: Determines, how often we allow the customer to give a wrong input (or no input at all), before we move on to the next module.
    2. Ignore no-input: This switch is enabled per default which means, babelforce ignores wrong input or no input per default and simply moves on to the next module. Toggle the switch to disable it so that wrong or no inputs are registered and the respective "No-input" or "No-match" audio is played to the customer.

 

5. Result handling:

mceclip5.png

In this section you can define the variable the customer input is stored on. You can later use it in Triggers you can then include in Switch nodes or queue selections to route the call depending on the customer input. Have a look at this article if you want to read a use case on the topic. Use a meaningful variable name to make it easier to find in the expression list afterwards and remember its function.

6. Now, let's look at the last part, the section "Other":

mceclip6.png

    1. Enable DTMF: Toggle this switch if you want to allow the customer to type in a number on their phone to give the input.
    2. Barge in: Toggle this switch if you want to allow barge in. This means that the customer can already start talking although text is read out to them still.
    3. Barge in delay: Define, after how many seconds the customer can interrupt the text which is read out to them.
    4. After-flow: Determine the next module in the call flow.

 

Have more questions? Submit a request