Voice Recognition to Transcription

While Voice Recognition has been used in Interactive Voice Response applications for over two decades, traditional technologies were costly to implement, needed deep specialist knowledge and required continuous fine-tuning. The new generation of Voice Recognition is where spoken language is converted to text through transcription, so it can be processed by an application.
The use of transcription, which lies at the core of Oration, creates an improved caller experience because it can interpret a caller’s intent from natural and freely spoken language instead of being constrained by limited words and phrases.
Voice Recognition is the process of taking spoken audio and converting it into text that can be processed by an application. Voice Recognition has been used in Interactive Voice Response (IVR) applications allowing callers to speak responses rather than needing to use push buttons. Traditional Voice Recognition used a complex ‘grammar’ to constrain the words that it might expect to hear in order to create acceptable accuracy - this was very limiting and created very costly implementations that needed regular tuning. The new generation of Voice Recognition uses what is called Transcription. Transcription, as the name suggests, simply transcribes what the user says into text - the words can be almost anything and are not constrained to a limited set - the transcription even includes punctuation.
The use of Transcription based Voice Recognition completely changes the processes involved with creating voice caller experiences. Oration embeds this latest technology at its very core.
Old world
Grammar created to cover typical banking phrases. This involves significant effort and often many months of data collection prior to implementation. The caller is promoted to keep their response short so that the spoken phrase matches the grammar.
“In just a few words how can I help you?” -> “transfer some money” -> recognised as “transfer some money” -> interpreted as “transfer money” and sent to the agent queue for transactional banking.
New world
No priming or grammar is required and the system learns through supervised learning what the ideal response to given phrases should be.
“How can I help you today” -> “Hi yeah I need to transfer some more money” -> transcribed as “Hi yeah I need to transfer some more money” -> interpreted as “transfer money” and offered natural language self service option or sent to the agent queue for transactional banking where agent sees “Hi yeah I need to transfer some more money” and can engage with the request right from the start.
Old world
What is traditional Voice Recognition?
Voice Recognition enables a person’s natural spoken language to be converted into words so that it can be used by software applications. Traditional Speech Recognition systems need to be primed in order to understand certain words or phrases. These phrases were compiled into a grammar that limited the range of possible things a recogniser could cope with. This constraint was required to maintain an acceptable level of accuracy but came at the cost of flexibility, range and worst of all a requirement to need continuous tuning. Whilst specialists could create acceptable caller experiences using this technology the reality was that only top-end contact centres could afford the cost of implementation or the ongoing tuning and change costs.
New world
Transcription
Transcription enables a person’s natural spoken language to be converted into words – transcribing the audio into text, so that it can be used by software applications. The new generation of ASR resources like Amazon Transcribe and Google Speech To Text use vast amounts of data to create highly effective language models which can even understand uncommon words like place names, product names, or even scientific terms. Importantly, Voice Recognition no longer relies on pre-programmed grammar to interpret the meaning of phrases, it can even create a transcription that includes punctuation. Transcription Recognisers features also allow users to tweak behaviour by adding extra words to the language model themselves.
Transcription enables a person’s natural spoken language to be converted to text, so it can be used by applications. Much less restrictive than traditional Voice Recognition systems which rely on complex pre-programmed words and phrases, the Oration Interpreter can extract information from the text to apply an intent to every call.
How does it work in a contact centre environment?
It all starts by Oration automatically greeting callers with “How can I help?” Integrating the very latest technology from Amazon and Google, Oration’s ASR capabilities are used to recognise a caller's response and create a transcription. The Oration Interpreter then extracts information from the text to apply an ‘intent’ to the call received.
To create complete coverage for all caller intents, Oration also engages in a process of supervised learning whereby information is surfaced to the administrator who can map what was said to what was needed (the intent).
What makes Oration different?
With Oration, you can pick and choose the best ASR engine for your use case, giving you peace of mind that your contact centre is up to date with the world’s best contact centre practices. Oration combines this with simple supervised learning techniques which allow behaviours to be continually refined and adjusted – like giving the system hints by adding uncommon words to its vernacular. By knowing exactly what’s important in a caller’s spoken language and what isn’t, Oration can also be configured to carry out information extraction for a wide range of common data types such as a caller’s date of birth.
Oration will:
- Reduce average handling times
- Increase uptake to self service
- Provide targeted banners
- Facilitate a digital channel shift
- Improve agent and customer engagement
- Support speed to competency