Frequently Asked QuestionsAnswers to most of the things people ask
What type of technology is used for Speaker Independent Recognition (SIR)?
For SIR we use HMM. HMM (Hidden Markov Model) is a statistical model of human speech. Using HMM, the system observes certain parameters of the signal and makes assumptions regarding the statistical behavior of human speech. It does this by training on many recordings collected from different speakers. When somebody speaks to the trained system, it calculates the probabilities associated with different options and selects the recognition result with the highest probability.
The algorithms involved in this type of system are far more sophisticated than in Speaker Dependent Recognition (SDR).
Does the speech recognition support “keyword spotting” mode?
Yes. The technology can listen continuously and alert the application when a certain word or phrase were detected.This is sometimes called “word trigger mode”.
Can the application learn the user’s voice?
Yes. We can identify the user’s voice, or recognize a certain password he trained the application for. This is called “Speaker Dependent Recognition (SDR)”.
Can an application mix both Speaker Independent Recognition (SIR) and Speaker Dependent Recognition (SDR)?
Yes. An application can have both SDR and SIR modes, but not simultaneously.
Does the recognition support “barge-in” mode?
“Barge-in” refers to the ability of the speech recognition technology to recognize users as they talk on top of product’s speech output. In other words, when users can start talking while the application is talking to them. Barge-in is usually resource intensive and is not realistic to implement in small embedded platforms. However, in Rubidium we developed a patent-pending barge-in algorithm that can run in very small footprint and still perform well.
Does Rubidium have an SDK (software development kit) for its speech recognition application development?
No, our policy is to develop the application together with our customers. Our experience shows that customers which are improficient in speech recognition try to develop an application, the result is often unsatisfactory. We rather suggest a milestone-oriented plan, and deliver to the customer engineering samples at each milestone.
What is the recognition accuracy? Is there any standard method used to test recognition accuracy?
There is no standard how to measure speech recognition accuracy. The accuracy depends on so many factors that a uniform test cannot be repeated exactly. Moreover, recognition accuracy alone cannot be considered without taking into account the resources needed by the engine to achieve this accuracy score.
However, in most field testing of our applications we achieve over 98% recognition accuracy.
What type of microphones is required?
We can handle any microphone. Our recognition algorithm has a special algorithm to cancel the effect of different microphones. This can be a simple electret condenser microphone capsule, or an analog MEMS or digital MEMS microphone. We can use either omni-directional, cardioid or other types of mics. Furthermore, Rubidium developed a beam-forming algorithm that can handle more than more microphone.
Does the speed of speech affect the recognition quality? Are there any limitations on talking speed?
There is no limitation on talking speed, provided that the user is not trying to trick the system, and the user is speaking naturally.
Is it possible to make a multilingual product? Is it possible to have products switch between the languages by just saying the name of the language?
Yes, it is possible and Rubidium has already implemented this feature in products. The language can be changed by a physical switch or by voice (stating the name of the language desired). We can even support several languages concurrently!
Can you supply the technology even just for a small production quantity?
Yes. We would charge NRE for the development and tooling and then usually refund it during mass production.
Is it possible for the same application to recognize the voices of 6- to 10 year-old boys and girls, as well as of 18- to 23-year-old women?
Yes. We can make a product recognize young voices of children over the age of 5, if the child speaks clearly. However, for optimum results, the product must be pre-trained to recognize voices in this age group.
How many chips does a solution include? What is the chipset cost?
We offer several chip configurations, depending on the application’s complexity, namely number of words, voice output duration, number of languages, interface requirements, etc.
Our typical solution includes a single chip, functioning as a System-on-Chip (SOC), with all peripherals on board, including analog input and output, microphone pre-amp, RAM and ROM.
Is it possible to replace or expand the speech recognition vocabulary or speech output corpus using a cartridge-type external memory chip?
Yes, it is possible. Change of an external memory cartridge does not need any installation, just “plug and play”.
Is there a list of already-available vocabulary that can be used to pick the application vocabulary?
Yes, however we refer to this list as Confidential Information. The list of available vocabulary is therefore supplied to customers after setting up an NDA to protect our IP. Of course, many times we record the words for an application to satisfy the customer’s specific vocabulary.
What is the estimated development cost and royalties for 100,000 units?
Without a product specification, it is very difficult to estimate these costs with a reasonable accuracy level. However, development for a full project is generally in the range of thousands of US$. For high volume applications we would consider refunding the development costs by applying a certain discount on the first orders.
Is it possible to recognize several words in a single phrase?
Yes. Furthermore, we can recognize a phrase that contains words of interest mixed with extraneous speech. Our technology isolates the vocabulary words and/or phrases and ignores the rest.
For the speech output, is it possible to use the customer's own voice recordings or the voice of famous actors and actresses? If so, does it affect the cost of development?
Rubidium gladly allows you to use the voices of your choice at no additional cost. After you provide us with the recordings, we simply process it and add it to our application.
Is it possible to create a product that is multilingual so that it can be used it in multiple markets?
Yes. The language of the product can be determined by factory setup or by the end user. It just depends on the specification.
We are interested in marketing an interactive toy for instruction of foreign languages. Can Rubidium's technology be adapted for language-learning settings?
Yes, this is possible.
What is the hardware resolution of the DAC and ADC?
Most of our configurations support 10 bits audio input and output. Some configurations support 16 bit input and output.
What is the speech recognition vocabulary limitation?
The total vocabulary size is only limited by memory size. The ACTIVE vocabulary size (the maximal number of words/phrases that can be recognized at a certain dialogue state) is limited to 12-50, depending on the platform.
Can you implement a voice-controlled Remote Control?
Yes. Furthermore, the speech recognition chip can also perform the function of the remote control chip, so that there is no need to have both chips. Our chip can recognize speech, talk back to the user, control the remote control keypad and perform the infra-red remote operation.
How much memory does the compressed speech feedback consume?
We offer several speech compression algorthms, consuming from 4800bps to 64Kbps. This means that 1 minute of talk-back to the user consumes 36Kbytes to 480Kbyte.
A typical application would have up to a few minutes of total speech output time.
I have heard the term Active Vocabulary. What does it mean?
It refers to the number of words recognized at the same time. The total number of recognized words can be different than the maximum active vocabulary, because the active vocabulary can change during the dialog.
What processor does the speech recognition require? Is it a DSP? How many bits of CPU power does it need (8, 16, etc.)?
Our speech recognition can run on many processors, including DSP and non-DSP (such as ARM Cortex, etc.). Minimum CPU size is 16 bit, fixed point but we run also on floating point CPUs.
Can the speech recognition recognize connected words and words in long sentences?
Our technology can recognize spontaneous continuous speech, meaning the user can speak naturally, including several words in one phrase that may be connected.
What is the memory consumption for speech recognition vocabulary?
Each word or phrase (depending on the dialogue structure) consumes around 2 – 2.5 KByte of ROM space.
How long does it take to develop an initial prototype? How long for something that is ready for market?
It typically takes about 2 months to develop a prototype. Full development takes about the same time, depending on the integration period and change of specification from prototype to full product.