Friday, November 27, 2009

A not-so-simple answer

From the annotations to Irregular Webcomic:

Now, I realise that voice recognition is an extremely difficult problem in computer science. I understand that it's highly non-trivial, and that many excellent and very intelligent people have put years and years and years of research into the topic for painstaking and hard-won theoretical and practical gains in the field, agianst seemingly insurmountable problems.
  1. Computer voice recognition still sucks so incredibly badly that it's essentially useless for most purposes for which you might conceivably want to utilise it.
  2. In many of the places where it is used, it's so actively bad that it's a well-known joke how inaccurate and stupidly annoying it is.
  3. Three-year-old kids can understand the human voice, and by the time they're five, they can do it with virtually no difficulties at all other than exposure to vocabulary.
I know it's a hard problem to tackle from a computer science point of view. But I can't help feeling that we are puny ants on the face of an edifice of such size and elegance that we can't discern the patterns for which we seek. That computer science is tackling the problem of voice recognition in completely and utterly the wrong way.
I'm not arrogant enough to assert that this is true, or that I have any better ideas. But it wouldn't surprise me in the least if some young gun came along next year and did something completely out of left field that nobody in the research landscape had even considered before, and it turns out to vastly simplify the problem to something that is actually tractable to our computers. And that it will leave all the experts in the voce recognition field scratching their heads and going, "Well that was obvious. Why didn't we think of doing that before?"
Or I could be completely deluded and 400 years from now we'll still be struggling to order our pizzas on automated voice recognition systems that can't tell the difference between "Phillip Street" and "no anchovies".
Here's my not-so-simple approach. A breakthrough in speech-recognition will occur when computers and A.I. systems learn to recognize speech the same same way that humans learn it: within the larger context of language learning, and with plenty of help from social conditioning.

Let's look at an off the cuff theory of language learning and use among human infants. Starting at about two months, infants begin transitioning from crying to cooing. At its basic level, cooing is the infant making all of the possible vowel sounds that lie within the capacity of the human vocal apparatus. Within eight to ten months, cooing will be supplemented with babbling, the same process done with consonant sounds. During this period, the infant is exploring all the sounds he or she can make.
Next begins a process of phonetic elimination. The infant begins to remove sounds that are not productive from their repetoire, molding the range of phoenetic production to the sounds the infant hears in the environment. The greatest influence on this process is the language use of adult humans. Infants learn most of their language behavior at this stage by mimicing the patterns of the sounds produced by adults. If the adults in the environment are speaking one language with a consistent set of phoenemes, the infant will reduce their vocolaizations to that set of phonemes.

Later stages of babbling begin to mold into proto-language use. This is the time when social condition becomes important. Over the course of babbling, the infant will tend to produce phonems from the ambient language set in more or less random order, to a certain degree mimicing phoneme patterns of adults. Most of this babbling will go rewarded at a constant rate, enough to encourage the continuance of babbling. However, in an application of Shakespear's infinite monkees, eventually the infant will stumble upon a phoeneme pattern that more or less matches an intelligible word of the ambient language at a time and place where it will be overheard and understood by an attending adult: a child's 'first word'. Frequently, this first word will result in the child being rewarded in some fashion.
Over the next months and years, the child will continue to be rewarded when making sounds appropriate to the ambient language and the social context, and will go unrewarded when making inappropriate sounds; making the sound 'milk' may be rewarded with the desired food item, while making the sound 'ilkm' goes unrewarded. In this way, the child's basic vocabulary is built until more complex language mechanisms begin to become employed.
Hearing language is probably helped by a similar social conditioning process. When one hears utterances, one responds with behavior (whether speech or action). When the responsive behavior is appropriate, one is rewarded. When the responsive behavior is inappropriate, one is unrewarded.
Children, too, undergo this conditioning process. When they correctly understand speech, they learn to respond appropriately and are rewarded. When speech is not correctly understood, they are unable to initiate the correct responsive behavior and go unrewarded. I feel this conditioning is integral to the language-hearing learning process.
Computers at this stage of technology are immune to social conditioning. Outside of various academic AI labs, computers are pretty much incapable of modifying their own behavior. Certainly, it is difficult to program a computer to have needs, or to recognise its needs. The presence and internal recognition of needs, and the desire and ability to have those needs met constitute the basis for the reward process of social conditioning. Once it becomes capable to reward computers, and only then, do I feel great strides will be made in the quest for decent speech recognition.
At the same time, one of the difficulties of computer speech recognition is the contextual dependency of language and the human language process. Quite a bit of error-correcting occurs in human speech by reference to the context of the utterance. If I'm talking to my doctor, and I make some sort of reference to 'elbeny', I may be speaking of my 'elboy' or my 'knee'. If I'm discussing New York geography, the same set of sounds is likely to be interpreted as 'Albany'.

Computers, at this point, are mostly incapable of making such judgements. They are less able to refer to context, and are unaware of all of the different contexts in which a native human language user is capable of drawing upon. This is another barrier to computer speech recognition.

Googlebombing for a cause: 


Lord Carnifex said...

Although the human speech recognition apparatus doesn't always work that well, either.

Anonymous said...

Especially if you are deaf in one ear and can't hear out of the other.
There is a very nice video, unfortunately posted on Facebook, of your niece saying the word 'turtle'. Too bad I can't share it with you.
Have you read about recent research that shows that newborns cry in the rhythm of the language they heard in utero?

phaedrus said...

Which reminds me that its time to restart on trying to learn a bit of conversational German to help the next biological "computer" in the household have a bit more to work with.