Speech Recognition
Speech recognition technologies allow computers
equipped with microphones to interpret human speech, e.g.
for transcription or as a control method.
Such systems can be classified as to whether they require
the user to "train" the system to recognise their own particular
speech patterns or not, whether the system can recognise
continuous speech or requires users to break up their speech
into discrete words, and whether the vocabulary the system
recognises is small (in the order of tens or at most hundreds
of words), or large (thousands of words).
Systems requiring a short amount of training can (as of
2001) capture continuous speech with a large vocabulary
at normal pace with an accuracy of about 98% (getting two
words in one hundred wrong), and different systems that
require no training can recognize a small number of words
(for instance, the ten digits of the decimal system) as
spoken by most English speakers. Such systems are popular
for routing incoming phone calls to their destinations in
large organisations.
Commercial systems for speech recognition have been available
off-the-shelf since the 1990s. However, it is interesting
to note that despite the apparent success of the technology,
few people use such speech recognition systems.
It appears that most computer users can create and edit
documents more quickly with a conventional keyboard, despite
the fact that most people are able to speak considerably
faster than they can type. Additionally, heavy use of the
speech organs results in vocal loading.
Some of the key technical problems in speech recognition
are that: Some of the key technical problems in speech recognition
are that:
- Inter-speaker differences are often large and difficult
to account for. It is not clear which characteristics
of speech are speaker-independent.
- The interpretation of many phonemes, words and phrases
are context sensitive. For example, phonemes are often
shorter in long words than in short words. Words have
different meanings in different sentences, e.g. "Philip
lies" could be interpreted either as Philip being a liar,
or that Philip is lying on a bed.
- Intonation and speech timbre can completely change
the correct interpretation of a word or sentence, e.g.
"Go!", "Go?" and "Go." can clearly be recognised by a
human, but not so easily by a computer.
- Words and sentences can have several valid interpretations
such that the speaker leaves the choice of the correct
one to the listener.
- Written language may need punctuation according to
strict rules that are not strongly present in speech,
and are difficult to infer without knowing the meaning
(commas, ending of sentences, quotations).
The "understanding" of the meaning of spoken words is
regarded by some as a separate field, that of natural language
understanding. However, there are many examples of sentences
that sound the same, but can only be disambiguated by an
appeal to context: one famous T-shirt worn by Apple Computer
researchers stated,
- I helped Apple wreck a nice beach,
which, when spoken, sounds like I helped Apple recognize
speech.
A general solution of many of the above problems effectively
requires human knowledge and experience, and would thus
require advanced artificial
intelligence technologies to be implemented on a computer.
In particular, statistical language models are often employed
for disambiguation and improvement of the recognition accuracies.
See also
Robotics
Other resources
Statistical
Language Modeling (Natural Language Processing Lab, Northeastern
University, China)
This article is licensed under the GNU
Free Documentation License. It uses material from the
Wikipedia.
|