When a speech-recognition system hears a stream of sound, it makes a number of guesses about what has been said, then calculates the odds that it has found the right one, based on the kinds of words, phrases and clauses it has seen earlier. At the level of phonemes, each language has strings that are permitted (in English, a word may begin with str-, for example) or banned (an English word cannot start with tsr-). The same goes for words. Some strings of words are more common than others. For example, the is far more likely to be followed by a noun or an adjective than by a verb or an adverb. In making guesses about homophones, the computer will have remembered that in its training data the phrase the right to bear arms came up much more often than the right to bare arms, and will thus have made the right guess.
Writing Wednesday: Is your, handwriting
Phonemes also differ according to the context. (Compare the l sound at the beginning of light with that at the end of full.) Speakers differ in timbre and pitch of voice, and in accent. Conversation is far less clear than careful dictation. People stop and restart much more often than they realise. All the same, technology has gradually mitigated many of these problems, so error rates in speech-recognition software have fallen steadily over the years—and then sharply with the introduction of deep puppy learning. Microphones have got better and cheaper. With ubiquitous wireless internet, speech recordings can easily be beamed to computers in the cloud for analysis, and even smartphones now often have computers powerful enough to carry out this task. Bear arms or bare arms? Perhaps the most important feature of a speech-recognition system is its set of expectations about what someone is likely to say, or its language model. Like other training data, the language models are based on large amounts of real human speech, transcribed into text.
More recently speech recognition has also gained from deep learning. English has about 44 phonemes, the units that make up the sound system of a language. P and b are different phonemes, because they distinguish words like pat and bat. But in English p with a puff essay of air, as in party, and p without a puff of air, as in spin, are not different phonemes, though they are in other languages. If a computer hears the phonemes s, p, i and n back to back, it should be able to recognise the word spin. But the nature of live speech makes this difficult for machines. Sounds are not pronounced individually, one phoneme after the other; they mostly come in a constant stream, and finding the boundaries is not easy.
The vowels have frequencies called formants, two of which are usually enough to differentiate one vowel from another. For example, the vowel in the English word fleece has its first two formants at around 300hz and 3,000Hz. Consonants have their own characteristic features. In principle, it should be easy to turn this stream of sound into transcribed speech. As in other language technologies, machines that recognise speech are trained on data gathered earlier. In this instance, the training data are sound recordings transcribed to text by humans, so that the software has both a sound and a text input. All it has to do is match the two. It gets better and better at working out how to transcribe daddy a given chunk of sound in the same way as humans did in the training data. The traditional matching approach was a statistical technique called a hidden Markov model (hmm making guesses based on what was done before.
The technique has already produced big leaps in quality for all kinds of deep learning, including deciphering handwriting, recognising faces and classifying images. Now they are helping to improve all manner of language technologies, often bringing enhancements of up. That has shifted language technology from usable at a pinch to really rather good. But so far no one has quite worked out what will move it on from merely good to reliably great. Computers have made huge strides in understanding human speech. When a person speaks, air is forced out through the lungs, making the vocal chords vibrate, which sends out characteristic wave patterns through the air. The features of the sounds depend on the arrangement of the vocal organs, especially the tongue and the lips, and the characteristic nature of the sounds comes from peaks of energy in certain frequencies.
Dhq: Digital Humanities quarterly: Forward to the
In speech recognition, the software learns from a body of recordings and the transcriptions made by humans. Thanks to the growing power of processors, falling prices for data storage and, most crucially, the explosion in available data, this approach eventually bore fruit. Mathematical techniques that had been known for decades came into their own, and big companies with access to enormous seattle amounts of data were poised to benefit. People who had been put off by the hilariously inappropriate translations offered by online tools like babelFish began to have more faith in google Translate. Apple persuaded millions of iPhone users to talk not only on their phones but to them.
The final advance, which began only about five years ago, came with the advent of deep learning through digital neural networks (DNNs). These are often touted as having qualities similar to those of the human brain: neurons are connected in software, and connections can become stronger or weaker in the process of learning. But Nils Lenke, head of research for nuance, a language-technology company, explains matter-of-factly that dnns are just another kind of mathematical model, the basis of which had been well understood for decades. What changed was the hardware being used. Almost by chance, dnn researchers discovered stars that the graphical processing units (GPUs) used to render graphics fluidly in applications like video games were also brilliant at handling neural networks. In computer graphics, basic small shapes move according to fairly simple rules, but there are lots of shapes and many rules, requiring vast numbers of simple calculations. The same gpus are used to fine-tune the weights assigned to neurons in dnns as they scour data to learn.
In the bad old days researchers kept their methods in the dark and described their results in ways that were hard to evaluate. But beginning in the 1980s, Charles wayne, then at Americas Defence Advanced Research Projects Agency, encouraged them to try another approach: the common task. Step by step, researchers would agree on a common set of practices, whether they were trying to teach computers speech recognition, speaker identification, sentiment analysis of texts, grammatical breakdown, language identification, handwriting recognition or anything else. They would set out the metrics they were aiming to improve on, share the data sets used to train their software and allow their results to be tested by neutral outsiders. That made the process far more transparent. Funding started up again and language technologies began to improve, though very slowly.
Many early approaches to language technology—and particularly translation—got stuck in a conceptual cul-de-sac: the rules-based approach. In translation, this meant trying to write rules to analyse the text of a sentence in the language of origin, breaking it down into a sort of abstract interlanguage and rebuilding it according to the rules of the target language. These approaches showed early promise. But language is riddled with ambiguities and exceptions, so such systems were hugely complicated and easily broke down when tested on sentences beyond the simple set they had been designed for. Nearly all language technologies began to get a lot better with the application of statistical methods, often called a brute force approach. This relies on software scouring vast amounts of data, looking for patterns and learning from precedent. For example, in parsing language (breaking it down into its grammatical components the software learns from large bodies of text that have already been parsed by humans. It uses what it has learned to make its best guess about a previously unseen text. In machine translation, the software scans millions of words already translated by humans, again looking for patterns.
Hen Track definition of Hen Track by merriam-Webster
In the period leading up to this, scholars had been promising automatic translation between languages within a few years. But the report was scathing. Reviewing almost a decade of write work on machine translation and automatic speech recognition, it concluded that the time had come to spend money hard-headedly toward important, realistic and relatively short-range goals—another way of saying that language-technology research had overpromised and underdelivered. In 1969 pierce wrote that both the funders and eager researchers had often fooled themselves, and that no simple, clear, sure knowledge is gained. After that, Americas government largely closed the money tap, and research on language technology went into hibernation for two decades. The story of how it emerged from that hibernation is both salutary and surprisingly workaday, says Mark liberman. As professor of linguistics at the University of Pennsylvania and head of the linguistic Data consortium, a huge trove of texts and recordings presentation of human language, he knows a thing or two about the history of language technology.
Apples Siri, amazons Alexa, google now and Microsofts Cortana, can now take a wide variety of questions, structured in many different ways, and return accurate and useful answers in a natural-sounding voice. Alexa can even respond to a request to tell me a joke, but only by calling upon a database of corny quips. Computers lack a sense of humour. When Apple introduced Siri in 2011 it was frustrating to use, so many people gave. Only around a third of smartphone owners use their personal assistants regularly, even though 95 have tried them at some point, according to Creative strategies, a consultancy. Many of those discouraged users may not realise how much they have improved. In 1966 John pierce was working at Bell Labs, the research arm of Americas telephone monopoly. Having overseen the team that had built the first transistor and the first communications satellite, he enjoyed a sterling reputation, so he was asked to take charge of a report on the state of automatic language processing for the national Academy of Sciences.
If 2001 had been made to apple reflect the state of todays language technology, the conversation might have gone something like this: Open the pod bay doors, hal. I didnt understand the question. Open the pod bay doors, hal. I have a list of ebay results about pod doors, dave. Creative and truly conversational computers able to handle the unexpected are still far off. Artificial-intelligence (AI) researchers can only laugh when asked about the prospect of an intelligent hal, terminator or Rosie (the sassy robot housekeeper in The jetsons). Yet although language technologies are nowhere near ready to replace human beings, except in a few highly routine tasks, they are at last about to become good enough to be taken seriously. They can help people spend more time doing interesting things that only humans can. After six decades of work, much of it with disappointing outcomes, the past few years have produced results much closer to what early pioneers had hoped for.
Watch The Idea, the first Animated Film to deal with Big
Language: Finding a voice, computers have got much better at translation, voice recognition and speech synthesis, says Lane Greene. But they still dont understand the meaning of language. Im afraid I cant do that. With chilling calm, hal 9000, the on-board computer in 2001: a space Odyssey, refuses to open the doors to dave bowman, an astronaut who had ventured outside the ship. Hals decision to turn on his human companion reflected a wave of fear about intelligent computers. When the film came out in 1968, computers that could have proper conversations with humans seemed nearly as far away as manned flight to jupiter. Since then, humankind has progressed quite summary a lot farther with building machines that it can talk to, and that can respond with something resembling natural speech. Even so, communication remains difficult.