How To Create A Mind - LightNovelsOnl.com
You're reading novel online at LightNovelsOnl.com. Please use the follow button to get notifications about your favorite novels and its latest chapters so you can come back anytime and won't miss anything.
A mother rat will build a nest for her young even if she has never seen another rat in her lifetime. mother rat will build a nest for her young even if she has never seen another rat in her lifetime.1 Similarly, a spider will spin a web, a caterpillar will create her own coc.o.o.n, and a beaver will build a dam, even if no contemporary ever showed them how to accomplish these complex tasks. That is not to say that these are not learned behaviors. It is just that these animals did not learn them in a single lifetime-they learned them over thousands of lifetimes. The evolution of animal behavior does const.i.tute a learning process, but it is learning by the species, not by the individual, and the fruits of this learning process are encoded in DNA. Similarly, a spider will spin a web, a caterpillar will create her own coc.o.o.n, and a beaver will build a dam, even if no contemporary ever showed them how to accomplish these complex tasks. That is not to say that these are not learned behaviors. It is just that these animals did not learn them in a single lifetime-they learned them over thousands of lifetimes. The evolution of animal behavior does const.i.tute a learning process, but it is learning by the species, not by the individual, and the fruits of this learning process are encoded in DNA.
To appreciate the significance of the evolution of the neocortex, consider that it greatly sped up the process of learning (hierarchical knowledge) from thousands of years to months (or less). Even if millions of animals in a particular mammalian species failed to solve a problem (requiring a hierarchy of steps), it required only one to accidentally stumble upon a solution. That new method would then be copied and spread exponentially through the population.
We are now in a position to speed up the learning process by a factor of thousands or millions once again by migrating from biological to nonbiological intelligence. Once a digital neocortex learns a skill, it can transfer that know-how in minutes or even seconds. As one of many examples, at my first company, Kurzweil Computer Products (now Nuance Speech Technologies), which I founded in 1973, we spent years training a set of research computers to recognize printed letters from scanned doc.u.ments, a technology called omni-font (any type font) optical character recognition (OCR). This particular technology has now been in continual development for almost forty years, with the current product called OmniPage from Nuance. If you want your computer to recognize printed letters, you don't need to spend years training it to do so, as we did-you can simply download the evolved patterns already learned by the research computers in the form of software. In the 1980s we began on speech recognition, and that technology, which has also been in continuous development now for several decades, is part of Siri. Again, you can download in seconds the evolved patterns learned by the research computers over many years.
Ultimately we will create an artificial neocortex that has the full range and flexibility of its human counterpart. Consider the benefits. Electronic circuits are millions of times faster than our biological circuits. At first we will have to devote all of this speed increase to compensating for the relative lack of parallelism in our computers, but ultimately the digital neocortex will be much faster than the biological variety and will only continue to increase in speed.
When we augment our own neocortex with a synthetic version, we won't have to worry about how much additional neocortex can physically fit into our bodies and brains, as most of it will be in the cloud, like most of the computing we use today. I estimated earlier that we have on the order of 300 million pattern recognizers in our biological neocortex. That's as much as could be squeezed into our skulls even with the evolutionary innovation of a large forehead and with the neocortex taking about 80 percent of the available s.p.a.ce. As soon as we start thinking in the cloud, there will be no natural limits-we will be able to use billions or trillions of pattern recognizers, basically whatever we need, and whatever the law of accelerating returns can provide at each point in time.
In order for a digital neocortex to learn a new skill, it will still require many iterations of education, just as a biological neocortex does, but once a single digital neocortex somewhere and at some time learns something, it can share that knowledge with every other digital neocortex without delay. We can each have our own private neocortex extenders in the cloud, just as we have our own private stores of personal data today.
Last but not least, we will be able to back up the digital portion of our intelligence. As we have seen, it is not just a metaphor to state that there is information contained in our neocortex, and it is frightening to contemplate that none of this information is backed up today. There is, of course, one way in which we do back up some of the information in our brains-by writing it down. The ability to transfer at least some of our thinking to a medium that can outlast our biological bodies was a huge step forward, but a great deal of data in our brains continues to remain vulnerable.
Brain Simulations
One approach to building a digital brain is to simulate precisely a biological one. For example, Harvard brain sciences doctoral student David Dalrymple (born in 1991) is planning to simulate the brain of a nematode (a roundworm).2 Dalrymple selected the nematode because of its relatively simple nervous system, which consists of about 300 neurons, and which he plans to simulate at the very detailed level of molecules. He will also create a computer simulation of its body as well as its environment so that his virtual nematode can hunt for (virtual) food and do the other things that nematodes are good at. Dalrymple says it is likely to be the first complete brain upload from a biological animal to a virtual one that lives in a virtual world. Like his simulated nematode, whether even biological nematodes are conscious is open to debate, although in their struggle to eat, digest food, avoid predators, and reproduce, they do have experiences to be conscious of. Dalrymple selected the nematode because of its relatively simple nervous system, which consists of about 300 neurons, and which he plans to simulate at the very detailed level of molecules. He will also create a computer simulation of its body as well as its environment so that his virtual nematode can hunt for (virtual) food and do the other things that nematodes are good at. Dalrymple says it is likely to be the first complete brain upload from a biological animal to a virtual one that lives in a virtual world. Like his simulated nematode, whether even biological nematodes are conscious is open to debate, although in their struggle to eat, digest food, avoid predators, and reproduce, they do have experiences to be conscious of.
At the opposite end of the spectrum, Henry Markram's Blue Brain Project is planning to simulate the human brain, including the entire neocortex as well as the old-brain regions such as the hippocampus, amygdala, and cerebellum. His planned simulations will be built at varying degrees of detail, up to a full simulation at the molecular level. As I reported in chapter 4 chapter 4, Markram has discovered a key module of several dozen neurons that is repeated over and over again in the neocortex, demonstrating that learning is done by these modules and not by individual neurons.
Markram's progress has been scaling up at an exponential pace. He simulated one neuron in 2005, the year the project was initiated. In 2008 his team simulated an entire neocortical column of a rat brain, consisting of 10,000 neurons. By 2011 this expanded to 100 columns, totaling a million cells, which he calls a mesocircuit. One controversy concerning Markram's work is how to verify that the simulations are accurate. In order to do this, these simulations will need to demonstrate learning that I discuss below.
He projects simulating an entire rat brain of 100 mesocircuits, totaling 100 million neurons and about a trillion synapses, by 2014. In a talk at the 2009 TED conference at Oxford, Markram said, "It is not impossible to build a human brain, and we can do it in 10 years."3 His most recent target for a full brain simulation is 2023. His most recent target for a full brain simulation is 2023.4 Markram and his team are basing their model on detailed anatomical and electrochemical a.n.a.lyses of actual neurons. Using an automated device they created called a patch-clamp robot, they are measuring the specific ion channels, neurotransmitters, and enzymes that are responsible for the electrochemical activity within each neuron. Their automated system was able to do thirty years of a.n.a.lysis in six months, according to Markram. It was from these a.n.a.lyses that they noticed the "Lego memory" units that are the basic functional units of the neocortex.
Actual and projected progress of the Blue Brain brain simulation project.
Significant contributions to the technology of robotic patch-clamping was made by MIT neuroscientist Ed Boyden, Georgia Tech mechanical engineering professor Craig Forest, and Forest's graduate student Suhasa Kodandaramaiah. They demonstrated an automated system with one-micrometer precision that can perform scanning of neural tissue at very close range without damaging the delicate membranes of the neurons. "This is something a robot can do that a human can't," Boyden commented.
To return to Markram's simulation, after simulating one neocortical column, Markram was quoted as saying, "Now we just have to scale it up."5 Scaling is certainly one big factor, but there is one other key hurdle, which is learning. If the Blue Brain Project brain is to "speak and have an intelligence and behave very much as a human does," which is how Markram described his goal in a BBC interview in 2009, then it will need to have sufficient content in its simulated neocortex to perform those tasks. Scaling is certainly one big factor, but there is one other key hurdle, which is learning. If the Blue Brain Project brain is to "speak and have an intelligence and behave very much as a human does," which is how Markram described his goal in a BBC interview in 2009, then it will need to have sufficient content in its simulated neocortex to perform those tasks.6 As anyone who has tried to hold a conversation with a newborn can attest, there is a lot of learning that must be achieved before this is feasible. As anyone who has tried to hold a conversation with a newborn can attest, there is a lot of learning that must be achieved before this is feasible.
The tip of the patch-clamping robot developed at MIT and Georgia Tech scanning neural tissue.
There are two obvious ways this can be done in a simulated brain such as Blue Brain. One would be to have the brain learn this content the way a human brain does. It can start out like a newborn human baby with an innate capacity for acquiring hierarchical knowledge and with certain transformations preprogrammed in its sensory preprocessing regions. But the learning that takes place between a biological infant and a human person who can hold a conversation would need to occur in a comparable manner in nonbiological learning. The problem with that approach is that a brain that is being simulated at the level of detail antic.i.p.ated for Blue Brain is not expected to run in real time until at least the early 2020s. Even running in real time would be too slow unless the researchers are prepared to wait a decade or two to reach intellectual parity with an adult human, although real-time performance will get steadily faster as computers continue to grow in price/performance.
The other approach is to take one or more biological human brains that have already gained sufficient knowledge to converse in meaningful language and to otherwise behave in a mature manner and copy their neocortical patterns into the simulated brain. The problem with this method is that it requires a noninvasive and nondestructive scanning technology of sufficient spatial and temporal resolution and speed to perform such a task quickly and completely. I would not expect such an "uploading" technology to be available until around the 2040s. (The computational requirement to simulate a brain at that degree of precision, which I estimate to be 1019 calculations per second, will be available in a supercomputer according to my projections by the early 2020s; however, the necessary nondestructive brain scanning technologies will take longer.) calculations per second, will be available in a supercomputer according to my projections by the early 2020s; however, the necessary nondestructive brain scanning technologies will take longer.) There is a third approach, which is the one I believe simulation projects such as Blue Brain will need to pursue. One can simplify molecular models by creating functional equivalents at different levels of specificity, ranging from my own functional algorithmic method (as described in this book) to simulations that are closer to full molecular simulations. The speed of learning can thereby be increased by a factor of hundreds or thousands depending on the degree of simplification used. An educational program can be devised for the simulated brain (using the functional model) that it can learn relatively quickly. Then the full molecular simulation can be subst.i.tuted for the simplified model while still using its acc.u.mulated learning. We can then simulate learning with the full molecular model at a much slower speed.
American computer scientist Dharmendra Modha and his IBM colleagues have created a cell-by-cell simulation of a portion of the human visual neocortex comprising 1.6 billion virtual neurons and 9 trillion synapses, which is equivalent to a cat neocortex. It runs 100 times slower than real time on an IBM BlueGene/P supercomputer consisting of 147,456 processors. The work received the Gordon Bell Prize from the a.s.sociation for Computing Machinery.
The purpose of a brain simulation project such as Blue Brain and Modha's neocortex simulations is specifically to refine and confirm a functional model. AI at the human level will princ.i.p.ally use the type of functional algorithmic model discussed in this book. However, molecular simulations will help us to perfect that model and to fully understand which details are important. In my development of speech recognition technology in the 1980s and 1990s, we were able to refine our algorithms once the actual transformations performed by the auditory nerve and early portions of the auditory cortex were understood. Even if our functional model was perfect, understanding exactly how it is actually implemented in our biological brains will reveal important knowledge about human function and dysfunction.
We will need detailed data on actual brains to create biologically based simulations. Markram's team is collecting its own data. There are large-scale projects to gather this type of data and make it generally available to scientists. For example, Cold Spring Harbor Laboratory in New York has collected 500 terabytes of data by scanning a mammal brain (a mouse), which they made available in June 2012. Their project allows a user to explore a brain similarly to the way Google Earth allows one to explore the surface of the planet. You can move around the entire brain and zoom in to see individual neurons and their connections. You can highlight a single connection and then follow its path through the brain.
Sixteen sections of the National Inst.i.tutes of Health have gotten together and sponsored a major initiative called the Human Connectome Project with $38.5 million of funding.7 Led by Was.h.i.+ngton University in St. Louis, the University of Minnesota, Harvard University, Ma.s.sachusetts General Hospital, and the University of California at Los Angeles, the project seeks to create a similar three-dimensional map of connections in the human brain. The project is using a variety of noninvasive scanning technologies, including new forms of MRI, magnetoencephalography (measuring the magnetic fields produced by the electrical activity in the brain), and diffusion tractography (a method to trace the pathways of fiber bundles in the brain). As I point out in Led by Was.h.i.+ngton University in St. Louis, the University of Minnesota, Harvard University, Ma.s.sachusetts General Hospital, and the University of California at Los Angeles, the project seeks to create a similar three-dimensional map of connections in the human brain. The project is using a variety of noninvasive scanning technologies, including new forms of MRI, magnetoencephalography (measuring the magnetic fields produced by the electrical activity in the brain), and diffusion tractography (a method to trace the pathways of fiber bundles in the brain). As I point out in chapter 10 chapter 10, the spatial resolution of noninvasive scanning of the brain is improving at an exponential rate. The research by Van J. Wedeen and his colleagues at Ma.s.sachusetts General Hospital showing a highly regular gridlike structure of the wiring of the neocortex that I described in chapter 4 chapter 4 is one early result from this project. is one early result from this project.
Oxford University computational neuroscientist Anders Sandberg (born in 1972) and Swedish philosopher Nick Bostrom (born in 1973) have written the comprehensive Whole Brain Emulation: A Roadmap Whole Brain Emulation: A Roadmap, which details the requirements for simulating the human brain (and other types of brains) at different levels of specificity from high-level functional models to simulating molecules.8 The report does not provide a timeline, but it does describe the requirements to simulate different types of brains at varying levels of precision in terms of brain scanning, modeling, storage, and computation. The report projects ongoing exponential gains in all of these areas of capability and argues that the requirements to simulate the human brain at a high level of detail are coming into place. The report does not provide a timeline, but it does describe the requirements to simulate different types of brains at varying levels of precision in terms of brain scanning, modeling, storage, and computation. The report projects ongoing exponential gains in all of these areas of capability and argues that the requirements to simulate the human brain at a high level of detail are coming into place.
An outline of the technological capabilities needed for whole brain emulation, in Whole Brain Emulation: A Roadmap Whole Brain Emulation: A Roadmap by Anders Sandberg and Nick Bostrom. by Anders Sandberg and Nick Bostrom.
An outline of Whole Brain Emulation: A Roadmap Whole Brain Emulation: A Roadmap by Anders Sandberg and Nick Bostrom. by Anders Sandberg and Nick Bostrom.
Neural Nets
In 1964, at the age of sixteen, I wrote to Frank Rosenblatt (19281971), a professor at Cornell University, inquiring about a machine called the Mark 1 Perceptron. He had created it four years earlier, and it was described as having brainlike properties. He invited me to visit him and try the machine out.
The Perceptron was built from what he claimed were electronic models of neurons. Input consisted of values arranged in two dimensions. For speech, one dimension represented frequency and the other time, so each value represented the intensity of a frequency at a given point in time. For images, each point was a pixel in a two-dimensional image. Each point of a given input was randomly connected to the inputs of the first layer of simulated neurons. Every connection had an a.s.sociated synaptic strength, which represented its importance, and which was initially set at a random value. Each neuron added up the signals coming into it. If the combined signal exceeded a particular threshold, the neuron fired and sent a signal to its output connection; if the combined input signal did not exceed the threshold, the neuron did not fire, and its output was zero. The output of each neuron was randomly connected to the inputs of the neurons in the next layer. The Mark 1 Perceptron had three layers, which could be organized in a variety of configurations. For example, one layer might feed back to an earlier one. At the top layer, the output of one or more neurons, also randomly selected, provided the answer. (For an algorithmic description of neural nets, see this endnote.)9
Since the neural net wiring and synaptic weights are initially set randomly, the answers of an untrained neural net are also random. The key to a neural net, therefore, is that it must learn its subject matter, just like the mammalian brains on which it's supposedly modeled. A neural net starts out ignorant; its teacher-which may be a human, a computer program, or perhaps another, more mature neural net that has already learned its lessons-rewards the student neural net when it generates the correct output and punishes it when it does not. This feedback is in turn used by the student neural net to adjust the strength of each interneuronal connection. Connections that are consistent with the correct answer are made stronger. Those that advocate a wrong answer are weakened.
Over time the neural net organizes itself to provide the correct answers without coaching. Experiments have shown that neural nets can learn their subject matter even with unreliable teachers. If the teacher is correct only 60 percent of the time, the student neural net will still learn its lessons with an accuracy approaching 100 percent.
However, limitations in the range of material that the Perceptron was capable of learning quickly became apparent. When I visited Professor Rosenblatt in 1964, I tried simple modifications to the input. The system was set up to recognize printed letters, and would recognize them quite accurately. It did a fairly good job of autoa.s.sociation (that is, it could recognize the letters even if I covered parts of them), but fared less well with invariance (that is, generalizing over size and font changes, which confused it).
During the last half of the 1960s, these neural nets became enormously popular, and the field of "connectionism" took over at least half of the artificial intelligence field. The more traditional approach to AI, meanwhile, included direct attempts to program solutions to specific problems, such as how to recognize the invariant properties of printed letters.
Another person I visited in 1964 was Marvin Minsky (born in 1927), one of the founders of the artificial intelligence field. Despite having done some pioneering work on neural nets himself in the 1950s, he was concerned with the great surge of interest in this technique. Part of the allure of neural nets was that they supposedly did not require programming-they would learn solutions to problems on their own. In 1965 I entered MIT as a student with Professor Minsky as my mentor, and I shared his skepticism about the craze for "connectionism."
In 1969 Minsky and Seymour Papert (born in 1928), the two cofounders of the MIT Artificial Intelligence Laboratory, wrote a book called Perceptrons Perceptrons, which presented a single core theorem: specifically, that a Perceptron was inherently incapable of determining whether or not an image was connected. The book created a firestorm. Determining whether or not an image is connected is a task that humans can do very easily, and it is also a straightforward process to program a computer to make this discrimination. The fact that Perceptrons could not do so was considered by many to be a fatal flaw.
Two images from the cover of the book Perceptrons Perceptrons by Marvin Minsky and Seymour Papert. The top image is not connected (that is, the dark area consists of two disconnected parts). The bottom image is connected. A human can readily determine this, as can a simple software program. A feedforward Perceptron such as Frank Rosenblatt's Mark 1 Perceptron cannot make this determination. by Marvin Minsky and Seymour Papert. The top image is not connected (that is, the dark area consists of two disconnected parts). The bottom image is connected. A human can readily determine this, as can a simple software program. A feedforward Perceptron such as Frank Rosenblatt's Mark 1 Perceptron cannot make this determination.
Perceptrons, however, was widely interpreted to imply more than it actually did. Minsky and Papert's theorem applied only to a particular type of neural net called a feedforward neural net (a category that does include Rosenblatt's Perceptron); other types of neural nets did not have this limitation. Still, the book did manage to largely kill most funding for neural net research during the 1970s. The field did return in the 1980s with attempts to use what were claimed to be more realistic models of biological neurons and ones that avoided the limitations implied by the Minsky-Papert Perceptron theorem. Nevertheless, the ability of the neocortex to solve the invariance problem, a key to its strength, was a skill that remained elusive for the resurgent connectionist field.
Spa.r.s.e Coding: Vector Quantization
In the early 1980s I started a project devoted to another cla.s.sical pattern recognition problem: understanding human speech. At first, we used traditional AI approaches by directly programming expert knowledge about the fundamental units of speech-phonemes-and rules from linguists on how people string phonemes together to form words and phrases. Each phoneme has distinctive frequency patterns. For example, we knew that vowels such as "e" and "ah" are characterized by certain resonant frequencies called formants, with a characteristic ratio of formants for each phoneme. Sibilant sounds such as "z" and "s" are characterized by a burst of noise that spans many frequencies.
We captured speech as a waveform, which we then converted into multiple frequency bands (perceived as pitches) using a bank of frequency filters. The result of this transformation could be visualized and was called a spectrogram (see page 136 page 136).
The filter bank is copying what the human cochlea does, which is the initial step in our biological processing of sound. The software first identified phonemes based on distinguis.h.i.+ng patterns of frequencies and then identified words based on identifying characteristic sequences of phonemes.
A spectrogram of three vowels. From left to right: [i] as in "appreciate," [u] as in "acoustic," and [a] as in "ah." The Y axis represents frequency of sound. The darker the band the more acoustic energy there is at that frequency.A spectrogram of a person saying the word "hide." The horizontal lines show the formants, which are sustained frequencies that have especially high energy.10 The result was partially successful. We could train our device to learn the patterns for a particular person using a moderate-sized vocabulary, measured in thousands of words. When we attempted to recognize tens of thousands of words, handle multiple speakers, and allow fully continuous speech (that is, speech with no pauses between words), we ran into the invariance problem. Different people enunciated the same phoneme differently-for example, one person's "e" phoneme may sound like someone else's "ah." Even the same person was inconsistent in the way she spoke a particular phoneme. The pattern of a phoneme was often affected by other phonemes nearby. Many phonemes were left out completely. The p.r.o.nunciation of words (that is, how phonemes are strung together to form words) was also highly variable and dependent on context. The linguistic rules we had programmed were breaking down and could not keep up with the extreme variability of spoken language.
It became clear to me at the time that the essence of human pattern and conceptual recognition was based on hierarchies. This is certainly apparent for human language, which const.i.tutes an elaborate hierarchy of structures. But what is the element at the base of the structures? That was the first question I considered as I looked for ways to automatically recognize fully normal human speech.
Sound enters the ear as a vibration of the air and is converted by the approximately 3,000 inner hair cells in the cochlea into multiple frequency bands. Each hair cell is tuned to a particular frequency (note that we perceive frequencies as tones) and each acts as a frequency filter, emitting a signal whenever there is sound at or near its resonant frequency. As it leaves the human cochlea, sound is thereby represented by approximately 3,000 separate signals, each one signifying the time-varying intensity of a narrow band of frequencies (with substantial overlap among these bands).
Even though it was apparent that the brain was ma.s.sively parallel, it seemed impossible to me that it was doing pattern matching on 3,000 separate auditory signals. I doubted that evolution could have been that inefficient. We now know that very substantial data reduction does indeed take place in the auditory nerve before sound signals ever reach the neocortex.
In our software-based speech recognizers, we also used filters implemented as software-sixteen to be exact (which we later increased to thirty-two, as we found there was not much benefit to going much higher than this). So in our system, each point in time was represented by sixteen numbers. We needed to reduce these sixteen streams of data into one while at the same emphasizing the features that are significant in recognizing speech.
We used a mathematically optimal technique to accomplish this, called vector quantization. Consider that at any particular point in time, sound (at least from one ear) was represented by our software by sixteen different numbers: that is, the output of the sixteen frequency filters. (In the human auditory system the figure would be 3,000, representing the output of the 3,000 cochlea inner hair cells.) In mathematical terminology, each such set of numbers (whether 3,000 in the biological case or 16 in our software implementation) is called a vector.
For simplicity, let's consider the process of vector quantization with vectors of two numbers. Each vector can be considered a point in two-dimensional s.p.a.ce.
If we have a very large sample of such vectors and plot them, we are likely to notice cl.u.s.ters forming.
In order to identify the cl.u.s.ters, we need to decide how many we will allow. In our project we generally allowed 1,024 cl.u.s.ters so that we could number them and a.s.sign each cl.u.s.ter a 10-bit label (because 210 = 1,024). Our sample of vectors represents the diversity that we expect. We tentatively a.s.sign the first 1,024 vectors to be one-point cl.u.s.ters. We then consider the 1,025th vector and find the point that it is closest to. If that distance is greater than the smallest distance between any pair of the 1,024 points, we consider it as the beginning of a new cl.u.s.ter. We then collapse the two (one-point) cl.u.s.ters that are closest together into a single cl.u.s.ter. We are thus still left with 1,024 cl.u.s.ters. After processing the 1,025th vector, one of those cl.u.s.ters now has more than one point. We keep processing points in this way, always maintaining 1,024 cl.u.s.ters. After we have processed all the points, we represent each multipoint cl.u.s.ter by the geometric center of the points in that cl.u.s.ter. = 1,024). Our sample of vectors represents the diversity that we expect. We tentatively a.s.sign the first 1,024 vectors to be one-point cl.u.s.ters. We then consider the 1,025th vector and find the point that it is closest to. If that distance is greater than the smallest distance between any pair of the 1,024 points, we consider it as the beginning of a new cl.u.s.ter. We then collapse the two (one-point) cl.u.s.ters that are closest together into a single cl.u.s.ter. We are thus still left with 1,024 cl.u.s.ters. After processing the 1,025th vector, one of those cl.u.s.ters now has more than one point. We keep processing points in this way, always maintaining 1,024 cl.u.s.ters. After we have processed all the points, we represent each multipoint cl.u.s.ter by the geometric center of the points in that cl.u.s.ter.
We continue this iterative process until we have run through all the sample points. Typically we would process millions of points into 1,024 (210) cl.u.s.ters; we've also used 2,048 (211) or 4,096 (212) cl.u.s.ters. Each cl.u.s.ter is represented by one vector that is at the geometric center of all the points in that cl.u.s.ter. Thus the total of the distances of all the points in the cl.u.s.ter to the center point of the cl.u.s.ter is as small as possible.
The result of this technique is that instead of having the millions of points that we started with (and an even larger number of possible points), we have now reduced the data to just 1,024 points that use the s.p.a.ce of possibilities optimally. Parts of the s.p.a.ce that are never used are not a.s.signed any cl.u.s.ters.
We then a.s.sign a number to each cl.u.s.ter (in our case, 0 to 1,023). That number is the reduced, "quantized" representation of that cl.u.s.ter, which is why the technique is called vector quantization. Any new input vector that arrives in the future is then represented by the number of the cl.u.s.ter whose center point is closest to this new input vector.
We can now precompute a table with the distance of the center point of every cl.u.s.ter to every other center point. We thereby have instantly available the distance of this new input vector (which we represent by this quantized point-in other words, by the number of the cl.u.s.ter that this new point is closest to) to every other cl.u.s.ter. Since we are only representing points by their closest cl.u.s.ter, we now know the distance of this point to any other possible point that might come along.
I described the technique above using vectors with only two numbers each, but working with sixteen-element vectors is entirely a.n.a.logous to the simpler example. Because we chose vectors with sixteen numbers representing sixteen different frequency bands, each point in our system was a point in sixteen-dimensional s.p.a.ce. It is difficult for us to imagine a s.p.a.ce with more than three dimensions (perhaps four, if we include time), but mathematics has no such inhibitions.
We have accomplished four things with this process. First, we have greatly reduced the complexity of the data. Second, we have reduced sixteen-dimensional data to one-dimensional data (that is, each sample is now a single number). Third, we have improved our ability to find invariant features, because we are emphasizing portions of the s.p.a.ce of possible sounds that convey the most information. Most combinations of frequencies are physically impossible or at least very unlikely, so there is no reason to give equal s.p.a.ce to unlikely combinations of inputs as to likely ones. This technique reduces the data to equally likely possibilities. The fourth benefit is that we can use one-dimensional pattern recognizers, even though the original data consisted of many more dimensions. This turned out to be the most efficient approach to utilizing available computational resources.
Reading Your Mind with Hidden Markov Models
With vector quantization, we simplified the data in a way that emphasized key features, but we still needed a way to represent the hierarchy of invariant features that would make sense of new information. Having worked in the field of pattern recognition at that time (the early 1980s) for twenty years, I knew that one-dimensional representations were far more powerful, efficient, and amenable to invariant results. There was not a lot known about the neocortex in the early 1980s, but based on my experience with a variety of pattern recognition problems, I a.s.sumed that the brain was also likely to be reducing its multidimensional data (whether from the eyes, the ears, or the skin) using a one-dimensional representation, especially as concepts rose in the neocortex's hierarchy.
For the speech recognition problem, the organization of information in the speech signal appeared to be a hierarchy of patterns, with each pattern represented by a linear string of elements with a forward direction. Each element of a pattern could be another pattern at a lower level, or a fundamental unit of input (which in the case of speech recognition would be our quantized vectors).
You will recognize this situation as consistent with the model of the neocortex that I presented earlier. Human speech, therefore, is produced by a hierarchy of linear patterns in the brain. If we could simply examine these patterns in the brain of the person speaking, it would be a simple matter to match her new speech utterances against her brain patterns and understand what the person was saying. Unfortunately we do not have direct access to the brain of the speaker-the only information we have is what she actually said. Of course, that is the whole point of spoken language-the speaker is sharing a piece of her mind with her utterance.
So I wondered: Was there a mathematical technique that would enable us to infer the patterns in the speaker's brain based on her spoken words? One utterance would obviously not be sufficient, but if we had a large number of samples, could we use that information to essentially read the patterns in the speaker's neocortex (or at least formulate something mathematically equivalent that would enable us to recognize new utterances)?
People often fail to appreciate how powerful mathematics can be-keep in mind that our ability to search much of human knowledge in a fraction of a second with search engines is based on a mathematical technique. For the speech recognition problem I was facing in the early 1980s, it turned out that the technique of hidden Markov models fit the bill rather perfectly. The Russian mathematician Andrei Andreyevich Markov (18561922) built a mathematical theory of hierarchical sequences of states. The model was based on the possibility of traversing the states in one chain, and if that was successful, triggering a state in the next higher level in the hierarchy. Sound familiar?
A simple example of one layer of a hidden Markov model. S1 through S through S4 represent the "hidden" internal states. The P represent the "hidden" internal states. The Pi, j transitions each represent the probability of going from state S transitions each represent the probability of going from state Si to state S to state Sj. These probabilities are determined by the system learning from training data (including during actual use). A new sequence (such as a new spoken utterance) is matched against these probabilities to determine the likelihood that this model produced the sequence.
Markov's model included probabilities of each state's successfully occurring. He went on to hypothesize a situation in which a system has such a hierarchy of linear sequences of states, but those are unable to be directly examined-hence the name hidden hidden Markov models. The lowest level of the hierarchy emits signals, which are all we are allowed to see. Markov provides a mathematical technique to compute what the probabilities of each transition must be based on the observed output. The method was subsequently refined by Norbert Wiener in 1923. Wiener's refinement also provided a way to determine the connections in the Markov model; essentially any connection with too low a probability was considered not to exist. This is essentially how the human neocortex trims connections-if they are rarely or never used, they are considered unlikely and are pruned away. In our case, the observed output is the speech signal created by the person talking, and the state probabilities and connections of the Markov model const.i.tute the neocortical hierarchy that produced it. Markov models. The lowest level of the hierarchy emits signals, which are all we are allowed to see. Markov provides a mathematical technique to compute what the probabilities of each transition must be based on the observed output. The method was subsequently refined by Norbert Wiener in 1923. Wiener's refinement also provided a way to determine the connections in the Markov model; essentially any connection with too low a probability was considered not to exist. This is essentially how the human neocortex trims connections-if they are rarely or never used, they are considered unlikely and are pruned away. In our case, the observed output is the speech signal created by the person talking, and the state probabilities and connections of the Markov model const.i.tute the neocortical hierarchy that produced it.
I envisioned a system in which we would take samples of human speech, apply the hidden Markov model technique to infer a hierarchy of states with connections and probabilities (essentially a simulated neocortex for producing speech), and then use this inferred hierarchical network of states to recognize new utterances. To create a speaker-independent system, we would use samples from many different individuals to train the hidden Markov models. By adding in the element of hierarchies to represent the hierarchical nature of information in language, these were properly called hierarchical hidden Markov models (HHMMs).
My colleagues at Kurzweil Applied Intelligence were skeptical that this technique would work, given that it was a self-organizing method reminiscent of neural nets, which had fallen out of favor and with which we had had little success. I pointed out that the network in a neural net system is fixed and does not adapt to the input: The weights adapt, but the connections do not. In the Markov model system, if it was set up correctly, the system would prune unused connections so as to essentially adapt the topology.
I established what was considered a "skunk works" project (an organizational term for a project off the beaten path that has little in the way of formal resources) that consisted of me, one part-time programmer, and an electrical engineer (to create the frequency filter bank). To the surprise of my colleagues, our effort turned out to be very successful, having succeeded in recognizing speech comprising a large vocabulary with high accuracy.
After that experiment, all of our subsequent speech recognition efforts have been based on hierarchical hidden Markov models. Other speech recognition companies appeared to discover the value of this method independently, and since the mid-1980s most work in automated speech recognition has been based on this approach. Hidden Markov models are also used in speech synthesis-keep in mind that our biological cortical hierarchy is used not only to recognize input but also to produce output, for example, speech and physical movement.
HHMMs are also used in systems that understand the meaning of natural-language sentences, which represents going up the conceptual hierarchy.
Hidden Markov states and possible transitions to produce a sequence of words in natural-language text.
To understand how the HHMM method works, we start out with a network that consists of all the state transitions that are possible. The vector quantization method described above is critical here, because otherwise there would be too many possibilities to consider.
Here is a possible simplified initial topology:
A simple hidden Markov model topology to recognize two spoken words.
Sample utterances are processed one by one. For each, we iteratively modify the probabilities of the transitions to better reflect the input sample we have just processed. The Markov models used in speech recognition code the likelihood that specific patterns of sound are found in each phoneme, how the phonemes influence one another, and the likely orders of phonemes. The system can also include probability networks on higher levels of language structure, such as the order of words, the inclusion of phrases, and so on up the hierarchy of language.
Whereas our previous speech recognition systems incorporated specific rules about phoneme structures and sequences explicitly coded by human linguists, the new HHMM-based system was not explicitly told that there are forty-four phonemes in English, the sequences of vectors that were likely for each phoneme, or what phoneme sequences were more likely than others. We let the system discover these "rules" for itself from thousands of hours of transcribed human speech data. The advantage of this approach over hand-coded rules is that the models develop probabilistic rules of which human experts are often not aware. We noticed that many of the rules that the system had automatically learned from the data differed in subtle but important ways from the rules established by human experts.
Once the network was trained, we began to attempt to recognize speech by considering the alternate paths through the network and picking the path that was most likely, given the actual sequence of input vectors we had seen. In other words, if we saw a sequence of states that was likely to have produced that utterance, we concluded that the utterance came from that cortical sequence. This simulated HHMM-based neocortex included word labels, so it was able to propose a transcription of what it heard.
We were then able to improve our results further by continuing to train the network while we were using it for recognition. As we have discussed, simultaneous recognition and learning also take place at every level in our biological neocortical hierarchy.
Evolutionary (Genetic) Algorithms