Everything short of pseudocodeMay 25, 2006 at 8:27 pm | Posted in Jeff Hawkins, Numenta | 4 Comments
Numenta published a paper dated May 10th, 2006, entitled Hierarchical Temporal Memory about the theory and concepts of HTM. In a nutshell, they propose to use bayesian networks and belief propagation to model the human neocortex.
Originally, I intended to summarize the paper by stringing quotes together. But the last page says it all, so lets start at the end:
Technically, HTMs can be considered a form of Bayesian network where the network consists of a collection of nodes arranged in a tree-shaped hierarchy. Each node in the hierarchy self-discovers a set of causes in its input through a process of finding common spatial patterns and then finding common temporal patterns. Unlike many Bayesian networks, HTMs are s lf-training, have a well-defined parent/child relationship between each node, inherently handle time-varying data, and afford mechanisms for covert attention. Sensory data is presented at the “bottom” of the hierarchy. To train an HTM, it is necessary to present continuous, timevarying, sensory input while the causes underlying that sensory data persist in the environment. That is, you either move the senses of the HTM through the world, or the objects in the world move relative to the HTM’s senses. During inference, information flows up the hierarchy starting at the lowest level nodes closest to sensory input. As the information rises up the hierarchy, beliefs are formed at successively higher nodes, each representing causes over larger and larger spatial areas and longer and longer temporal periods. Belief propagation-like techniques lead all nodes in the network to quickly reach beliefs that are consistent with the bottoms-up sensory data. Top-down predictions can influence the inference process by biasing the network to settle on predicted causes. HTMs are memory systems. By this we mean that HTMs must learn about their world. You sometimes can supervise the learning process but you can’t program an HTM. Everything an HTM learns is stored in memory matrices at each node. These memory matrices represent the spatial quantization points and sequences learned by the node.
Ok? Lets go back to the beginning. By the way, what you won't find in the paper are algorithms, pseudo-code, proposed data structures, implementation hints, in fact anything tangible to get you going as a developer. Never mind, lets read on:
What do HTM's do? They discover causes (e.g. objects) in the world. One of the goals of an HTM is to discover from the raw sensory input that objects like “cars” and “words” exist. Sensory data will be a topologically arrayed collection of inputs,
where each input measures a local and simple quantity. There are two essential characteristics of sensory data. First, the sensory data must measure something that is directly or indirectly impacted by the causes in the world that you might be interested in. Second, the sensory data must change and flow continuously through time, while the causes underlying the sensory data remain relatively stable. At any moment in time, given current and past input, an HTM will assign a likelihood that individual causes are currently being sensed.
HTMs consist of a hierarchy of memory nodes where each node learns causes and forms beliefs. Part of the learning algorithm (never explained in detail) performed by each node is to store likely sequences of patterns. By combining memory of likely sequences with current input, each node has the ability to make predictions of what is likely to happen next. When an HTM predicts what is likely to happen next, the prediction can act as what is called a “prior probability”, meaning it biases the system to infer the predicted causes.
HTMs are structured as a hierarchy of nodes, where each node is performing the same learning algorithm. Each node in an HTM generally has a fixed number of causes and a fixed number of output variables. The nodes do not “add” causes as they are discovered, instead, over the course of training the meaning of the outputs gradually change. This happens at all levels in the hierarchy simultaneously.
The basic operation of each node is divided into two steps. The first step is to assign the node’s input pattern to one of a set of quantization points (representing common spatial patterns of input). In the second step, the node looks for common sequences of these quantization points. The set of these sequence variables is the output of the node, and is passed up the hierarchy to the parent(s) of the node. A node also can send information to its children. The messages going down the hierarchy represent the distribution over the quantization points, whereas the messages going up the hierarchy represent the distribution over the sequences.
Why is the use of a hierarchy important? HTMs try to match inputs to previously seen patterns, but they do so a piece at a time and in a hierarchy. A node couldn’t store every pattern that it would likely see in its lifetime. Instead, the node stores a limited, fixed number, of patterns, say 50 or 100. These stored patterns are the quantization points. You can think of the quantization points as the most common patterns seen by the node during training. Further training will not increase the number of quantization points, but it can change them. At every moment, the node takes a new and novel input and determines how close (never explained how) it is to each stored quantization point. After sufficient initial training, most new learning occurs in the upper levels of the HTM hierarchy. When training a new HTM from scratch, the lower-level nodes become stable before the upper-level nodes, reflecting the common sub-properties of causes in the world. HTMs do not just exploit the hierarchical spatial structure of the world. They take advantage of the hierarchical temporal structure of the world as well. Nodes at the bottom of an HTM find temporal correlations among patterns that occur relatively close together in both space and time: “pattern B immediately follows pattern A”.
When designing an HTM system for a particular problem, it is important to ask whether the problem space (and the corresponding sensory data) have hierarchical structure. For example, if you desire an HTM to understand financial markets, you might want to present data to the HTM where adjacent sensory input data are likely to be correlated in space and time. Perhaps this means first grouping stock prices by category, and then by industry segment. (E.g. technology stocks such as semiconductors, communications, and biotechnology would get grouped together in the first level. At the next level, the technology group is combined with manufacturing, financial, and other groups.).
A connected graph where each node in the graph represents a belief or set of beliefs is commonly referred to as a Bayesian
network. Thus, HTMs are similar to Bayesian networks. In a Bayesian network, beliefs at one node can modify the beliefs at another node if the two nodes are connected via a conditional probability table (CPT). A CPT is a matrix of numbers where the
columns of the matrix correspond to the individual beliefs from one node and the rows correspond to the individual beliefs from
the other node. Multiplying a vector representing the belief in a source node times the CPT results in a vector in the dimension and “language” of beliefs in the destination node. Belief Propagation (BP) is a mathematical technique that is used
with Bayesian networks. BP doesn’t iterate to reach its final state; it happens in one pass. HTM uses a variation of Belief Propagation to do inference. The sensory data imposes a set of beliefs at the lowest level in an HTM hierarchy, and by the time the beliefs propagate to the highest level, each node in the system represents a belief that is mutually consistent with all the other nodes. The highest level nodes show what highest level causes are most consistent with the inputs at the lowest levels. BP is that it is possible to make large systems that settle rapidly. The time it takes for an HTM to infer its input increases linearly with the number of levels in the hierarchy. However, the memory capacity of the HTM increases exponentially with the number of levels. HTM networks can have millions of nodes, yet still have the longest path be short, say five or ten steps. Because basic belief propagation has no way of handling time-varying data, the concept of time must be added to do inference in these domains. The nodes in an HTM are also more sophisticated than in standard BP.
In summary, there are three sources of dynamic change in an HTM. One occurs because of the changing sensory input. The
second occurs as each node uses its memory of sequences to predict what will happen next and passes this prediction down
the hierarchy. The third happens only during training and at a much slower time scale.
Let’s say a node identifies the fifty most common spatial patterns found in its input. Let’s label the learned spatial patterns sp1 thru sp50. Suppose the node observes that over time sp4 often follows sp7. Assume a node stores the 100 most common temporal sequences. Here then is what nodes in an HTM do. At every point in time, a node looks at its input and assigns a probability that this input matches each element in a set of commonly occurring spatial patterns. Then the node takes this probability distribution and combines it with previous state information to assign a probability that the current input is part of a set of commonly occurring temporal sequences. The distribution over the set of sequences is the output of the node and is passed up the hierarchy. Finally, if the node is still learning, then it might modify the set of stored spatial and temporal patterns to reflect the new input. In summary, we can say that each node in an HTM first learns to represent the most commonly occurring spatial patterns in its input. Then it learns to represent the most commonly occurring sequences of those spatial patterns. The node’s outputs going up the hierarchy are variables that represent the sequences, or more precisely, the probability those sequences are active at this moment in time. A node also may pass predicted spatial patterns down the hierarchy.
The fact that a node always sees distributions means it is generally not practical to simply enumerate and count spatial and
temporal patterns. Probabilistic techniques must be used. For example, the idea of a sequence in an HTM is generally not as
clean as the sequence of notes in a melody. For most causes in the world, it is not clear when a sequence begins or ends. Nodes in an HTM have to decide when the change in the input pattern is sufficient to mark it as a new event. There is much prior art on how to learn spatial patterns with messy real world data. Some of these models try to precisely model parts of the visual cortex. There is less prior art on learning sequences from distributions, at least not in ways that will work in an HTM.
Recall that Bayesian networks send belief messages between nodes. Further recall that CPTs (Conditional Probability Tables)
are two-dimensional memory matrices that convert a belief in one node into the dimension and language of the belief in another node. The CPT allows the belief at one node to modify the belief at another node. Earlier we illustrated CPTs with the example of nodes representing temperature and precipitation. After that, we didn’t explain how the CPTs were learned. Well, we did, but in different language. In an HTM, the CPTs used in passing information from node to node going up the hierarchy are formed as a result of learning the quantization points. The quantization function itself is the CPT. By contrast, in a traditional Bayesian network the causes at each node would be fixed, and the CPT would be created by pairing instantaneous beliefs between two nodes. We can’t do this in HTMs because the causes represented by each node are not fixed and have to be learned. Learning the quantization points is in essence a method of creating a CPT on the fly.