Hierarchical brain

Prediction

Prediction is a term used in different levels of my hierarchy of different levels of description to mean the use of stored information to facilitate future processes. The stored information is broadly what can be called memory, but the nature of the information stored is different at the different levels, so the mechanisms of its usage by prediction are different.

Although prediction sits with memory in my hierarchy as a brain-wide function, and the two are closely connected, there is one major difference: the process of memory is part of my model of my world and I therefore have an innate (although often incorrect) understanding of it; but prediction is a process that I am generally unaware of, it is not modelled except as part of the process of perception. Even though I am not normally aware of the powerful assistance that prediction provides, it is actually crucial for many of my every-day activities such as walking or picking up an object, and is most likely a fundamental part of how I perceive and interact with the world and my body.

Contents of this page
Overview - an overview of my proposals related to prediction.
Introduction to the science - an introduction to the science of prediction in the brain.
The science of prediction before Predictive Processing - a brief history of prediction before Predictive Processing and prerequisites for it.
Predictive Processing - an overview of the theory of Predictive Processing and the Free Energy Principle.
Comments on Predictive Processing and the Free Energy Principle - how my proposals fit with these.
Details - further details of my proposals on prediction.
References - references and footnotes.

Overview

Prediction takes place in the brain at many different levels of description in my seven-level hierarchy, ranging from level 2, the lowest level afferent processing through to high-level functions such as perception, cognoception (the perception of internal brain functions), action and attention that are all part of level 6.
- At the lowest level, prediction operates at the level of individual neurons and synapses.
  - Prediction uses information stored as low-level memories to help make decisions based on past events.
  - It is used in basic coincidence detection (part of level 2) in order to detect real coincidences of two events, and ignore chance events where two things just happen to occur at the same time.
  - It is an automatic and unconscious mechanism that enables the brain to react more quickly to incoming data and to fill in gaps in sometimes incomplete data, so it must have evolved because it helps the survival of the owner of the brain.
  - My afferent processing examples show this can happen, particularly examples 1-4 and 6.
- At a higher level, prediction acts at the level of symbol schemas and the connections between them (level 4), and then is also part of a number of functions in level 6 that depend on the existence of symbol schemas.
  - The strengths of both afferent and efferent connections from and to individual symbol schemas are examples of memories that are, in effect, predictions about connections between different symbols.
  - These connections have an influence on how easily the symbol schemas are activated, and how the activations of one symbol schema can make the activation of another more or less likely.
  - This has an effect on both perception and cognoception, which is the perception of internal brain functions, and in both cases the perception happens only when the relevant symbol schema is activated.
  - The same is also true for symbol schemas that represent actions, so certain actions are more likely to be triggered by the activation of certain other symbol schemas.
  - The process of attention is also closely tied in with these predictions, because the end result of attention depends on the influences of multiple connections from and to many symbol schemas over many levels of processing.
- Higher level prediction is an emergent feature of many occurrences of lower level predictions; they are the same thing at different levels of description.
Since around 1999, there has been a big focus on prediction as a driving force in the brain. This page covers an overview of the new science, and compares the implications of the theories with my proposals.
- Overall, there is a very good match between the latest theories and my proposals, in terms of the likely outcomes of both.
- The science is largely theoretical and mostly quite high-level, with only a small amount of supporting evidence, whereas my proposals build on the lowest levels of detail.

An introduction to the science

There were hints many centuries ago that prediction played a part in the processing of the brain, but it was not until the middle of the 19th century that specific proposals were made.
The topic of prediction in the brain has become a very popular subject in the 21st century, with thousands of scientific papers¹ and whole books written on the subject², focussing on describing brain function from the perspective of it being predictive.
- It certainly is an intriguingly different way of looking at the processing of the brain and it has generated a number of new concepts and made some interesting suggestions, but it is possible to take it too far, I feel.
- Proponents have claimed that the brain is a “probabilistic prediction machine”³ and that it is “predictive, not reactive”⁵.
  - Saying that the brain is “predictive, not reactive” is certainly going too far because clearly it is has to be reactive in many circumstances.
  - The brain of a young baby, for example, cannot be predictive because it has no experience to create predictions from⁶.
  - When undertaking a new adventure, such as a new sport, exploring a new place, and so on, some experiences are not going to be able to be predicted, so the brain has to be reactive.
  So a fairer statement, I feel, is that the brain has to balance reaction with prediction, and although it may be true to say it is more predictive than reactive, I think it is a mistake to only consider the processing of the brain from a predictive viewpoint.
- There have been claims in recent years that Predictive Processing and the Free Energy Principle (details of both are below) determine the actions of not only the brain, but the operation of all life, and explain attention, action, perception, and even the concept of self and consciousness.
  - Again, these are interestingly different ways of looking at things, but they do not necessarily provide insights that cannot be provided from other ways of looking at things.
  - It may be of relevance that much of the popularisation of these more unusual ways of looking at things has been done by philosophers rather than neuroscientists.

The science of prediction before Predictive Processing

This section is a review of the history of investigations into the subject of prediction in the brain before around 1999. It is not intended to be complete, so there are inevitably some gaps. It covers a wide range of subjects at a high level, so I have included plenty of Wikipedia links for those who want more detailed information.
There are some hints from ancient Greece that philosophers realised that our perceptions are not what they seem, and that we do not have direct access to the external world.
- An example is the allegory of the cave by the Greek philosopher Plato, dating from around 375 BCE.
- This is usually understood to be an allegory or metaphor saying that our perceptions are like shadows of reality⁷.
The implication is that if these shadows of reality are all we can know about the world, we therefore are continually guessing, or perhaps predicting, what is really out there.
The suggestion that the brain processes things unconsciously before any awareness happens was probably first made by the Middle-Eastern scientist Ibn al-Haytham, also known as Alhazen, in his Book of Optics written between the years 1011 and 1021.
- He was the first to correctly deduce that vision is produced by light entering the eye after being reflected from objects (see theory of optics).
- He is now recognised as probably the first true scientist because of his use of experiments and evidence to confirm proposals, so he helped develop the so-called scientific method.
- He got quite close to describing the concept of unconscious inference (see under Helmholtz immediately below) when discussing the perception of colour and light, and also the perception of the moon illusion.
The person who is usually credited as being the first to suggest that the brain predicts outcomes is the German physicist Hermann von Helmholtz. He also played an important part in several other areas of relevance, as will be seen in the new few sections.
- In 1867, Helmholtz said that perception is a process of unconscious conclusion, more normally known now as unconscious inference¹⁰, meaning that the brain decides what is being perceived without our conscious knowledge.
- He also realised that in order to be predictive, the brain must create a model of the world that contains symbols that represent things in the real world¹¹, and he described a number of illusions that also supported this idea.
- He provided an outline for how a motor command to move the eyes could be used by the brain to predict the incoming visual signals that would be received, and gave a simple example of how to show that this is probably true¹²:
  - If you close one eye and focus your open eye on some object, and then move your eye just a little to one side, you do not perceive any movement of the general scene behind the object.
  - However, if you push with a finger (very gently) at the side of your open eye (not on the eyeball itself), you will see that the world appears to move slightly, depending on the direction of your push.
  - This would seem to show that the brain is compensating for, and predicting the outcome of, the movement of the eye when it is moved by the brain, but obviously cannot do the same when the movement is caused by an outside force, so it has to assume that the world has moved.
- Helmholtz understood that the brain resolved perceptions by predicting what might be out there, but that this was done completely unconsciously so that we are not able to know about this process, only the result of it⁸.
- Unfortunately, these ideas were pretty much ignored and were not really developed for the next hundred years. This was at least partly because people felt that the idea that the brain might made decisions unconsciously caused potential moral and legal issues relating to personal responsibility.
The British psychologist Richard Gregory published several papers in the 1960s and 1970s that took up some of the ideas of Helmholtz.
- Like Helmholtz, he described a number of illusions of the senses, categorised them, and then put forward his ideas on a predictive brain.
- He described in some detail the advantages of the brain having a model of the world and using prediction to come to a conclusion about perceptions, rather than being purely reactive¹³.
- Once again, however, these ideas were not picked up very much at the time, and little progress was made in this line for the next thirty years or more.
In the meantime, however, there were developments in several other lines of research that would later come together to make progress in the area of prediction in the brain.
1. The first was is the area of statistics and particularly the use of Bayesian inference to model the process of perception in the brain.
  - Bayesian statistics had been around for a long time. It was first developed by Rev. Thomas Bayes but formalised and first published by others two years after his death in 1763.
  - Bayes’ theorem provides a formula for calculating the conditional probability of an event when new information relevant to the event is taken into account.
    - The general principle behind the theory is basic common sense, and we all use it both consciously and unconsciously. For example, if I see a weather forecast for the next day, and it says 20% chance of rain in the morning, but then I look outside just before I want to go out and there are dark clouds building up, then I will consciously update the probability that I will need an umbrella to be rather higher than 20%, but not as much as 100%.
    - Bayes’ theorem has been used more formally in a number of areas for many years, such as in insurance and finance calculations, and it is particularly helpful in calculating medical risk associated with test results (see Bayes’ theorem examples), but calculations are sometimes complex or even unmanageable.
    - The main concepts in Bayesian statistics are the prior probability of the event already known, new evidence of something relevant to the event, the likelihood or reliability of this new evidence, and the updated posterior probability of the event.
    - The theorem simply says that the posterior probability is equal to the prior probability times the likelihood or reliability of the new evidence, or, in mathematical terms:
    - With respect to perception in the brain, the prior relates to the existing model, the posterior relates to the new version of the model after it is updated with the new evidence, the new evidence is new sense data that doesn’t fit with the existing model, and the likelihood is how reliable the brain judges the new evidence to be.
  - The relevance to prediction is that using a model in the brain to try to match with a perception is the same as using prediction to make a best guess as to what is being perceived.
2. The second is artificial neural networks as a method of machine learning, which developed particularly as computers became more powerful from the 1960s onwards.
  - Despite the name, very few artificial neural networks were actually based on the known architecture of the brain, or developed with an aim of understanding the brain, because they were mostly aimed at advancing computer learning, but some did provide clues on the possible predictive nature of the brain (without actually providing any evidence that this was the way the brain worked).
  - The history of artificial neural networks goes back many years and there have been many different models proposed.
  - Bayesian statistics was used as part of machine learning in many different models, for example Bayesian optimization and Bayesian network (see also Bayes theorem for machine learning).
  - One of the first models that emphasised the requirement for a hierarchical structure with both feedforward (what I prefer to call afferent) and feedback (efferent) connection and weighting was called the Helmholtz machine in acknowledgement of the contributions of Helmholtz¹⁴.
  - This model was co-created by one of the well-known names in machine learning of the 1990s, Geoffrey Hinton. He was one of the first to suggest that the brain learned by using Bayesian statistics to update models, and that this was essentially the same as prediction.
3. The final area is information theory, an area of research that was largely spurred by the development of computers and Information Technology (IT) from the 1940s onwards. It covers many concepts, but those that were later relevant to prediction in the brain include entropy, free energy and data compression.
  - Information theory is the mathematical treatment of information, and defines information as the resolution of uncertainty (see overview of information theory).
    - It studies how information is stored, transmitted and used, which is what computers do, but of course is also exactly what the brain does.
    - The field was defined by an important paper published in 1948 by Claude Shannon. The Mathematical Theory of Communication defined the concepts required for the production, transmission and receiving of information, as well as introducing the term “bit” for a unit of information.
    - In particular, theorem 2 from the Mathematical Theory of Communication gives a formula for what we now call information entropy (more below).
  - Entropy is a concept that causes a lot of confusion, particularly among non-scientists, because it has several apparently different definitions, and also because it is used in several different fields to have different, but related, meanings¹⁵. In summary, it is a measure of the amount of disorder or uncertainty in a system (for more detail, see entropy disambiguation, introduction to entropy and history of entropy).
    - In the field of thermodynamics (the study of heat and related concepts) it had been observed that heat only flows from a hot body to a cold body and not the other way around, and that some energy is always lost in the process. The term entropy was first used in 1865 as a name for this inevitable loss of heat.
    - Entropy in thermodynamics is defined in the second law of thermodynamics as a quantity that cannot decrease over time in a closed system (a closed system is any environment that is isolated from outside influences). This means that any closed system will inevitably over time change towards a state of thermodynamic equilibrium where the entropy is highest.
    - In 1872, Ludwig Boltzmann, who worked with Helmholtz in Berlin for a time, published his H-theorem, which is basically a derivation of the second law of thermodynamics from statistical mechanics using thermodynamic entropy.
    - Entropy in statistical thermodynamics is a quantity that is proportional to the number of possible states of a system¹⁶ (also see irreversible process, entropy as an arrow of time).
    - Entropy in information theory is defined by the formula given by Shannon in his Mathematical Theory of Communication (as mentioned above). This is directly connected to entropy in statistical thermodynamics (see immediately above), because the larger the number of possible states of a system, the more information is needed to describe the system. So in this field, entropy is, in effect, the same as information.
  - Free energy is closely related to entropy, and therefore also has slightly different definitions in different fields. It is the energy available in a system to do useful work, so broadly it is inversely proportional to entropy. The word “free” in this context means “available”. It has now become an important concept in describing the not only the predictive nature of the brain (see below for more details), but also a principle that is proposed to drive the development and operation of all living things.
    - In 1882, Helmholtz (see above) proposed the concept of free energy in an article about the thermodynamics of chemical processes¹⁷, which is now generally referred to as Helmholtz free energy.
    - Thermodynamic free energy is the amount of work or energy that can be obtained from a system.
    - Assuming a constant temperature, free energy is inversely proportional to the thermodynamic entropy of the system, so as the entropy of a system increases (as it becomes more disordered), its free energy decreases.
    - One of the first attempts to define the concept of free energy in information theory was a paper published in 1986¹⁸.
    - It was first applied to artificial neural networks in 1994¹⁹.
  - Data compression was very important in the early days of signal processing and computer science to try to save on bandwidth and storage, both of which were expensive.
    - In the 1950s, Bell Labs developed early versions of what later became the familiar JPEG and other compression algorithms²⁰.
    - These algorithms compress expected and repeated patterns, and only encode unexpected variations when the actual value is different from the expected one, which can be considered to be prediction errors.
    - When these compression techniques were used in audio and speech compression, it was called linear predictive coding (LPC)
    - LPC was built into early silicon chips, such as the TMC0280 LPC speech synthesiser chip, which was used in the Speak & Spell toy, first sold in 1978.
    - LPC is also used in the European GSM mobile phone standard for voice compression.
    - Similar techniques were later used in machine learning and later in computer models of visual processing, and then proposed as the way the brain predicts.

Some of these area started to be knitted together in the 1980s.
- The first proposal for the use of Bayesian approaches to brain function was in a paper by the physicist E. T. Jaynes in 1988²³.
  - Jaynes proposed that the brain uses Bayes probability for all reasoning, not just perception, and that this was the only “plausible reasoning”, which he equated with “common sense”.
  - He derived the rules of Bayes probability from theorem 2 from Shannon’s Mathematical Theory of Communication (see above), which is a definition of information entropy²⁴, so linking entropy to Bayes probability.
  - This paper also quoted a principle that Jaynes had first derived in 1957 called the principle of maximum entropy. This says that, in terms of Bayes’ Theorem, the best representation of prior probability is the one with the largest entropy, so giving an indication on how the measure of entropy may be used in describing brain function.

Predictive Processing

A new approach started to be taken from around 1999 that brings together all the areas outlined above. The following summary also includes many Wikipedia links for those who want more information.
- It started with a theory called Predictive Coding, now more usually called Predictive Processing, which, as its name suggests, concentrates on prediction as the driver of perception, but now also encompasses both action and attention²¹.
  - The term “Predictive Coding” came from linear predictive coding used in compression algorithms, as explained above.
  - Predictive Coding was first used as a proposed description of visual processing in the brain in a paper published in 1999 by Americans Rao and Ballard²⁵.
    - It describes a “hierarchical predictive coding model” created as a computer simulation that successfully emulated some particular known features of the visual cortex in the brain.
    - The simulation included prediction from feedback (efferent) connections and feed-forward (afferent) incoming data, with prediction errors being created if the two did not match.
    - It is described as a hierarchical generative model, but there is little mention of the minimisation of prediction errors, although it does briefly touch on the concept of minimum description length as being the most efficient model because it has the most compression.
  - “Predictive Processing” was later defined as a proposed method that the brain uses to perceive, using hierarchical probabilistic generative models. This 1999 model does fit within this definition, so it was the first to detail how Predictive Processing may be used in the brain²².
- The elements of Predictive Processing are as follows²⁶:
  - A model of the world is built by extracting patterns or statistical regularities from incoming signals.
    - The Scottish philosopher Andy Clark call this a “multilayer probabilistic generative model”⁴,
    - The word “generative” is used because it is capable of generating the sensory pattern that produced the input to create the model, but it doesn’t specify exactly the form this model takes or how it is updated.
  - At each level of the hierarchy, the model (top-down or efferent) is compared to incoming information (bottom-up or afferent).
    - A mis-match generates a prediction error, which updates the model on the level above.
    - Prediction error is also called surprise or surprisal.
  - Bayesian statistics are used to update probabilities; the actual mechanism proposed is known as empirical Bayes, where prior probabilities are learnt over time and extracted from the higher levels of the hierarchy.
  - A perception happens only when the prediction errors are minimised.
  - The free energy principle was proposed by Karl Friston in 2009²⁷. This was at first seen as an addition to Predictive Processing, but more recently it is recognised as a more fundamental principle that applies to life in general and therefore Predictive Processing can be derived from it.
    - It is a general mathematical principle describing the ability of a model to represent the thing it is modelling, based on the existing theories of free energy (see above).
    - The proposal is that the brain implements the principle by minimising prediction errors or surprise.
    - Free energy is the average of prediction errors over a period of time, which is the same as entropy.
    - The minimisation of prediction errors has been shown to be the approach of least energy usage, in other words, the most efficient method to maintain a model.
  - The concept of precision weighting or expectation is defined as a measure of how accurate a prediction might be, and depends on the details or circumstances of the sensing.
  - Attention then is the minimisation of prediction errors by selecting the signal most likely to fit the model²⁸.
  - A further proposal is that action is driven by the requirement to minimise prediction errors, or to try to match incoming sense data with the model. This is known as active inference.
    - This means that if I sense an object, but cannot immediately be clear what it is, I will take action, such as moving my eyes, or my head, or even my body, to try to change or improve my ability to sense the object, and allow me to perceive it.
    - This is likely to be done subconsciously, because a perception only happens when prediction errors are minimised.
    - Predictive Processing takes this one stage further, however, and says that all action is to minimise prediction error, so that, for example, if I predict I am picking up a cup of coffee, the only way to minimise the error is to actually pick it up.

Comments on Predictive Processing and the Free Energy Principle

These new theories are a very interesting and innovative way of looking at things that seem to suggest that the brain, or possibly life in general, has some underlying rules or principles that guide how it develops and how it is used. However, I have a number of concerns and comments.
- The theories assume that a hierarchical, generative model of the world is built and used, but don’t specify exactly how the model is built, what it looks like, or how exactly it is updated. The assumption seems to be that the processes described can build a model from scratch and maintain it, but no detail is given.
- New concepts, such as prediction error or precision weighting, are proposed, described and developed, but there are no levels of description described at which they emerge, and very little evidence on how they are represented or used by the brain.
- To take an example, how could the brain keep track of prediction errors so that it could minimise them over a period of time? According to the proposals of Predictive Processing, this is necessary to generate attention, minimise energy usage and maintain entropy.
  - I think prediction error is kept to a minimum by virtue of the way that the brain processes data and builds its model. For example, if I am looking out into my garden and there is a small movement in my peripheral vision, my brain will make a decision whether or not to divert my gaze to it or not, and a separate decision will also be made as to whether my conscious attention is drawn to it or not.
  - In my proposals, these things are not driven by any over-arching principle to do with free energy or prediction error minimisation, but simply the biased multi-level competition that we call attention, which depends on influences (the strengths of many synapse connections) from many directions.
  - These strengths, in turn, have been influenced and changed by previous interactions and events, some very recent, some from a long time previously.
  - This is the whole basis of how thought works, and how my conscious and unconscious thoughts flow from one thing to another.
- I have defined a symbol schema as a network of neurons and their mutual synapse connections that represents a concept in the brain when it is activated.
  - A symbol schema is activated when a majority of its component neurons fire at the same time, or within a very short space of time.
  - The concept only gains meaning if it becomes conscious, and it becomes conscious only if the process of attention connects it to the self symbol schema.
  - So I only recognise an object that I sense if it triggers the activation of a symbol schema, and this only happens if there is a good match between the sense data and the stored generative model that is a representation.
  - A further step is then required for me to become consciously aware of the object, by the process of attention connecting that symbol schema to my self symbol schema.
  - So prediction errors are minimised if my brain recognises the object. If it doesn’t recognise it (that small movement in my peripheral vision, for example) because there is very little input, and my attention is not drawn to it because I am focussing on something else at the time, then there are prediction errors, but very little changes in my brain because of it, and I can never be aware of what my brain ignored.
  - So prediction errors are minimised for what my brain conceives as being useful or functional, but are not for those things that are not.
- On active inference, the question is: when I sense that small movement in my peripheral vision, does my brain assess the current level of free energy, or entropy, or cumulative prediction error when it decides whether or not to instruct my eyes to move towards the movement?
  - It seems unlikely that it does, because there is no evidence that it is, or can be, keeping track of any of these things.
  - The decision is made, in the way described above, by the multi-level competitive process that is attention, which is driven, ultimately, by the memory of previous events stored in the strengths of connections.
- Regarding attention being the minimisation of prediction errors by selecting the signal most likely to fit the model.
  - In one sense, this is self-fulfilling, because (in my definition) there is no perception without attention.
  - However, attention is a multi-level biased competition involving relative signal strengths, with influences not only vertically but also laterally, so the process of signal selection is not just prediction verses incoming data.
  - However high the priority of efferent predictive influences from the model are, other influences can always override them.

Further details of my proposals on prediction.

In my hierarchical explanation of the working of the brain, prediction appears right from the very lowest levels of afferent processing, the processing of incoming data. This data can be sense data from external or internal senses, or data from internal brain processes.
- The very first diagram of the first, and simplest, of my examples taken from afferent processing example 1 (reproduced here), includes four possibilities for low-level memory, and therefore three possibilities for future low-level prediction.
  - The diagram illustrates the first step of two neurons connected to two different cells in the retina of the eye detecting the “coincidence” of both cells seeing the same colour and the same brightness at the same time.
  - It shows only three of the simplified model neurons that I call ABCD neurons, where A and B are activated at very nearly the same time, which means that C is then activated.
  - The links from A to C and B to C may be strengthened, as well as any links between A and B (not shown), so these are two possibilities for low-level memory.
  - The two red arrows, which in all my diagrams indicate efferent connections back towards the origin of the data, are also strengthened, which is a third possibility for low-level memory.
  - These three possibilities for low-level memory can all be used for prediction the next time a similar set of incoming data is encountered: since they are stronger from the first occurrence, they are more likely to be activated the next time.
  - The fourth possibility for memory is a slightly higher-level, because it involves a new mini-circuit of neurons A, B and C, and this provides the possibility of a slightly higher-level of prediction because an activated circuit can have a longer-term effect than occurrences of individual strengthened connections.
- The other two steps that make up afferent processing example 1 build on step 1 to show how the whole frisbee is sensed. Many more links are created and each strengthening of a link and each new mini-circuit provides an opportunity for a low-level memory to be created and therefore the possibility of a future low-level prediction. However, the sensing of the whole frisbee in step 3 provides the opportunity for high-level memory and prediction.
  - When the whole frisbee is viewed (step 3), and the start of the formation of a symbol schema is demonstrated, a memory of seeing the whole frisbee has been created, which may, or may not, become part of a consciously accessible memory.
  - Either way, this can then lead to predictions that assist when the incoming data is incomplete or even corrupt or noisy, which translates to: I can recognise a frisbee even when I can’t see all of it, or I am looking through a dirty window, or both.
Once symbol schemas have been created, a much higher level of description of both memory and prediction is possible.
- The connections between symbol schemas enable all what I have called symbol schema memory, because the connections represent connections in the real world.
- For example, if I see a frisbee in the garden shed, connections between my symbol schema for the frisbee and the symbol schema for the shed will be strengthened and these will be part of an episodic memory of when I saw it.
- The existence of these connections can be used as a prediction in the future, so that when I next go into the shed I will expect to see the frisbee, which may be a subconscious or a conscious expectation.
- The strength of the connections will also play a part in how strong the prediction is, and this can also contribute to whether the prediction becomes conscious or not.
- The existence and strength of links between symbol schemas is part of the process of attention, they are the primary drivers of lateral influences on the multi-level competition.
As is shown in my afferent processing examples, incoming data goes through recursive and hierarchical afferent processing that looks for patterns or coincidences in the data and compresses it.
- The resulting data that is stored for future use is a form of memory, but at different levels, which means it may or may not be consciously accessible.
- A memory is only consciously accessible when perception is complete, which is when the symbol schema has been activated or updated, and also connected to the self symbol schema.
- Efferent connections are made or strengthened as part of this same afferent processing, and it is these links that can subsequently be used for higher-level prediction, among other things.
- Compression works hand-in-hand with prediction: with more prediction, more compression is possible, and with more compression, better prediction is possible²⁹.
- It is also clear that prediction goes hand-in-hand with perception: rather than bottom-up processing of sense data being responsible for perception, it is the top-down prediction that causes perception, because the symbol schema, the top-level representation, must match the incoming data.
On my page about memory, there are three types of low-level implicit memory that are described by the science. These are habituation, sensitisation, and classical conditioning.
- Although these three were studied in sea slugs at the level of single neurons and even single synapses, human-level examples will involve symbol schemas in representing things in the world.
- For example, when I no longer notice a clock ticking, I am habituated to it, it is not because the sound is no longer reaching my brain, but because the strength of the connections of the incoming sense data towards the symbol schema that represents a clock ({clock}) have been dialled down so that attention gives priority to almost everything else.
- Changes to multiple synapses have caused an emergent behaviour which in this case is to keep the noise of a ticking clock away from consciousness.
- Similarly, classical conditioning is the strengthening of links between two (or more) symbol schemas that were hardly connected before.
- The result is that whenever I perceive one, the thought of the other is triggered.
- Another good example is walking or running. Without you realising it, you are sensing more than one step ahead in order to keep balance, keep your pace and maintain an upright posture. If you try to think about what you doing or how you are doing it, what you are predicting, you may find you can hardly walk or run at all, it is just not smooth anymore.
The highest level of prediction is between symbol schemas when they are consciously perceived.
- However, we are still not aware of the prediction process, because the model in our brain of perception is all we can be aware of, and it is a very high-level model.
- For example, when you go into a bakery shop and the unmistakable smell of freshly cooked bread invades your consciousness. You are also consciously aware of all the things that this might remind you of, but you are not aware of the prediction that drives these associations.
- My model of perception, however, merely says that some data comes to my senses, I recognise what it is, and I associate it with other things because of previous encounters I have had.
- On the other side of the coin, a bad experience with food poisoning can put you off eating a certain food or type of food for many years. You know the reason for it, and you consciously avoid it, but again prediction is playing a big part in your associations.
Prediction can also be described as pattern matching by probability.
- Patterns in sense data in either space or time are stored in the synapse connections between neurons, either by the creation of new connections or by existing ones being strengthened (see afferent processing examples).
  - If a pattern is seen several times, and then in future a part of that pattern is seen again, the likelihood that it represents a part of the whole pattern can be assessed, depending on the number of connections and their strength.
  - If there are differences between the incoming signal and the best-fit model, the perception will not be completed, and some updates may be needed. What then happens will depend on the strength of the signals that have been filtered through the process of attention.
  - An action may be required, for example, a movement of the eyes, a movement of the body, or a total shift in perspective.
  - All of this is, in effect, a calculation of the probability based on previous experience and then an update if required, and is a form of Bayesian inference, as described above in the history section. However, this does not mean that the brain rigorously adheres to Bayesian probabilities, it will only be an approximation⁹.
- Since the symbol schema must match the incoming data, then clearly prediction error, or surprise, must be minimised by the brain, and therefore the free energy principle must be true almost by definition.
Prediction is happening all the time, at many levels and affecting all incoming data in your brain, all at once. It happens for incoming sense data, and it happens for out-going action.
Some examples of prediction in real life:
- If you are lying in bed half-asleep, something that can wake you up and come to your attention is if a steady noise such as the heating or air conditioning system, suddenly stops making a noise. You may not even be aware of what stopped, but your unconscious, in predicting the continuing noise, had filtered it out so that you did not notice it, until it unexpectedly stopped.
- Walking - you are continually looking several steps ahead.
- Catching a ball
The old idea was that the brain is purely reactive and that perception is a matter of accepting input from the senses, analysing it, and deciding what is out there. This has been turned on its head; the radical new idea is that the brain is primarily proactive and predictive, rather than reactive and rule driven.
Prediction is necessary because otherwise we would not have enough time to react to urgent situations. A neuron can pass a signal to another neuron in about 200ms (one-fifth of a second), so a chain of neurons only five long would take one second propagate a signal. If this included analysing from scratch what was being sensed and then reacting to it, a chain much, much longer than this would be needed. Prediction gets round this problem by allowing short-cuts to be generated, both so that the correct thing is recognised much more quickly, and also that the correct action can be taken much more quickly.

References For information on references, see structure of this website - references

^ There are over 4000 papers currently listed in Google Scholar that use the phrase 'Predictive brain', but less than 100 of these are from before the year 2000.
^ Examples of whole books on the subject of prediction in the brain:
The Predictive Mind - Jakob Hohwy 2013
Surfing Uncertainty: Prediction, Action, and the Embodied Mind - Andy Clark 2016
Both of these authors are philosophers rather than neuroscientists.
^ Surfing Uncertainty - Prediction, Action and the Embodied Mind - Clark 2016 Oxford University Press
doi: 10.1093/acprof:oso/9780190217013.001.0001
There are five occurrence of the phrase “probabilistic prediction machine” in the book. For example, page 53, start of chapter 2 “Adjusting the Volume (Noise, Signal, Attention)”, under the heading “2.1 Signal Spotting”: “...the on-board probabilistic prediction machine that underpins our contact with the world.”
Page 57, under the heading “2.3 The Delicate Dance between Top-Down and Bottom-Up”: “Driving fast along an unfamiliar winding mountain road, we need to let sensory input take the lead. How is a probabilistic prediction machine to cope?”
^ Ibid. Surfing Uncertainty - Prediction, Action and the Embodied Mind
Pages 3-4 in Introduction: “For to be able to predict the play of sensory data at multiple spatial and temporal scales just is, or so I shall argue, to encounter the world as a locus of meaning. It is to encounter, in perception, action, and imagination, a world that is structured, populated by organism-salient distal causes, and prone to evolve in certain ways. Perception, understanding, action and imagination, if PP [Predictive Processing] is correct, are constantly co-constructed courtesy of our ongoing attempts at guessing the sensory signal. That guessing ploy is of profound importance. It provides the common currency that binds perception, action, emotion, and the exploitation of environmental structure into a functional whole. In contemporary cognitive scientific parlance, this ploy turns upon the acquisition and deployment of a 'multilayer probabilistic generative model'.”
^ Response to the Edge.org question What do you consider the most interesting recent [scientific] news? What makes it important? - Lisa Feldman Barrett 2015
Opening sentence: “Your brain is predictive, not reactive.”
^ How emotions are made - The secret life of the brain - Lisa Feldman Barrett 2017 Pan Books (UK) or see GoogleScholar.
This book is by the same author who said that the brain is “predictive, not reactive” (see reference above).
In the chapter entitled “How the brain makes emotions”, page 113, third paragraph: “The infant brain is missing most of the concepts that we have as adults. ... Not surprisingly, the infant brain does not predict well. A grown-up brain is dominated by prediction, but an infant brain is awash in prediction error. So babies must learn about the world from sensory input before their brains can model the world. This learning is a primary task of the infant brain. At first, much of the onslaught of sensory input is new to an infant’s brain, and its significance is undetermined, so little will be ignored. ... Infants absorb the sensory input around them and learn, learn, learn. The developmental psychologist Alison Gopnik describes babies as having a 'lantern' of attention that is exquisitely bright but diffuse. In contrast, your adult brain has a network to shut out information that might sidetrack your predictions, allowing you to do things like read this book without distraction. You have a built-in 'spotlight' of attention that illuminates some things, such as these words, while leaving other things in the dark. The infant brain’s 'lantern' cannot focus in this manner. As the months pass, if everything is working properly, the infant brain begins to predict more effectively. Sensations from the outside world have become concepts in the infant’s model of the world.”
^ Being You - A new science of consciousness - Anil Seth Faber & Faber London 2021
Page 80, third paragraph, in the chapter entitled “Perceiving from the inside out”: “The first glimmers of a top-down theory of perception emerge in ancient Greece, with Plato’s Allegory of the Cave. Prisoners, chained and facing a blank wall all their lives, see only a play of shadows cast by objects passing by a fire behind them, and they give the shadows names, because for them the shadows are what is real. The allegory is that our own conscious perceptions are just like these shadows, indirect reflections of hidden causes that we can never directly encounter.”
^ Ibid. Being You - A new science of consciousness
Page 81, second paragraph: “Helmholtz proposed the idea of perception as a process of 'unconscious inference'. The contents of perception, he argued, are not given by sensory signals themselves but have to be inferred by combining these signals with the brain’s expectations or beliefs about their causes. In calling this process unconscious, Helmholtz understood that we are not aware of the mechanisms by which perceptual inferences happen, only of the results.”
^ Ibid. Being You - A new science of consciousness
Page 107, second paragraph in chapter 5 entitled “The Wizard of Odds”: “By minimising prediction errors everywhere and all the time, it turns out that the brain is actually implementing Bayes’ rule. More precisely, it is approximating Bayes’ rule.”
^ Treatise on Physiological Optics, Volume III - Hermann von Helmholtz 1867, translated from German by James P. C. Southall 1925
downloadable here.
Page 4, in the chapter headed “Concerning the Perceptions in General”: “...activities that lead us to infer that there in front of us at a certain place there is a certain object of a certain character, are generally not conscious activities, but unconscious ones. In their result they are equivalent to a conclusion, to the extent that the observed action on our senses enables us to form an idea as to the possible cause of this action; although, as a matter of fact, it is invariably simply the nervous stimulations that are perceived directly, that is, the actions, but never the external objects themselves.”
^ Ibid. Treatise on Physiological Optics, Volume III
Page 23: “The idea of a single individual table which I carry in my mind is correct and exact, provided I can deduce from it correctly the precise sensations I shall have when my eye and my hand are brought into this or that definite relation with respect to the table. Any other sort of similarity between such an idea and the body about which the idea exists, I do not know how to conceive. One is the mental symbol of the other.”
^ Principles of Neural Science - Sixth edition - Kandel et al. McGraw-Hill US 2021 - or see GoogleScholar.
Page 721, in chapter 30 “Principles of Sensorimotor Control” under the heading “Estimation of the Body’s Current State Relies on Sensory and Motor Signals”: “The concept of motor prediction was first considered by Helmholtz when trying to understand how we localize visual objects. To calculate the location of an object relative to the head, the central nervous system must know both the retinal location of the object and the gaze direction of the eye. Helmholtz’s ingenious suggestion was that the brain, rather than sensing the gaze direction, predicted it based on a copy of the motor command to the eye muscles. Helmholtz used a simple experiment on himself to demonstrate this. If you move your own eye without using the eye muscles (cover one eye and gently press with your finger on your open eye through the eyelid), the retinal locations of visual objects change. Because the motor command to the eye muscles is required to update the estimate of the eye’s state, the predicted eye position is not updated. However, because the retinal image has changed, this leads to the false percept that the world must have moved.”
I have not yet managed to locate the source of this text in the work of Helmholtz.
^ Perceptual illusions and brain models - Gregory 1968
doi: 10.1098/rspb.1968.0071 downloadable here or see GoogleScholar.
(All papers of Richard Gregory are available at Richard Gregory - papers)
Page 6, from sixth paragraph of left-hand column: “Perception seems, then, to be a matter of 'looking up' stored information of objects, and how they behave in various situations. Such systems have great advantages. ... Systems which control their output directly from currently available input information have serious limitations. In biological terms, these would be essentially reflex systems. Some of the advantages of using input information to select stored data for controlling behaviour, in situations which are not unique to the system, are as follows:
1. In typical situations they can achieve high performance with limited information transmission rate. It is estimated that human transmission rate is only about 15 bits/second. They gain results because perception of objects - which are redundant - requires identification of only certain key features of each object.
2. They are essentially predictive. In typical circumstances, reaction-time is cut to zero.
3. They can continue to function in the temporary absence of input; this increases reliability and allows trial selection of alternative inputs.
4. They can function appropriately to object-characteristics which are not signalled directly to the sensory system. This is generally true of vision, for the image is trivial unless used to 'read' non-optical characteristics of objects.
5. They give effective gain in signal/noise ratio, since not all aspects of the model have to be separately selected on the available data, when the model has redundancy. Provided the model is appropriate, very little input information can serve to give adequate perception and control.
There is, however, one disadvantage of 'internal model' look-up systems, which appears inevitably when the selected stored data are out of date or otherwise inappropriate. We may with some confidence attribute perceptual illusions to selection of an inappropriate model, or to mis-scaling of the most appropriate available model.”
^ The Helmholtz Machine - Dayan, Hinton, Neal and Zemel 1994
doi: 10.1162/neco.1995.7.5.889 downloadable here or see GoogleScholar.
Beginning of introduction, page 1: “Following Helmholtz, we view the human perceptual system as a statistical inference engine whose function is to infer the probable causes of sensory input. We show that a device of this kind can learn how to perform these inferences without requiring a teacher to label each sensory input vector with its underlying causes.”
And page 8, second paragraph: “The Helmholtz machine is closely related to other schemes for self-supervised learning that use feedback as well as feedforward weights. ...the Helmholtz machine treats self-supervised learning as a statistical problem - one of ascertaining a generative model which accurately captures the structure in the input examples.”
^ On Entropy, Information, and Conservation of Information - Cengel 2021
doi: 10.3390/e23060779 downloadable here or see GoogleScholar.
Start of abstract: “The term entropy is used in different meanings in different contexts, sometimes in contradictory ways, resulting in misunderstandings and confusion. The root cause of the problem is the close resemblance of the defining mathematical expressions of entropy in statistical thermodynamics and information in the communications field, also called entropy, differing only by a constant factor with the unit 'J/K' in thermodynamics and 'bits' in the information theory.”
^ Ibid. On Entropy, Information, and Conservation of Information
In the section headed “4. Information and Entropy”, last paragraph of page 10, to page 11: “Information (or entropy) in physical sciences and in the communications field is proportional to the number of possible states or configurations N with non-zero probability. At a given time, the probability of any of the possible states of an equiprobable system is p = 1/N. These possible states may be reshuffled as time progresses. The larger the number of allowed states N is, the larger the information, the larger the uncertainty or the degrees of freedom to keep track of, and thus the larger what is not known. Therefore, ironically, information in physical and information sciences turns out to be a measure of ignorance, not a measure of knowledge...”
^ Physical Memoirs, Selected and Translated from Foreign Sources, Volume 1, Part 1 - Helmholtz 1882, published Taylor & Francis, 1888
downloadable here or see GoogleScholar.
In the second section starting on page 43 entitled “On the thermodynamics of Chemical Processes”, page 49 onwards entitled “Idea of Free Energy”, page 55 third paragraph: “For isothermal changes the function δ coincides, as we have seen, with the value of the potential energy for work-values convertible without limit. I propose therefore to style this quantity the 'free energy' of the system of bodies.”
^ Relating thermodynamics to information theory: the equality of free energy and mutual information - Feinstein 1986
doi: 10.7907/XVQB-7902 downloadable here or see GoogleScholar.
Fourth sentence of abstract, page iv: “Thermodynamic free energy measures the approach of the system toward equilibrium. Information theoretical mutual information measures the loss of memory of initial state. We regard the free energy and the mutual information as operators which map probability distributions over state space to real numbers.”
^ Autoencoders, minimum description length and Helmholtz free energy - Hinton and Zemel 1994
downloadable here or see GoogleScholar.
Last paragraph of discussion, page 10: “In this paper we have shown that an autoencoder network can learn factorial codes by using non-equilibrium Helmholtz free energy as an objective function. ... We anticipate that the general approach described here will be useful for a wide variety of complicated generative models. It may even be relevant for gradient descent learning in situations where the model is so complicated that it is seldom feasible to consider more than one or two of the innumerable ways in which the model could generate each observation.”
^ Whatever next? Predictive brains, situated agents, and the future of cognitive science - Andy Clark 2013
doi: 10.1017/S0140525X12000477 downloadable here or see GoogleScholar.
Pages 2-3: “Predictive coding itself was first developed as a data compression strategy in signal processing. Thus, consider a basic task such as image transmission: In most images, the value of one pixel regularly predicts the value of its nearest neighbors, with differences marking important features such as the boundaries between objects. That means that the code for a rich image can be compressed (for a properly informed receiver) by encoding only the 'unexpected' variation: the cases where the actual value departs from the predicted one. What needs to be transmitted is therefore just the difference (a.k.a. the 'prediction error') between the actual current signal and the predicted one. This affords major savings on bandwidth, an economy that was the driving force behind the development of the techniques by James Flanagan and others at Bell Labs during the 1950s. Descendents [sic] of this kind of compression technique are currently used in JPEGs, in various forms of lossless audio compression, and in motion-compressed coding for video.”
^ Ibid. Whatever next? Predictive brains, situated agents, and the future of cognitive science
Beginning of abstract: “Brains, it has recently been argued, are essentially prediction machines. They are bundles of cells that support perception and action by constantly attempting to match incoming sensory inputs with top-down expectations or predictions. This is achieved using a hierarchical generative model that aims to minimize prediction error within a bidirectional cascade of cortical processing. Such accounts offer a unifying model of perception and action, illuminate the functional role of attention, and may neatly capture the special contribution of cortical processing to adaptive success.”
^ Ibid. Whatever next? Predictive brains, situated agents, and the future of cognitive science
Note 5 on page 22: “In speaking of 'predictive processing' rather than resting with the more common usage 'predictive coding', I mean to highlight the fact that what distinguishes the target approaches is not simply the use of the data compression strategy known as predictive coding. Rather, it is the use of that strategy in the special context of hierarchical systems deploying probabilistic generative models. Such systems exhibit powerful forms of learning and are able flexibly to combine top-down and bottom-up flows of information within a multilayer cascade”
^ How does the brain do plausible reasoning? - Jaynes 1988
downloadable here or see GoogleScholar or see Google books
Start of abstract: “We start from the observation that the human brain does plausible reasoning in a fairly definite way. It is shown that there is only a single set of rules for doing this which is consistent and in qualitative correspondence with common sense. These rules are simply the equations of probability theory, and they can be deduced without any reference to frequencies. We conclude that the method of maximum-entropy inference and the use of Bayes’ theorem are statistical techniques fully as valid as any based on the frequency interpretation of probability.”
Page 15: “Shannon’s theorem 2 tells us that the consistent measure of the 'amount of uncertainty' in a probability distribution is its entropy, and therefore we must choose the distribution which has maximum entropy subject to the constraints. Any other distribution would represent an arbitrary assumption of some kind of information which was not given to us.”
Unfortunately, the last page containing the last four references for this paper are missing from all sources I have found.
^ A Mathematical Theory of Communication - Shannon 1948
doi: 10.1002/j.1538-7305.1948.tb01338.x downloadable here or see GoogleScholar.
Page 11 concerning theorem 2: “The form of H will be recognized as that of entropy as defined in certain formulations of statistical mechanics... H is then, for example, the H in Boltzmann’s famous H theorem.”
^ Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects - Rao and Ballard 1999
doi: 10.1038/4580 downloadable here or see GoogleScholar.
Start of abstract: “We describe a model of visual processing in which feedback connections from a higher- to a lower-order visual cortical area carry predictions of lower-level neural activities, whereas the feedforward connections carry the residual errors between the predictions and the actual lower-level activities.”
End of introduction, bottom of page 79: “Using a hierarchical model of predictive coding, we show that visual cortical neurons with extra-classical RF [Receptive Field] properties can be interpreted as residual error detectors, signaling the difference between an input signal and its statistical prediction based on an efficient internal model of natural images.”
Under the heading “Results” and “Hierarchical Predictive Coding Model”, page 80: “Each level in the hierarchical model network (except the lowest level, which represents the image) attempts to predict the responses at the next lower level via feedback connections. The error between this prediction and the actual response is then sent back to the higher level via feedforward connections. This error signal is used to correct the estimate of the input signal at each level... The prediction and error-correction cycles occur concurrently throughout the hierarchy, so top-down information influences lower-level estimates, and bottom-up information influences higher-level estimates of the input signal.”
^ YouTube video - “Ransom & Fazelpour’s Intro to 'Three Problems For Predictive Coding Theory Of Attention'” - Ransom and Fazelpour 2016
The summary of Predictive Processing is taken partly from this video, which is an accompaniment to the online paper Three Problems for the Predictive Coding Theory of Attention - Ransom and Fazelpour 2015. The video contains a useful introduction to the theory, as well as a description of a possible problem with the theory, and the online paper has a number of thoughts and answers at the end.
The following quote is from the a slide on the YouTube video at 4' 45": “Attention is the process of selecting the prediction error expected to be most precise and revising perceptual hypotheses on this basis.”
^ The free-energy principle: a rough guide to the brain? - Friston 2009
doi: 10.1016/j.tics.2009.04.005 downloadable here or see GoogleScholar.
Third line of Introduction (first page, numbered page 293): “...any adaptive change in the brain will minimize free-energy.”
^ Ibid. The free-energy principle: a rough guide to the brain?
Page 299, under the heading “Attention and precision”, second paragraph: “...attention is simply the process of optimising precision [of prediction errors] during hierarchical perceptual inference.”
^ Driven by compression progress (or here) - Schmidhuber 2009
doi: 10.1007/978-3-642-02565-5_4 downloadable here or see GoogleScholar.
Introduction to section 3 on page 12: “... predictors and compressors are closely related. Any type of partial predictability of the incoming sensory data stream can be exploited to improve the compressibility of the whole.”

Go back to where you last came from or return to the top level summary.

Page last uploaded Sat Mar 2 02:55:43 2024 MST