Introduction

There are many ways to understand the brain, from directly manipulating the activity of neurons  in vitro to observing the behavior of patients who have suffered a stroke. One of the most powerful tools that we have nowadays to understand the brain is mathematical models and computer simulations. This approach started sometime in the ’40s but has become prominent in the past decade due to a combination of successes, including the recent rise of deep-learning neural networks in AI and the successes of reinforcement learning in robotics and automation.

Over the course of many decades, a core set of computational frameworks have become prominent in the field of computational neurosciences. These frameworks are remarkable for many reasons. Among them, they all have enjoyed enormous popularity, at different times, in computer science, Artificial Intelligence, and engineering; they all have found ways into cognitive science; and they all have given rise to their own set of specialized, consistent, and well-defined specialities.

Most importantly, all of these frameworks have become essential tools for understanding one of more aspects of the functional neuroanatomy of the brain.

Intuitively, the brain is a complex system, that has evolved to solve multiple problems at the same time. Each of these frameworks has evolved from an original simple question (“How should one learn from reward?”, “How should one form memories?”, “What is the best way to recognize objects?”). So, none of these frameworks really answers the question, “How does the brain work?”. However, all of them provide partial answers; the brain does learn from rewards; it does memorize facts; and it  does recognize objects. Thus, in a way, these frameworks provide important insights into how certain parts of the brain work, and why they work precisely they way do. Some of these answers are partially overlapping; fading memories can be used to learn better from rewards, for example.

What is a Model?

All of these frameworks attack these problems using a modeling angle. There are many definitions of what a model is, but, in the simplest possible terms, a model is just an abstract, simplified representation of a complex system. The model usually simplifies certain characteristics of the system and explicitly captures its internal workings into a set of formal equations or computational processing steps. Once these workings are captured, researchers can do a variety of things.

Explanation and Prediction

There are at least two reasons why the computational approach is important. And, although intertwined, they are also separate.

The first is explanation. When we understand what a circuit does, we can gather insight into why our data looks the way it does. You might have a puzzling experimental result, and the model might explain why this result occurs in the first place.

The second is prediction. A model that is a good approximation of a system would be able to predict what would happen in the system. An epidemiological model, for example, could be use to predict how many people would be infected by a specific disease in the upcoming days.

There is tension between the two. In many ways, the difference between explanation and prediction is not as clear-cut as it seems. Because a model, by its very nature, produces an output every time it is run, it is always making a prediction, the difference between explanation and prediction often becomes whether a prediction is about past or future data, or between existing data or yet unseen data (even {\em past} unseen data).

Models as Functions

They can use the model to explain previously strange patterns of results, and, by examining the model, better understand how and why these results arise. They can also use the model to predict what would happen in circumstances that have not been experimentally tested yet. And, finally, they can compare the model to data, and examine whether the model does a good job, and to what extent.

So, a model is a theory of a particular system’s {\em function}. In fact, mathematically, a model can be thought as a mathematical function that connects a set of conditions $X$ to a set of observed outcomes, $Y$, i.e., $X \rightarrow f(X) \rightarrow Y$. The model captures how and why these initial variables $X$ affect the outcome.

Why Should We Understand The Function?

But why would one care about the function in the first place? After all, experimental scientists do a lot of work characterizing what happens in a system when a set of variables is changed. And having a model does not get away with the need to run experiments: In most cases, nobody would trust a model’s predictions blindly, and most researchers would like to see them verified anyway. So, why would experimental sciences need to use models?

I like to summarize the difference between these two approaches with the metaphor of the difference between the rules of the movement of a chess piece and that piece’s function in a game, as in Figure \ref{fig:computation_function}. Consider, for example, the knight. The knight’s movement is the most complicated of all the chess pieces: it proceeds in every direction by two squares and then it turns and moves perpendicularly one square, making an L-shaped trajectory (Figure \ref{fig:computation_function}, left). But, if we were to observe how the knight is played during a game, we might not be able to make this inference at all. At the beginning of a game, for example, the knight might be used very early and place strategically to defend the center the of the board (Figure ~\ref{fig:computation_function}, center). At the end of a game, instead, the knight might be used to restrict the movement of the king in preparation for a checkmate (Figure ~\ref{fig:computation_function}, right). When we, as neuroscientists, perform experiments on how a certain brain region is being used, we are in fact just observing how the brain might be using the same chess piece under different conditions. If all we can say about the knight is that it is being used to “defend the center at the beginning” and to “attack the king in endgame”, we are left with little explanatory power. Even worse, if all we can say is that “the knight is used at the beginning and the end of a game”, we are left with not much knowledge than what we started with.

If, on the other hand, we can describe exactly how the knight moves, then we can make sense of all of its functions, {\it explain} why the player has used it that way, and {\it predict} how and when the knight will be used in the future.

\begin{figure}
\centering
\fullfigure{figures/computation_function.png}
\caption{A comparison of the knight’s “computations” ({\it left}) and two of its possible “functions”: defending the center, at the beginning ({\it center}) and supporting a check, in endgame ({\it right}). Although the functions might be different, the computations remain the same, and in fact it is the piece’s computations (the rules of movement) that explain its use in different phases of the game.}
\label{fig:computation_function}
\end{figure}

\section{Two Traditions of Modeling}

In a landmark paper, Breiman\cite{breiman2001statistical} identified two traditions of statistics, one he called {\em data modeling} and one he called {\em algorithmic modeling}.

In the most general term possible, a model is a function $f(X)$ that connects a set of data $X$ to a set of outcomes $Y$, i.e., $X \rightarrow f(X) \rightarrow Y$. This extremely general definition holds for anything we might want to call “model”: It works for detailed models of brain networks as well as for statistical models. Whether you are creating a large neural network that simulates how the visual brain sees the world or you are just fitting a linear regression model, you are still doing the same: creating a {\em function} that makes sense of the {\em data}.

The difference is how these two traditions think about this mysterious function. The data modeling approach works a in sort of top down manner: It starts with some assumptions about the nature of the data, and proceeds to derive predictions about the outcomes from these assumptions. When a statistician states that two variables need to be “independent and normally distributed”, they are doing precisely that–making assumptions about the processes that generate the data: in this case, that the data might be generated by sampling without replacement from random pool of values with a certain mean and variance. From this assumption, a statistician can derive very precise predictions about the observed outcomes $Y$; for example, they might derive the probability that all outcomes come from the very same pool. Because this approach starts with assumptions about the function $f$ that generates the $Y$, it is called data modeling.

However, one could also be agnostic about these hypothesis, and simply use mathematical tools that approximate the underlying function from the constraints posed by the data itself. Breiman called this approach “algorithmic”; nowadays, this approach is commonly known as {\em machine learning}.

These two traditions reverberate throughout any modeling approach, in any field I have ever seen. In old fashioned, symbolic AI, adherents to the two traditions were called sometimes called “neats” and “scruffies” \footnote{I personally love these terms. Full disclousure: I am a scruffy.} and in Computational Psychiatry, for example, these two approaches are called “explanatory models” and (much more transparently) “machine learning”.

Confusingly, the same concepts, abstractions and techniques are sometimes be used in both approach. Take, for instance, the case of the technique called {\em linear regression}: it consists of using a simple model in which the effects of a series of independent variables add up to determine the value of a dependent variable. This modeling technique is commonly used in statistics be to test the existence of a relationship between two variables (an example of data modeling) and but it can also be used to approximate an unknown function, as it is used in a special technique called LASSO (an example of machine learning). Even more dramatically, neural networks can be taken as a structural model of the brain (an example of data modeling) but can also be trained, as we will see, to approximate {\em any} function (which is way they are ubiquitous in contemporary machine learning).

These ambiguities nonetheless, in this book I will focus on computational models of the {\em explanatory} tradition. All of the models described here embody a theory about a specific brain function (about learning, memory, perception) and use different abstractions to make sense of what we know.

\section{An Example of Explanatory Model: Fitts’ Law}

To understand the different facets of an explanatory model, let’s consider a simple one. It is a mathematical model of response times for motor movements, known as Fitts’ Law \cite{fitts1954information}. Fitts’ Law is an equation that predicts the time required to move a hand (or a cursor, or a pen) to a target area that has width $W$ and is located at a distance $D$ from the current position of the hand. Figure \ref{fig:intro:fitts} illustrates a typical example: calculating the time needed to move a cursor from one position to a different area. According to Fitts’ Law, the time $T$ is related to width $W$ and distance $D$ by Equation \ref{eq:intro:fittslaw}:

$latex  T = a + b \log_2 \left( \frac{2D}{W} \right)$

\begin{marginfigure}
\centering
\caption{A possible application of Fitts’ law: Determining the time it takes to use a mouse to move a cursor (black arrow) from its original position to a new area (the grey circle)}
\label{fig:intro:fitts}

\begin{tikzpicture}[x=1cm, y=1cm, >=stealth, node distance=1.25cm]
\draw (0,0) rectangle (5,4);
\draw[fill=lightgray] (3.5, 1.5) circle (1);
\draw[thick,dashed, ->] (0.5,3.5) — node[anchor=south] {{\em D}} (3.5,1.5);
\draw[thick,dashed, <->] (2.5,1.5) — node[anchor=north] {{\em W}} (4.5,1.5);
\draw[line width=2,solid, ->] (0.7,3.1) — (0.5,3.4);
\end{tikzpicture}
\end{marginfigure}

\noindent
where $a$ is an intercept, and can be consider the minimum amount of time required to initiate any movement, while $b$ is a scaling factor, and can considered as a general parameter that captures an individual’s speed of movement.

\subsection{Inside a Model: Fit, Features, and Free Parameters}

If we peek inside a model ({\it any} model) we can find some common elements. First, any model must have an {\it output}. In the case of Fitt’s law, the output is the movement time $T$. This output might or might not reflect the data; the degree to which the model’s output matches the data is called the model’s {\it fit}.

Second, each model contains certain quantities that capture specific aspects of the outside world and environment. In Eq. \ref{eq:intro:fittslaw}, for example, the quantities $W$ and $D$ (width and distance of the target) represent everything we need to know about the world in which we need to make a movement. These variables are called {\it features}; in choosing the appropriate features, the designer of a model implicitly defines the levels of abstraction and the degree of simplification they want to impose on the world. Note that, once the level of abstraction is chosen, the features are, in principle, measurable properties of the outside world. (A partial exception to this rule is represented by contemporary deep-learning models, which are trained on raw data and are capable at extracting features on their own),

Finally, Eq. \ref{eq:intro:fittslaw} contains two more variables, $a$ and $b$. Unlike $D$ and $W$, they do not represent a measurable property of the world; in fact, there is no way they can be measured independently of the the equation itself. These variables, which mediate the effect of the features (the outside world) on the output, are called {\it free parameters}. One of the defining characteristics of explanatory models is that, because they embody a theory, it is somewhat clear what their parameters represent. By looking at equation \ref{eq:intro:fittslaw}, it is clear that no response time time can ever be smaller than $a$; thus, $a$ can be thought of as the smallest time it takes to initiate a movement on the given device. The paramenter $b$, on the other hand, mediates the additional time it takes to move a cursor to a given location. Thus, we can think of $b$ as representing the {\em effort} necessary to control the movement itself. In general, a movement will be slower as $D$ grows and faster as $W$ grows, but some individuals will be faster overall, while others will need more time to move the cursor; these differences will be reflected in different values of $b$, and we can say that, when $b$ is smaller, the amount of effort that the movement takes is also smaller.

\subsection{Fitting a Model}

But, now that we have identified features and free parameters, how do we know whether our model is any good?

To do so, we need to find the specific parameters the model that better fit the data. In the case of Fitts’ model, that comes down to finding that data values of $a$ and $b$ that reduce the difference between the model’s prediction $Y’$ and the actual data $Y$. This difference, or any other difference we want to minimize, is called the {\em loss} function.

Suppose, for instance, you ran an experiment varying the distance $D$ from a cursor to a target are of width $W$, and you obtain the data from Table \ref{tab:intro:fitts_data}.

\begin{margintable}
\centering
\begin{tabular}{c|c|c}
\hline
$W$ & $D$ & Time ($s$) \\
\hline
100 & 300 & 3.36 \\
150 & 50 & 1.08 \\
220 & 100 & 1.16 \\
110 & 200 & 3.07 \\
40 & 250 & 4.12 \\
\hline
\end{tabular}
\caption{Results from a hypothetical experiment with the setup of Figure \ref{fig:intro:fitts} , with varying values of the distances $D$ and the target width $W$ }
\label{tab:intro:fitts_data}
\end{margintable}

We can define a loss function that quantifies how close the predictions of the Fitts model come to the times recorded in the third column in the table. For each combination of values of the parameters $a$ and $b$, we can plug in the different values of $W$ and $D$ and compare the model’s predictions $Y’$ against the five observed values $Y$. A convenient way to do so is to calculate the sum of squares, much as it is done in statistics:

\begin{equation*}
L = \sum_{y \in Y} (y-y’)^2
\end{equation*}

Fitting a model is, therefore, the process of finding the values of parameters $a$ and $b$ that minimize the output of this equation.

In some cases, you might be lucky enough that there are some specific formulae that let you calculate the ideal parameters in a few simple steps. This is the case of linear regression. It also happens to be the case of Fitt’s law. If you consider equation \ref{eq:intro:fittslaw}, you’ll notice that the it is essentially a linear equation of the form $y = a + bx$, once you consider $\log_2(2D/W)$ as your independent variable $x$. To find the two values of $a$ and $b$ that create the better fit, you simply combine all of the pairs of $W$ and $D$ into a single variable $x$, concatenate all of these variables into a vector $X$, and apply the linear regression formula: $(X^TX)^{-1} X^T Y$.

The result, in this case, is $a = 1.4438$ and $b= 0.7560$. With these parameters value, the loss function measures only 0.093. Notice that, to calculate these values, we had to change the model’s {\em features}: Fitts’ model sees the world as made of distances and widths, but the linear regression model sees only a single value $x$. This is another case in which you need to transform the data to fit it into the model’s worldview. But is this a good value? You can judge for yourself: Figure \ref{fig:intro:fitts_linear} shows the predictions of Fitts’ Law, represented as line, against the experimental results.

\begin{marginfigure}
\centering
\includegraphics{figures/intro/fitts_linear.png}
\caption{Predictions of Fitts’ Law (red dashed line) against the experimental results of Table \ref{tab:intro:fitts_data} (blue dots). The values of features $W$ and $D$ have been combined into a single value $x$, and the parameters $a$ and $b$ were fit with linear regression}
\label{fig:intro:fitts_linear}
\end{marginfigure}

But what if we cannot use a direct formula to calculate the best values of a model’s parameters? In the most general case, it is possible to use brute force and examine multiple values of $a$ and $b$ until we identify the combination that minimizes our loss function. For example, one could sample all values of $a$ and $b$ from 0.5 to 1.5 in increments of 0.01, and compute the loss function for each combination. This approach, called {\em grid search}, gives you an approximate idea of fit of Fitts’ model within a slide of its parameter space, as shown in Figure \ref{fig:intro:fitts_grid}. In the figure, colors represent the magnitude of the loss function, and darker areas represents smaller loss value and, thus, better fits. The cross sign “+” marks the position, in the parameter space, that corresponds to the solution found by linear regression.

\begin{marginfigure}
\centering
\includegraphics{figures/intro/fitts_grid.png}
\caption{Loss function for the Fitts model across different values of parameters $a$ and $b$, when compared against the data in Table \ref{tab:intro:fitts_data}.}
\label{fig:intro:fitts_grid}
\end{marginfigure}

However, this brute force, grid-search approach is rarely used in practice, as sampling all of the parameters is often unfeasible–especially as models become more complex and take longer to run. For example the plot in Figure \ref{fig:intro:fitts_grid} was generated by examining 10,000 combination of $a$ and $b$ values; such as a sample might not be feasibile. Furthermore, grid search requires setting a predefined sampling that discretaizes the possible values of $a$ and $b$. For example, the grid search examined cases in which $a=1.44$ and $a=1.45$, but never examined the case in which $a = 1.4438$; such a value would, in fact, be invisible to the method.

For all of these reason, it is common to use special techniques called {\em optimization algorithms} instead of grid searches. These algorithms capitalize on the fact that, in most models, similar parameter values would produce similar results in terms of the model’s loss function. In Fitt’s law, for example, changing the value of $a$ from 1.44 to 1.45 does not produce appreciable changes in the loss function, no matter what the value of $b$ is. Furthermore, the direction of the changes in the loss function is usually consistent: If changing $a$ from 1.44 to 1.45 increases the loss function, then a further change of $a$ to 1.46 would likely result in an even larger loss value. In other words, that the surface of the loss function over the two parameters’ values is smooth. And smooth functions can be explored fairly easily by examining finding out the direction in which the parameters can be changed to reduce the loss function. This is exactly what optimization algorithms do:they start with an initial guess for the model parameters, and modify them iteratively in the direction that reduces the loss function, until a minimum value is found\footnote{For this reason, these algorithms are also called {\em minimization} algorithms.}. Optimization algorithms explore only a small portion of the parameter space, but they quickly converge over the correct solution. Figure \ref{fig:intro:fitts_grid} depicts the points (in white) explored by one such method, the Nelder-Mead algorithm, to find the values of $a$ and $b$ that minimize the loss function of Fitts’ Law, starting at $a=1$, $b=1$ and terminating at the same values that were identified by linear regression.

\begin{marginfigure}
\centering
\includegraphics{figures/intro/fitts_optimize.png}
\caption{Points of the parameter space explored by the Nelder-Mead algorithm}
\label{fig:intro:fitts_grid}
\end{marginfigure}

\section{Models as Theories and Models as Measures}

Every explanatory models is an abstract, simplified representations of some system—a theory of how the system works. When fitting a model, however, researchers might be interested in two very different things: the theory itself, and the properties of the system that the theory allows to measure.

\subsection{Models as Theories}

In the first case, the researchers might be interested in the model in itself. Every model, in a sense, is a {\em theory}, and the researchers might have developed the model as a new theory that explains how and why a particular set of phenomena happen. Fitting the model to empirical data is done as a way to provide a quantitative measure of how good of a theory the model is.

This is exactly how Fitts’ Law was originally conceived: it was proposed as a principled way to make sense of motor movements. Like much of the ’50s mathematical psychology work (such as the work of Hick \cite{hick1952rate} and Hyman \cite{hyman1953stimulus} on response times in the presence of multiple options), it was deeply influenced by information theory\footnote{The use of logarithms in base 2 is a dead giveaway!}. In his paper, Fitts derived these equations from a purely theoretical point of view. In this sense, this model was a representation of the computational efforts that brains engaged with when performing a movement. A very {\it successful} representation, as evidenced by the sheer amount of citations that the original paper keeps amassing over the year. In his paper, experimental data were used corroborate the original intuition.

%\subsection{Comparing Models: Fit and Complexity}

%Often, researchers who are proposing a new model attempt to compare it to {\em other} models or theories. Imagine, for example, that you are coming up with a better model of how
%Now, since we introduced the topic of comparing models.
% In general, it is easier to increase the fit of a model by adding more parameters. For this reason, the number of parameters is often considered a proxy for the complexity of a model. This is, of course, an extreme simplification: It is possible to create extremely complex and flexible functions that contaion few parameters. However, at least within models of the same family

\subsection{Models as Measures}

There is a second view of models, which is related and easily confused with the first one but remain substantially different. In this view, researchers typically {\em assume} that a given model is correct, and is interested in the values of the model’s parameters.

Since it was originally proposed, Fitts’ Law has become a {\it de facto} assumption in much research in Human Factors and Human Computer Interaction—that is why it is called Fitts’ {\em Law}. In this case, the model is not used as a representation, but as a measurement method. It is assumed that the model is either true or a sufficiently accurate representation, and used to make predictions. Do you want to calculate how easy it is to move a mouse to a “submit” button on a login page? You can use Fitts’ Law. Have you created a new interface that uses special laser pointers pointed to an LCD? It will still follow Fitts’ Law, but you might need to collect some data and establish new values of $a$ and $b$ for your interface.

Because the parameters of an explanatory model are conceptually clear, it is also possible to use the model to make sense of complex data patterns. As noted above, the $a$ and $b$ parameters can be interpreted as the time it takes to initiate a movement and a the effort it takes to perform it. suppose you want to investigate whether it is easier to use a computer mouse or a touch device (like a smartphone’s screen) to drag a window. You can collect data and extract the values of $a$ and $b$ for both devices. You might find, for example, that $a$ is smaller for the touch device, as it is faster to point your finger than it is to grab a mouse and click, but $b$ is smaller for the mouse, as it can cover longer distances much more easily than the finger. This would be an insightful analysis, and would help you conclude, for example, that touch devices are better for small screens and mice are better for larger ones.

Parameters can be also used to investigate different between individuals. In general, some people will be consistently slower or faster at movement movements, and these differences would be reflected in different values of $a$ and $b$.

An entire class of explanatory models described herein, that of accumulator models, was designed precisely as way to measure unobserved but conceptually clear processes from the distribution of response times—and they are incredibly successful at it.

\section{Levels and Traditions of Models}

As I mentioned at the beginning of this chapter, a useful model is simpler than the object it is trying to simulate. When deciding what to model, one of the fundamental choices is the level of abstraction at which you intend to capture the phenomenon of interest. Consider, for example, the case of a scientist interested in developing a model of memory. One choice would be to start from first principles: what is memory for? If memory is necessary to make relevant events of the past available, it likely mirrors the statistics of the environment, so that memories of very common events are more likely to be remembered than memories of less likely ones.

A different approach would be to abstract some feature of memory and try to capture it with a simple mechanism. For example, memories tend to fade with time; thus, we can imagine memories being discrete entities whose availability decays over time. We could borrow the metaphor of radioactive decay, and approximate forgetting with an exponential decay.

Yet another approach would be to consider where in the brain memory is implemented. We know from patient studies that memories are stored (at least initially) in a circuit known as the hippocampus. The hippocampus, and in particular an area known as CA3, has a particular structure: it is a single layer of interconnected neurons. We can start by modeling this structure, the interaction between the different neurons, and see how memories can be represented in the network of neurons.

All three of these approaches have been attempted, and they are covered in Chapters 4 and 6. David Marr, one of the pioneers of computational approaches in the study of the brain and cognition, proposed a classification of these approaches that has proven influential \cite{marr1982vision}. According to him, each phenomenon could be modeled at three levels:

\begin{itemize}
\item The {\em functional} level\footnote{In Marr’s book, this level is actually called “computational”. This is often confusing, since all of the other models are also computational in the common sense of the word. So I prefer to use “functional” here}. At this level, the experimenter is investigating the general structure of the problem, and typically asks what would be the most general and optimal solution. A memory researcher who uses a Bayesian approach to capture when memories are more likely to be retrieved is working at this level.

\item The {\em algorithmic} level. At this level, a researchers outlines the basic elements of the model, including how to represent the problem (the features) and how the model is working step-by-step. The memory researcher who adopts the metaphor of radioactive decay and tries to predict the effect of time on forgetting works at this level.

\item The {\em implementation} level. At this level, the modeler seriously consider the nature of the physical processes that occur. The memory researcher who decided to study memory by modeling the interactions of neurons in the hippocampus would be working at this level.
\end{itemize}

Much has been written about these levels of abstractions, and many authors have proposed their own classifications or expanded them. There is no right or wrong levels; each and every level provides different insights into the nature of thought. Similarly, the levels are not som clear-cut. For example, is Fitts’ Law a model that exists at the functional or at the implementation level? And, finally, each of these approaches stills remains an abstraction of the original phenomenon—even if you dig deep into the implementation level.

\subsection{Symbolic vs. Connectionist Traditions}

Historically, when the idea of understanding brain function through modeling was still in its infancy, the field of cognitive science, cognitive psychology, and artificial intelligence were very close and almost indistinguishable. But even then, modelers aligned themselves in two different traditions, which have often been name “symbolic” and “connectionist”. Symbolic models tend to explictly represent abstract concepts and their relationships, much in the same way as variables are represented in a computer program. They make for often elegant theories, but they tend to overlook the nitty-gritty details of how networks of neurons carry out the computations. For example, in Section \ref{sec:actr:representation}, we will present a model of memory in which different features of a fact are represented as a list, so that “The canary is a yellow bird” becomes something like “[(Object : Canary), (Type: Bird), (Property: Yellow)]”. The use of such representations has earned this tradition the name “symbolic”.

In the brain, of course, these lists do not exists, and properties such like being a “bird” and being “yellow” are represented in a distributed network of neurons. The specific ways in which the brain encodes and modifies these representations can be ignored only up to a certain point. A group of researchers have argued, since the very beginning, that it is better to start right there with a better understanding of how networks of neurons represent concepts and carry out computations. This school of thought has given rise to modern neural networks (including those used in contemporary deep-learning AIs) and is has become known as “connectionist”.

These two traditions have often being in sharp disagreement with each other. Over the course of decades, they have taken turns in dominating the cognitive neurosciences (and also the fields of artificial intelligence and machine learning). As usual, I maintain that there is no unique, correct answer as to which one is the best—it largely depends on what one sets out to achieve and (why not?) on individual preferences. These two traditions are reflected in the structure of this book, whose first part contains symbolic models while its second part contains connectionist models.

\section{What This Textbook is {\it Not}}

A few words of caution. This is largely a textbook I have written because I needed to create a consistent set of materials for my own classes at the University of Washington. Although I could refer my students to individual papers or tutorials, I was constantly bothered by the lack of an easy way to integrate between different aspects of my classes. So, eventually, during a my 2020-21 sabbatical, I finally got around to start this textbook.

Compared to those textbooks, the material covered here is much more focused on the systems-level neuroscience than on single neuron properties. Similarly, the focus is much more on large-scale theories (RL, memory associators) than on biophysical models.

%\section{Acknowledgements}

%Many thanks to the students and collaborators who read early versions of this textbook and provided helpful feedback: Linxing Jiang, Catherine Sibert, Ellen Xing, Annie C. Yang