#### The traditional reasoning behind the need for nonlinear activation functions is only one aspect of this story.

What do the softmax, ReLU, sigmoid and tanh functions have in common? They are all **activation functions**— and they are all **non-linear**. But why do we need activation functions in the first place, specifically nonlinear activation functions? There is a traditional reasoning, and also a new way of looking at it.

The traditional reasoning is this: without a nonlinear activation function, a deep neural network is just a composition of matrix multiplications and the addition of biases. These are **linear transformations**and you can prove using linear algebra that **The composition of linear transformations is just another linear transformation.**

So no matter how many linear layers we stack on top of each other, without activation functions our entire model is no better than a linear regression. It will completely fail to capture nonlinear relationships, even simple ones like XOR.

Enter activation functions: by allowing the model **learning a nonlinear function** we get the opportunity to model all kinds of complicated relationships in the real world.

This story, which you may already know, is completely true. But the study of any subject benefits from different perspectives, especially deep learning with all its interpretability challenges. Today I want to share with you another way of looking at the need for activation functions and what it reveals about the inner workings of deep learning models.

In short, what I want to share with you is this: the way we normally construct deep learning classifiers creates a **inductive bias** in the model. More specifically, **using a linear layer for the output** means that the rest of the model is a **linearly separable**transformation of the input. The intuition behind this can be very useful, so I’ll share some examples that will hopefully clear up some of the jargon.

### The traditional explanation

Let us revisit the traditional reasoning for nonlinear activation functions with an example. We consider a simple case: **XOR**.

A plot of the XOR function with ground truth values colored. Background color represents linear regression predictions. Image by author.

Here I trained a linear regression model on the XOR function with two binary inputs (ground truth values are plotted as dots). I plotted the outputs of the regression as background color. The regression learned nothing at all: it guessed 0.5 in all cases.

Now, instead of a linear model, I’m going to train a very simple deep learning model with MSE loss. **one linear layer with two neurons** followed by the **ReLU** activation function, and finally the output neuron. For simplicity, I only use weights, no biases.

A diagram of our basic neural network. Created with draw.io by author.

What happens now?

Another plot of the XOR function, this time with predictions from a simple deep learning model. Image by author.

Wow, now it’s perfect! What do the weights look like?

Layer 1 weight: (( 1.1485, -1.1486),

(-1.0205, 1.0189))

(ReLU)

Layer 2 weight: ((0.8707, 0.9815))

So for two entrances *X* And *and*our output is:

This is really similar to

which you can verify is exactly the XOR function for input *X*, *and* and {0, 1}.

If we didn’t have the ReLU, we could simplify our model to 0.001*and *– 0.13*X*a linear function that wouldn’t work at all. So there you have it, the traditional explanation: since XOR is an inherently nonlinear function, it can’t be modeled exactly by a linear function. Even a composition of linear functions won’t work, because that’s just another linear function. **Introducing the nonlinear ReLU function** allows us to capture non-linear relationships.

### Digging Deeper: Inductive Bias

Now we will work on the same XOR model, but we will look at it from a different perspective and get a better picture of the **inductive bias **of this model.

What is an inductive bias? Given any problem, there are many ways to solve it. In essence, an inductive bias is something built into the architecture of a model that causes it to choose one method of solving a problem over another.

In this deep learning model, our last layer is a simple linear layer. This means that our model cannot work at all unless the output of the model immediately before the last layer can be solved by linear regression. In other words, **The final hidden state before output must be linearly separable for the model to work.**This inductive bias is a property of our model architecture, not of the XOR function.

Fortunately, our hidden state in this model has only two neurons. Therefore, we can visualize it in two dimensions. What does it look like?

The input representation for the XOR function is transformed into a hidden representation using deep learning (after one linear layer and ReLU). Background color represents the predictions of a linear regression model. Image by author.

As we saw earlier, a linear regression model alone is not effective for the XOR input. But once we pass the input through the first layer and ReLU of our neural network, our output classes can be neatly *separated by a line* (**linearly separable** ). This means that linear regression will now work, and in fact our last layer is just performing this linear regression.

Now what does this tell us about inductive bias? Since our last layer is a linear layer, the representation before this layer is **must**at least approximately linearly separable. Otherwise, the last layer, which functions as a linear regression, will fail.

### Linear Classification Probes

For the XOR model, this may seem like a trivial extension of the traditional view we saw earlier. But how does this work for more complex models? As models get deeper, we can gain more insight by looking at nonlinearity in this way. This paper by Guillaume Alain and Yoshua Bengio explores this idea using **linear classifier probes**.(1)

“The hex dump shown on the left has more information content than the image on the right. Only one of them can be processed by the human brain in time to save their life. Computational convenience is important. Not just entropy.” Image and caption by Alain & Bengio, 2018 (Link). (1)

For many cases, such as MNIST handwritten digits, all the information needed to make a prediction already exists in the input: it is just a matter of processing it. Alain and Bengio note that as we delve deeper into a model, we actually *fewer* information at each layer, no more. But the advantage is that at each layer the information we have becomes “easier to use”. What we mean by this is that the information becomes more and more linearly separable after each hidden layer.

How do we find out how linearly separable the representation of the model is after each layer? Alain and Bengio propose to use what they call **linear classifier probes**The idea is that after each layer we have a **linear regression to predict the final output, using the hidden states at that layer as input.**

This is essentially what we did for the final XOR plot: we trained a linear regression on the hidden states just before the final layer, and we found that this regression successfully predicted the final output (1 or 0). We couldn’t do this with the raw input, where the data wasn’t linearly separable. Remember, the final layer is essentially linear regression, so in a sense this method is like creating a new final layer that’s shifted earlier in the model.

Alain and Bengio applied this to a convolutional neural network trained on MNIST handwritten digits: before and after each convolution, ReLU and pooling, they added a linear probe. What they found is that the test error almost always decreased from one probe to the next, indicating an increase in linear separability.

Why are we making the data linearly separable and not “polynomially separable” or something else? Because the last layer is linear, the loss function we use will force all the other layers in the model to work together and create a linearly separable representation for the last layer to predict.

Does this idea also apply to large language models (LLMs)? Yes, it certainly does. Jin et al. (2024) used linear classifier probes to demonstrate how LLMs learn different concepts. They found that **simple concepts**such as whether a particular city is the capital of a particular country, **become linearly separable early in the model**: only a few nonlinear activations are needed to model these relationships. In contrast, many **reasoning skills only become linearly separable later in the model**or not at all for smaller models.(2)

### Conclusion

When we use **activation functions** we introduce **non-linearity** in our deep learning models. This is certainly good to know, but we can get even more value from interpreting the consequences of linearity and nonlinearity in multiple ways.

While the above interpretation looks at the model as a whole, a useful mental model focuses on the **last linear layer** of a deep learning model. Because this is a linear layer, everything that comes before it must be linearly separable; otherwise, the model will not work. Therefore, the other layers of the model will work together during training to **find a linear representation that the final layer can use**for his prediction.

It is always good to have more than one intuition for the same thing. This is especially true for deep learning, where models can be so black-box that any trick to get better interpretability is useful. Many papers have applied this intuition to get fascinating results: Alain and Bengio (2018) used it to develop the concept of **linear classifier probe**while Jin et al. (2024) built on this to see increasingly complex concepts developing layer by layer in a language model.

I hope this new mental model for the purpose of nonlinearities has been useful to you and that you can now shed more light on black-box deep neural networks!

Photo by Nashad Abdu on Unsplash

### References

(1) G. Alain and Y. Bengio, Understanding intermediate layers using linear classifier probes (2018), arXiv

(2) M. Jin et al., Exploring Concept Depth: How Do Large Language Models Acquire Knowledge at Different Layers? (2024), arXiv

A Fresh Look at Nonlinearity in Deep Learning was originally published in Towards Data Science on Medium. People continued the discussion by bookmarking and commenting on this story.