The Non-linear Privilege: How Feedforward Networks enable neuron Interpretability
Feedforward Networks (FFNs) are fundamental building blocks in deep learning. They are the basis of many architectures - from simple MLPs to complex Transformer layers.
But do you know that the non-linearity inside FFNs is what makes neurons interpretable? The concept of a âmeaningful neuronâ is impossible without this non-linearity. I was stunned by this idea when I read about it a few days ago. I will explain this non-linearity magic in this blog.
A feedforward network can generally be expressed as:
where is the input, the weight matrix, and the bias term. Typically, we apply a non-linear activation such as ReLU or softmax after each layer.
We do this because, without non-linear activation, a network made up of stacked linear layers would still represent a single linear transformation overall. Non-linear functions allow the network to capture complex, non-linear relationships between inputs and outputs.
The non-linearity magic
To better understand how the non-linearity of FFNs makes the neuron interpretable, we will take an example.
Consider a linear two-layer network:
After the first layer, we get:
Now, suppose we multiply this intermediate representation by an orthogonal matrix () (where ):
We can âundoâ this rotation in the next layer by multiplying with :
The output hasnât changed at all.
Implication: Since the output remain same after rotation, this means that pure linear network is rotationally invariant.
This means the coordinate axes (i.e., individual neurons) are arbitrary and features do not prefer any particular direction. No single neuron activation is inherently meaningful. In other words, neurons in a pure linear network are not interpretable. They are just points in a space that can be rotated however you like.
Now let's introduce non-linearity into the equation:
Here, is an elementwise non-linear activation function (like ReLU, GELU, or softmax).
As earlier, if we try to rotate the internal representation:
After applying the non-linearity:
Implication: Unlike before, we canât âundoâ the rotation anymore, because the non-linearity acts independently on each neuron and the output has changed:
This means the network is no longer rotationally invariant. The rotation changes the output in a way that cannot be compensated by simply adjusting . The rotational invariance is broken.
The privileged basis
This breaking of rotational symmetry has a major consequence.
Each neuron now has a specific direction in the representation space. Non-linearity forces the model to represent features along the coordinate axes aka neurons.
The elementwise activation âlocks inâ a coordinate system where each axis (neuron) becomes independently meaningful, and features are aligned with specific neurons rather than arbitrary rotated directions.
This is why, in practice, we can meaningfully talk about âneuron interpretabilityâ - for example, âthis neuron activates for gendered wordsâ or âthis one fires for negative sentiment.â Such alignment cannot exist in a purely linear network.
Why It Matters for Interpretability
Because of this non-linear privilege, the neurons are interpretable and we can inspect or identify a specific neuron that correspond to human-interpretable features. The neuron can be ablated or modified to observe the behavioral changes and understand how network disentangle features internally. This is what makes the fundamental block for many mechanistic interpretability techniques and current research. Without this privilege bias, such interpretability is not possible, every neuron will be same.
Thanks for reading the blog. If you have any feedback for me or want to talk about the mech interp or anything related to ai, you can text me on twitter: angkul