angkul's site

The Non-linear Privilege: How Feedforward Networks enable neuron Interpretability

Feedforward Networks (FFNs) are fundamental building blocks in deep learning. They are the basis of many architectures - from simple MLPs to complex Transformer layers.

But do you know that the non-linearity inside FFNs is what makes neurons interpretable? The concept of a “meaningful neuron” is impossible without this non-linearity. I was stunned by this idea when I read about it a few days ago. I will explain this non-linearity magic in this blog.

A feedforward network can generally be expressed as:

F(x)=Wx+b

where x is the input, W the weight matrix, and b the bias term. Typically, we apply a non-linear activation such as ReLU or softmax after each layer.

We do this because, without non-linear activation, a network made up of stacked linear layers would still represent a single linear transformation overall. Non-linear functions allow the network to capture complex, non-linear relationships between inputs and outputs.


The non-linearity magic

To better understand how the non-linearity of FFNs makes the neuron interpretable, we will take an example.

Consider a linear two-layer network:

F(x)=xW1W2

After the first layer, we get:

h=xW1

Now, suppose we multiply this intermediate representation by an orthogonal matrix (O) (where OO1=I):

h=xW1O

We can “undo” this rotation in the next layer by multiplying W2 with O1:

F(x)=xW1OO1W2=xW1W2

The output hasn’t changed at all.

Implication: Since the output remain same after rotation, this means that pure linear network is rotationally invariant.

This means the coordinate axes (i.e., individual neurons) are arbitrary and features do not prefer any particular direction. No single neuron activation is inherently meaningful. In other words, neurons in a pure linear network are not interpretable. They are just points in a space that can be rotated however you like.

Now let's introduce non-linearity into the equation:

F(x)=g(xW1)W2

Here, g is an elementwise non-linear activation function (like ReLU, GELU, or softmax).

As earlier, if we try to rotate the internal representation:

h=xW1O

After applying the non-linearity:

g(h)=g(xW1O)

Implication: Unlike before, we can’t “undo” the rotation anymore, because the non-linearity acts independently on each neuron and the output has changed:

g(Ov)Og(v)

This means the network is no longer rotationally invariant. The rotation changes the output in a way that cannot be compensated by simply adjusting W2. The rotational invariance is broken.


The privileged basis

This breaking of rotational symmetry has a major consequence.

Each neuron now has a specific direction in the representation space. Non-linearity forces the model to represent features along the coordinate axes aka neurons.

The elementwise activation “locks in” a coordinate system where each axis (neuron) becomes independently meaningful, and features are aligned with specific neurons rather than arbitrary rotated directions.

This is why, in practice, we can meaningfully talk about “neuron interpretability” - for example, “this neuron activates for gendered words” or “this one fires for negative sentiment.” Such alignment cannot exist in a purely linear network.


Why It Matters for Interpretability

Because of this non-linear privilege, the neurons are interpretable and we can inspect or identify a specific neuron that correspond to human-interpretable features. The neuron can be ablated or modified to observe the behavioral changes and understand how network disentangle features internally. This is what makes the fundamental block for many mechanistic interpretability techniques and current research. Without this privilege bias, such interpretability is not possible, every neuron will be same.


Thanks for reading the blog. If you have any feedback for me or want to talk about the mech interp or anything related to ai, you can text me on twitter: angkul