Deep Dive into Anthropic’s Sparse Autoencoders by Hand ✍️ | by Srijanie Dey, PhD | May, 2024
Following the story of Zephyra, Anthropic AI delved into the expedition of extracting meaningful features in a model. The idea behind this investigation lies in understanding how different components in a neural network interact with one another and what role each component plays.
According to the paper “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning” a Sparse Autoencoder is able to successfully extract meaningful features from a model. In other words, Sparse Autoencoders help break down the problem of ‘polysemanticity’ — neural activations that correspond to several meanings/interpretations at once by focusing on sparsely activating features that hold a single interpretation — in other words, are more one-directional.
To understand how all of it is done, we have these beautiful handiworks on Autoencoders and Sparse Autoencoders by Prof. Tom Yeh that explain the behind-the-scenes workings of these phenomenal mechanisms.
(All the images below, unless otherwise noted, are by Prof. Tom Yeh from the above-mentioned LinkedIn posts, which I have edited with his permission. )
To begin, let us first let us first explore what an Autoencoder is and how it works.
Imagine a writer has his desk strewn with different papers — some are his notes for the story he is writing, some are copies of final drafts, some are again illustrations for his action-packed story. Now amidst this chaos, it is hard to find the important parts — more so when the writer is in a hurry and the publisher is on the phone demanding a book in two days. Thankfully, the writer has a very efficient assistant — this assistant makes sure the cluttered desk is cleaned regularly, grouping similar items, organizing and putting things into their right place. And as and when needed, the assistant would retrieve the correct items for the writer, helping him meet the deadlines set by his publisher.
Well, the name of this assistant is Autoencoder. It mainly has two functions — encoding and decoding. Encoding refers to condensing input data and extracting the essential features (organization). Decoding is the process of reconstructing original data from encoded representation while aiming to minimize information loss (retrieval).
Now let’s look at how this assistant works.
Given : Four training examples X1, X2, X3, X4.
[1] Auto
The first step is to copy the training examples to targets Y’. The Autoencoder’s work is to reconstruct these training examples. Since the targets are the training examples themselves, the word ‘Auto’ is used which is Greek for ‘self’.
[2] Encoder : Layer 1 +ReLU
As we have seen in all our previous models, a simple weight and bias matrix coupled with ReLU is powerful and is able to do wonders. Thus, by using the first Encoding layer we reduce the size of the original feature set from 4×4 to 3×4.
A quick recap:
Linear transformation : The input embedding vector is multiplied by the weight matrix W and then added with the bias vector b,
z = Wx+b, where W is the weight matrix, x is our word embedding and b is the bias vector.
ReLU activation function : Next, we apply the ReLU to this intermediate z.
ReLU returns the element-wise maximum of the input and zero. Mathematically, h = max{0,z}.
[3] Encoder : Layer 2 + ReLU
The output of the previous layer is processed by the second Encoder layer which reduces the input size further to 2×3. This is where the extraction of relevant features occurs. This layer is also called the ‘bottleneck’ since the outputs in this layer have much lower features than the input features.
[4] Decoder : Layer 1 + ReLU
Once the encoding process is complete, the next step is to decode the relevant features to build ‘back’ the final output. To do so, we multiply the features from the last step with corresponding weights and biases and apply the ReLU layer. The result is a 3×4 matrix.
[5] Decoder : Layer 2 + ReLU
A second Decoder layer (weight, biases + ReLU) applies on the previous output to give the final result which is the reconstructed 4×4 matrix. We do so to get back to original dimension in order to compare the results with our original target.
[6] Loss Gradients & BackPropagation
Once the output from the decoder layer is obtained, we calculate the gradients of the Mean Square Error (MSE) between the outputs (Y) and the targets (Y’). To do so, we find 2*(Y-Y’) , which gives us the final gradients that activate the backpropagation process and updates the weights and biases accordingly.
Now that we understand how the Autoencoder works, it’s time to explore how its sparse variation is able to achieve interpretability for large language models (LLMs).
To start with, suppose we are given:
- The output of a transformer after the feed-forward layer has processed it, i.e. let us assume we have the model activations for five tokens (X). They are good but they do not shed light on how the model arrives at its decision or makes the predictions.
The prime question here is:
Is it possible to map each activation (3D) to a higher-dimension space (6D) that will help with the understanding?
[1] Encoder : Linear Layer
The first step in the Encoder layer is to multiply the input X with encoder weights and add biases (as done in the first step of an Autoencoder).
[2] Encoder : ReLU
The next sub-step is to apply the ReLU activation function to add non-linearity and suppress negative activations. This suppression leads to many features being set to 0 which enables the concept of sparsity — outputting sparse and interpretable features f.
Interpretability happens when we have only one or two positive features. If we examine f6, we can see X2 and X3 are positive, and may say that both have ‘Mountain’ in common.
[3] Decoder : Reconstruction
Once we are done with the encoder, we proceed to the decoder step. We multiply f with decoder weights and add biases. This outputs X’, which is the reconstruction of X from interpretable features.
As done in an Autoencoder, we want X’ to be as close to X as possible. To ensure that, further training is essential.
[4] Decoder : Weights
As an intermediary step, we compute the L2 norm for each of the weights in this step. We keep them aside to be used later.
L2-norm
Also known as Euclidean norm, L2-norm calculates the magnitude of a vector using the formula: ||x||₂ = √(Σᵢ xᵢ²).
In other words, it sums the squares of each component and then takes the square root over the result. This norm provides a straightforward way to quantify the length or distance of a vector in Euclidean space.
As mentioned earlier, a Sparse Autoencoder instils extensive training to get the reconstructed X’ closer to X. To illustrate that, we proceed to the next steps below:
[5] Sparsity : L1 Loss
The goal here is to obtain as many values close to zero / zero as possible. We do so by invoking L1 sparsity to penalize the absolute values of the weights — the core idea being that we want to make the sum as small as possible.
L1-loss
The L1-loss is calculated as the sum of the absolute values of the weights: L1 = λΣ|w|, where λ is a regularization parameter.
This encourages many weights to become zero, simplifying the model and thus enhancing interpretability.
In other words, L1 helps build the focus on the most relevant features while also preventing overfitting, improving model generalization, and reducing computational complexity.
[6] Sparsity : Gradient
The next step is to calculate L1’s gradients which -1 for positive values. Thus, for all values of f >0 , the result will be set to -1.