OAI SAE paper: Scaling and evaluating sparse autoencoders

https://cdn.openai.com/papers/sparse-autoencoders.pdf

TLDR: SAEs are based on the superposition hypothesis; deep learning systems ‘cram’ features/logic circuits into the available nodes to allow for compressed sensing, while the real feature space is higher dimensional and orthogonal, and can be represented by linear combinations of network activations.
- So we train a 1 hidden layer network that takes residual stream/MLP layer activations as input and tries to restore it in the output layer, with a high dimensional hidden layer, each node being a ‘feature’
- Both papers use residual activations
- Cost function = MSE + sparsity constraint
  - Although this isn’t the ultimate goal, it seems to be a good proxy
SAE work on Claude and GPT both suggest that models reason based on sentences and paragraphs because feature activations for certain certain concepts disproportionately fall on period or next line tokens
- Very few features exhibit this, probably due to distribution
k-SAE: only the k-highest activations are kept in the hidden layer
From the k-SAE paper:
- No activation functions aside from ReLu used to compute k-largest activations…
- Tied weights: encoder matrix = decoder matrix transposed (explicit reconstruction)
  - Not sure why…
Anthropic used mid layer residual stream while OAI uses a layer near the end
- Anthropic used intuition, many activation engineering studies found middle layers to be effective.
- OAI did sweep and analysis, but the reasoning seemed sketchy to me (probably due to my lack of understanding though)
Anthropic uses ReLu SAE whereas OAI uses TopK SAE
- TopK has lower MSE and Sparsity tradeoff + scales better with number of hidden nodes
Preventing dead latents:
- Initialize encoder matrix to the transposed decoder matrix
- Add an auxiliary error term that uses the TopK dead latents to restore the input
- 90% dead to 7% dead
They do a good job of summarizing the scaling laws so I won’t add anything there

My questions:

Are dead latents necessarily bad?
Yes we get lower loss and directly setting L0 with TopK, but again that’s not the final goal, qualitatively based on a small sample size I looked at, Anthropic’s features seem more interpretable and see more cohesive token activations
- We should expect multiple feature activations per token because the same token can carry many different meanings in human language. So I can’t say I agree with TopK from a qualitative perspective