Scaling monosemanticity (anthropic)

Mono Semanticity in Claude 3 Sonnet:

We now apply SAE to residual steam activations (values) halfway through the model rather than MLP neurons because its cheaper (lower dimension)
- Mitigate cross layer superposition
Average features active per token is fewer than 300, good!
Dead features % (not active for 10^7 tokens) increases with SAE size
Features generalize across modality, language, and models
Feature steering: choose a feature, amplify within the SAE, and replace residual stream with the output layer (which represents the same residual stream after amplifying the feature)
- Feature steering seems to generalize across prompts with varying effects, I wonder if we can improve current LLMs by amplifying good features like truthfulness, honesty, corrigibility…?
UMAP of feature vectors (hidden layer activations of SAE) based on cosine similarity reveals that semantically similar features are close together
We don’t have ALL features that models use, only have like 60% roughly, it seems like feature availability is correlated to the frequency of relevant tokens in the SAE training data/pre training data (similar samples)
https://transformer-circuits.pub/2024/scaling-monosemanticity/#safety-relevant-deception-case-study/!!!