Mono Semanticity in Claude 3 Sonnet:
https://transformer-circuits.pub/2024/scaling-monosemanticity/
- We now apply SAE to residual steam activations (values) halfway through the model rather than MLP neurons because its cheaper (lower dimension)
- Mitigate cross layer superposition
- Average features active per token is fewer than 300, good!
- Dead features % (not active for 10^7 tokens) increases with SAE size
- Features generalize across modality, language, and models
- Feature steering: choose a feature, amplify within the SAE, and replace residual stream with the output layer (which represents the same residual stream after amplifying the feature)
- Feature steering seems to generalize across prompts with varying effects, I wonder if we can improve current LLMs by amplifying good features like truthfulness, honesty, corrigibility…?
- UMAP of feature vectors (hidden layer activations of SAE) based on cosine similarity reveals that semantically similar features are close together
- We don’t have ALL features that models use, only have like 60% roughly, it seems like feature availability is correlated to the frequency of relevant tokens in the SAE training data/pre training data (similar samples)
- https://transformer-circuits.pub/2024/scaling-monosemanticity/#safety-relevant-deception-case-study/!!!