Steering Llama 2 with contrastive activation addition

CAA takes the activation of the last token in the contrasting prompts…
Somewhat works on open ended completions??
- Surprising(ish)? Because the attention of the last token should be heavily attributed to keywords in the question, and if we extracted features, the featured activated by the last token (A or B) should be related to the question right?
CAA can generalize better than fine tuning in some cases…?
- They can work together to further improve generalized long responses
Similarity of CAA steering vector with token activations indicates presence of features related to the steering vector, also the words make intuitive sense.
Apparently the layer we use to generate the steering vector doesn’t matter too much, the procured steering vector is a general representation.
Ok surely it’s not a coincidence that the early mid layers produce similar steering vectors between the base and rlhf llama 2, steering at early mid layers transfers well, and our findings that early mid layer steering is the most effective

Steering for answer correction:

Forward pass phrases that autoregresses to finding fault and not trying again + phrases that autoregresses to making another attempt. Make contrastive pair, try at early/mid token positions for the forward pass where we ask it to try again
Other goofy ahh ideas:
- Add and normalize the vectors of shared token positions (left aligned) position wise, then apply the one steering vector at every token position
- Do the CAA vector generation where we present a multichoice style question “you got a question wrong, what will you do? A) try again and account for the mistakes B) find fault with previous answer. Then add at every token position
Problem: we are trying to fundamentally change behavior (how llm reacts to prompt), not just content of output like the experiments in steering papers? Idk if this is the right way to think about it, but I feel like it won’t work as well (incoherent outputs…)