https://ar5iv.labs.arxiv.org/html/2212.08073
-
Central idea: encapsulate human preference as a set of instructions to automate RLAIF
-
Supervised learning stage
- generate responses to harmful prompts with pretrained model
- ask (online) model to critique the response according to randomly drawn principles from the constitution
- repeat online (in the same conversation)
- finetune the pretrained model with supervised learning on all revised responses
- also put in some helpfulness samples to retain helpfulness
-
RL stage
- Use the SL model to generate pairs of responses over a harmful prompt set
- formulate each prompt & pair into a binary choice question
- Ask the AI to choose according to a constitutional principle
- results in preference dataset for harmlessness
- Mixes above with human feedback generated helpfulness dataset to get a preference model that can assign a score to any sample
- RL with the preference model as reward
-
My problems with all RLAIF agendas or scalable oversight
- Language is an imperfect approximation of human preferences, but it might be the best we’ve got
- Reward hacking the preference model and needing to be restricted to the initial unaligned model means limited alignment to dual signals, we can often elicit original unaligned behavior (as observed)
- although data suggests that LLM performance in choosing the better option is fast approaching crowd worker performance
- Spontaneous reward hacking without changing gradients (both online and offline), applies specifically to the critique and revise step of CAI
- Idk I feel like we have too many components, each an additional proxy of human preference that becomes exponentially less accurate