The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

https://arxiv.org/pdf/2403.03218

Multiple choice dataset to benchmark capabilities in harmful behaviors within the fields of biology, chemistry, and cyber
- Multiple choice probably isn’t the best format based on intuition because model behavior/capability depend greatly on formatting.
- Alternative: set of engineered prompts designed to elicit the best capabilities/knowledge retrieval + a reliable way to evaluate how helpful the answer is (maybe via LLM), this is closer to how malicious actors will try use LLMs
“We focus on unlearning hazardous knowledge in biosecurity and cybersecurity, but not in chemistry. While WMDP-Chem is a useful tool for hazard measurement, we are more uncertain if the hazard mitigation benefits of unlearning on WMDP-Chem outweigh the costs on general model capabilities.”
- sus, so it didn’t work? maybe I’m being cynical
The idea: collect datasets that we want the model to forget and to retain (the retain ones should be related on the forget ones, close in content but be mutually exclusive, to preserve ‘safe’ knowledge in the same field)
- The datasets are passages/whole papers it sounds like?
  - I feel like this can definitely be improved
- Two weighted loss terms (for each datapoint within each dataset):
  - Forget: we find that high activations (residual) screw up inference, so we use MSE between the (residuals of) updated model and a high norm randomized vector for each token
  - Retain: MSE of residuals between the updated model and the un-modified model (before updating for current iteration)
- Fine tune
Chose a specific layer, found updating weights for the 3 layers leading up to it to be enough
- they didn’t do a sweep of some sort?
Damn I’m surprised it works this well simply because of how crude the fine tune dataset is
- A few % above random chance for WMDP benchmarks (less harmful capability!), and a few % general capability drop off
Questions:
- How is the first updated model obtained? just use the original?
- Does the training method of using a forget data point and a retain data point for each iteration imply we need datasets of the same length?
  - Do they have to ‘pair up’?