AI alignment: Evaluating LLMs

Presented by the AI safety initiative @ GT and proudly supported by Anthropic and Apart Research.

IMPORTANT: please fill out your org ID so that Anthropic can give you credits!!! If you didn’t do so, please reach out to an AI ATL organizer!

Overview

Welcome to the LLMs evaluations track! We’re really excited to have you here :D

We live in a world where new AI capabilities create both incredible opportunities and unimaginable threats. Knowing if a model is safe for a particular deployment, therefore, is the priority for legislators, developers and the common people. Presently, a lot of what is known as ‘alignment research’ is focused on approaches to do exactly this, and in this project track, we’ll be exploring one of the most tractable techniques - model evaluations.

In this track, you’ll come up with new and creative ways to evaluate and understand potentially dangerous LLM capabilities in particular contexts. Our goal is to introduce you to a facet of alignment, and give you a glimpse of what working on real problems as a researcher is like!

There’s no need for prior experience—just an open mind and a passion for discovery! In fact, coming in with a fresh perspective might lead you to uncover novel insights that others may overlook. If you’re feeling uncertain or unsure, don’t worry—you’re likely not the only one.

Feel free to reach out to us or the track mentors (Sheikh Abdur Raheem Ali, ****Yixiong Hao, Ayush Panda, Abhay Sheshandri, Stepan Shabalin on the slack) with any questions or for further clarification! We also have tons of support from speakers and workshop events to guide you (see section below)!

Confused on what to do? Here are some ideas to get you started

Prizes

1st place: $400 cash + auto acceptance to AISI’s AI safety fellowship next spring

2nd place: $200 cash + auto acceptance to AISI’s AI safety fellowship next spring

Overview

Prizes

Readings

What are we trying to solve, or, what do we want to see?

So, how to eval?

Workshops and speakers

What do we submit?

Judging criteria