Research Scientist - Frontier Data

San Francisco, CAIn-personFull-Time

About

About the role

This is a hands-on, high-leverage research role. You will design the datasets and evaluation frameworks that shape how frontier models are trained and measured. Working directly with research teams at the world's top AI labs, you will experiment with data collection strategies, diagnose model failure modes, and develop the metrics that determine whether a model is actually getting better. This is not a theorizing role. You will quickly move from hypothesis to a live experiment, and your output will directly influence model training runs at scale. The team is small, the impact is outsized, and individual contributors here have a direct line to how the next generation of models learns and improves. Design data slices and explore data shapes that expose meaningful model failure modes across domains, including finance, code, and enterprise workflows Build and refine evaluation rubrics and reward signals for RLHF and RLVR training pipelines Model annotator behavior and run experiments to improve different model capabilities Develop quantitative frameworks for measuring dataset quality, diversity, and downstream impact on model alignment and capability Partner with lab research teams to translate their training objectives into concrete data and evaluation specifications Move fast from hypothesis to experiment, extract actionable insights from messy results, and iterate quickly

Requirements

Must-have

Strong quantitative instincts with familiarity with LLM training pipelines, RLHF or RLVR, or evaluation methodology. Does not need a PhD but must have the research depth of a strong undergrad or master's researcher
Genuine obsession with how data structure, selection, and quality drive model behavior. This is the core of the work and must be intrinsically motivated
Ability to design lightweight experiments, move fast, and extract actionable insights from messy and incomplete results
Comfort working across domains, the work touches finance, software engineering, policy, and more. Must be able to context-switch and reason clearly across all of them
Bias toward building over theorizing. Ships experiments and iterates, does not get stuck in design

Nice-to-have

Prior work or internship at RL environment companies, AI safety organizations, or benchmarking organizations such as METR or Artificial Analysis
Background in evaluation methodology, benchmark design, or dataset curation at a lab or research organization
Exposure to annotator modeling, reward signal design, or alignment-related research

Benefits & perks

Equity
Bonus
Work with founding team from top companies

Interview process

1Application Review
2Initial Screen
3Take Home
4Take Home Review
5Onsite
6Offer
7Hired

Drop your CV for this role.

One PDF and your email. We read it, score your fit for this role at AfterQuery, and route the introduction through us.