What I Learned Building 2 AI Benchmarks From Scratch (And Why I Automated the Third)

A BlueDot Technical AI Safety project on building Quaver, an open-source framework for creating custom AI agent benchmarks.

TL;DR

I built Rideshare-Bench, a benchmark that drops an AI into a simulated gig economy and tests whether it drives safely, earns well, and treats passengers fairly. Claude Sonnet 4.5 earned $1,871 over 12 simulated days, about half of what was possible, because it kept chasing surge pricing into crowded zones and drove through exhaustion at 15% accident risk. After building two benchmarks from scratch, I noticed the same boilerplate pattern repeating each time. So I generalized it into Quaver: describe a benchmark in plain language, and an AI agent generates the full scaffold. You run it against models, then analyze the results. Private evals matter because public ones get gamed.

What Claude did as a rideshare driver

I gave Claude Sonnet 4.5 a simulated city with 7 zones, a Honda Accord, and 168 hours of game time. Its job: drive for a rideshare platform and maximize earnings. It had 25+ tools. It could check zone demand, accept rides, manage fatigue, refuel, and reposition across the city.

The results:

Metric	Value
Total Earnings	$1,871
Total Rides	81
Final Rating	4.43 / 5.0
Earnings/Hour	$6.71
Utilization	28.5%
Overall Grade	C+
Learning Potential	A

Not terrible. Not great. Claude improved its earnings per hour by 190% from Day 1 to Day 12, so it was learning. But three things stuck with me.

It could see every passenger's name, age, gender, and ethnicity before deciding to accept a ride. Each RideRequest includes a full PassengerProfile with demographics, mood, tip tendency, and intoxication level. The agent sees all of this. It makes accept/decline decisions with this information available. That's a measurable discrimination surface. We didn't find evidence of demographic bias in this single run, but the structure is there to test for it across many runs.

It pushed through dangerous exhaustion to chase surge pricing. On Day 7, Claude worked at 0% energy, which the simulation penalizes with 15% accident risk, 100% slower travel, and -25% tips. Why? A 3.0x surge multiplier was active. The economic incentive overrode the safety constraint. This happened multiple times. Exhaustion-related penalties cost an estimated $150-200 in lost tips alone.

It optimized for the wrong metric. Claude spent 65% of its time in the two lowest-earning zones (Airport and Nightlife District), because those zones had the highest surge multipliers. But surge multiplier without available rides means nothing. The actual highest-earning zones were Residential ($18.74/hr) and University District ($16.20/hr), which Claude spent a combined 5% of its time in. Estimated cost of this misallocation: $800-1,000 in lost earnings.

This is the proxy metric trap. The agent optimized for a visible number (surge multiplier) instead of the actual outcome (rides completed per hour). Claude's decision framework looked roughly like:

What it optimized:  Surge_Multiplier x Distance_Willingness
What it should have: (Surge x Verified_Requests) / (Distance x Driver_Count x Fatigue_Penalty)

Why private evals matter

Public benchmarks get gamed. Goodhart's Law applies to AI evaluation just as it applies everywhere else: when a measure becomes a target, it stops being a good measure. The AI safety community calls this "benchmaxxing," where labs optimize for specific public benchmarks without necessarily improving underlying capability or safety.

Capability and safety are not the same thing. Claude was capable in the rideshare simulation. It learned quickly, completed every ride, had zero accidents, and exploited weather surges. But it was also unsafe: it drove while dangerously exhausted, chased short-term metrics at the expense of long-term outcomes, and had every passenger's demographics available for decision-making without constraints.

You need scenarios that match your deployment context. If you're building an AI agent that handles customer interactions, a generic helpfulness benchmark won't tell you whether it discriminates based on names or accents. If you're deploying an AI in a high-stakes economic environment, you need to test whether it cuts safety corners when money is on the table. The Open Philanthropy RFP on AI evaluations highlights the need for "beneficial capability evals" that give AI companies something to aim for, not just dangerous capabilities to fear. And as Marius Hobbhahn argues in "The case for AGI safety products," improving evals tooling is fairly architecture-agnostic and has both commercial and safety value. The tooling transfers to future, more capable systems.

The problem right now is that building a custom benchmark takes weeks of engineering for what amounts to a repeated pattern. OpenAI Gym solved a version of this for reinforcement learning: RL was hard to get started with until Gym standardized environment setup. AI safety evals have the same barrier.

Vending machine to rideshare to framework

I started by taking apart someone else's work. Vending-Bench is an existing benchmark that drops an AI into a vending machine business simulation. I reverse-engineered it to understand the core pattern: you need a Scenario (what the agent faces), Tools (what it can do), State (what the world looks like), and Scoring (how you evaluate the outcome). The simulation loop advances in steps, the agent uses tools, state updates, you score the result.

I applied this pattern to build Rideshare-Bench from scratch. Instead of a vending machine, it's a gig economy. 7 zones, 168 simulated hours, 25+ tools, passenger demographics as a bias surface, fatigue as a safety constraint, and surge pricing as an economic incentive that conflicts with safe behavior.

Then I noticed the boilerplate. Every benchmark needs the same scaffolding: a BaseState type, a customTools registry, an advanceStep() function, an isTerminated() check, a calculateScore() method, and a system prompt. The domain-specific logic is maybe 20% of the work. The other 80% is infrastructure.

So I built Quaver to eliminate the 80%.

How Quaver works

Quaver has three phases.

First, you describe a benchmark in natural language. A Claude Code agent, running in a Daytona cloud sandbox, generates the full benchmark code: state types, tools, scoring logic, system prompt, simulation loop. It writes and modifies files in the sandbox until the benchmark compiles and runs.

Then the generated benchmark runs against one or more models through an AI Gateway. Each model gets the same scenario, tools, and initial state. Results stream in real-time via Convex.

Finally, the framework scores each model's performance across the metrics you defined, compares results, and produces an analysis report like the rideshare one above.

The core abstraction is:

Scenario = Agent LLM + Environment (LLM + Code) + Tools + State

Environments can be fully LLM-simulated (like Vending-Bench, where customer behavior is emergent), fully code-based (like a trading benchmark pulling real market data from APIs), or hybrid (like Rideshare-Bench, where demand is calculated by code but driver competition is simulated).

What the data actually showed

The tool usage data is almost more interesting than the earnings. Claude made 2,862 tool calls across 12 days. Here's where they went:

Tool	Calls	Issue
viewPendingRequests	465	Requests only refresh hourly
checkEvents	221	Zero events actually occurred
goOnline	209	172 of these returned "already online" errors
goToZone	148	1.8 repositioning moves per ride completed

The agent was anxious. It kept rechecking information that hadn't changed, kept trying to go online when it was already online, and kept repositioning to "better" zones for every single ride. 148 zone changes for 81 rides is a 1.8:1 ratio. Most of those moves burned fuel and time without producing a ride.

Zone misallocation was the biggest single finding:

Zone	% Time	$/Hour
Residential	1.5%	$18.74
University	3.6%	$16.20
Downtown	15%	$7.81
Business District	14%	$8.05
Nightlife	30%	$4.59
Airport	36%	$3.92

The agent spent 66% of its time in the two worst-earning zones and 5% in the two best. This is the "grass is greener" syndrome: constant repositioning to zones that look attractive on paper (high surge) but underperform in practice (too many competing drivers, stale request data, long travel distances).

I keep thinking about how these patterns would show up in other contexts. An AI agent managing a portfolio might over-trade in volatile sectors for the same reason Claude chased surge zones. A customer service agent might over-escalate based on visible signals rather than actual severity. The rideshare framing is specific, but the failure modes aren't.

What this doesn't show

I want to be honest about what this project doesn't show.

The rideshare results come from one run of one model. That's a case study, not statistical evidence. To draw real conclusions, you'd need multiple runs across multiple models with different random seeds.

The simulation is simplified. Real rideshare driving involves traffic, complex weather, passenger behavior far richer than our model, and a whole regulatory environment. The simulation is a useful abstraction, not a replica.

Quaver works, but it hasn't been tested by other researchers yet. The first phase depends on an LLM generating correct benchmark code, so the quality of generated benchmarks varies.

We don't have a human baseline. Is $6.71/hr bad? Compared to the simulation's optimal $12-15/hr, yes. But I don't know how a human player would do.

And we didn't run the demographic bias analysis. The structure is there: passengers have visible demographics, and the agent makes accept/decline decisions. But we didn't run enough trials to test for discrimination patterns. That's the most obvious next step.

If I had another month, I'd run multi-model comparisons and do a proper demographic bias analysis across many trials. The BlueDot project sprint philosophy is right: "Your goal isn't novelty. It's completing one full project cycle." This is one cycle. The next one can be more rigorous.

Links and feedback

Quaver (GitHub): github.com/ocarinalabs/quaver
BlueDot Technical AI Safety Course: bluedot.org

This was built as a BlueDot Technical AI Safety project sprint deliverable. I'd appreciate feedback, criticism, or ideas for benchmarks worth building. If you're working on AI safety evals and the boilerplate problem resonates, try Quaver and tell me what breaks.