Service Page · For AI Labs & Tooling Companies

AI Evaluation Engineer for Code Agents

I author production-grade evaluation tasks for AI coding agents. My recent work as an expert task author with Mercor has shipped 130+ SWE-bench-Extended tasks across eight languages, Docker-reproducible OSS issues with implementation-agnostic rubrics, golden solutions, and automated test harnesses, graded at ≥0.95 QC threshold.

What I do

Agentic-benchmark task authoring

Take a real GitHub issue or PR from an OSS project. Turn it into a self-contained evaluation task: pinned Docker environment, structured problem and prompt statements, interface contract, requirements file, golden patch, test patch, and an implementation-agnostic rubric covering functional / robustness / style criteria. Validate end-to-end via automated grader before shipping.

Rubric design & QC

Apply the alignment rules that matter, every functional criterion verifiable by a test, atomic behaviors (no AND-stacked criteria), behavioral descriptions only (no implementation details), source attribution to requirements / prompt / interface, and rationale tied to user/codebase impact. Re-trigger QC until both reviewer and super-reviewer scores hit threshold.

LLM-as-judge pipeline operation

Operate the full LLM-orchestration toolchain: context generators, problem-statement generators, golden-plan generators, planning + execution graders, and QC checks for test alignment, prompt clarity, fairness, and rubric quality. Comfortable extending these pipelines with custom Python tooling.

Eval-harness consulting

For companies building internal AI coding agents or assistants, design the eval suite that grades them honestly. Rubric taxonomy, harness architecture, dataset sourcing, and the LLM-judge pipeline that QCs each evaluation cycle.

AI integration audits

Independent review of GPT/Claude integrations already running in production code: prompts, fallback paths, cost ceilings, failure modes, observability. Where applicable, recommend prompt caching, streaming, fine-tuning vs RAG, model selection.

Q1 2026 output

130+SWE-bench tasks shipped
8Languages
≥0.95QC threshold
Q1 '26Active program
Mercor's customer relationships are under NDA. I can describe the program, methodology, and my output, but not name the downstream labs. References available on request to qualified hiring contacts.

Languages I author tasks in

Across 130+ tasks shipped Q1 2026, with Go and Rust dominant:

Go
Rust
Java
Kotlin
C++
JavaScript
TypeScript
Python

Authoring a task in a language requires reading a real OSS PR end-to-end, building a reproducible Docker repro, and designing a working test rubric. So beyond what my shipped product code shows, I can credibly read and reason about production code in all of the above.

Methodology (NDA-safe summary)

Two alignment rules I never break

  • Tests → rubric. Every functional criterion is verifiable by an offline assertion in test.patch. Two unverified criteria caps the score at ~0.80 regardless of other quality.
  • Requirements → rubric. Behaviors specified in requirements.json and problem_statement.md have rubric coverage. Coverage gaps are penalized but less severely than alignment gaps.

Stacking rule

One observable behavior per criterion. If a criterion contains an AND joining two independent behaviors, split it into two criteria with a dependent_on link.

Description quality

  • Behavioral outputs only, no "correctly parses", no "appropriately handles", no "either A or B" disjunctions.
  • Self-contained, no "before the change" back-references.
  • Rationale is the user / codebase impact, not a description restatement.

Online-test handling

Criteria only verifiable by @pytest.mark.online tests are marked minor with the verification limitation called out in the rationale.

How to engage

Task authoring contracts

Volume work, typically per-task fixed pricing or hourly. I'm currently active in the Mercor expert program; I have capacity for 1-2 additional eval programs (no overlap with Mercor's customer set).

Rubric & harness consulting

Hourly or per-engagement. Common scope: design the eval taxonomy, prototype the LLM judge, pilot on a small dataset, hand off the harness for your team to extend.

AI integration audits

Fixed-bid 1-week engagement: review the integration end-to-end, deliver a written report with severity-ranked findings and concrete recommendations.

Polyglot code review

Hourly. Useful when you need an outside reviewer for Go / Rust / Java / Kotlin / C++ code where you don't have in-team depth.

Want a second opinion on your code-agent eval?

I respond to qualified inbound within 24 hours. Tell me what you're evaluating and what's not yet working.

Email me ↗