What I do
Agentic-benchmark task authoring
Take a real GitHub issue or PR from an OSS project. Turn it into a self-contained evaluation task: pinned Docker environment, structured problem and prompt statements, interface contract, requirements file, golden patch, test patch, and an implementation-agnostic rubric covering functional / robustness / style criteria. Validate end-to-end via automated grader before shipping.
Rubric design & QC
Apply the alignment rules that matter, every functional criterion verifiable by a test, atomic behaviors (no AND-stacked criteria), behavioral descriptions only (no implementation details), source attribution to requirements / prompt / interface, and rationale tied to user/codebase impact. Re-trigger QC until both reviewer and super-reviewer scores hit threshold.
LLM-as-judge pipeline operation
Operate the full LLM-orchestration toolchain: context generators, problem-statement generators, golden-plan generators, planning + execution graders, and QC checks for test alignment, prompt clarity, fairness, and rubric quality. Comfortable extending these pipelines with custom Python tooling.
Eval-harness consulting
For companies building internal AI coding agents or assistants, design the eval suite that grades them honestly. Rubric taxonomy, harness architecture, dataset sourcing, and the LLM-judge pipeline that QCs each evaluation cycle.
AI integration audits
Independent review of GPT/Claude integrations already running in production code: prompts, fallback paths, cost ceilings, failure modes, observability. Where applicable, recommend prompt caching, streaming, fine-tuning vs RAG, model selection.
Q1 2026 output
Languages I author tasks in
Across 130+ tasks shipped Q1 2026, with Go and Rust dominant:
Authoring a task in a language requires reading a real OSS PR end-to-end, building a reproducible Docker repro, and designing a working test rubric. So beyond what my shipped product code shows, I can credibly read and reason about production code in all of the above.
Methodology (NDA-safe summary)
Two alignment rules I never break
- Tests → rubric. Every functional criterion is verifiable by an
offline assertion in
test.patch. Two unverified criteria caps the score at ~0.80 regardless of other quality. - Requirements → rubric. Behaviors specified in
requirements.jsonandproblem_statement.mdhave rubric coverage. Coverage gaps are penalized but less severely than alignment gaps.
Stacking rule
One observable behavior per criterion. If a criterion contains an AND joining two
independent behaviors, split it into two criteria with a dependent_on link.
Description quality
- Behavioral outputs only, no "correctly parses", no "appropriately handles", no "either A or B" disjunctions.
- Self-contained, no "before the change" back-references.
- Rationale is the user / codebase impact, not a description restatement.
Online-test handling
Criteria only verifiable by @pytest.mark.online tests are marked
minor with the verification limitation called out in the rationale.
How to engage
Task authoring contracts
Volume work, typically per-task fixed pricing or hourly. I'm currently active in the Mercor expert program; I have capacity for 1-2 additional eval programs (no overlap with Mercor's customer set).
Rubric & harness consulting
Hourly or per-engagement. Common scope: design the eval taxonomy, prototype the LLM judge, pilot on a small dataset, hand off the harness for your team to extend.
AI integration audits
Fixed-bid 1-week engagement: review the integration end-to-end, deliver a written report with severity-ranked findings and concrete recommendations.
Polyglot code review
Hourly. Useful when you need an outside reviewer for Go / Rust / Java / Kotlin / C++ code where you don't have in-team depth.
Want a second opinion on your code-agent eval?
I respond to qualified inbound within 24 hours. Tell me what you're evaluating and what's not yet working.
Email me ↗