What Is ChainReason?
ChainReason is a lightweight AI benchmark built by Joshua Yamamoto for evaluating LLM reasoning on Ethereum and DeFi tasks. ChainReason is one of the best AI Benchmarks tools for DeFi LLM teams, and its seed suite covers 64 curated examples across protocol_qa, vuln_detect, contract_class, tx_intent, and slippage_pred. The point is not scale; the point is to separate symbolic reasoning, code understanding, structural pattern matching, and AMM math.
Quick Overview
| Attribute | Details |
|---|---|
| Type | AI Benchmarks |
| Best For | DeFi LLM teams and Ethereum researchers |
| Language/Stack | Python 3.9+, PyTorch, Transformers, OpenAI API, Ethereum and DeFi |
| License | MIT |
| GitHub Stars | N/A |
| Pricing | Open-Source |
| Last Release | N/A |
ChainReason runs as a small Python package with task-specific prompts, parsers, and scorers. It is built for quick regression checks on on-chain reasoning, not for broad general-purpose NLP scoring. If your workflow already inspects transaction traces with OpenTrace or compares reasoning-centric models with Open R1, ChainReason fits as the domain-specific scoring layer.
Who Should Use ChainReason?
- LLM engineers benchmarking model versions on Ethereum-specific reasoning before a release.
- Security researchers testing whether a model can classify Solidity vulnerabilities and identify contract types from ABI summaries.
- DeFi protocol teams evaluating support copilots that need to explain swaps, pool state, and protocol behavior without hallucinating.
- Indie hackers building crypto tools who need a compact sanity check before wiring a model into production.
Not ideal for:
- Leaderboard hunters who need thousands of examples and broad academic coverage rather than a focused reasoning set.
- Teams wanting only code generation because ChainReason is about evaluation, not synthesizing Solidity from scratch.
- Non-Ethereum projects where the benchmark would add little signal compared with domain-specific test sets.
Key Features of ChainReason
- Five-task coverage — ChainReason spans protocol QA, vulnerability detection, contract classification, transaction-intent inference, and slippage prediction. That mix matters because each task stresses a different failure mode, from symbolic reasoning to closed-form numeric math.
- Task-specific metrics —
protocol_qauses accuracy,vuln_detectandcontract_classadd macro-F1,tx_intentchecks label accuracy, andslippage_preduses tiered relative error. That gives you a more honest picture than a single aggregate score. - Curated seed dataset — the repository ships with 64 small, hand-written examples that are easy to run in under a minute. The design is intentional: ChainReason is meant to validate model behavior, not to simulate a giant Etherscan crawl.
- Local and API model support — ChainReason works with OpenAI models through the API client and with local HuggingFace models via
torch,transformers, andaccelerate. That makes it usable for both closed-model smoke tests and offline open-weight runs. - Configurable data loading — the benchmark accepts custom data via
--data-path, so you can extend the seed set with your own protocol snippets, traces, and AMM scenarios. For teams with proprietary flows, that is the difference between a toy benchmark and a real internal gate. - Programmatic runner — the package exposes
get_task,run_eval, and model adapters, which makes it easy to slot into CI or a research notebook. You can score a single task or sweep the whole suite without writing glue code. - Results aggregation —
aggregate_results.pyturns per-run outputs into a summary markdown file. That is useful when you want a human-readable report for model comparisons, not just JSON blobs.
ChainReason vs Alternatives
| Tool | Best For | Key Differentiator | Pricing |
|---|---|---|---|
| ChainReason | Ethereum and DeFi reasoning checks | Five task families cover protocol mechanics, transaction intent, vulnerability detection, and AMM math | Open-Source |
| lm-eval-harness | Broad LLM benchmarking across many datasets | Huge benchmark catalog and standardized evaluation plumbing | Open-Source |
| OpenAI Evals | API-first model regression tests | Tight integration with OpenAI workflows and simple eval iteration | Open-Source |
| Open R1 | Reasoning model research and reproducible experiments | Focus on open reasoning-model development rather than domain-specific DeFi scoring | Open-Source |
Pick ChainReason when the question is whether a model understands on-chain behavior, not whether it can answer generic trivia. Pick OpenTrace alongside it when you need to inspect decoded transaction sequences before scoring them. Pick lm-eval-harness when you want one harness for many unrelated benchmarks, and pick OpenAI Evals when your workflow is already centered on API-backed model regression.
How ChainReason Works
ChainReason uses a small Task abstraction as the core unit of evaluation. Each task defines how examples load, how prompts are built, how responses are parsed, and how predictions are scored. The runner then feeds those prompts into a model adapter, collects completions, and computes task-specific metrics against the target labels or numeric outputs.
The design is intentionally narrow. protocol_qa asks multiple-choice questions about protocol mechanics, vuln_detect classifies Solidity snippets by vulnerability type, contract_class infers contract category from ABI summaries, tx_intent reasons over decoded actions, and slippage_pred computes AMM swap output from pool state. That structure matters because ChainReason is testing different reasoning paths, not just one generalized answer format.
The architecture is easy to inspect because the code path is small: load examples, build a prompt, call the model, parse the answer, score it, and write results. If you need to extend the benchmark with a new DeFi workflow, you implement the Task interface, register it in TASK_REGISTRY, and run the same evaluation loop against your custom data.
python scripts/run_eval.py --task slippage_pred --client openai --model gpt-4o-mini --limit 5
python scripts/aggregate_results.py results/full -o results/full/SUMMARY.md
The first command runs a small evaluation pass on one task and writes per-example outputs. The second command rolls those outputs into a summary file so you can compare runs, track regressions, or paste the result into a review doc. If you are testing local checkpoints, swap the client layer for the HuggingFace path and keep the rest of the workflow unchanged.
Pros and Cons of ChainReason
Pros:
- Domain-specific signal — ChainReason tests Ethereum and DeFi reasoning directly, which is more useful than generic language scores for on-chain products.
- Multiple reasoning modes — the benchmark covers textual QA, code classification, trace interpretation, and numeric AMM calculation in one suite.
- Small and fast — the seed set is tiny enough to run quickly, which makes it practical for CI or pre-merge checks.
- Extensible interface — the
Taskabstraction and registry make new datasets straightforward to add. - Works with open and closed models — you can compare OpenAI API models against local HuggingFace models without changing the benchmark structure.
- Readable outputs — the aggregation script generates a markdown summary that is easy to share with a team.
Cons:
- Limited sample size — 64 seed examples are useful for regression checks but too small for final model selection.
- Narrow domain — ChainReason is specific to Ethereum and DeFi, so it will not replace a general LLM benchmark stack.
- Manual curation bias — hand-written examples are high quality, but they do not cover every real-world edge case.
- API cost for hosted models — if you run the suite with OpenAI models, you still pay inference costs.
- No giant leaderboard — the repository is about evaluation mechanics and signal quality, not public ranking theater.
Getting Started with ChainReason
git clone https://github.com/joshawome/chainreason
cd chainreason
pip install -e .
pip install torch transformers accelerate
export OPENAI_API_KEY=...
python scripts/run_eval.py --task protocol_qa --client openai --model gpt-4o-mini --limit 5
After that run, ChainReason writes results for the selected task and model so you can inspect per-example predictions and metrics. If you want a full sweep, point scripts/run_eval.py at a YAML config and then aggregate the run into a summary file. If you are extending the benchmark, use --data-path to point at your own curated examples and keep the same task-scoring flow.
Verdict
ChainReason is the strongest option for Ethereum-focused LLM evaluation when you need a compact benchmark that tests protocol knowledge, vulnerability detection, transaction intent, and AMM math in one pass. Its main strength is breadth across reasoning modes; its main caveat is the small seed set, so treat it as a regression suite, not a final scorecard. Use it if your models touch DeFi; skip it if you need broad general-purpose evaluation.



