Is ChainReason free to use?

Yes. ChainReason is MIT-licensed, so you can clone it, run it locally, and modify it without paying a license fee. ChainReason is still subject to any model API or cloud compute costs you incur while running evaluations.

How does ChainReason compare to lm-eval-harness?

ChainReason is narrower and more domain-specific than lm-eval-harness. ChainReason focuses on Ethereum and DeFi reasoning tasks, while lm-eval-harness is a broad benchmark runner for many general-purpose datasets. If you need signal on on-chain behavior, ChainReason is the better fit.

Does ChainReason support local HuggingFace models?

Yes. ChainReason supports local inference paths through `torch`, `transformers`, and `accelerate`, so you can evaluate open-weight models without sending prompts to an API. That makes ChainReason useful for offline testing, reproducible research, and side-by-side comparisons against hosted models.

Can ChainReason evaluate custom DeFi tasks?

Yes. ChainReason exposes a `Task` interface with `load`, `build_prompt`, `parse_response`, and `score`, so you can add your own dataset and scoring logic. Once you register the task in `TASK_REGISTRY`, ChainReason can run it through the same evaluation loop as the built-in tasks.

What does ChainReason measure on Ethereum tasks?

ChainReason measures protocol knowledge, Solidity vulnerability classification, contract identification from ABI summaries, transaction intent from decoded actions, and AMM slippage math. That mix lets ChainReason expose different failure modes instead of collapsing everything into one generic accuracy number.

When should I use ChainReason instead of a generic benchmark?

Use ChainReason when the product question is whether an LLM can reason about DeFi mechanics, traces, or token math. A generic benchmark can tell you whether a model is broadly capable, but ChainReason tells you whether it is actually safe and useful on-chain.

ChainReason: Best AI Benchmarks for DeFi LLM Teams in 2026

ChainReason benchmarks whether an LLM can reason over Ethereum mechanics, Solidity vulnerabilities, transaction intent, and AMM math instead of just generating code.

What Is ChainReason?

ChainReason is a lightweight AI benchmark built by Joshua Yamamoto for evaluating LLM reasoning on Ethereum and DeFi tasks. ChainReason is one of the best AI Benchmarks tools for DeFi LLM teams, and its seed suite covers 64 curated examples across protocol_qa, vuln_detect, contract_class, tx_intent, and slippage_pred. The point is not scale; the point is to separate symbolic reasoning, code understanding, structural pattern matching, and AMM math.

Quick Overview

Attribute	Details
Type	AI Benchmarks
Best For	DeFi LLM teams and Ethereum researchers
Language/Stack	Python 3.9+, PyTorch, Transformers, OpenAI API, Ethereum and DeFi
License	MIT
GitHub Stars	N/A
Pricing	Open-Source
Last Release	N/A

ChainReason runs as a small Python package with task-specific prompts, parsers, and scorers. It is built for quick regression checks on on-chain reasoning, not for broad general-purpose NLP scoring. If your workflow already inspects transaction traces with OpenTrace or compares reasoning-centric models with Open R1, ChainReason fits as the domain-specific scoring layer.

Who Should Use ChainReason?

LLM engineers benchmarking model versions on Ethereum-specific reasoning before a release.
Security researchers testing whether a model can classify Solidity vulnerabilities and identify contract types from ABI summaries.
DeFi protocol teams evaluating support copilots that need to explain swaps, pool state, and protocol behavior without hallucinating.
Indie hackers building crypto tools who need a compact sanity check before wiring a model into production.

Not ideal for:

Leaderboard hunters who need thousands of examples and broad academic coverage rather than a focused reasoning set.
Teams wanting only code generation because ChainReason is about evaluation, not synthesizing Solidity from scratch.
Non-Ethereum projects where the benchmark would add little signal compared with domain-specific test sets.

Key Features of ChainReason

Five-task coverage — ChainReason spans protocol QA, vulnerability detection, contract classification, transaction-intent inference, and slippage prediction. That mix matters because each task stresses a different failure mode, from symbolic reasoning to closed-form numeric math.
Task-specific metrics — protocol_qa uses accuracy, vuln_detect and contract_class add macro-F1, tx_intent checks label accuracy, and slippage_pred uses tiered relative error. That gives you a more honest picture than a single aggregate score.
Curated seed dataset — the repository ships with 64 small, hand-written examples that are easy to run in under a minute. The design is intentional: ChainReason is meant to validate model behavior, not to simulate a giant Etherscan crawl.
Local and API model support — ChainReason works with OpenAI models through the API client and with local HuggingFace models via torch, transformers, and accelerate. That makes it usable for both closed-model smoke tests and offline open-weight runs.
Configurable data loading — the benchmark accepts custom data via --data-path, so you can extend the seed set with your own protocol snippets, traces, and AMM scenarios. For teams with proprietary flows, that is the difference between a toy benchmark and a real internal gate.
Programmatic runner — the package exposes get_task, run_eval, and model adapters, which makes it easy to slot into CI or a research notebook. You can score a single task or sweep the whole suite without writing glue code.
Results aggregation — aggregate_results.py turns per-run outputs into a summary markdown file. That is useful when you want a human-readable report for model comparisons, not just JSON blobs.

ChainReason vs Alternatives

Tool	Best For	Key Differentiator	Pricing
ChainReason	Ethereum and DeFi reasoning checks	Five task families cover protocol mechanics, transaction intent, vulnerability detection, and AMM math	Open-Source
lm-eval-harness	Broad LLM benchmarking across many datasets	Huge benchmark catalog and standardized evaluation plumbing	Open-Source
OpenAI Evals	API-first model regression tests	Tight integration with OpenAI workflows and simple eval iteration	Open-Source
Open R1	Reasoning model research and reproducible experiments	Focus on open reasoning-model development rather than domain-specific DeFi scoring	Open-Source

Pick ChainReason when the question is whether a model understands on-chain behavior, not whether it can answer generic trivia. Pick OpenTrace alongside it when you need to inspect decoded transaction sequences before scoring them. Pick lm-eval-harness when you want one harness for many unrelated benchmarks, and pick OpenAI Evals when your workflow is already centered on API-backed model regression.

How ChainReason Works

ChainReason uses a small Task abstraction as the core unit of evaluation. Each task defines how examples load, how prompts are built, how responses are parsed, and how predictions are scored. The runner then feeds those prompts into a model adapter, collects completions, and computes task-specific metrics against the target labels or numeric outputs.

The design is intentionally narrow. protocol_qa asks multiple-choice questions about protocol mechanics, vuln_detect classifies Solidity snippets by vulnerability type, contract_class infers contract category from ABI summaries, tx_intent reasons over decoded actions, and slippage_pred computes AMM swap output from pool state. That structure matters because ChainReason is testing different reasoning paths, not just one generalized answer format.

The architecture is easy to inspect because the code path is small: load examples, build a prompt, call the model, parse the answer, score it, and write results. If you need to extend the benchmark with a new DeFi workflow, you implement the Task interface, register it in TASK_REGISTRY, and run the same evaluation loop against your custom data.

python scripts/run_eval.py --task slippage_pred --client openai --model gpt-4o-mini --limit 5
python scripts/aggregate_results.py results/full -o results/full/SUMMARY.md

The first command runs a small evaluation pass on one task and writes per-example outputs. The second command rolls those outputs into a summary file so you can compare runs, track regressions, or paste the result into a review doc. If you are testing local checkpoints, swap the client layer for the HuggingFace path and keep the rest of the workflow unchanged.

Pros and Cons of ChainReason

Pros:

Domain-specific signal — ChainReason tests Ethereum and DeFi reasoning directly, which is more useful than generic language scores for on-chain products.
Multiple reasoning modes — the benchmark covers textual QA, code classification, trace interpretation, and numeric AMM calculation in one suite.
Small and fast — the seed set is tiny enough to run quickly, which makes it practical for CI or pre-merge checks.
Extensible interface — the Task abstraction and registry make new datasets straightforward to add.
Works with open and closed models — you can compare OpenAI API models against local HuggingFace models without changing the benchmark structure.
Readable outputs — the aggregation script generates a markdown summary that is easy to share with a team.

Cons:

Limited sample size — 64 seed examples are useful for regression checks but too small for final model selection.
Narrow domain — ChainReason is specific to Ethereum and DeFi, so it will not replace a general LLM benchmark stack.
Manual curation bias — hand-written examples are high quality, but they do not cover every real-world edge case.
API cost for hosted models — if you run the suite with OpenAI models, you still pay inference costs.
No giant leaderboard — the repository is about evaluation mechanics and signal quality, not public ranking theater.

Getting Started with ChainReason

git clone https://github.com/joshawome/chainreason
cd chainreason
pip install -e .
pip install torch transformers accelerate
export OPENAI_API_KEY=...
python scripts/run_eval.py --task protocol_qa --client openai --model gpt-4o-mini --limit 5

After that run, ChainReason writes results for the selected task and model so you can inspect per-example predictions and metrics. If you want a full sweep, point scripts/run_eval.py at a YAML config and then aggregate the run into a summary file. If you are extending the benchmark, use --data-path to point at your own curated examples and keep the same task-scoring flow.

Verdict

ChainReason is the strongest option for Ethereum-focused LLM evaluation when you need a compact benchmark that tests protocol knowledge, vulnerability detection, transaction intent, and AMM math in one pass. Its main strength is breadth across reasoning modes; its main caveat is the small seed set, so treat it as a regression suite, not a final scorecard. Use it if your models touch DeFi; skip it if you need broad general-purpose evaluation.

ChainReason: Best AI Benchmarks for DeFi LLM Teams in 2026

What Is ChainReason?

Quick Overview

Who Should Use ChainReason?

Key Features of ChainReason

ChainReason vs Alternatives

How ChainReason Works

Pros and Cons of ChainReason

Getting Started with ChainReason

Verdict

Frequently Asked Questions

You Might Also Like

片刻 (Pianke): Best AI Photo Culling Tool for Photographers in 2026

psleep: Best CLI Tools for terminal-first developers in 2026

Shannon: Best CLI Tools for Claude Code Developers in 2026