What Is ProgramBench?
ProgramBench is an open-source AI coding benchmark from Facebook Research that asks agents to rebuild a complete codebase from a compiled binary and documentation, and it is one of the best AI Coding Benchmarks tools for eval engineers. The project is backed by a 2026 paper with 12 named authors and a Hugging Face test dataset, which gives teams a reproducible target for measuring how well a model can infer program structure, APIs, and behavior from black-box evidence alone.
ProgramBench is not a toy coding task. It is a reconstruction benchmark that starts from a compiled artifact and forces the model to design, implement, and validate a full source tree that matches the original program's observed behavior. That framing makes it useful for measuring real agent capability, not just autocomplete quality.
Quick Overview
| Attribute | Details |
|---|---|
| Type | AI Coding Benchmarks |
| Best For | eval engineers |
| Language/Stack | Python, uv, CLI packaging, Hugging Face datasets |
| License | See LICENSE |
| GitHub Stars | N/A |
| Pricing | Open-Source |
| Last Release | N/A |
Who Should Use ProgramBench?
ProgramBench fits teams that need a hard benchmark for end-to-end code reconstruction, not just unit-test completion. It is especially relevant when the question is whether an agent can infer architecture and behavior from indirect evidence.
- Eval engineers building benchmark suites for coding agents who need a black-box task that is harder than standard prompt-to-code generation.
- Research teams studying program synthesis, decompilation-adjacent reasoning, or agent planning under incomplete information.
- Platform teams comparing model revisions where the output must be measured against behavioral fidelity rather than style or diff size.
- Indie builders testing whether their agent pipeline can move from docs to a functioning implementation without source access.
Not ideal for:
- Teams that only need boilerplate generation or scaffolded CRUD apps.
- Workflows where the source code is already available and the problem is mostly refactoring.
- Product teams that want a simple benchmark with minimal setup and no reverse-engineering component.
Key Features of ProgramBench
- Binary-first task framing — ProgramBench starts from a compiled binary plus docs, so the model has to infer APIs, control flow, and data models from behavior rather than source code. That makes it a much stricter test than prompt-only coding benchmarks.
- Behavioral reconstruction target — The objective is to reproduce the original program's behavior, not just pass one-off tests. That pushes agents toward architecture-level reasoning, interface discovery, and careful implementation choices.
- Reproducible benchmark package — The repository exposes a Python CLI and a Hugging Face dataset entry, which makes it straightforward to pin versions and rerun evaluations in CI.
- CLI-driven workflow —
uvx programbench --help,uv pip install programbench, andpip install programbenchare all supported, so teams can adopt it without hand-rolling environment management. - Paper-backed evaluation design — The benchmark ships with a 2026 arXiv paper, which gives you a published description of the task definition, evaluation philosophy, and intended research use.
- Leaderboard orientation — ProgramBench is tied to a public leaderboard, so you can compare models, prompts, and agent scaffolds against a shared target instead of inventing private scoring rules.
- Research-friendly packaging — The repo includes a usage guide and development setup via
uv sync, which means contributors can modify the harness, inspect the code, and extend the benchmark without fighting environment drift.
ProgramBench vs Alternatives
| Tool | Best For | Key Differentiator | Pricing |
|---|---|---|---|
| ProgramBench | Rebuilding programs from binaries and docs | Black-box reconstruction with behavioral fidelity as the core signal | Open-Source |
| [SWE-bench] | Fixing real GitHub issues in existing repos | Source-available bug-fix workflow with issue-driven evaluation | Open-Source |
| [HumanEval] | Small-function synthesis | Compact function-level problems with fast scoring | Open-Source |
| OpenTrace | Inspecting agent execution paths | Run-time tracing for debugging failures in benchmark pipelines | Open-Source |
Pick ProgramBench when you care about whether an agent can recover a complete program shape from a compiled artifact and documentation. Pick [SWE-bench] when your evaluation target is realistic issue fixing inside an existing repository, and pick [HumanEval] when you need fast, low-friction synthesis metrics for small functions.
If you are instrumenting a multi-agent evaluation stack, pair ProgramBench with OpenTrace to see where reasoning collapses during reconstruction. If you are orchestrating multiple attempts or planner-worker flows, OpenSwarm is a good companion because it helps you benchmark agent coordination around the same hard task.
How ProgramBench Works
ProgramBench uses a black-box reconstruction model: the model sees a compiled binary, associated documentation, and benchmark materials, then must infer what the original source tree likely did. The key abstraction is not a prompt template or a single function signature; it is a full target program whose observable behavior becomes the contract.
That design matters because it changes the failure mode. A coding model can often pass HumanEval by emitting a short correct function, but ProgramBench forces it to solve a larger software engineering problem: discover the program's interface, preserve behavior, and choose a structure that can survive evaluation across more than one code path. For teams testing agent planning, this is closer to real reverse engineering than to autocomplete.
The repository is a Python package with a CLI entrypoint, so the workflow is centered on uv and standard package installation instead of custom build tooling. The benchmark also exposes a Hugging Face dataset and a public leaderboard, which makes it easier to compare results across models and agent frameworks. That combination is useful when you want a repeatable evaluation target for OpenSwarm runs, or when you want to post-process failures with OpenTrace.
# get the CLI without installing globally
uvx programbench --help
# install into the current environment
uv pip install programbench
# or use pip directly
pip install programbench
The commands above let you inspect the CLI, install the package, and confirm that your environment can resolve the benchmark dependencies. For development work, cloning the repo and running uv sync installs editable and dev dependencies, which is the path you want if you plan to patch the harness or run local experiments.
Pros and Cons of ProgramBench
Pros:
- Harder than function-level benchmarks because the model has to reconstruct system behavior from a compiled binary and docs.
- Good for agent evaluation since it measures planning, inference, and implementation quality in one task.
- Reproducible Python packaging through
uv,pip, anduvx, which keeps setup friction low. - Research-backed design with a published paper and a public dataset reference.
- Leaderboard-friendly so model comparisons are easier to interpret across runs.
- Useful for failure analysis when paired with trace tooling and multi-agent orchestration.
Cons:
- Not a quick benchmark because reverse-engineering style tasks are slower to run and harder to score than small coding problems.
- Requires more context management than standard code-generation tests, so simple agents will underperform.
- Limited value for CRUD-only teams that do not need binary reconstruction or behavioral matching.
- License details are not spelled out in the README excerpt, so teams should read the repository LICENSE before redistribution.
- Less approachable for casual users because the benchmark assumes some familiarity with Python tooling, CLI workflows, and eval design.
Getting Started with ProgramBench
The fastest way to start with ProgramBench is to use uvx for a zero-install CLI check, then install the package into a virtual environment if you want to run repeatable experiments. If you are contributing to the benchmark itself, clone the repository and use uv sync so your editable install matches the repo state.
# inspect the command surface
uvx programbench --help
# install locally
uv pip install programbench
# or work from source
git clone https://github.com/facebookresearch/programbench.git
cd programbench
uv sync
After the install, the most important next step is to read the usage guide in docs/README.md and the leaderboard page on programbench.com. That gives you the benchmark's expected workflow before you wire it into CI, agent experiments, or regression tests.
Verdict
ProgramBench is the strongest option for measuring whether an agent can rebuild software from a compiled binary when you need a strict, research-grade benchmark. Its main strength is the black-box reconstruction setup; its main caveat is the higher setup and reasoning cost. If you are evaluating serious coding agents, ProgramBench is worth using.



