Is OpenAI Privacy Filter free to use?

Yes. OpenAI Privacy Filter is released under Apache 2.0, so you can use it without per-seat licensing fees. Your costs are only the compute and any internal infrastructure you run.

How does OpenAI Privacy Filter compare to Microsoft Presidio?

OpenAI Privacy Filter is more model-centric than Microsoft Presidio. Presidio is a broader PII detection and anonymization framework with many built-in recognizers, while OpenAI Privacy Filter focuses on a compact long-context transformer that you can finetune and run on-prem.

Does OpenAI Privacy Filter support long documents?

Yes. OpenAI Privacy Filter supports a 128,000-token context window, so it can process long transcripts and documents without manual chunking. That reduces boundary errors when sensitive spans cross section breaks.

Can OpenAI Privacy Filter run on CPU?

Yes. OpenAI Privacy Filter can run on CPU with the `--device cpu` flag, which is useful for laptops, CI jobs, and locked-down servers. Expect lower throughput than GPU execution, but the workflow stays local.

What data does OpenAI Privacy Filter detect?

OpenAI Privacy Filter detects eight privacy categories: account numbers, private addresses, private emails, private people, private phones, private URLs, private dates, and secrets. The labels are emitted as BIOES spans, so the tool redacts contiguous token ranges instead of isolated tokens.

Why would I fine-tune OpenAI Privacy Filter?

You should fine-tune OpenAI Privacy Filter when your data has domain-specific identifier formats or a different precision and recall balance than the default checkpoint. Training on your own labeled JSONL data helps the model learn your internal jargon, ticket format, or account structure.

OpenAI Privacy Filter: Open-Source PII Redaction CLI

OpenAI Privacy Filter turns a small bidirectional transformer into an on-prem PII redactor that labels long text in one pass and decodes spans with BIOES constraints.

What Is OpenAI Privacy Filter?

OpenAI Privacy Filter is a bidirectional PII redaction CLI from OpenAI that detects and masks sensitive spans in text for teams running on-prem sanitization pipelines. OpenAI Privacy Filter is one of the best PII Redaction CLI tools for platform teams, security engineers, and data/ML teams because the repo ships a 1.5B-parameter model with 50M active parameters, a 128,000-token context window, and Apache 2.0 licensing.

Quick Overview

Attribute	Details
Type	PII Redaction CLI
Best For	On-prem PII detection, masking, evaluation, and finetuning
Language/Stack	Python, bidirectional transformer, token classification, Viterbi decoding
License	Apache 2.0
GitHub Stars	N/A as of Feb 2026
Pricing	Open-Source
Last Release	N/A
Model Size	1.5B parameters total, 50M active parameters
Context Window	128,000 tokens

The repo is split into a CLI, runtime, evaluation harness, and training path. That separation matters because you can run redaction locally, inspect the decoding stack, and finetune against your own label distribution without turning the model into a managed black box.

Who Should Use OpenAI Privacy Filter?

Security engineers building redaction into ingestion or export jobs who need deterministic local inference and a clean CLI.
Data platform teams handling logs, customer tickets, or support transcripts that contain names, dates, emails, and account numbers.
ML platform teams that need a trainable privacy classifier with explicit labels instead of a generic named-entity pipeline.
CTOs and indie hackers who want an Apache 2.0 privacy layer that can run on a laptop, in CI, or inside a private VPC.

Not ideal for:

Teams looking for broad information extraction beyond the eight privacy label categories.
Workflows that need a fully managed SaaS instead of local inference and checkpoint control.
Low-volume scripts where a regex pass is enough and model startup overhead is wasted.

Key Features of OpenAI Privacy Filter

Single-pass token classification — OpenAI Privacy Filter labels the full sequence in one forward pass instead of generating text token by token. That makes throughput predictable for long logs, chat transcripts, and batch redaction jobs.
BIOES span decoding — The model emits B-, I-, E-, S-, and O labels, then a constrained Viterbi decoder turns those logits into valid spans. That avoids broken redaction spans that often happen with naive argmax decoding.
128,000-token context window — The model can process long documents without chunking, which reduces boundary errors around names, dates, and addresses split across sections. That is a concrete advantage over shorter-context NER systems that need manual segmentation.
Banded bidirectional attention — The repo uses a band size of 128 with an effective attention window of 257 tokens including self. This keeps token classification context-aware while staying much cheaper than full dense attention across the entire sequence.
Runtime precision and recall controls — Preset operating points let you bias the system toward higher recall for compliance sweeps or higher precision for conservative redaction. That matters when false positives can damage downstream analytics or search.
Local CLI and Python API — The project exposes opf as a terminal entrypoint and opf/_api.py as a Python-facing runtime wrapper. That makes it easy to embed into pipelines that already use Airflow, Prefect, or a custom ETL runner.
Finetuning support — opf train can adapt the checkpoint to your own labeled dataset so the model learns your data distribution, document style, and domain-specific identifiers. If your environment uses unique account formats or internal secret markers, this is where the tool earns its keep.

OpenAI Privacy Filter vs Alternatives

Tool	Best For	Key Differentiator	Pricing
OpenAI Privacy Filter	On-prem PII redaction with long-context token classification	Small bidirectional transformer with finetuning and constrained span decoding	Open-Source
Microsoft Presidio	General PII detection and anonymization pipelines	Mature rule-plus-NER ecosystem with many built-in operators	Open-Source
spaCy + custom NER	Teams that already train custom entity extractors	Full control over pipelines, labels, and model components	Open-Source
Google Cloud DLP	Enterprise data loss prevention and managed scanning	Hosted DLP service with policy integration and cloud governance	Paid

Pick Microsoft Presidio if you want a broader privacy framework with more prebuilt integrations and a well-known API surface. Pick spaCy if your team already owns the labeling and training loop and wants to keep the entire NER stack in-house.

Pick Google Cloud DLP if you prefer a managed service over local model ops and are fine sending data to a cloud vendor. Pick OpenAI Privacy Filter when on-prem inference, long-context handling, and editable span behavior matter more than managed convenience.

If your pipeline also needs governed storage or downstream auditability, pair OpenAI Privacy Filter with DataHaven. If you need a traceable record of every sanitization run, OpenTrace is a cleaner companion than bolting logging onto the CLI.

How OpenAI Privacy Filter Works

OpenAI Privacy Filter starts from an autoregressive checkpoint that was later converted into a bidirectional token classifier over a privacy taxonomy. The implementation uses a pre-norm transformer stack with 8 blocks, grouped-query attention, rotary positional embeddings, and a sparse mixture-of-experts feed-forward layer with 128 experts and top-4 routing per token.

That design is not accidental. The model keeps the output head focused on 33 token classes instead of a natural-language vocabulary, so inference becomes a structured classification problem with [T, 33] logits for each sequence rather than a generation loop. The result is a model that is easier to calibrate for privacy work than an autoregressive redactor, while still carrying the contextual signal needed to catch spans like multi-token names, addresses, and dates.

After classification, OpenAI Privacy Filter runs constrained Viterbi decoding over BIOES transitions so the final spans are coherent. That matters because privacy redaction is sensitive to boundary errors, and one bad token can leave a partial address or account number exposed.

opf --checkpoint /path/to/checkpoint_dir --device cpu 'Alice was born on 1990-01-02.'

The command above loads a local checkpoint, scores each token against the privacy taxonomy, and prints structured redaction output instead of a free-form summary. Expect the decoder to emit contiguous spans for dates, people, emails, and other supported labels, with CPU mode trading speed for portability.

Pros and Cons of OpenAI Privacy Filter

Pros:

Runs on-prem so sensitive text does not need to leave your own environment.
Long-context inference cuts down on manual chunking and the boundary bugs that come with it.
Apache 2.0 licensing allows commercial use, internal customization, and checkpoint experimentation.
Finetunable architecture gives you control over precision and recall on your own data distribution.
Structured CLI workflow supports files, pipes, interactive mode, evaluation, and training from one repo.
Explicit span labels make audits easier than opaque black-box redaction outputs.

Cons:

Only eight label families are built in, so it is not a universal entity extraction system.
Model deployment still needs compute; even small models are not free if you run them at scale.
Finetuning requires labeled data, which means your best results depend on annotation quality.
The CLI is opinionated, so teams that want a pure library-only flow may need to wrap it.
It is not a managed service, so you own upgrades, checkpoint storage, and monitoring.

Getting Started with OpenAI Privacy Filter

git clone https://github.com/openai/privacy-filter.git
cd privacy-filter
pip install -e .
opf 'Alice was born on 1990-01-02.'

After the editable install, opf becomes available as a shell command and as python -m opf. If no checkpoint is present, the CLI looks for OPF_CHECKPOINT or ~/.opf/privacy_filter and downloads the model when needed, so the first run may take longer than later runs.

For local testing, add --device cpu when you do not have GPU access. For pipeline integration, use file mode with opf -f /path/to/file, pipe input through standard Unix commands, or move straight to opf eval and opf train when you are ready to validate or finetune against labeled JSONL data.

Verdict

OpenAI Privacy Filter is the strongest option for on-prem PII redaction when you need long-context inference and editable span labeling instead of regex-heavy heuristics. Its main strength is single-pass, context-aware masking across 128,000 tokens; its main caveat is that precision and recall still depend on your checkpoint and labels. Use it when privacy control matters more than managed convenience.

OpenAI Privacy Filter: Open-Source PII Redaction CLI

What Is OpenAI Privacy Filter?

Quick Overview

Who Should Use OpenAI Privacy Filter?

Key Features of OpenAI Privacy Filter

OpenAI Privacy Filter vs Alternatives

How OpenAI Privacy Filter Works

Pros and Cons of OpenAI Privacy Filter

Getting Started with OpenAI Privacy Filter

Verdict

Frequently Asked Questions

You Might Also Like

Mega-ASR: Best ASR Models for ML Engineers in 2026

Web3 Developer Toolkit: Open-Source CLI for EVM & Solana

Stable Diffusion WebUI: Best AI Image UI for Creators in 2026