Benchmarks

The benchmark suite is meant to support a narrow claim: ferro-ta is often faster on selected indicators, and the evidence is published in a reproducible form.

What is published

The authoritative benchmark workflow lives in benchmarks/:

Cross-library speed suite: benchmarks/test_speed.py
Cross-library accuracy suite: benchmarks/test_accuracy.py
TA-Lib head-to-head script: benchmarks/bench_vs_talib.py
Backtesting engine benchmark: benchmarks/bench_backtest.py
Table generation from benchmark JSON: benchmarks/benchmark_table.py
Perf-contract artifact bundle: benchmarks/run_perf_contract.py

Backtesting engine — competitor comparison

Measured on Apple M-series, Python 3.13, Rust 1.91, using an SMA(20/50) crossover strategy with 0.1% commission and 5 bps slippage. Median of 5 runs.

Speed vs backtesting libraries (signal → equity curve)
Library	1k bars	10k bars	100k bars	vs ferro-ta core (100k)
ferro-ta `backtest_core`	0.004 ms	0.033 ms	0.286 ms	—
ferro-ta `backtest_ohlcv_core`	0.004 ms	0.037 ms	0.332 ms	~same
NumPy vectorized (manual)	0.013 ms	0.042 ms	0.459 ms	1.6× slower
vectorbt 0.28	1.32 ms	1.31 ms	2.90 ms	10× slower
backtesting.py	10.5 ms	42.3 ms	319.6 ms	1,117× slower
backtrader 1.9	53.9 ms	518 ms	n/a (skipped)	>15,000× slower

Accuracy: ferro-ta positions and bar-returns are bit-exact against the NumPy reference implementation (max per-bar equity diff = 0.00e+00 with zero commission/slippage).

Additional ferro-ta capabilities not present in the libraries above:

Capability	ferro-ta result	NumPy baseline	Speedup
Monte Carlo 1,000 sims (100k bars)	50 ms (parallel Rayon + LCG)	612 ms (Python loop)	12×
23 performance metrics, single call (100k bars)	2.8 ms	0.36 ms (2 metrics only)	0.12 ms / metric
Multi-asset 100 assets (100k bars)	43 ms parallel / 88 ms serial	—	2× parallel speedup
Walk-forward fold indices (100k bars)	0.3 µs	—	—

Reproduce the backtest benchmark:

python benchmarks/bench_backtest.py --sizes 10000 100000 \
    --json benchmarks/artifacts/latest/bench_backtest_results.json

Latest checked-in TA-Lib artifact

The current checked-in TA-Lib comparison artifact benchmarks contiguous float64 arrays at 10k and 100k bars on an Apple M3 Max with 14 logical cores, about 38.7 GB RAM, CPython 3.13.5, and Rust 1.91.1 using the default release profile (lto = true, codegen-units = 1).

Summary from benchmarks/artifacts/latest/benchmark_vs_talib.json:

Size	Rows	ferro-ta wins	Median speedup	TA-Lib wins or ties
`10,000`	12	6	`1.0850x`	`EMA`, `RSI`, `ATR`, `STOCH`, `ADX`, `OBV`
`100,000`	12	6	`1.0784x`	`EMA`, `RSI`, `ATR`, `STOCH`, `ADX`, `OBV`

Examples from the 100k-bar run:

Indicator	ferro-ta	TA-Lib	Speedup	Read
`SMA`	`0.0985 ms`	`0.2241 ms`	`2.2751x`	clear ferro-ta win
`BBANDS`	`0.2122 ms`	`0.4966 ms`	`2.3402x`	clear ferro-ta win
`MACD`	`0.5152 ms`	`0.7111 ms`	`1.3801x`	ferro-ta win
`STOCH`	`1.7064 ms`	`0.7603 ms`	`0.4455x`	TA-Lib win
`ADX`	`0.7910 ms`	`0.5769 ms`	`0.7294x`	TA-Lib win
`ATR`	`0.5087 ms`	`0.5147 ms`	`1.0118x`	tie on this machine

Methodology notes

The head-to-head script uses the same synthetic OHLCV generator, the same parameters, and the same contiguous float64 array layout for both libraries.
Reported speedup is TA-Lib median time / ferro-ta median time.
The script uses 1 warmup run and 7 measured runs per case, and now records the full per-run timing samples, not just one selected number.
Published JSON artifacts include machine/runtime metadata, git metadata, Rust toolchain and build-profile metadata, per-run variance statistics, and Python-tracked peak allocation snapshots.
Allocation snapshots are based on tracemalloc and capture Python-tracked allocations only; they are not full native RSS profiles.
If your workload uses non-contiguous arrays, different dtypes, or different batch sizes, benchmark that exact workload. Those factors can materially change the result.

Reproduce the TA-Lib comparison

pip install ta-lib
python benchmarks/bench_vs_talib.py --sizes 10000 100000 --json benchmark_vs_talib.json

The JSON output is the main artifact to review when publishing performance claims.

Cross-library suite

Run the broader speed suite on 100,000 bars:

uv run pytest benchmarks/test_speed.py --benchmark-only --benchmark-json=benchmarks/results.json -v

Selected throughput examples from the checked-in table:

Indicator	Throughput
`ADD`	1.9 G bars/s
`CDLENGULFING`	454 M bars/s
`EMA`	444 M bars/s
`SMA`	259 M bars/s
`RSI`	145 M bars/s
`ATR`	70 M bars/s
`MACD`	104 M bars/s
`STOCH`	33 M bars/s

Perf-contract artifacts

Use the perf-contract runner when you want a compact, machine-readable artifact bundle for single-series latency, batch throughput, streaming throughput, and hotspot attribution:

uv run python benchmarks/run_perf_contract.py --output-dir benchmarks/artifacts/latest

See benchmarks/README.md for the detailed benchmark playbook and the checked-in comparison tables.