Your Flutter CI is slow, and the test job is most of it. Two levers cut it down, and they stack: very_good_cli compiles the whole suite once instead of once per file, and sharding splits the run across parallel machines. Used together, the right way, a 300-file suite went from ~280s to ~33s on GitHub Actions, about 8× faster.
TL;DR
- Compile once with
very_good test. It builds one optimized entry point for the whole suite instead of one per file: ~64% off on a single runner. - Then shard across runners, by file. Split whole test files across machines, each running its slice through
very_good. Another ~67% off, ~88% total. - Shard by file, not by case.
flutter test --total-shardsdoes split the run across machines, but by test case, so every shard still recompiles the whole suite (and cross-case state can break). Splitting whole files gives each shard one compile. - Cache the SDK and pub first, or per-job setup swamps the win.
Everything here is reproducible: a testbed and the GitHub Actions workflow that produced every number.
The two levers
They fix different costs. very_good_cli’s very_good test goes after compile cost. A normal flutter test compiles each test file as its own entry point, so 300 files means 300 separate compiles. very_good test bundles them into one optimized entry point the VM compiles once.
very_good test collapses 300 per-file compiles into one.Sharding goes after parallelism: split the work across machines that run at once. In the ideal case, four machines means a quarter of the wait.
The catch, and the crux: sharding splits the work to run, not the cost to compile. If every machine still compiles the whole suite, the split barely helps. So you do both, in order: compile once, then split what’s left.
The benchmark
The testbed is a self-contained Rubik’s cube timer. Here’s what it produces, and what the tests pin down:
- A scramble (20 WCA moves, never the same face twice in a row):
F2 R D2 R2 D2 F2 B' D2 U R F2 U R2 D' B2 D' L F' L U2 - A solve: a raw time with an optional penalty, like
12.34,14.34 (+2), orDNF - Rolling stats: best,
ao5,ao12(drop the best and worst, average the rest; aDNFcounts as the worst)
The suite is a hand-written core plus generated tests (real assertions against the real library) scaled to 300 files, about the shape of a production app: real widget, integration, and computation work, not just trivial unit checks. It’s all in the repository.
I measured on a GitHub Actions matrix: ubuntu-latest (2-core Linux), sharded scenarios across four parallel runners. Wall-clock is the slowest shard (what you wait for); compute is the sum of every shard (what you’re billed for).
These gains rest on a cached pipeline. Every number here is the test step only. A real CI job first installs the Flutter SDK, runs
pub get, and activatesvery_good_cli, and pays for that setup per job. Cached (this workflow caches the SDK and~/.pub-cache), it’s about 35 seconds; uncached, minutes, enough to swamp any test-step win. Cache the SDK especially, or none of the speedups below matter.
Compile once, then shard
Start on one runner and switch flutter test for very_good test:
| Scenario | Wall-clock | vs baseline |
|---|---|---|
flutter test (baseline) | 280s | |
very_good test | 100s | 64% faster |
That’s the single biggest cut: one compile instead of 300. For many suites it’s all you need.
Now split the files across four runners, each running very_good on its slice. Round-robin the test files into shards, generate one entry point per shard that imports its files and runs each main() in a group(), and hand that to very_good test:
# Assign whole files to this shard (round-robin by line number).
shard_files=$(find test -name '*_test.dart' | sort | awk -v s=4 -v i="$INDEX" 'NR % s == i')
# Generate one entry point: import every shard file, dispatch each main() in a group().
runner="shard_${INDEX}_test.dart" # at the repo root, so `find test` never re-picks it up
{
echo "import 'package:flutter_test/flutter_test.dart';"
n=0; while IFS= read -r f; do n=$((n+1)); echo "import '$f' as t_$n;"; done <<< "$shard_files"
echo "void main() {"
n=0; while IFS= read -r f; do n=$((n+1)); echo " group('$f', t_$n.main);"; done <<< "$shard_files"
echo "}"
} > "$runner"
very_good test "$runner" -j 8
You split the files yourself because the two primitives don’t compose with a flag: very_good test has no shard option, and flutter test --total-shards shards the wrong way (next section). So you hand each runner its own slice and let very_good collapse that slice to one compile.
| Scenario | Wall-clock | vs baseline | Compute |
|---|---|---|---|
| combined | 33s | 88% faster | 118s |
Each shard compiles ~75 files once and runs only its quarter of the tests. And it’s honest sharding: I checked that the four shards run the suite’s 2,188 tests exactly once between them, disjoint files, per-shard counts that sum to the full suite, no test run twice.
The key is that the split is by file. Cases that share file-scope setup stay together, and each shard still gets the one-compile win.
Don’t shard by cases
flutter test --total-shards is real sharding, and it does help: it splits the run across machines, so an execution-heavy suite dropped to ~210s (~25% off). The catch is how it splits. The test package documents --total-shards as splitting suites (whole files), but through flutter test the split lands at the test-case level: every one of the 300 files shows up in every shard, so each shard still recompiles the entire suite. You get the parallel run, never the one-compile win, so it never approaches the file-level ~33s.
It can also break correctness. Because a file’s cases scatter across shards, any test that leans on file-scope state from an earlier case in the same file fails:
flutter test demo/cross_case_state_test.dart -> all 4 pass
flutter test demo/... --total-shards 4 --shard-index 1..3 -> 3 of 4 FAIL
So shard whole files, as above. You get the parallelism without recompiling everything four times or scrambling shared state.
The numbers
| Scenario | Wall-clock | vs baseline |
|---|---|---|
flutter test (baseline) | 280s | |
very_good test (one runner) | 100s | 64% faster |
--total-shards 4 (cases) | ~210s | ~25% faster |
file-level shards + very_good | 33s | 88% faster |
Compile once is the big cut; sharding by file compounds on top, once there’s real execution to spread. --total-shards is the odd one out: it splits cases but recompiles everything per shard, so it never gets near the file-level result.
What it costs
Wall-clock isn’t the bill. GitHub’s Linux runners (ubuntu-latest) cost $0.008/min, billed per job, rounded up to the whole minute. A single runner is billed once; four shards are billed four times, and each shard first pays the same fixed setup. So very_good alone is the cheapest option as well as the simplest. File-level sharding buys the fastest wall-clock, but four runners cost more billed minutes than one. Shard when the feedback loop matters more than the bill; otherwise very_good alone is the sweet spot.
When to use it
- Cache the SDK and
~/.pub-cachefirst. Uncached setup swamps every speedup. - Reach for
very_good test. One line, ~64% off here, cheapest. Often all you need. - Shard by file (round-robin +
very_goodper shard) when tests have real run time to split. ~88% off. - Shard by file, not
--total-shardsalone. It splits cases, so every shard recompiles the whole suite and cross-case state can break. Splitting whole files gives each shard one compile. - Mind the bill. Each runner pays setup again, so add runners only when the wall-clock is worth it.
Resources
very_good_cliforvery_good testand its optimization steppackage:testsharding for--total-shardsand--shard-index- The full testbed (the app, the 300-file suite, the scenario scripts, and the workflow that produced every number) is in the repository
Takeaway
Speeding up Flutter tests is two moves, in order: compile the whole suite once with very_good test, then split the files across runners so each does a fraction of the work. The 300-file suite went from ~280s to ~33s. The one thing to get right is to shard by file, not by case: flutter test --total-shards splits cases, so every shard recompiles the whole suite, while splitting whole files gives each runner one compile and its own slice, exactly once.
—Joshua