Faster Flutter Tests — Joshua de Guzman

Your Flutter CI is slow, and the test job is most of it. Two levers cut it down, and they stack: very_good_cli compiles the whole suite once instead of once per file, and sharding splits the run across parallel machines. Used together, the right way, a 300-file suite went from ~280s to ~33s on GitHub Actions, about 8× faster.

TL;DR

Compile once with very_good test. It builds one optimized entry point for the whole suite instead of one per file: ~64% off on a single runner.
Then shard across runners, by file. Split whole test files across machines, each running its slice through very_good. Another ~67% off, ~88% total.
Shard by file, not by case. flutter test --total-shards does split the run across machines, but by test case, so every shard still recompiles the whole suite (and cross-case state can break). Splitting whole files gives each shard one compile.
Cache the SDK and pub first, or per-job setup swamps the win.

Everything here is reproducible: a testbed and the GitHub Actions workflow that produced every number.

The two levers

They fix different costs. very_good_cli’s very_good test goes after compile cost. A normal flutter test compiles each test file as its own entry point, so 300 files means 300 separate compiles. very_good test bundles them into one optimized entry point the VM compiles once.

Figure 1. Where does the time go? very_good test collapses 300 per-file compiles into one.

Sharding goes after parallelism: split the work across machines that run at once. In the ideal case, four machines means a quarter of the wait.

Figure 2. Sharding, in principle: split the work across machines that run at once, and ideally wait a fraction of the time.

The catch, and the crux: sharding splits the work to run, not the cost to compile. If every machine still compiles the whole suite, the split barely helps. So you do both, in order: compile once, then split what’s left.

The benchmark

The testbed is a self-contained Rubik’s cube timer. Here’s what it produces, and what the tests pin down:

A scramble (20 WCA moves, never the same face twice in a row): F2 R D2 R2 D2 F2 B' D2 U R F2 U R2 D' B2 D' L F' L U2
A solve: a raw time with an optional penalty, like 12.34, 14.34 (+2), or DNF
Rolling stats: best, ao5, ao12 (drop the best and worst, average the rest; a DNF counts as the worst)

The suite is a hand-written core plus generated tests (real assertions against the real library) scaled to 300 files, about the shape of a production app: real widget, integration, and computation work, not just trivial unit checks. It’s all in the repository.

I measured on a GitHub Actions matrix: ubuntu-latest (2-core Linux), sharded scenarios across four parallel runners. Wall-clock is the slowest shard (what you wait for); compute is the sum of every shard (what you’re billed for).

These gains rest on a cached pipeline. Every number here is the test step only. A real CI job first installs the Flutter SDK, runs pub get, and activates very_good_cli, and pays for that setup per job. Cached (this workflow caches the SDK and ~/.pub-cache), it’s about 35 seconds; uncached, minutes, enough to swamp any test-step win. Cache the SDK especially, or none of the speedups below matter.

Compile once, then shard

Start on one runner and switch flutter test for very_good test:

Scenario	Wall-clock	vs baseline
`flutter test` (baseline)	280s
`very_good test`	100s	64% faster

That’s the single biggest cut: one compile instead of 300. For many suites it’s all you need.

Now split the files across four runners, each running very_good on its slice. Round-robin the test files into shards, generate one entry point per shard that imports its files and runs each main() in a group(), and hand that to very_good test:

# Assign whole files to this shard (round-robin by line number).
shard_files=$(find test -name '*_test.dart' | sort | awk -v s=4 -v i="$INDEX" 'NR % s == i')

# Generate one entry point: import every shard file, dispatch each main() in a group().
runner="shard_${INDEX}_test.dart"   # at the repo root, so `find test` never re-picks it up
{
  echo "import 'package:flutter_test/flutter_test.dart';"
  n=0; while IFS= read -r f; do n=$((n+1)); echo "import '$f' as t_$n;"; done <<< "$shard_files"
  echo "void main() {"
  n=0; while IFS= read -r f; do n=$((n+1)); echo "  group('$f', t_$n.main);"; done <<< "$shard_files"
  echo "}"
} > "$runner"

very_good test "$runner" -j 8

You split the files yourself because the two primitives don’t compose with a flag: very_good test has no shard option, and flutter test --total-shards shards the wrong way (next section). So you hand each runner its own slice and let very_good collapse that slice to one compile.

Figure 4. File-level shards keep cases together and collapse each shard to one compile. Parallelism that actually pays.

Scenario	Wall-clock	vs baseline	Compute
combined	33s	88% faster	118s

Each shard compiles ~75 files once and runs only its quarter of the tests. And it’s honest sharding: I checked that the four shards run the suite’s 2,188 tests exactly once between them, disjoint files, per-shard counts that sum to the full suite, no test run twice.

The key is that the split is by file. Cases that share file-scope setup stay together, and each shard still gets the one-compile win.

Don’t shard by cases

flutter test --total-shards is real sharding, and it does help: it splits the run across machines, so an execution-heavy suite dropped to ~210s (~25% off). The catch is how it splits. The test package documents --total-shards as splitting suites (whole files), but through flutter test the split lands at the test-case level: every one of the 300 files shows up in every shard, so each shard still recompiles the entire suite. You get the parallel run, never the one-compile win, so it never approaches the file-level ~33s.

It can also break correctness. Because a file’s cases scatter across shards, any test that leans on file-scope state from an earlier case in the same file fails:

flutter test demo/cross_case_state_test.dart                  ->  all 4 pass
flutter test demo/... --total-shards 4 --shard-index 1..3      ->  3 of 4 FAIL

Figure 3. The trap: native sharding splits test cases, so every runner recompiles all 300 files. It spreads the run, never the compile.

So shard whole files, as above. You get the parallelism without recompiling everything four times or scrambling shared state.

The numbers

Figure 5. CI wall-clock per scenario. Bars in a group start together, so they run in parallel; the longest is what you wait for. Native sharding's bars stay long because each runner recompiles the suite; file-level shards collapse to one compile each.

Scenario	Wall-clock	vs baseline
`flutter test` (baseline)	280s
`very_good test` (one runner)	100s	64% faster
`--total-shards 4` (cases)	~210s	~25% faster
file-level shards + `very_good`	33s	88% faster

Compile once is the big cut; sharding by file compounds on top, once there’s real execution to spread. --total-shards is the odd one out: it splits cases but recompiles everything per shard, so it never gets near the file-level result.

What it costs

Wall-clock isn’t the bill. GitHub’s Linux runners (ubuntu-latest) cost $0.008/min, billed per job, rounded up to the whole minute. A single runner is billed once; four shards are billed four times, and each shard first pays the same fixed setup. So very_good alone is the cheapest option as well as the simplest. File-level sharding buys the fastest wall-clock, but four runners cost more billed minutes than one. Shard when the feedback loop matters more than the bill; otherwise very_good alone is the sweet spot.

When to use it

Cache the SDK and ~/.pub-cache first. Uncached setup swamps every speedup.
Reach for very_good test. One line, ~64% off here, cheapest. Often all you need.
Shard by file (round-robin + very_good per shard) when tests have real run time to split. ~88% off.
Shard by file, not --total-shards alone. It splits cases, so every shard recompiles the whole suite and cross-case state can break. Splitting whole files gives each shard one compile.
Mind the bill. Each runner pays setup again, so add runners only when the wall-clock is worth it.

Resources

very_good_cli for very_good test and its optimization step
package:test sharding for --total-shards and --shard-index
The full testbed (the app, the 300-file suite, the scenario scripts, and the workflow that produced every number) is in the repository

Takeaway

Speeding up Flutter tests is two moves, in order: compile the whole suite once with very_good test, then split the files across runners so each does a fraction of the work. The 300-file suite went from ~280s to ~33s. The one thing to get right is to shard by file, not by case: flutter test --total-shards splits cases, so every shard recompiles the whole suite, while splitting whole files gives each runner one compile and its own slice, exactly once.

—Joshua