Stress test for OpenComputer checkpoint/restore integrity. Creates checkpoints and restores from them at scale, verifying git object integrity and file-level SHA-256 on every restore.
Requires Node.js 20+ and an OpenComputer API key.
A customer reported corrupted snapshots — files had incorrect or missing content after checkpoint/restore operations. Root cause was race conditions in the QEMU backend where concurrent operations accessed qcow2 virtual disk files without synchronization. Specifically, tar was archiving qcow2 files while QEMU was modifying them (tar: rootfs.qcow2: file changed as we read it).
Four fixes were applied, all based on the same principle — never read a qcow2 file that another process can write:
- Hibernate archive: reflink-copy qcow2 to staging before archiving
- Destroy during archive: wait for archive goroutine before deleting files
- Migration upload: mutex + reflink staging during S3 upload
- Checkpoint cache delete: write-lock blocks removal while forks hold read locks
This repo validates the fix at the SDK level — 1,000 checkpoint restores against production.
Full incident report: docs/incident-report.pdf
Ran against OpenComputer production. 5 independent rounds, 200 restores each, concurrency 5.
| Metric | Result |
|---|---|
| Total restores | 1,000 |
| Corruptions | 0 |
| Infra errors (timeouts) | 3 |
| Avg restore (fork from checkpoint) | ~130ms |
| Avg verify (git + SHA-256 over 5MB) | ~10s |
| Round | Restores | Corrupted | Infra errors | Setup | Avg create | Avg verify |
|---|---|---|---|---|---|---|
| 1 | 200 | 0 | 0 | 7.0s | 140ms | 10.3s |
| 2 | 200 | 0 | 0 | 7.8s | 130ms | 11.5s |
| 3 | 200 | 0 | 1 | 6.2s | 132ms | 10.8s |
| 4 | 200 | 0 | 2 | 7.7s | 149ms | 11.2s |
| 5 | 200 | 0 | 0 | 59.6s | 135ms | 8.9s |
| Total | 1,000 | 0 | 3 | ~137ms | ~10.5s |
The 3 infra errors were Cloudflare 524 timeouts — transient network issues, not data integrity failures. Round 5's longer setup was a slow checkpoint readiness poll.
Raw results: results/2026-04-01_15-11-52/
Each round:
- Boot a sandbox (1 CPU / 4 GB), write a 5MB random marker file, commit it to a git repo
- Checkpoint the sandbox
- Restore from that checkpoint N times concurrently
- Verify every restore with three checks:
git status— segfault on corrupted filesystem/memorygit log— verifies object database matches expected commit- SHA-256 of marker file — detects bit rot or truncation
Git's content-addressed object store (SHA-1) makes it sensitive to even single bit flips — corruption typically surfaces as segfaults or hash errors rather than silent data loss.
Errors are classified as corruption (segfault, hash mismatch, git breakage) or infra (timeout, 502, rate limit). Only corruption counts as a test failure.
cp .env.example .env
# add your OpenComputer API key
npm install
source .env
npm run test:smoke # 10 restores, ~30s
npm run test:full # 1000 restores, ~40 minEach run creates a timestamped directory under results/:
results/2026-04-01_14-30-00/
├── report.json # structured results
├── run.log # console output (ANSI stripped)
└── error-details.log # verbose error bodies (if any)
-n, --restores <num> Total restores across all rounds (default: 1000)
-r, --rounds <num> Independent checkpoint rounds (default: 5)
-c, --concurrency <num> Max simultaneous restores (default: 5)
--marker-size <mb> Marker file size in MB (default: 5)
npx tsx src/stress-test.ts -n 500 -r 3 -c 8
npx tsx src/stress-test.ts -n 2000 -r 10 -c 5 --marker-size 10src/
├── stress-test.ts # entry point
└── lib/
├── round.ts # one round: create checkpoint, restore N times
├── verify.ts # one restore: 3 integrity checks
├── errors.ts # classify corruption vs infra errors
├── report.ts # terminal output + JSON report
└── types.ts # shared types and utilities
docs/
└── incident-report.pdf # root cause analysis & remediation
results/
└── 2026-04-01_15-11-52/ # 1000-restore full run