Skip to content

diggerhq/oc-snapshot-stress-test

Repository files navigation

OpenComputer Snapshot Stress Test

Stress test for OpenComputer checkpoint/restore integrity. Creates checkpoints and restores from them at scale, verifying git object integrity and file-level SHA-256 on every restore.

Requires Node.js 20+ and an OpenComputer API key.


Background

A customer reported corrupted snapshots — files had incorrect or missing content after checkpoint/restore operations. Root cause was race conditions in the QEMU backend where concurrent operations accessed qcow2 virtual disk files without synchronization. Specifically, tar was archiving qcow2 files while QEMU was modifying them (tar: rootfs.qcow2: file changed as we read it).

Four fixes were applied, all based on the same principle — never read a qcow2 file that another process can write:

  1. Hibernate archive: reflink-copy qcow2 to staging before archiving
  2. Destroy during archive: wait for archive goroutine before deleting files
  3. Migration upload: mutex + reflink staging during S3 upload
  4. Checkpoint cache delete: write-lock blocks removal while forks hold read locks

This repo validates the fix at the SDK level — 1,000 checkpoint restores against production.

Full incident report: docs/incident-report.pdf


Results: 1,000 restores (April 2026)

Ran against OpenComputer production. 5 independent rounds, 200 restores each, concurrency 5.

Metric Result
Total restores 1,000
Corruptions 0
Infra errors (timeouts) 3
Avg restore (fork from checkpoint) ~130ms
Avg verify (git + SHA-256 over 5MB) ~10s
Round Restores Corrupted Infra errors Setup Avg create Avg verify
1 200 0 0 7.0s 140ms 10.3s
2 200 0 0 7.8s 130ms 11.5s
3 200 0 1 6.2s 132ms 10.8s
4 200 0 2 7.7s 149ms 11.2s
5 200 0 0 59.6s 135ms 8.9s
Total 1,000 0 3 ~137ms ~10.5s

The 3 infra errors were Cloudflare 524 timeouts — transient network issues, not data integrity failures. Round 5's longer setup was a slow checkpoint readiness poll.

Raw results: results/2026-04-01_15-11-52/


Methodology

Each round:

  1. Boot a sandbox (1 CPU / 4 GB), write a 5MB random marker file, commit it to a git repo
  2. Checkpoint the sandbox
  3. Restore from that checkpoint N times concurrently
  4. Verify every restore with three checks:
    • git status — segfault on corrupted filesystem/memory
    • git log — verifies object database matches expected commit
    • SHA-256 of marker file — detects bit rot or truncation

Git's content-addressed object store (SHA-1) makes it sensitive to even single bit flips — corruption typically surfaces as segfaults or hash errors rather than silent data loss.

Errors are classified as corruption (segfault, hash mismatch, git breakage) or infra (timeout, 502, rate limit). Only corruption counts as a test failure.


Usage

cp .env.example .env
# add your OpenComputer API key
npm install
source .env

npm run test:smoke    # 10 restores, ~30s
npm run test:full     # 1000 restores, ~40 min

Each run creates a timestamped directory under results/:

results/2026-04-01_14-30-00/
├── report.json        # structured results
├── run.log            # console output (ANSI stripped)
└── error-details.log  # verbose error bodies (if any)

Options

-n, --restores <num>      Total restores across all rounds (default: 1000)
-r, --rounds <num>        Independent checkpoint rounds (default: 5)
-c, --concurrency <num>   Max simultaneous restores (default: 5)
--marker-size <mb>        Marker file size in MB (default: 5)
npx tsx src/stress-test.ts -n 500 -r 3 -c 8
npx tsx src/stress-test.ts -n 2000 -r 10 -c 5 --marker-size 10

Structure

src/
├── stress-test.ts          # entry point
└── lib/
    ├── round.ts            # one round: create checkpoint, restore N times
    ├── verify.ts           # one restore: 3 integrity checks
    ├── errors.ts           # classify corruption vs infra errors
    ├── report.ts           # terminal output + JSON report
    └── types.ts            # shared types and utilities
docs/
└── incident-report.pdf     # root cause analysis & remediation
results/
└── 2026-04-01_15-11-52/    # 1000-restore full run

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors