Files
skills/fullstack-dev/references/release-checklist.md
shihao 6487becf60 Initial commit: add all skills files
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 16:52:49 +08:00

7.7 KiB

Release & Acceptance Checklist

6-gate release checklist for backend and full-stack applications. Prevents "it works on my machine" and "we forgot to check X" failures.

Iron Law: NO RELEASE WITHOUT ALL GATES PASSING.


Release Gates Overview

Feature Complete
    ↓
Gate 1: Functional Acceptance        → Does it do what it should?
    ↓
Gate 2: Non-Functional Acceptance    → Is it fast, reliable, observable?
    ↓
Gate 3: Security Review              → Is it safe?
    ↓
Gate 4: Deployment Readiness         → Can we deploy and rollback safely?
    ↓
Gate 5: Release Execution            → Deploy with canary + monitoring
    ↓
Gate 6: Post-Release Validation      → Did it actually work in production?

Gate 1: Functional Acceptance

Question: Does it do what the requirements say?

  • All acceptance criteria from ticket/PRD have passing tests
  • Happy path works end-to-end
  • Edge cases tested (empty inputs, max lengths, Unicode)
  • Error cases tested (invalid input, not found, timeout)
  • Data integrity verified (CRUD cycle produces correct state)
  • Backward compatibility confirmed (existing clients not broken)
  • API contract matches OpenAPI spec
  • Idempotency verified (retries don't create duplicates)

Evidence Template

Requirement Test Status Notes
User can create order orders.api.test:creates order PASS
Empty cart → error orders.api.test:rejects empty PASS
Payment failure handled payments.test:handles decline PASS

Gate 2: Non-Functional Acceptance

Question: Is it fast, reliable, and observable?

Performance

  • Response time within budget (p95 < ___ms) — measured, not assumed
  • No N+1 queries (checked with query logging)
  • New queries use indexes (EXPLAIN ANALYZE)
  • Pagination works on large datasets
  • Caching effective (hit rate > 80%)
  • Connection pool healthy under load

Reliability

  • Graceful degradation when dependencies fail (circuit breaker)
  • Retry logic works for transient failures
  • All external calls have timeouts
  • Rate limiting returns 429 correctly
  • Health check endpoints verified (/health, /ready)

Observability

  • Structured logging with request ID (not console.log)
  • Metrics exposed (request count, latency, error rate)
  • Alerts configured (error spike, latency spike)
  • Request tracing works end-to-end
  • Dashboard updated for new feature

Evidence

Metric Target Actual Status
p95 response < 500ms ___ms /
p99 response < 1000ms ___ms /
Error rate (load) < 0.1% ___% /
Throughput > ___ RPS ___ RPS /

Gate 3: Security Review

Question: Does this introduce vulnerabilities?

Input & Output

  • All input validated server-side (never trust client)
  • SQL injection prevented (parameterized queries only)
  • XSS prevented (output encoding)
  • File upload validated (type, size, name sanitized)
  • Rate limiting on sensitive endpoints (login, reset, APIs)

Auth & Data

  • Protected endpoints require valid credentials
  • Users can only access their own resources
  • Admin routes require admin role
  • Tokens expire (short-lived access + refresh)
  • Passwords hashed (bcrypt/argon2, not MD5/SHA)
  • Sensitive data not logged (passwords, tokens, PII)
  • Secrets in env vars (not hardcoded)
  • Error messages don't leak internals

Dependencies

  • No known vulnerabilities (npm audit / pip audit / govulncheck)
  • Dependencies pinned in lockfile
  • Unused dependencies removed

Gate 4: Deployment Readiness

Question: Can we deploy safely and roll back if needed?

Code

  • All tests pass in CI (not "it passed locally")
  • Linter clean, build succeeds
  • Code reviewed and approved
  • No unresolved TODO/FIXME/HACK

Database

  • Migration tested on staging with production-like data
  • Down migration works (tested!)
  • Migration is non-destructive (additive only)
  • Migration timing estimated on production data size
  • Backfill plan documented (if needed)

Configuration

  • New env vars documented in .env.example
  • Env vars set in staging and verified
  • Env vars set in production
  • Feature flags configured (if applicable)

Rollback Plan Template

## Rollback Plan: [Feature]

### When to rollback
- Error rate > 1% sustained 5 minutes
- p99 latency > 3000ms sustained 10 minutes
- Critical business function broken

### Steps
1. Revert deploy: [command]
2. Rollback migration (if applied): [command]
3. Invalidate cache: [command]
4. Notify team: #incidents channel
5. Verify rollback: [verification steps]

### Estimated time: [X minutes]
### Data recovery: [procedure if data was modified]

Gate 5: Release Execution

Deployment Sequence

1. 📢 ANNOUNCE in release channel

2. 🗄️ DATABASE — Apply migration
   - Run migration
   - Verify completion
   - Check data integrity

3. 🚀 DEPLOY — Roll out code
   - Canary first (10% traffic)
   - Monitor 5 minutes
   - If OK → 50% → monitor → 100%
   - If NOT OK → STOP immediately

4. 🔍 SMOKE TEST
   - Health check → 200
   - Login works
   - Core operation works
   - No error spikes

5. ✅ ANNOUNCE "Release complete. Monitoring 30 min."

Canary Decision Table

Metric Baseline Canary OK STOP ROLLBACK
Error rate 0.05% < 0.1% 0.5% > 1%
p95 latency 300ms < 500ms 700ms > 1000ms

Gate 6: Post-Release Validation

Immediate (0-30 min)

  • Health checks green on all instances
  • Error rate within normal range
  • Latency normal (p95, p99)
  • Core user journey manually tested
  • Logs clean — no unexpected errors
  • Alerts silent

Short-term (1-24 hours)

  • No customer complaints
  • Business metrics stable (conversion, revenue, signups)
  • Memory/CPU stable (no creeping usage)
  • Queue backlogs clear
  • Database performance stable

Post-Release Report Template

## Release Report: [Feature]
- Deployed: [timestamp] by @[engineer]
- Duration: [minutes]

| Check | Status | Notes |
|-------|--------|-------|
| Health checks | ✅ | All healthy |
| Error rate | ✅ | 0.03% (baseline: 0.05%) |
| p95 latency | ✅ | 310ms (baseline: 300ms) |
| Core flow | ✅ | Order creation verified |

Issues found: None / [details]
Rollback used: No / Yes: [reason]

Release Readiness Score

Score each gate 0-2: (0 = not checked, 1 = partially, 2 = fully verified with evidence)

Gate Score
1. Functional Acceptance /2
2. Non-Functional Acceptance /2
3. Security Review /2
4. Deployment Readiness /2
5. Release Execution Plan /2
6. Post-Release Validation Plan /2
Total /12

Decision:

  • 12/12 → Ship it
  • 10-11 → Ship with documented exceptions + owner assigned
  • < 10 → Do NOT release. Fix gaps first.

Common Rationalizations

Excuse Reality
"It's a small change" Small changes cause outages every day
"We tested locally" Local ≠ production
"We'll fix it if it breaks" You'll fix it at 3 AM. Prevent now.
"Deadline is today" Broken code costs more than late code
"CI passed" CI doesn't check everything. Run the checklist.
"We can always rollback" Only if you planned and tested rollback
"We did this last time fine" Survivorship bias. Checklist every time.