Files

shihao 6487becf60 Initial commit: add all skills files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-10 16:52:49 +08:00

7.7 KiB

Raw Blame History

Release & Acceptance Checklist

6-gate release checklist for backend and full-stack applications. Prevents "it works on my machine" and "we forgot to check X" failures.

Iron Law: NO RELEASE WITHOUT ALL GATES PASSING.

Release Gates Overview

Feature Complete
    ↓
Gate 1: Functional Acceptance        → Does it do what it should?
    ↓
Gate 2: Non-Functional Acceptance    → Is it fast, reliable, observable?
    ↓
Gate 3: Security Review              → Is it safe?
    ↓
Gate 4: Deployment Readiness         → Can we deploy and rollback safely?
    ↓
Gate 5: Release Execution            → Deploy with canary + monitoring
    ↓
Gate 6: Post-Release Validation      → Did it actually work in production?

Gate 1: Functional Acceptance

Question: Does it do what the requirements say?

All acceptance criteria from ticket/PRD have passing tests
Happy path works end-to-end
Edge cases tested (empty inputs, max lengths, Unicode)
Error cases tested (invalid input, not found, timeout)
Data integrity verified (CRUD cycle produces correct state)
Backward compatibility confirmed (existing clients not broken)
API contract matches OpenAPI spec
Idempotency verified (retries don't create duplicates)

Evidence Template

Requirement	Test	Status
User can create order	`orders.api.test:creates order`	✅ PASS
Empty cart → error	`orders.api.test:rejects empty`	✅ PASS
Payment failure handled	`payments.test:handles decline`	✅ PASS

Gate 2: Non-Functional Acceptance

Question: Is it fast, reliable, and observable?

Performance

Response time within budget (p95 < ___ms) — measured, not assumed
No N+1 queries (checked with query logging)
New queries use indexes (EXPLAIN ANALYZE)
Pagination works on large datasets
Caching effective (hit rate > 80%)
Connection pool healthy under load

Reliability

Graceful degradation when dependencies fail (circuit breaker)
Retry logic works for transient failures
All external calls have timeouts
Rate limiting returns 429 correctly
Health check endpoints verified (/health, /ready)

Observability

Structured logging with request ID (not console.log)
Metrics exposed (request count, latency, error rate)
Alerts configured (error spike, latency spike)
Request tracing works end-to-end
Dashboard updated for new feature

Evidence

Metric	Target	Actual	Status
p95 response	< 500ms	___ms	✅/❌
p99 response	< 1000ms	___ms	✅/❌
Error rate (load)	< 0.1%	___%	✅/❌
Throughput	> ___ RPS	___ RPS	✅/❌

Gate 3: Security Review

Question: Does this introduce vulnerabilities?

Input & Output

All input validated server-side (never trust client)
SQL injection prevented (parameterized queries only)
XSS prevented (output encoding)
File upload validated (type, size, name sanitized)
Rate limiting on sensitive endpoints (login, reset, APIs)

Auth & Data

Protected endpoints require valid credentials
Users can only access their own resources
Admin routes require admin role
Tokens expire (short-lived access + refresh)
Passwords hashed (bcrypt/argon2, not MD5/SHA)
Sensitive data not logged (passwords, tokens, PII)
Secrets in env vars (not hardcoded)
Error messages don't leak internals

Dependencies

No known vulnerabilities (npm audit / pip audit / govulncheck)
Dependencies pinned in lockfile
Unused dependencies removed

Gate 4: Deployment Readiness

Question: Can we deploy safely and roll back if needed?

Code

All tests pass in CI (not "it passed locally")
Linter clean, build succeeds
Code reviewed and approved
No unresolved TODO/FIXME/HACK

Database

Migration tested on staging with production-like data
Down migration works (tested!)
Migration is non-destructive (additive only)
Migration timing estimated on production data size
Backfill plan documented (if needed)

Configuration

New env vars documented in .env.example
Env vars set in staging and verified
Env vars set in production
Feature flags configured (if applicable)

Rollback Plan Template

## Rollback Plan: [Feature]

### When to rollback
- Error rate > 1% sustained 5 minutes
- p99 latency > 3000ms sustained 10 minutes
- Critical business function broken

### Steps
1. Revert deploy: [command]
2. Rollback migration (if applied): [command]
3. Invalidate cache: [command]
4. Notify team: #incidents channel
5. Verify rollback: [verification steps]

### Estimated time: [X minutes]
### Data recovery: [procedure if data was modified]

Gate 5: Release Execution

Deployment Sequence

1. 📢 ANNOUNCE in release channel

2. 🗄️ DATABASE — Apply migration
   - Run migration
   - Verify completion
   - Check data integrity

3. 🚀 DEPLOY — Roll out code
   - Canary first (10% traffic)
   - Monitor 5 minutes
   - If OK → 50% → monitor → 100%
   - If NOT OK → STOP immediately

4. 🔍 SMOKE TEST
   - Health check → 200
   - Login works
   - Core operation works
   - No error spikes

5. ✅ ANNOUNCE "Release complete. Monitoring 30 min."

Canary Decision Table

Metric	Baseline	Canary OK	STOP	ROLLBACK
Error rate	0.05%	< 0.1%	0.5%	> 1%
p95 latency	300ms	< 500ms	700ms	> 1000ms

Gate 6: Post-Release Validation

Immediate (0-30 min)

Health checks green on all instances
Error rate within normal range
Latency normal (p95, p99)
Core user journey manually tested
Logs clean — no unexpected errors
Alerts silent

Short-term (1-24 hours)

No customer complaints
Business metrics stable (conversion, revenue, signups)
Memory/CPU stable (no creeping usage)
Queue backlogs clear
Database performance stable

Post-Release Report Template

## Release Report: [Feature]
- Deployed: [timestamp] by @[engineer]
- Duration: [minutes]

| Check | Status | Notes |
|-------|--------|-------|
| Health checks | ✅ | All healthy |
| Error rate | ✅ | 0.03% (baseline: 0.05%) |
| p95 latency | ✅ | 310ms (baseline: 300ms) |
| Core flow | ✅ | Order creation verified |

Issues found: None / [details]
Rollback used: No / Yes: [reason]

Release Readiness Score

Score each gate 0-2: (0 = not checked, 1 = partially, 2 = fully verified with evidence)

Gate	Score
1. Functional Acceptance	/2
2. Non-Functional Acceptance	/2
3. Security Review	/2
4. Deployment Readiness	/2
5. Release Execution Plan	/2
6. Post-Release Validation Plan	/2
Total	/12

Decision:

12/12 → Ship it ✅
10-11 → Ship with documented exceptions + owner assigned
< 10 → Do NOT release. Fix gaps first.

Common Rationalizations

❌ Excuse	✅ Reality
"It's a small change"	Small changes cause outages every day
"We tested locally"	Local ≠ production
"We'll fix it if it breaks"	You'll fix it at 3 AM. Prevent now.
"Deadline is today"	Broken code costs more than late code
"CI passed"	CI doesn't check everything. Run the checklist.
"We can always rollback"	Only if you planned and tested rollback
"We did this last time fine"	Survivorship bias. Checklist every time.

7.7 KiB Raw Blame History