Files
skills/fullstack-dev/references/release-checklist.md
shihao 6487becf60 Initial commit: add all skills files
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 16:52:49 +08:00

279 lines
7.7 KiB
Markdown

# Release & Acceptance Checklist
6-gate release checklist for backend and full-stack applications. Prevents "it works on my machine" and "we forgot to check X" failures.
**Iron Law: NO RELEASE WITHOUT ALL GATES PASSING.**
---
## Release Gates Overview
```
Feature Complete
Gate 1: Functional Acceptance → Does it do what it should?
Gate 2: Non-Functional Acceptance → Is it fast, reliable, observable?
Gate 3: Security Review → Is it safe?
Gate 4: Deployment Readiness → Can we deploy and rollback safely?
Gate 5: Release Execution → Deploy with canary + monitoring
Gate 6: Post-Release Validation → Did it actually work in production?
```
---
## Gate 1: Functional Acceptance
**Question: Does it do what the requirements say?**
- [ ] All acceptance criteria from ticket/PRD have passing tests
- [ ] Happy path works end-to-end
- [ ] Edge cases tested (empty inputs, max lengths, Unicode)
- [ ] Error cases tested (invalid input, not found, timeout)
- [ ] Data integrity verified (CRUD cycle produces correct state)
- [ ] Backward compatibility confirmed (existing clients not broken)
- [ ] API contract matches OpenAPI spec
- [ ] Idempotency verified (retries don't create duplicates)
### Evidence Template
| Requirement | Test | Status | Notes |
|-------------|------|--------|-------|
| User can create order | `orders.api.test:creates order` | ✅ PASS | |
| Empty cart → error | `orders.api.test:rejects empty` | ✅ PASS | |
| Payment failure handled | `payments.test:handles decline` | ✅ PASS | |
---
## Gate 2: Non-Functional Acceptance
**Question: Is it fast, reliable, and observable?**
### Performance
- [ ] Response time within budget (p95 < ___ms) — measured, not assumed
- [ ] No N+1 queries (checked with query logging)
- [ ] New queries use indexes (`EXPLAIN ANALYZE`)
- [ ] Pagination works on large datasets
- [ ] Caching effective (hit rate > 80%)
- [ ] Connection pool healthy under load
### Reliability
- [ ] Graceful degradation when dependencies fail (circuit breaker)
- [ ] Retry logic works for transient failures
- [ ] All external calls have timeouts
- [ ] Rate limiting returns 429 correctly
- [ ] Health check endpoints verified (`/health`, `/ready`)
### Observability
- [ ] Structured logging with request ID (not `console.log`)
- [ ] Metrics exposed (request count, latency, error rate)
- [ ] Alerts configured (error spike, latency spike)
- [ ] Request tracing works end-to-end
- [ ] Dashboard updated for new feature
### Evidence
| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| p95 response | < 500ms | ___ms | ✅/❌ |
| p99 response | < 1000ms | ___ms | ✅/❌ |
| Error rate (load) | < 0.1% | ___% | ✅/❌ |
| Throughput | > ___ RPS | ___ RPS | ✅/❌ |
---
## Gate 3: Security Review
**Question: Does this introduce vulnerabilities?**
### Input & Output
- [ ] All input validated server-side (never trust client)
- [ ] SQL injection prevented (parameterized queries only)
- [ ] XSS prevented (output encoding)
- [ ] File upload validated (type, size, name sanitized)
- [ ] Rate limiting on sensitive endpoints (login, reset, APIs)
### Auth & Data
- [ ] Protected endpoints require valid credentials
- [ ] Users can only access their own resources
- [ ] Admin routes require admin role
- [ ] Tokens expire (short-lived access + refresh)
- [ ] Passwords hashed (bcrypt/argon2, not MD5/SHA)
- [ ] Sensitive data not logged (passwords, tokens, PII)
- [ ] Secrets in env vars (not hardcoded)
- [ ] Error messages don't leak internals
### Dependencies
- [ ] No known vulnerabilities (`npm audit` / `pip audit` / `govulncheck`)
- [ ] Dependencies pinned in lockfile
- [ ] Unused dependencies removed
---
## Gate 4: Deployment Readiness
**Question: Can we deploy safely and roll back if needed?**
### Code
- [ ] All tests pass in CI (not "it passed locally")
- [ ] Linter clean, build succeeds
- [ ] Code reviewed and approved
- [ ] No unresolved TODO/FIXME/HACK
### Database
- [ ] Migration tested on staging with production-like data
- [ ] Down migration works (tested!)
- [ ] Migration is non-destructive (additive only)
- [ ] Migration timing estimated on production data size
- [ ] Backfill plan documented (if needed)
### Configuration
- [ ] New env vars documented in `.env.example`
- [ ] Env vars set in staging and verified
- [ ] Env vars set in production
- [ ] Feature flags configured (if applicable)
### Rollback Plan Template
```markdown
## Rollback Plan: [Feature]
### When to rollback
- Error rate > 1% sustained 5 minutes
- p99 latency > 3000ms sustained 10 minutes
- Critical business function broken
### Steps
1. Revert deploy: [command]
2. Rollback migration (if applied): [command]
3. Invalidate cache: [command]
4. Notify team: #incidents channel
5. Verify rollback: [verification steps]
### Estimated time: [X minutes]
### Data recovery: [procedure if data was modified]
```
---
## Gate 5: Release Execution
### Deployment Sequence
```
1. 📢 ANNOUNCE in release channel
2. 🗄️ DATABASE — Apply migration
- Run migration
- Verify completion
- Check data integrity
3. 🚀 DEPLOY — Roll out code
- Canary first (10% traffic)
- Monitor 5 minutes
- If OK → 50% → monitor → 100%
- If NOT OK → STOP immediately
4. 🔍 SMOKE TEST
- Health check → 200
- Login works
- Core operation works
- No error spikes
5. ✅ ANNOUNCE "Release complete. Monitoring 30 min."
```
### Canary Decision Table
| Metric | Baseline | Canary OK | STOP | ROLLBACK |
|--------|----------|-----------|------|----------|
| Error rate | 0.05% | < 0.1% | 0.5% | > 1% |
| p95 latency | 300ms | < 500ms | 700ms | > 1000ms |
---
## Gate 6: Post-Release Validation
### Immediate (0-30 min)
- [ ] Health checks green on all instances
- [ ] Error rate within normal range
- [ ] Latency normal (p95, p99)
- [ ] Core user journey manually tested
- [ ] Logs clean — no unexpected errors
- [ ] Alerts silent
### Short-term (1-24 hours)
- [ ] No customer complaints
- [ ] Business metrics stable (conversion, revenue, signups)
- [ ] Memory/CPU stable (no creeping usage)
- [ ] Queue backlogs clear
- [ ] Database performance stable
### Post-Release Report Template
```markdown
## Release Report: [Feature]
- Deployed: [timestamp] by @[engineer]
- Duration: [minutes]
| Check | Status | Notes |
|-------|--------|-------|
| Health checks | ✅ | All healthy |
| Error rate | ✅ | 0.03% (baseline: 0.05%) |
| p95 latency | ✅ | 310ms (baseline: 300ms) |
| Core flow | ✅ | Order creation verified |
Issues found: None / [details]
Rollback used: No / Yes: [reason]
```
---
## Release Readiness Score
Score each gate **0-2**: (0 = not checked, 1 = partially, 2 = fully verified with evidence)
| Gate | Score |
|------|-------|
| 1. Functional Acceptance | /2 |
| 2. Non-Functional Acceptance | /2 |
| 3. Security Review | /2 |
| 4. Deployment Readiness | /2 |
| 5. Release Execution Plan | /2 |
| 6. Post-Release Validation Plan | /2 |
| **Total** | **/12** |
**Decision:**
- **12/12** → Ship it ✅
- **10-11** → Ship with documented exceptions + owner assigned
- **< 10** → Do NOT release. Fix gaps first.
---
## Common Rationalizations
| ❌ Excuse | ✅ Reality |
|----------|-----------|
| "It's a small change" | Small changes cause outages every day |
| "We tested locally" | Local ≠ production |
| "We'll fix it if it breaks" | You'll fix it at 3 AM. Prevent now. |
| "Deadline is today" | Broken code costs more than late code |
| "CI passed" | CI doesn't check everything. Run the checklist. |
| "We can always rollback" | Only if you planned and tested rollback |
| "We did this last time fine" | Survivorship bias. Checklist every time. |