279 lines
7.7 KiB
Markdown
279 lines
7.7 KiB
Markdown
# Release & Acceptance Checklist
|
|
|
|
6-gate release checklist for backend and full-stack applications. Prevents "it works on my machine" and "we forgot to check X" failures.
|
|
|
|
**Iron Law: NO RELEASE WITHOUT ALL GATES PASSING.**
|
|
|
|
---
|
|
|
|
## Release Gates Overview
|
|
|
|
```
|
|
Feature Complete
|
|
↓
|
|
Gate 1: Functional Acceptance → Does it do what it should?
|
|
↓
|
|
Gate 2: Non-Functional Acceptance → Is it fast, reliable, observable?
|
|
↓
|
|
Gate 3: Security Review → Is it safe?
|
|
↓
|
|
Gate 4: Deployment Readiness → Can we deploy and rollback safely?
|
|
↓
|
|
Gate 5: Release Execution → Deploy with canary + monitoring
|
|
↓
|
|
Gate 6: Post-Release Validation → Did it actually work in production?
|
|
```
|
|
|
|
---
|
|
|
|
## Gate 1: Functional Acceptance
|
|
|
|
**Question: Does it do what the requirements say?**
|
|
|
|
- [ ] All acceptance criteria from ticket/PRD have passing tests
|
|
- [ ] Happy path works end-to-end
|
|
- [ ] Edge cases tested (empty inputs, max lengths, Unicode)
|
|
- [ ] Error cases tested (invalid input, not found, timeout)
|
|
- [ ] Data integrity verified (CRUD cycle produces correct state)
|
|
- [ ] Backward compatibility confirmed (existing clients not broken)
|
|
- [ ] API contract matches OpenAPI spec
|
|
- [ ] Idempotency verified (retries don't create duplicates)
|
|
|
|
### Evidence Template
|
|
|
|
| Requirement | Test | Status | Notes |
|
|
|-------------|------|--------|-------|
|
|
| User can create order | `orders.api.test:creates order` | ✅ PASS | |
|
|
| Empty cart → error | `orders.api.test:rejects empty` | ✅ PASS | |
|
|
| Payment failure handled | `payments.test:handles decline` | ✅ PASS | |
|
|
|
|
---
|
|
|
|
## Gate 2: Non-Functional Acceptance
|
|
|
|
**Question: Is it fast, reliable, and observable?**
|
|
|
|
### Performance
|
|
|
|
- [ ] Response time within budget (p95 < ___ms) — measured, not assumed
|
|
- [ ] No N+1 queries (checked with query logging)
|
|
- [ ] New queries use indexes (`EXPLAIN ANALYZE`)
|
|
- [ ] Pagination works on large datasets
|
|
- [ ] Caching effective (hit rate > 80%)
|
|
- [ ] Connection pool healthy under load
|
|
|
|
### Reliability
|
|
|
|
- [ ] Graceful degradation when dependencies fail (circuit breaker)
|
|
- [ ] Retry logic works for transient failures
|
|
- [ ] All external calls have timeouts
|
|
- [ ] Rate limiting returns 429 correctly
|
|
- [ ] Health check endpoints verified (`/health`, `/ready`)
|
|
|
|
### Observability
|
|
|
|
- [ ] Structured logging with request ID (not `console.log`)
|
|
- [ ] Metrics exposed (request count, latency, error rate)
|
|
- [ ] Alerts configured (error spike, latency spike)
|
|
- [ ] Request tracing works end-to-end
|
|
- [ ] Dashboard updated for new feature
|
|
|
|
### Evidence
|
|
|
|
| Metric | Target | Actual | Status |
|
|
|--------|--------|--------|--------|
|
|
| p95 response | < 500ms | ___ms | ✅/❌ |
|
|
| p99 response | < 1000ms | ___ms | ✅/❌ |
|
|
| Error rate (load) | < 0.1% | ___% | ✅/❌ |
|
|
| Throughput | > ___ RPS | ___ RPS | ✅/❌ |
|
|
|
|
---
|
|
|
|
## Gate 3: Security Review
|
|
|
|
**Question: Does this introduce vulnerabilities?**
|
|
|
|
### Input & Output
|
|
|
|
- [ ] All input validated server-side (never trust client)
|
|
- [ ] SQL injection prevented (parameterized queries only)
|
|
- [ ] XSS prevented (output encoding)
|
|
- [ ] File upload validated (type, size, name sanitized)
|
|
- [ ] Rate limiting on sensitive endpoints (login, reset, APIs)
|
|
|
|
### Auth & Data
|
|
|
|
- [ ] Protected endpoints require valid credentials
|
|
- [ ] Users can only access their own resources
|
|
- [ ] Admin routes require admin role
|
|
- [ ] Tokens expire (short-lived access + refresh)
|
|
- [ ] Passwords hashed (bcrypt/argon2, not MD5/SHA)
|
|
- [ ] Sensitive data not logged (passwords, tokens, PII)
|
|
- [ ] Secrets in env vars (not hardcoded)
|
|
- [ ] Error messages don't leak internals
|
|
|
|
### Dependencies
|
|
|
|
- [ ] No known vulnerabilities (`npm audit` / `pip audit` / `govulncheck`)
|
|
- [ ] Dependencies pinned in lockfile
|
|
- [ ] Unused dependencies removed
|
|
|
|
---
|
|
|
|
## Gate 4: Deployment Readiness
|
|
|
|
**Question: Can we deploy safely and roll back if needed?**
|
|
|
|
### Code
|
|
|
|
- [ ] All tests pass in CI (not "it passed locally")
|
|
- [ ] Linter clean, build succeeds
|
|
- [ ] Code reviewed and approved
|
|
- [ ] No unresolved TODO/FIXME/HACK
|
|
|
|
### Database
|
|
|
|
- [ ] Migration tested on staging with production-like data
|
|
- [ ] Down migration works (tested!)
|
|
- [ ] Migration is non-destructive (additive only)
|
|
- [ ] Migration timing estimated on production data size
|
|
- [ ] Backfill plan documented (if needed)
|
|
|
|
### Configuration
|
|
|
|
- [ ] New env vars documented in `.env.example`
|
|
- [ ] Env vars set in staging and verified
|
|
- [ ] Env vars set in production
|
|
- [ ] Feature flags configured (if applicable)
|
|
|
|
### Rollback Plan Template
|
|
|
|
```markdown
|
|
## Rollback Plan: [Feature]
|
|
|
|
### When to rollback
|
|
- Error rate > 1% sustained 5 minutes
|
|
- p99 latency > 3000ms sustained 10 minutes
|
|
- Critical business function broken
|
|
|
|
### Steps
|
|
1. Revert deploy: [command]
|
|
2. Rollback migration (if applied): [command]
|
|
3. Invalidate cache: [command]
|
|
4. Notify team: #incidents channel
|
|
5. Verify rollback: [verification steps]
|
|
|
|
### Estimated time: [X minutes]
|
|
### Data recovery: [procedure if data was modified]
|
|
```
|
|
|
|
---
|
|
|
|
## Gate 5: Release Execution
|
|
|
|
### Deployment Sequence
|
|
|
|
```
|
|
1. 📢 ANNOUNCE in release channel
|
|
|
|
2. 🗄️ DATABASE — Apply migration
|
|
- Run migration
|
|
- Verify completion
|
|
- Check data integrity
|
|
|
|
3. 🚀 DEPLOY — Roll out code
|
|
- Canary first (10% traffic)
|
|
- Monitor 5 minutes
|
|
- If OK → 50% → monitor → 100%
|
|
- If NOT OK → STOP immediately
|
|
|
|
4. 🔍 SMOKE TEST
|
|
- Health check → 200
|
|
- Login works
|
|
- Core operation works
|
|
- No error spikes
|
|
|
|
5. ✅ ANNOUNCE "Release complete. Monitoring 30 min."
|
|
```
|
|
|
|
### Canary Decision Table
|
|
|
|
| Metric | Baseline | Canary OK | STOP | ROLLBACK |
|
|
|--------|----------|-----------|------|----------|
|
|
| Error rate | 0.05% | < 0.1% | 0.5% | > 1% |
|
|
| p95 latency | 300ms | < 500ms | 700ms | > 1000ms |
|
|
|
|
---
|
|
|
|
## Gate 6: Post-Release Validation
|
|
|
|
### Immediate (0-30 min)
|
|
|
|
- [ ] Health checks green on all instances
|
|
- [ ] Error rate within normal range
|
|
- [ ] Latency normal (p95, p99)
|
|
- [ ] Core user journey manually tested
|
|
- [ ] Logs clean — no unexpected errors
|
|
- [ ] Alerts silent
|
|
|
|
### Short-term (1-24 hours)
|
|
|
|
- [ ] No customer complaints
|
|
- [ ] Business metrics stable (conversion, revenue, signups)
|
|
- [ ] Memory/CPU stable (no creeping usage)
|
|
- [ ] Queue backlogs clear
|
|
- [ ] Database performance stable
|
|
|
|
### Post-Release Report Template
|
|
|
|
```markdown
|
|
## Release Report: [Feature]
|
|
- Deployed: [timestamp] by @[engineer]
|
|
- Duration: [minutes]
|
|
|
|
| Check | Status | Notes |
|
|
|-------|--------|-------|
|
|
| Health checks | ✅ | All healthy |
|
|
| Error rate | ✅ | 0.03% (baseline: 0.05%) |
|
|
| p95 latency | ✅ | 310ms (baseline: 300ms) |
|
|
| Core flow | ✅ | Order creation verified |
|
|
|
|
Issues found: None / [details]
|
|
Rollback used: No / Yes: [reason]
|
|
```
|
|
|
|
---
|
|
|
|
## Release Readiness Score
|
|
|
|
Score each gate **0-2**: (0 = not checked, 1 = partially, 2 = fully verified with evidence)
|
|
|
|
| Gate | Score |
|
|
|------|-------|
|
|
| 1. Functional Acceptance | /2 |
|
|
| 2. Non-Functional Acceptance | /2 |
|
|
| 3. Security Review | /2 |
|
|
| 4. Deployment Readiness | /2 |
|
|
| 5. Release Execution Plan | /2 |
|
|
| 6. Post-Release Validation Plan | /2 |
|
|
| **Total** | **/12** |
|
|
|
|
**Decision:**
|
|
- **12/12** → Ship it ✅
|
|
- **10-11** → Ship with documented exceptions + owner assigned
|
|
- **< 10** → Do NOT release. Fix gaps first.
|
|
|
|
---
|
|
|
|
## Common Rationalizations
|
|
|
|
| ❌ Excuse | ✅ Reality |
|
|
|----------|-----------|
|
|
| "It's a small change" | Small changes cause outages every day |
|
|
| "We tested locally" | Local ≠ production |
|
|
| "We'll fix it if it breaks" | You'll fix it at 3 AM. Prevent now. |
|
|
| "Deadline is today" | Broken code costs more than late code |
|
|
| "CI passed" | CI doesn't check everything. Run the checklist. |
|
|
| "We can always rollback" | Only if you planned and tested rollback |
|
|
| "We did this last time fine" | Survivorship bias. Checklist every time. |
|