Initial commit: add all skills files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-10 16:52:49 +08:00
commit 6487becf60
396 changed files with 108871 additions and 0 deletions

View File

@@ -0,0 +1,178 @@
# Metal Shader Reference
Expert reference for Metal shaders, real-time rendering, and Apple's Tile-Based Deferred Rendering (TBDR) architecture.
## Core Principles
**Half precision first → Leverage TBDR → Function constant specialization → Use Intersector API**
### When to Use
- Metal Shading Language (MSL) development
- Apple GPU optimization (TBDR architecture)
- PBR rendering pipelines
- Compute shaders and parallel processing
- Apple Silicon ray tracing
- GPU profiling and debugging
### When NOT to Use
- WebGL/GLSL (different architecture)
- CUDA (NVIDIA only)
- OpenGL (deprecated on Apple)
- CPU-side optimization
## Expert vs Novice
| Topic | Novice | Expert |
|-------|--------|--------|
| Data types | `float` everywhere | Default `half`, `float` only for position/depth |
| Branching | Runtime conditionals | Function constants for compile-time elimination |
| Memory | Everything in device | Know constant/device/threadgroup tradeoffs |
| Architecture | Treat as desktop GPU | Understand TBDR: tile memory is free, bandwidth is expensive |
| Ray tracing | intersection queries | intersector API (hardware-aligned) |
| Debugging | print debugging | GPU capture, shader profiler, occupancy analysis |
## Common Anti-Patterns
| Anti-Pattern | Problem | Solution |
|--------------|---------|----------|
| 32-bit floats | Wastes registers, reduces occupancy, doubles bandwidth | Default `half`, `float` only for position/depth |
| Ignoring TBDR | Not using free tile memory | Use `[[color(n)]]`, memoryless targets |
| Runtime constant branches | Warp divergence, wastes ALU | Function constants + pipeline specialization |
| intersection queries | Not hardware-aligned | Use intersector API |
## Metal Evolution
| Era | Key Development |
|-----|-----------------|
| Metal 2.x | OpenGL migration, basic compute |
| Apple Silicon | Unified memory, tile shaders critical |
| Metal 3 | Mesh shaders, hardware-accelerated ray tracing |
| Latest | Neural Engine + GPU cooperation, Vision Pro foveated rendering |
**Apple Family 9 Note**: Threadgroup memory less advantageous vs direct device access.
## Shader Types
| Type | Purpose | Key Attributes |
|------|---------|----------------|
| Vertex | Vertex transformation | `[[stage_in]]`, `[[buffer(n)]]` |
| Fragment | Pixel shading | `[[color(n)]]`, `[[texture(n)]]` |
| Compute/Kernel | General computation | `[[thread_position_in_grid]]` |
| Tile | TBDR-specific | `[[imageblock]]` |
| Mesh | Metal 3 geometry | `[[mesh_id]]` |
## Rendering Techniques
| Technique | Description |
|-----------|-------------|
| Fullscreen quad | 4 vertex triangle strip, no MVP, post-processing basis |
| PBR Cook-Torrance | Fresnel Schlick + GGX Distribution + Smith Geometry |
| Blinn-Phong | Simple specular, half-vector calculation |
## Procedural Generation
| Technique | Use Case |
|-----------|----------|
| Hash functions | Pseudo-random basis for noise, random sampling |
| Voronoi | Cell textures, stones, cracks |
| Value/Perlin Noise | Continuous random fields |
| FBM | Multi-octave layering, fractal terrain, clouds |
| Domain Warping | Coordinate distortion, organic shapes |
## Numerical Techniques
| Technique | Formula |
|-----------|---------|
| Central difference gradient | `(f(x+h) - f(x-h)) / (2h)` |
| Smoothstep | `x * x * (3 - 2 * x)` |
| SDF operations | `min/max/smooth_min` boolean ops |
## SwiftUI + MTKView Integration
### Architecture Pattern
```
MetalView (UIViewRepresentable)
└── Coordinator = Renderer (MTKViewDelegate)
├── MTLDevice
├── MTLCommandQueue
├── MTLRenderPipelineState
└── MTLBuffer (vertices, uniforms)
```
### Uniform Alignment Rules
| Swift Type | Metal Type | Alignment |
|------------|------------|-----------|
| `Float` | `float` | 4 bytes |
| `SIMD2<Float>` | `float2` | 8 bytes |
| `SIMD3<Float>` | `float3` | **16 bytes** |
| `SIMD4<Float>` | `float4` | 16 bytes |
**Key**: `float3` aligns to 16 bytes. Use `MemoryLayout<T>.size` to verify.
## Command Line Tools
| Command | Purpose |
|---------|---------|
| `xcrun metal -c shader.metal -o shader.air` | Compile to AIR |
| `xcrun metallib shader.air -o shader.metallib` | Link to metallib |
| `xcrun metal shader.metal -o shader.metallib` | One-step compile & link |
| `xcrun metal -Weverything -c shader.metal` | Syntax check |
| `xcrun metal-objdump --disassemble shader.metallib` | Disassemble |
## GPU Debugging
### Xcode Workflow
1. **GPU Capture**: ⌘⇧⌥G
2. **Shader Profiler**: Select draw call → View Shader
3. **Memory Viewer**: Inspect buffer/texture
4. **Performance HUD**: Enable in device options
### Key Metrics
| Metric | Healthy Value | Low Value Cause |
|--------|---------------|-----------------|
| GPU Occupancy | > 80% | Memory bandwidth bottleneck |
| ALU Utilization | > 60% | Waiting on memory |
| Bandwidth | As low as possible | TBDR should minimize store |
### Debug Utility Functions
| Function | Purpose |
|----------|---------|
| heatmap | Value visualization (blue→green→red) |
| debugNaN | NaN/Inf detection (magenta marker) |
| visualizeDepth | Linearized depth visualization |
## Performance Optimization Checklist
### Data Types
- [ ] Default `half`, `float` only for position/depth
### Memory Management
- [ ] Constants in constant address space
- [ ] Use `.storageModeShared`
- [ ] Leverage tile memory (TBDR free reads)
- [ ] Avoid unnecessary render target stores
### Branch Optimization
- [ ] Function constants to eliminate branches
- [ ] Fixed loop bounds (GPU unrolling)
### Rendering Tips
- [ ] Fullscreen quad with 4 vertex triangle strip
- [ ] Procedural textures to avoid sampling bandwidth
- [ ] `[[early_fragment_tests]]` for early depth test
- [ ] `setFragmentBytes` for small data
### Compute Optimization
- [ ] Vectorize (SIMD)
- [ ] Reduce register pressure
---
*Metal, Apple Silicon, and Xcode are trademarks of Apple Inc.*