Files
skills/shader-dev/reference/sound-synthesis.md
shihao 6487becf60 Initial commit: add all skills files
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 16:52:49 +08:00

24 KiB
Raw Blame History

Sound Synthesis — Detailed Reference

This document is a complete reference supplement to SKILL.md, covering prerequisites, detailed explanations of each step, in-depth variant descriptions, performance optimization analysis, and complete combination code examples.

Prerequisites

  • GLSL Fundamentals: Functions, vector operations, float/vec2 types, math functions like sin()/exp()/fract()
  • Audio Fundamentals: Sample rate (typically 44100Hz), frequency-to-pitch relationship, waveform concepts (sine, sawtooth, square)
  • Music Theory Basics: MIDI note numbers, equal temperament, octave relationship (frequency doubles), chord construction
  • ShaderToy Sound Mode: vec2 mainSound(int samp, float time) returns a vec2 stereo sample value in the range [-1, 1]

Implementation Steps

Step 1: mainSound Entry Point and Basic Framework

What: Establish the standard entry function for a sound shader, outputting a stereo signal.

Why: ShaderToy requires the fixed signature vec2 mainSound(int samp, float time), where the return value's .x and .y are the left and right channels respectively, with a range of [-1, 1]. samp is the sample index, and time is the corresponding time (in seconds).

// ShaderToy sound shader basic framework
#define TAU 6.28318530718
#define BPM 120.0                    // Adjustable: tempo
#define SPB (60.0 / BPM)             // Seconds per beat

vec2 mainSound(int samp, float time) {
    vec2 audio = vec2(0.0);

    // Layer instruments/tracks here
    // audio += instrument(time);

    // Master volume control + anti-click fade-in
    audio *= 0.5 * smoothstep(0.0, 0.5, time);

    return clamp(audio, -1.0, 1.0);
}

Step 2: MIDI Note to Frequency Conversion

What: Convert a MIDI note number to its corresponding frequency value.

Why: In equal temperament, each semitone up multiplies the frequency by 2^(1/12). MIDI 69 = A4 = 440Hz is the standard reference point. This is the foundation of all melodic synthesis.

// MIDI note number to frequency
// 69 = A4 = 440Hz, every +12 is one octave (frequency doubles)
float noteFreq(float note) {
    return 440.0 * pow(2.0, (note - 69.0) / 12.0);
}

Step 3: Basic Oscillators

What: Implement four standard waveform generators — sine, sawtooth, square, and triangle waves.

Why: Different waveforms have different harmonic characteristics. Sine waves are pure (fundamental only), sawtooth waves are rich in all harmonics (bright), square waves contain only odd harmonics (hollow), and triangle waves have faster harmonic decay (soft). These four are the building blocks of all timbre synthesis.

// Sine wave - pure tone, fundamental only
float osc_sin(float t) {
    return sin(TAU * t);
}

// Sawtooth wave - contains all harmonics, bright and sharp
float osc_saw(float t) {
    return fract(t) * 2.0 - 1.0;
}

// Square wave - odd harmonics only, hollow texture
float osc_sqr(float t) {
    return step(fract(t), 0.5) * 2.0 - 1.0;
}

// Triangle wave - fast harmonic decay, soft and warm
float osc_tri(float t) {
    return abs(fract(t) - 0.5) * 4.0 - 1.0;
}

Step 4: Additive Synthesis Instrument

What: Build a timbre by layering multiple harmonics (integer multiples of the fundamental), each with independent amplitude and decay rate.

Why: The timbre of real instruments is determined by their harmonic content (spectrum). Layering 3-8 harmonics with faster decay for higher harmonics can simulate piano, bell, and other timbres. This is the core technique for additive timbre synthesis.

// Additive synthesis instrument
// freq: fundamental frequency, t: time within note
// Additive synthesis with harmonic layering
float instrument_additive(float freq, float t) {
    float y = 0.0;

    // Layer harmonics: fundamental × 1, 2, 4
    // Decreasing amplitude + frequency-dependent decay (higher harmonics decay faster)
    y += 0.50 * sin(TAU * 1.00 * freq * t) * exp(-0.0015 * 1.0 * freq * t);
    y += 0.30 * sin(TAU * 2.01 * freq * t) * exp(-0.0015 * 2.0 * freq * t);
    y += 0.20 * sin(TAU * 4.01 * freq * t) * exp(-0.0015 * 4.0 * freq * t);

    // Nonlinear waveshaping to enrich harmonics
    y += 0.1 * y * y * y;                          // Adjustable: 0.0-0.35, higher = more distortion

    // Tremolo
    y *= 0.9 + 0.1 * cos(40.0 * t);                // Adjustable: 40.0 = tremolo frequency

    // Smooth attack to avoid clicks
    y *= smoothstep(0.0, 0.01, t);                  // Adjustable: 0.01 = attack time

    return y;
}

Step 5: FM Synthesis Instrument

What: Use one oscillator's (modulator) output as the phase offset of another oscillator (carrier) to produce rich harmonics.

Why: FM synthesis can generate extremely rich timbres with very few oscillators. Varying modulation depth over time can simulate the "bright→dark" decay characteristic of instruments. Electric pianos and sitar-like timbres are both based on this principle.

// FM synthesis electric piano
// FM electric piano synthesis
vec2 fm_epiano(float freq, float t) {
    // Stereo micro-detuning for chorus effect
    vec2 f0 = vec2(freq * 0.998, freq * 1.002);    // Adjustable: detune amount

    // "Glass" layer - high-frequency FM, fast decay → metallic attack quality
    vec2 glass = sin(TAU * (f0 + 3.0) * t
        + sin(TAU * 14.0 * f0 * t) * exp(-30.0 * t)  // Adjustable: 14.0=mod ratio, -30.0=mod decay
    ) * exp(-4.0 * t);                                 // Adjustable: -4.0 = glass layer decay
    glass = sin(glass);                                 // Second-order nonlinearity

    // "Body" layer - low-frequency FM, slow decay → sustained warm tone
    vec2 body = sin(TAU * f0 * t
        + sin(TAU * f0 * t) * exp(-0.5 * t) * pow(440.0 / f0.x, 0.5)  // Low-frequency compensation
    ) * exp(-t);                                        // Adjustable: -1.0 = body decay

    return (glass + body) * smoothstep(0.0, 0.001, t) * 0.1;
}

// FM synthesis generic instrument (struct-parameterized)
// FM synthesis generic instrument (struct-parameterized)
struct Instr {
    float att;      // Attack speed (higher = faster)
    float fo;       // Decay rate
    float vibe;     // Vibrato speed
    float vphas;    // Vibrato phase
    float phas;     // FM modulation depth
    float dtun;     // Detune amount
};

float fm_instrument(float freq, float t, float beatTime, Instr ins) {
    float f = freq - beatTime * ins.dtun;
    float phase = f * t * TAU;
    float vibrato = cos(beatTime * ins.vibe * 3.14159 / 8.0 + ins.vphas * 1.5708);
    float fm = sin(phase + vibrato * sin(phase * ins.phas));
    float env = exp(-beatTime * ins.fo) * (1.0 - exp(-beatTime * ins.att));
    return fm * env * (1.0 - beatTime * 0.125);
}

Step 6: Percussion Synthesis

What: Synthesize kick drum, snare/clap, and hi-hat percussion instruments.

Why: Percussion is typically composed of pitch sweeps (kick) or noise pulses (hi-hat/clap) with fast envelopes. The kick's core is a sine sweep from high to low frequency; hi-hats are noise with exponential decay. Nearly all complete music shaders require these.

// Pseudo-random hash (replaces noise texture)
float hash(float p) {
    p = fract(p * 0.1031);
    p *= p + 33.33;
    p *= p + p;
    return fract(p);
}

// 909-style kick drum
// 909-style kick drum synthesis
float kick(float t) {
    float df = 512.0;                               // Adjustable: frequency sweep depth
    float dftime = 0.01;                             // Adjustable: sweep time constant
    float freq = 60.0;                               // Adjustable: base frequency

    // Exponential frequency sweep: rapidly slides from high to base frequency
    float phase = TAU * (freq * t - df * dftime * exp(-t / dftime));
    float body = sin(phase) * smoothstep(0.3, 0.0, t) * 1.5;

    // Transient noise click
    float click = sin(TAU * 8000.0 * fract(t)) * hash(t * 2000.0)
                * smoothstep(0.007, 0.0, t);

    return body + click;
}

// Hi-hat (open / closed)
// Hi-hat synthesis (open / closed)
float hihat(float t, float decay) {
    // decay: 5.0 = open hat (long decay), 15.0 = closed hat (short decay)
    float noise = hash(floor(t * 44100.0)) * 2.0 - 1.0;
    return noise * exp(-decay * t) * smoothstep(0.0, 0.02, t);
}

// Clap / snare
float clap(float t) {
    float noise = hash(floor(t * 44100.0)) * 2.0 - 1.0;
    return noise * smoothstep(0.1, 0.0, t);
}

Step 7: Note Sequence Arrangement

What: Implement melody/chord temporal arrangement, determining which note should play at each moment.

Why: Music = timbre × timing. ShaderToy has three mainstream arrangement approaches: (A) D() macro accumulation for handwritten melodies, (B) array lookup for complex arrangements, (C) hash pseudo-random for algorithmic composition.

// === Approach A: D() Macro Accumulation ===
// Usage: D(duration, MIDI note number) arranged sequentially
// b = accumulated time, x = current note start time, n = current note
#define D(duration, note) b += float(duration); if(t > b) { x = b; n = float(note); }

float melody_macro(float time) {
    float t = time / 0.18;                          // Adjustable: 0.18 = seconds per unit duration
    float n = 0.0, b = 0.0, x = 0.0;

    D(10,71) D(2,76) D(3,79) D(1,78) D(2,76) D(4,83) D(2,81) D(6,78)
    // ... continue arranging notes ...

    float freq = noteFreq(n);
    float noteTime = 0.18 * (t - x);
    return instrument_additive(freq, noteTime);
}

// === Approach B: Array Lookup ===
const float NOTES[16] = float[16](
    60., 62., 64., 65., 67., 69., 71., 72.,         // Adjustable: note sequence
    60., 64., 67., 72., 65., 69., 64., 60.
);

float melody_array(float time, float bpm) {
    float beat = time * bpm / 60.0;
    int idx = int(mod(beat, 16.0));
    float noteTime = fract(beat);
    float freq = noteFreq(NOTES[idx]);
    return instrument_additive(freq, noteTime * 60.0 / bpm);
}

// === Approach C: Hash Pseudo-Random ===
float nse(float x) {
    return fract(sin(x * 110.082) * 19871.8972);
}

// Scale quantization: filter out dissonant notes
float scale_filter(float note) {
    float n2 = mod(note, 12.0);
    // Major scale: filter out semitones 1,3,6,8,10
    if (n2==1.||n2==3.||n2==6.||n2==8.||n2==10.) return -100.0;
    return note;
}

float melody_random(float time, float bpm) {
    float beat = time * bpm / 60.0;
    float seqn = nse(floor(beat));
    float note = 48.0 + floor(seqn * 24.0);         // Adjustable: 48.0=lowest note, 24.0=range
    note = scale_filter(note);
    float freq = noteFreq(note);
    float noteTime = fract(beat) * 60.0 / bpm;
    return instrument_additive(freq, noteTime);
}

Step 8: Chord Construction

What: Layer multiple notes according to chord relationships to form harmony.

Why: A chord is a combination of multiple pitches sounding simultaneously. The common structure is root + third + fifth (triad), with added seventh and ninth degrees for jazz chords. Jazz chord progressions can be built this way.

// Chord construction
vec2 chord(float time, float root, float isMinor) {
    vec2 result = vec2(0.0);
    float bass = root - 24.0;                        // Root two octaves lower

    // Root (bass)
    result += fm_epiano(noteFreq(bass), time, 2.0);
    // Root
    result += fm_epiano(noteFreq(root), time - SPB * 0.5, 1.25);
    // Third (major third = 4 semitones, minor third = 3 semitones)
    result += fm_epiano(noteFreq(root + 4.0 - isMinor), time - SPB, 1.5);
    // Fifth
    result += fm_epiano(noteFreq(root + 7.0), time - SPB * 0.5, 1.25);
    // Seventh
    result += fm_epiano(noteFreq(root + 11.0 - isMinor), time - SPB, 1.5);
    // Ninth
    result += fm_epiano(noteFreq(root + 14.0), time - SPB, 1.5);

    return result;
}

Step 9: Delay and Reverb Effects

What: Simulate spatial echo and reverb effects by layering time-offset copies of the audio signal.

Why: Dry audio sounds "flat". Multi-tap delay creates spatial depth by layering signal copies at different delays and decay amounts. Ping-pong delay bounces alternately between left and right channels, enhancing stereo width.

// Multi-tap echo/reverb
// Multi-tap echo/reverb
// NOTE: in GLSL ES 3.00, "sample" is a reserved word — use "samp" instead
vec2 echo_reverb(float time) {
    vec2 tot = vec2(0.0);
    float hh = 1.0;
    for (int i = 0; i < 6; i++) {                   // Adjustable: 6 = echo count
        float h = float(i) / 5.0;
        float delayedTime = time - 0.7 * h;         // Adjustable: 0.7 = echo interval

        // Call your instrument function to get audio at that time point
        float samp = get_instrument_sample(delayedTime);

        // Stereo spread: each echo has different L/R ratio
        tot += samp * vec2(0.5 + 0.1 * h, 0.5 - 0.1 * h) * hh;
        hh *= 0.5;                                   // Adjustable: 0.5 = decay per echo
    }
    return tot;
}

// Ping-pong stereo delay
// Ping-pong stereo delay
vec2 pingpong_delay(float time) {
    vec2 mx = get_stereo_sample(time) * 0.5;
    float ec = 0.4;                                  // Adjustable: initial echo volume
    float fb = 0.6;                                  // Adjustable: feedback decay coefficient
    float delay_time = 0.222;                        // Adjustable: delay time (seconds)
    float et = delay_time;

    // 4 alternating left/right ping-pong taps
    mx += get_stereo_sample(time - et) * ec * vec2(1.0, 0.5); ec *= fb; et += delay_time;
    mx += get_stereo_sample(time - et) * ec * vec2(0.5, 1.0); ec *= fb; et += delay_time;
    mx += get_stereo_sample(time - et) * ec * vec2(1.0, 0.5); ec *= fb; et += delay_time;
    mx += get_stereo_sample(time - et) * ec * vec2(0.5, 1.0); ec *= fb; et += delay_time;

    return mx;
}

Step 10: Beat and Arrangement Structure

What: Define a time grid using BPM, arrange different instruments at different beat positions, and control the overall song structure (intro, verse, interlude, etc.).

Why: The rhythmic skeleton of music is built on a uniform beat grid. Using floor(time * BPM / 60) gets the current beat number, and fract() gets the position within the beat. smoothstep gating controls instrument entry and exit at specific sections.

vec2 mainSound(int samp, float time) {
    vec2 audio = vec2(0.0);

    float beat = time * BPM / 60.0;                  // Current beat count
    float bar = beat / 4.0;                           // Current bar (4/4 time)
    float beatInBar = mod(beat, 4.0);                 // Beat position within bar

    // --- Rhythm layer ---
    // Kick: trigger every beat
    float kickTime = mod(time, SPB);
    audio += vec2(kick(kickTime) * 0.5);

    // Hi-hat: trigger every half beat
    float hatTime = mod(time, SPB * 0.5);
    audio += vec2(hihat(hatTime, 15.0) * 0.15);

    // --- Melody layer ---
    audio += vec2(melody_array(time, BPM)) * 0.3;

    // --- Arrangement automation ---
    // Use smoothstep to control instrument entry/exit
    float introFade = smoothstep(0.0, 4.0, bar);     // Fade in over first 4 bars
    float dropGate = smoothstep(16.0, 16.1, bar);    // Drop at bar 16

    audio *= introFade;

    // Master volume + anti-click
    audio *= 0.35 * smoothstep(0.0, 0.5, time);
    return clamp(audio, -1.0, 1.0);
}

Variant Details

Variant 1: Subtractive Synthesis / TB-303 Acid Synthesizer

Difference from basic version: Instead of building timbre by layering harmonics, generates a harmonic-rich waveform (sawtooth) and then sculpts it with a resonant low-pass filter to remove high frequencies. The filter cutoff frequency is modulated by an envelope, producing the classic "wah" sound.

Key modified code:

#define NSPC 128                                    // Adjustable: synthesis harmonic count (higher = better quality)

// Resonant low-pass frequency response
float lpf_response(float h, float cutoff, float reso) {
    cutoff -= 20.0;
    float df = max(h - cutoff, 0.0);
    float df2 = abs(h - cutoff);
    return exp(-0.005 * df * df) * 0.5              // Adjustable: -0.005 = rolloff slope
         + exp(df2 * df2 * -0.1) * reso;            // Adjustable: resonance peak
}

// TB-303 acid synthesizer
vec2 acid_synth(float freq, float noteTime) {
    vec2 v = vec2(0.0);
    // Envelope-driven filter cutoff frequency
    float cutoff = exp(noteTime * -1.5) * 50.0      // Adjustable: -1.5=envelope speed, 50.0=sweep range
                 + 10.0;                             // Adjustable: minimum cutoff
    float sqr = step(0.5, fract(noteTime * 4.5));   // Sawtooth/square switching

    for (int i = 0; i < NSPC; i++) {
        float h = float(i + 1);
        float inten = 1.0 / h;                      // Sawtooth spectrum
        inten = mix(inten, inten * mod(h, 2.0), sqr); // Square wave variant
        inten *= lpf_response(h, cutoff, 2.2);
        v.x += inten * sin((TAU + 0.01) * noteTime * freq * h);
        v.y += inten * sin(TAU * noteTime * freq * h);
    }
    float amp = smoothstep(0.05, 0.0, abs(noteTime - 0.31) - 0.26)
              * exp(noteTime * -1.0);
    return clamp(v * amp * 2.0, -1.0, 1.0);
}

Variant 2: IIR Biquad Filter

Difference from basic version: Uses a time-domain IIR filter based on the Audio EQ Cookbook instead of frequency-domain methods. Supports 7 filter types including low-pass, high-pass, band-pass, notch, peak, and shelf — closer to real hardware. Requires maintaining past sample state.

Key modified code:

// Sawtooth oscillator (sample-domain, anti-aliasing friendly)
float waveSaw(float freq, int samp) {
    return fract(freq * float(samp) / iSampleRate) * 2.0 - 1.0;
}

// Stereo widening
vec2 widerSaw(float freq, int samp) {
    int offset = int(freq) * 64;                    // Adjustable: 64 = width factor
    return vec2(waveSaw(freq, samp - offset), waveSaw(freq, samp + offset));
}

// Biquad low-pass filter coefficient calculation
void biquadLPF(float freq, float Q, float sr,
    out float b0, out float b1, out float b2,
    out float a0, out float a1, out float a2) {
    float omega = TAU * freq / sr;
    float sn = sin(omega), cs = cos(omega);
    float alpha = sn / (2.0 * Q);                   // Adjustable: Q = resonance (0.5-20)
    b0 = (1.0 - cs) * 0.5;
    b1 = 1.0 - cs;
    b2 = (1.0 - cs) * 0.5;
    a0 = 1.0 + alpha;
    a1 = -2.0 * cs;
    a2 = 1.0 - alpha;
}

Variant 3: Vocal / Formant Synthesis

Difference from basic version: Uses a sinusoidal tract model to simulate the human voice. By setting formants at different frequencies with their bandwidths, vowels can be synthesized. Consonants are implemented through fricative noise.

Key modified code:

// Vocal tract formant model
float tract(float x, float formantFreq, float bandwidth) {
    return sin(TAU * formantFreq * x)
         * exp(-bandwidth * 3.14159 * x);
}

// "Ah" vowel synthesis
float vowel_aah(float t, float pitch) {
    float period = 1.0 / pitch;
    float x = mod(t, period);
    // Formant frequencies and bandwidths (Hz) — adjustable to simulate different vowels
    float aud = tract(x, 710.0, 70.0) * 0.5         // F1: 710Hz ('a' vowel)
              + tract(x, 1000.0, 90.0) * 0.6         // F2: 1000Hz
              + tract(x, 2450.0, 140.0) * 0.4;       // F3: 2450Hz
    return aud;
}

// Fricative consonant noise
float fricative(float t, float formantFreq) {
    return (hash11(floor(formantFreq * t) * 20.0) - 0.5) * 3.0;
}

Variant 4: Algorithmic Composition (Generative Music)

Difference from basic version: Does not use handwritten note sequences; instead uses hash functions to generate pseudo-random melodies, with scale quantization to ensure harmonic consistency. Multi-level rhythmic subdivision (1-beat/2-beat/4-beat) produces fractal-like musical structure.

Key modified code:

// 8-note pseudo-random loop
vec2 noteRing(float n) {
    float r = 0.5 + 0.5 * fract(sin(mod(floor(n), 32.123) * 32.123) * 41.123);
    n = mod(n, 8.0);
    // Adjustable: modify these intervals to change the melodic character
    float note = n<1.?0. : n<2.?5. : n<3.?-2. : n<4.?4. : n<5.?7. : n<6.?4. : n<7.?2. : 0.;
    return vec2(note, r);                            // (interval, volume)
}

// FBM-style layered note generation
vec2 generativeNote(float beat) {
    float b0 = floor(beat);
    float b1 = floor(beat * 0.5);
    float b2 = floor(beat * 0.25);
    // Large-scale + medium-scale + small-scale layering
    vec2 note = noteRing(b2 * 0.0625)
              + noteRing(b2 * 0.25)
              + noteRing(b2);
    return note;
}

Variant 5: Chord Progression System (Circle of Fifths)

Difference from basic version: Automatically generates harmonic progressions based on the circle of fifths interval. Every 4 beats advances one fifth (+7 semitones), automatically alternating major/minor chords with jazz chord extensions (seventh, ninth).

Key modified code:

vec2 mainSound(int samp, float time) {
    float id = floor(time / SPB / 4.0);             // Current chord number
    float offset = id * 7.0;                         // Circle of fifths: +7 semitones per step
    float minor = mod(id, 4.0) >= 3.0 ? 1.0 : 0.0; // Every 4th chord is minor
    float t = mod(time, SPB * 4.0);

    float root = 57.0 + mod(offset, 12.0);           // Adjustable: 57.0 = starting root (A3)
    vec2 result = chord(t, root, minor);

    // Two-tap ping-pong delay
    result += vec2(0.5, 0.2) * chord(t - SPB * 0.5, root, minor);
    result += vec2(0.05, 0.1) * chord(t - SPB, root, minor);

    return result;
}

Performance Optimization Details

  1. Reduce Harmonic Count: In additive synthesis and frequency-domain filters, the harmonic count (NUM_HARMONICS / NSPC) is the biggest performance bottleneck. Start with 4-8 harmonics and don't add more once the sound is satisfactory. Using 256 harmonics is an extreme case.

  2. Avoid Sample History in Loops: IIR filters need to process 128 historical samples, meaning each output sample requires 128 loop iterations. Prefer frequency-domain methods or reduce PAST_SAMPLES.

  3. Simplify Echo/Delay: Each delay tap requires recomputing the complete signal chain. 4 taps means 5x computation. Consider reducing the complexity (fewer harmonics) for delayed signals.

  4. Use fract() Instead of mod(): When the divisor is 1.0, fract(x) is faster than mod(x, 1.0).

  5. Precompute Constants: Move loop-invariant expressions like TAU * freq outside the loop.

  6. Use the Common Pass: Place constant definitions and shared functions in ShaderToy's Common tab, accessible by both Sound and Image, avoiding redundant computation of BPM/SPB, etc.

Combination Suggestions

1. Combining with Audio Visualization

Sound shader output can be read in the Image shader via iChannel0 (set to this shader's Sound output). Use texture(iChannel0, vec2(freq, 0.0)) to get spectrum data to drive visual effects (waveforms, spectrum bar charts, etc.).

2. Combining with Raymarching Scenes

Sound-visual synchronization can be achieved by sharing timeline/cue events. Define shared timeline/cue events in the Common Pass, referenced by both Sound and Image shaders simultaneously, ensuring visual-audio synchronization.

3. Combining with Particle Systems

Use beat events (kick trigger moments) to drive particle emission. In the Image shader, use the same BPM/SPB to calculate the current beat position, and increase particle count or velocity at the kick trigger moment.

4. Combining with Post-Processing Effects

Share Sound shader envelope values (e.g., sidechain compression coefficient) with the Image shader via the Common Pass, driving bloom intensity, color shifting, screen shake, and other effects.

5. Combining with Text/Graphic Overlays

Use message() functions in the Image shader to render text hints, parameter displays, or interaction instructions to help users understand what is being played.