THE PHASE VOCODER
I dedicate this to Einstiens mum, ive never written any
educational material before, so forgive me if its a
little poorly explained in places.
A phase vocoder rips the sound apart into its individual
harmonics, which are sine waves, and they are the tiniest
element that the ear can percieve as a separate frequency.
Then it restores it as it was, and can be played like
any other sound after its been converted back
to the time domain, which is the usual domain for sound,
amplitude over time.
You can tell maybe 2hz apart no less than this and thats
only near the bottom of the spectrum.
This means, the "samples" you take from the audio signal
to rip apart into individual harmonics are spaced apart
similarly to how your ear can percieve difference in
frequency, if you take enough samples, the spaces between
dont matter to your ear, if you dont take enough samples
you will hear an empty sound that sounds like its had noise
reduction added to it. It sounds "jingly".
That means if there is someone out there with very sensitive
ears, the phase vocoder will sound more jingly to him that
it does to you, hypothetically.
The purpose of a phase vocoder is varied, once the sound has
been ripped apart into sine waves, and before you go back to
the time domain, there are a few things you could do to the
sound, but its highly experimental and theres not much out
there for the phase vocoder yet, because only recently have
computers had enough power to do this real time.
Possible purposes for a phase vocoder->
* Pitch transposition without altering time scale.
* Time scaling without altering pitch.
* Noise reduction
* Harmonic chorusing
* Speech recognition (its where the word vocoder comes from,
The quality you get out of the phase vocoder goes from
completely unintelligably without fault to poppy and
crackly if you stuff it up, and it can also sound wooshy and
blurry if its not set right also.
On the wikipedia today, they say the phase vocoder blurs the
sound and makes it sound crap, this isnt true for the model
im explaining, the model im explaining doesnt suffer from this
problem but it sounds a little ringy when you pitch up, but
the transience is preserved perfectly.
But the woosh you can get out of a poor phase vocoder can actually
be used in techno as an effect.
Im quite sure on an expensive piece of equipment the pitch scaler
can be perfect, it is possible.
This method uses the fourier transform to get the phase and
amplitude of each sine wave, then uses sine wave oscillators to
restore the sound.
The fourier transform takes segments of the sound at a time, (say 512
samples) and produces a batch of amplitudes and phases for each sine
wave it found in that segment. These will be pumped into the sine wave
oscillators on output of the phase vocoder, each oscillator must have
correct readings at each sound segment or you wont get the exact same
output as input.
HOW THE FOURIER TRANSFORM WORKS
The transform works by taking a window of the sound, and for each
sine wave that fits inside it you can get a phase and amplitude
reading for. Only whole cycles can be subtracted out from it
successfully, If the window size is 512, only cycle sizes 512 (512/1),
256 (512/2), 170 (512/3), 128 (512/4)... etc.. will fit inside the
window approximately whole phased, so only these sine waves you can
get out of it, but funnily enough these are the only sine waves you
Take a sine wave of the cycle size your wanting to get the amplitude
and phase for, and multiply it with the signal starting at 0 degree
phase and at 90 degree phase(a sine and cosine wave) at normalized
You may be thinking something funny now, that it works using ring
modulation! And it does!
Then sum up all the amplitudes across the segment.
Then treat the sine wave as an x component, the cosine wave as a y
component and the 2d vector you make, the length of it is the amplitude
of the sine wave with a complex phase going from 0 - 2*PI and the
phase is the angle of the vector.
TWEAKING THE TRANSFORM
Now you know its that easy to get fragments of sound, you still have
to use the transform right for it to work properly.
You must overlap your time segments you look at, this is because
of how the transform only looks at a window of the sound with pops
either side of it, if you dont overlap your looking segment (say
1024 samples) with your segment interval (say 512 samples) then you
will get pops and crackle in your output because the phase readings
will be wrong.
If your using the FFT, (I only explained the ordinary transform) you
could find all the amplitudes and phases realtime, but if your using
the transform I explained, you precompute all the sine wave amplitudes
and phases into a harmonic wave file, then you can avoid this step when
playing back the file, and only render the oscillators, its all you have
OUTPUTING BACK TO THE TIME DOMAIN
Make a sine wave oscillator (optimize with a lookup table) for each
harmonic, take the phase reading and amplitude reading, and every segment
play the amplitude and phase reading for each oscillator, and you should
have the exact sound playing out of the sine wave oscillators!
The only problem is, from small errors in the transform you will hear slight
pops and crackle in the sound.
To remedy this, (and so the pitch transposition works) slide the amplitude
from the last oscillator to the new oscillator, and give it a slight tremor
in frequency so the phase matches up from the old phase to the new phase.
And youve basicly got it now.
PITCH TRANSPOSITION AND TIME SCALING
Only pitch and time scaling is explained in this document, the rest of
the uses for this device is up to you if you want to experiment with it.
Changing the pitch is simple, Just alter the oscillators frequencies
up or down and you will get instaniously pitch scaling on output keeping
the time scale the same, and the "phase fix" will keep the phase from
popping and the pitch will change.
But note, you must have an interval size per oscillator large enough,
for the phase fix to have enough room to pop the phase back to alter
the pitch properly, or the frequencies will stick and not change and
itll sound more like a phaser than vibrato.
You can use a smaller segment size for your higher frequencies than lower
frequencies, and this will improve transience, so you should do this.
Changing the time scale can be implemented by simply doubling up the
algorythm with a resample, pitch up and resample down and you will get
a slower sound, pitch down then resample up and you will get a quicker