A classmate of mine built a homebrew lipsync system for his senior thesis. IIRC, he transformed the input waveform into Mel frequency cepstral coefficients or MFCCs (google it). Then he used some sort of neural network to recognize the visemes (that's the visual equivalent of phonemes, but there are fewer visemes, if you don't count tongue positions).
Anyway, as roel says it's not a trivial subject, but it is also a very well researched one, and thankfully a lot easier than full speech recognition. I'm sure you can find some papers online.
Finally, I believe the amplitude version was used in Half-Life 1, FWIW.