Been looking at open source Automatic Speech Recognition (ASR) Engines. What I want to do is integrate with an ASR engine so I can run speech audio through it and generate timing information on when words (more particularly phonemes) were spoken. I can then take this information and create a first pass lip-sync animation in ManuelBastioniLab.

I looked at a couple of ASR engines one called "DeepSpeech" and the other called "pocketsphinx". The former is written in C++ and the latter is written in C. Man pocketsphinx is a whole lot easier to understand. And it has the functionality I want. Deepspeech doesn't produce timing information. Well they do, but it's not exposed in the API, which means if I need to use it, I have to take the initiative and expose this information in the API. I actually thought about it, but it's a lot more work than I'm willing to undergo at the moment.

pocketsphinx is used in rhubarb-lip-sync. However this application runs only on windows and macs (as far as I can see) and I'm not a windows user. I have seen the light and abandoned windows. In other words, I need to have something working for linux. I'm cool if I only support linux.

rhubarb-lip-sync is also a generic application designed to work with multiple different application, like after effects. Anyway, the end result for me is to develop an application which works very similarly to rhubarb-lip-sync, but is directly integrated in blender.

We'll see how that goes.

P.S. If you're interested in learning more about pocketsphinx, here is there wiki.