well.. I finally have a program that takes a speech clip and breaks it down into its phonemes, with a start time and duration for each phoneme. The JSON file looks like this:

                }, {
                        "word": "letter",
                        "start":        945,
                        "duration":     36,
                        "phonemes":     [{
                                        "phoneme":      "L",
                                        "start":        945,
                                        "duration":     6
                                }, {
                                        "phoneme":      "EH",
                                        "start":        951,
                                        "duration":     7
                                }, {
                                        "phoneme":      "T",
                                        "start":        958,
                                        "duration":     8
                                }, {
                                        "phoneme":      "ER",
                                        "start":        966,
                                        "duration":     16
                                }]
                }]

The next step is to take this information and convert it into animation data for a ManuelBastioniLab character. The code is here. Still needs work to make it compile and build easily. (Linux only... sorry )