speech to text – word segmentation


op.recognize is a speech recognition external (speech to text) based on the Sphinx library. It translates the incoming signal into text and can give dates corresponding to the placement for each sound in a buffer. No special voice learning is needed.

op.recognize consists of a HMM system working with phonemes; it can work with several languages as soon as a dictionary of the language is available. Recognize has 63,998 English words and 107,227 French words.

search graph - HMM with phonemes

The more flexible the database is, the more possible mistaken words there will be. Three ways for preparing the recognition are available:

— a grammar file using JSpeech Grammar Format and describing the order of the words to recognize.

— a language model trigram file generated by Carnegie Mellon University Statistical Language Modeling toolkit containing the text to recognize.

— no preparation at all when no need to know what you’ll say before-wards. This method is more general but it has more chances not to recognize the right words.

An acoustic model is needed. An English one coming from Sphinx is provided with the external. For another language, another accent etc…, you can create your own acoustic models using SphinxTrain.

How to make someone talk with the voice of someone else:

Realtime voice alignment is one possible application. Two sentences are said at different speeds. After segmentation, one of the sentences is stretched in order fit with the other one (use of supervp.play~ for the stretch).

Other applications could be text follower in theatre or installations, dictation, translation, chatbots, summarizer and concatenative voice synthesis (recreate sentences from existing segments).

op.recognize help patch