Mimic 3
Overview
Mimic 3 is a neural text to speech engine that can run locally, even on low-end hardware like the Raspberry Pi 4. Mycroft A.I. has trained hundreds of voices in over a dozen languages, and made them freely available to the open source community.
You can hear samples from all three Mimic systems below, speaking the same sentence.
Name | Sample | Local | Technology |
---|---|---|---|
Mimic 1 | Yes | Festival Lite (flite) | |
Mimic 2 | No | Tacotron | |
Mimic 3 | Yes | VITS: Conditional Variational Autoencoder with Adversarial Learning |
The beige hue on the waters of the loch impressed all, including the French queen, before she heard that symphony again, just as young Arthur wanted.
Voice Keys
Voices in Mimic 3 are keyed by a name with specific parts. These parts include the voice's language, region, training dataset, quality level, and speaker.
The default voice is en_UK/apope_low
Web Server
A basic web server and interface is provided for quick testing and handling multiple text to speech clients.
You can run the web server with the following command:
mimic3-server --host localhost --port 59125 --preload-voice 'en_UK/apope_low'
With the web server running, clients can connect through the command line with the remote option:
mimic3 --remote 'http://localhost:59125' 'Some text to speak.' > output.wav
See below for more command line examples.
Mary TTS API
A web API compatible with Mary TTS is also available, allowing Mimic 3 to be used in other projects like Home Assistant.
curl -X GET -G \
--data-urlencode "INPUT_TEXT=Some text to speak." \
--data-urlencode "VOICE=en_UK/apope_low" \
--data-urlencode 'INPUT_TYPE=TEXT' \
--data-urlencode 'OUTPUT_TYPE=AUDIO' \
--data-urlencode 'AUDIO=WAVE' \
'localhost:59125/process' \
--output output.wav
Command Line Interface
The Mimic 3 command line interface makes it easy to convert text into audio.
mimic3 'Some text to speak.' > output.wav
Loading voice models can be slow, so the web server is recommended for repeated usage.
Many different voices are available in over a dozen languages.
mimic3 --voice 'en_us/vctk_low' 'Using a different voice.' > output.wav
Voices are automatically downloaded on first use from GitHub. You can list the available voices:
mimic3 --voices | awk '{print $1}'
KEY
de_DE/m-ailabs_low
de_DE/thorsten_low
el_GR/rapunzelina_low
en_UK/apope_low
en_US/cmu-arctic_low
en_US/ljspeech_low
en_US/vctk_low
es_ES/carlfm_low
es_ES/m-ailabs_low
...
Voice models are stored locally in your home directory:
tree "${HOME}/.local/share/mycroft/mimic3/voices"
├── de_DE
│ ├── m-ailabs_low
│ │ ├── ALIASES
│ │ ├── config.json
│ │ ├── generator.onnx
│ │ ├── LICENSE
│ │ ├── phoneme_map.txt
│ │ ├── phonemes.txt
│ │ ├── README.md
│ │ ├── SOURCE
│ │ ├── speaker_map.csv
│ │ └── speakers.txt
...
Some voices even have multiple speakers. This one has over one hundred.
mimic3 --voice 'en_us/vctk_low#p236' 'Using a different speaker.' > output.wav
Batch Processing
Multiple sentences can be synthesized with a single command and stored as separate audio files.
cat << EOF |
The birch canoe slid on the smooth planks.
Glue the sheet to the dark blue background.
It's easy to tell the depth of a well.
EOF
mimic3 --output-dir output/
ls output/
Glue_the_sheet_to_the_dark_blue_background.wav
Its_easy_to_tell_the_depth_of_a_well.wav
The_birch_canoe_slid_on_the_smooth_planks.wav
Enabling CSV mode allows you to name each sentence, and set the voice or speaker.
cat << EOF |
s01|#awb|The birch canoe slid on the smooth planks.
s02|#rms|Glue the sheet to the dark blue background.
s03|#slt|It's easy to tell the depth of a well.
EOF
mimic3 --csv-voice --voice 'en_US/cmu-arctic_low' --output-dir output/
ls output/
s01.wav s02.wav s03.wav
Longer texts like books can be synthesized in real-time. This example reads Alice in Wonderland:
curl --output - 'https://www.gutenberg.org/files/11/11-0.txt' | \
mimic3 --interactive --process-on-blank-line
SSML
Speech Synthesis Markup Language, or SSML, is available through the command line and web interface. SSML allows you to fine tune your output.
cat << EOF |
<speak>
<s>
Spoken before pause with default voice.
</s>
<break time="2s" />
<voice name="en_US/vctk_low#p236">
<s>
Spoken after pause in a different voice.
</s>
</voice>
</speak>
EOF
mimic3 --ssml --voice 'en_US/cmu-arctic#eey' > output.wav
SSML even lets you mix and match languages:
cat << EOF |
<speak>
<voice name="de_DE/thorsten_low">
<s>
Eine Sprache ist niemals genug.
</s>
</voice>
<voice name="nl/rdh_low">
<s>
Eén taal is nooit genoeg.
</s>
</voice>
<voice name="en_US/vctk_low">
<s>
One language is never enough.
</s>
</voice>
</speak>
EOF
mimic3 --ssml > output.wav
Thanks for listening. You can try Mimic 3 for yourself, and share your feedback.