Mimic 3


Welcome to a demonstration of the Mimic 3 text to speech system, designed to run on the Mark II.

Overview

Mimic 3 is a neural text to speech engine that can run locally, even on low-end hardware like the Raspberry Pi 4. Mycroft A.I. has trained hundreds of voices in over a dozen languages, and made them freely available to the open source community.

You can hear samples from all three Mimic systems below, speaking the same sentence.

Name Sample Local Technology
Mimic 1 Yes Festival Lite (flite)
Mimic 2 No Tacotron
Mimic 3 Yes VITS: Conditional Variational Autoencoder with Adversarial Learning
The beige hue on the waters of the loch impressed all, including the French queen, before she heard that symphony again, just as young Arthur wanted.

Voice Keys

Voices in Mimic 3 are keyed by a name with specific parts. These parts include the voice's language, region, training dataset, quality level, and speaker.

The default voice is en_UK/apope_low


Web Server

A basic web server and interface is provided for quick testing and handling multiple text to speech clients.


You can run the web server with the following command:

mimic3-server --host localhost --port 59125 --preload-voice 'en_UK/apope_low'

With the web server running, clients can connect through the command line with the remote option:

mimic3 --remote 'http://localhost:59125' 'Some text to speak.' > output.wav

See below for more command line examples.

Mary TTS API

A web API compatible with Mary TTS is also available, allowing Mimic 3 to be used in other projects like Home Assistant.

curl -X GET -G \
--data-urlencode "INPUT_TEXT=Some text to speak." \
--data-urlencode "VOICE=en_UK/apope_low" \
--data-urlencode 'INPUT_TYPE=TEXT' \
--data-urlencode 'OUTPUT_TYPE=AUDIO' \
--data-urlencode 'AUDIO=WAVE' \
'localhost:59125/process' \
--output output.wav

Command Line Interface

The Mimic 3 command line interface makes it easy to convert text into audio.

mimic3 'Some text to speak.' > output.wav

Loading voice models can be slow, so the web server is recommended for repeated usage.

Many different voices are available in over a dozen languages.

mimic3 --voice 'en_us/vctk_low' 'Using a different voice.' > output.wav

Voices are automatically downloaded on first use from GitHub. You can list the available voices:

mimic3 --voices | awk '{print $1}'
KEY
de_DE/m-ailabs_low
de_DE/thorsten_low
el_GR/rapunzelina_low
en_UK/apope_low
en_US/cmu-arctic_low
en_US/ljspeech_low
en_US/vctk_low
es_ES/carlfm_low
es_ES/m-ailabs_low
...

Voice models are stored locally in your home directory:

tree "${HOME}/.local/share/mycroft/mimic3/voices"

├── de_DE
│   ├── m-ailabs_low
│   │   ├── ALIASES
│   │   ├── config.json
│   │   ├── generator.onnx
│   │   ├── LICENSE
│   │   ├── phoneme_map.txt
│   │   ├── phonemes.txt
│   │   ├── README.md
│   │   ├── SOURCE
│   │   ├── speaker_map.csv
│   │   └── speakers.txt
...

Some voices even have multiple speakers. This one has over one hundred.

mimic3 --voice 'en_us/vctk_low#p236' 'Using a different speaker.' > output.wav

Batch Processing

Multiple sentences can be synthesized with a single command and stored as separate audio files.

cat << EOF |
The birch canoe slid on the smooth planks.
Glue the sheet to the dark blue background.
It's easy to tell the depth of a well.
EOF
    mimic3 --output-dir output/
ls output/
Glue_the_sheet_to_the_dark_blue_background.wav
Its_easy_to_tell_the_depth_of_a_well.wav
The_birch_canoe_slid_on_the_smooth_planks.wav

Enabling CSV mode allows you to name each sentence, and set the voice or speaker.

cat << EOF |
s01|#awb|The birch canoe slid on the smooth planks.
s02|#rms|Glue the sheet to the dark blue background.
s03|#slt|It's easy to tell the depth of a well.
EOF
    mimic3 --csv-voice --voice 'en_US/cmu-arctic_low' --output-dir output/
ls output/
s01.wav  s02.wav  s03.wav

Longer texts like books can be synthesized in real-time. This example reads Alice in Wonderland:

curl --output - 'https://www.gutenberg.org/files/11/11-0.txt' | \
    mimic3 --interactive --process-on-blank-line

SSML

Speech Synthesis Markup Language, or SSML, is available through the command line and web interface. SSML allows you to fine tune your output.

cat << EOF |
<speak>
  <s>
    Spoken before pause with default voice.
  </s>
  <break time="2s" />
  <voice name="en_US/vctk_low#p236">
    <s>
      Spoken after pause in a different voice.
    </s>
  </voice>
</speak>
EOF
    mimic3 --ssml --voice 'en_US/cmu-arctic#eey' > output.wav

SSML even lets you mix and match languages:

cat << EOF |
<speak>
  <voice name="de_DE/thorsten_low">
    <s>
      Eine Sprache ist niemals genug.
    </s>
  </voice>
  <voice name="nl/rdh_low">
    <s>
      Eén taal is nooit genoeg.
    </s>
  </voice>
  <voice name="en_US/vctk_low">
    <s>
      One language is never enough.
    </s>
  </voice>
</speak>
EOF
    mimic3 --ssml > output.wav

Thanks for listening. You can try Mimic 3 for yourself, and share your feedback.