Notifications
Clear all

espeak, espeak-ng, pyttsx3, and MBROLA

47 Posts
5 Users
5 Likes
14.6 K Views
Robo Pi
(@robo-pi)
Robotics Engineer
Joined: 5 years ago
Posts: 1669
Topic starter  

Has anyone worked with espeak, expeak-ng, pyttsx3 or MBROLA?

espeak is a local TTS engine (Text-to-Speech).    I think espeak-ng is just a newer version of the same.

I have espeak installed on Linux Ubuntu and it works just fine.  I'm also using this in Python with Pyttsx3 which actually uses espeak as the TTS engine.   So I have it all working and the computer is talking.\

However, I'm trying to go deeper into this and learn more about how to fine tune it and I'm having great difficulty in finding more advanced information.

I was able to change the voice characteristic to some degree.   Much better than it was originally. However, I still have two questions I haven't been able to find answer to on the Internet yet.

Question 1: How do I get espeak to use MBROLA voices via Pyttsx3 in Python?

It's my understanding that espeak can use MBROLA voices which are supposedly better than the voices that come with espeak as standard voices.  However, I haven't been able to fine any detailed information on how to get MBROLA voices to work with espeak via pyttsx3.

Question 2: Is there a way to change how espeak speaks specific words?

Again, I'm not finding any information on this.

My specific issue is that when espeak says "Alysha", it actually says "Malysha".  I can't seem to get rid of the "m" at the beginning of the word.  I even tried using alternative spelling: "ah leash ah", but it still says "Malysha". 

So I'm trying to learn how to modify the behavior of the language dictionary, but as I say, I'm not finding any useful information on this.

If anyone has any experience or knowledge of using espeak and especially on how to modify its performance I would be very grateful.

Please note also that I'm doing this on Linux Ubuntu.  I know that Windows already has much better sounding voices.   I have TTS set up on Windows using Microsoft Speech Platform, and I seem to have full control of that system.  But I can't seem to find a lot of information on how to modify espeak, MBROLA or pyttsx3 on Linux.

By the way, I'm fully aware that there are "better" TTS engines that use the Internet.   But I don't want to be dependent on the Internet.   This is why I turned to Pyttsz3 and espeak to begin with.  They are totally local to the machine, and are even refined for working on small SBCs such as Raspberry Pi and Jetson Nano.  So this is why I'm going with this particular TTS.

It's just really difficult to find any detailed or extensive information on it.  I've searched for YouTube videos but I'm already up to the point where they typically go.   I have it installed and it's working.   I'd just like to learn how to modify it more specifically.

So if you have any experience with espeak, espeak-ng, pyttsx3 or MBROLA please let me know.

Thanks.

DroneBot Workshop Robotics Engineer
James


   
Quote
Duce robot
(@duce-robot)
Member
Joined: 5 years ago
Posts: 680
 

I have been reading into it I think there is extra modules you can by for espeak think I saw a hat for it .


   
ReplyQuote
Robo Pi
(@robo-pi)
Robotics Engineer
Joined: 5 years ago
Posts: 1669
Topic starter  

I did find the following.  Not exactly a tutorial, but it might contain the information I need.  I'll need to go through this and see if I can learn how to create my own voice and language dictionary.

eSpeak Documents

It contains links to the following documents:

  • Voices
  • Mbrola Voices
  • Pronunciation Dictionary
  • Adding or Improving a Language
  • Phonemes
  • Phoneme Tables
  • Intonation
  • eSpeak Library API
  • Markup Tags
  • The eSpeak Edit Program

So it looks like I'm off to eSpeak college.  No easy tutorials, but hopefully I can glean everything I need from this documentation.  Then I can create the one-and-only video tutorial that explains all of this stuff in detail. 😊 

It's hard to believe that someone didn't already make an in-depth tutorial on this.

DroneBot Workshop Robotics Engineer
James


   
ReplyQuote
(@starnovice)
Member
Joined: 5 years ago
Posts: 110
 

@robo-pi

This looks like a text-to-speech program, right?  I would like to find a solution for speech input to the DB1.

Pat Wicker (Portland, OR, USA)


   
ReplyQuote
Robo Pi
(@robo-pi)
Robotics Engineer
Joined: 5 years ago
Posts: 1669
Topic starter  
Posted by: @starnovice

This looks like a text-to-speech program, right?

Yes, this is only TTS.   It converts written words into speech.  So this is just to make the robot talk only.

Posted by: @starnovice

I would like to find a solution for speech input to the DB1.

That requires be SRE (Speech Recognition Engine).  There are many different SRE's available.  If you're willing to use Windows you can use the Microsoft Speech Platform which contains both TTS and SRE.  MSP is not Internet dependent, but it is dependent on APIs that only run on the Windows OS.

There are also many others.  However, many of those require Interent and Cloud dependency.  In other words, you're not only tied to the Internet, but you're also tied to a specific server that does the SRE for you.  There are pros and cons to using Internet Dependent SREs. 

  • Pros
    • They supposedly work very well and are constantly upgraded.
  • Cons
    • You don't have the flexibility of fine tuning them to your specific needs
    • They can sometimes require delays, but that can depend on your Internet Speed.
    • If there's no Internet service, your robot can no longer understand speech.
    • Everything you say to your robot will be available on the Cloud.

There are very few Local SREs.  By local I mean that will run solely on your machine with no dependency on external servers or Internet.   However, some do exist.

One Local SRE is called CMU Sphinx by Carnegie Mellon University.

It comes in two flavors:

  • Sphinx4
  • Pocket Sphinx

Sphinx4 is a very large program with a very large database.  It also requires a pretty large and fast computer to run smoothly.

Pocket Sphinx is a condensed version of Sphinx4 aimed at SBCs and other computers that have limited speed and resources.

I'm currently trying to learn how to use Pocket Sphinx.  I've downloaded it onto a Raspberry Pi 4.  I have it up and running and it has been converting my voice to text almost perfectly.   Very close to 100% accuracy.  There is a small delay between when you speak and the computer returns the deciphered text.   However, it does this for entire sentences and even paragraphs.   So it appears to be a very good choice for robotics.

So far I only have it running from the command line prompt.  I'm trying to learn how to access it from within Python programs, but I haven't been able to figure that out yet.   Like the eSpeak TTS, Pocket Sphinx SRE is also difficult to find good tutorials on.  Especially anything in-depth.  About all I've been able to find are tutorials that show  you how to install it (no small task), and then run it from the command line.  But I can already do that so those toots aren't very helpful now.

I've printed out the entire CMU tutorial for Pocket Sphinx and I'm determined to learn the details.   But thus far I haven't found anything that teaches how to access it from Python.   I did see one YouTube tutorial on how to do this but all they were doing was importing System OS and then excuting the commands to the OS as you would on the command line.  But that's a bit cumbersome.  I'm hoping there's a better way to interface with Pocket Sphinx from Python than this.

In any case, the whole SRE thing is quite different from the TTS.   So I'm trying to keep them separate.  I have two notebook binders.    One for eSpeak, EMBORLA. and Pyttsx3 (all for the TTS project).  And a second binder for the Pocket Sphinx SRE project.

So I'm doing both.   And they are both very difficult to find good tutorials on.  So when I learn these I'm hoping to make some tutorial videos on them myself.  Assuming I can get them up and running the way I want. 😊 

I think this is going to take me some time to learn all this stuff.  But if it all pans out in the end it will be worth it.

DroneBot Workshop Robotics Engineer
James


   
ReplyQuote
(@starnovice)
Member
Joined: 5 years ago
Posts: 110
 

@robo-pi

I have been trying to use an Alexa but the tutorials for that are few and the code doesn't seem to work in practice.  It has been quite frustrating.  Like you said it would only work when you have cloud access.

Pat Wicker (Portland, OR, USA)


   
ReplyQuote
Robo Pi
(@robo-pi)
Robotics Engineer
Joined: 5 years ago
Posts: 1669
Topic starter  

@starnovice

I thought about using Alexa too, but I'm not sure if there is any way to control what Alexa says?

I'm working on what I was calling a "Semantic AI" project.  I have since renamed it the "Linguistic AI" project for reasons I won't go into here.   In any case, I need to have full control over what the robot says in response to my speech. 

I could do my entire Linguistic AI project just by typing in my text and having the robot reply with text output.  But I wanted to go with the audio both ways.   eSpeak, and Pocket Sphinx appear to be very promising for my specific application.   I will have full control over programming exactly what the robot understands. and how it chooses to respond.  And that's exactly what I need.  So I really don't want something like Alexa that someone else programmed to respond to me.  That wouldn't work for my Linguistic AI project.

 

DroneBot Workshop Robotics Engineer
James


   
ReplyQuote
(@starnovice)
Member
Joined: 5 years ago
Posts: 110
 

@robo-pi

You have total control over what Alexa says and does.  The only thing that seems to be hard to change is the name "Alexa", called the "wake" symbol. You can choose Amazon or Computer as alternate wake names, even for an Alexa or Echo.  Otherwise you have to patch in your own wake engine.

Bottom line it sounds like your approach will be more satisfying.

Pat Wicker (Portland, OR, USA)


   
ReplyQuote
Robo Pi
(@robo-pi)
Robotics Engineer
Joined: 5 years ago
Posts: 1669
Topic starter  
Posted by: @starnovice

You have total control over what Alexa says and does.

I didn't know that.  In fact, there's a LOT I don't know when it comes to exactly what is available for TTS and SRE.  I'm no expert in this field to be sure.  I came across pyttsx3 for TTS, and it looked promising.  Then I discovered that it actually uses eSpeak as the TTS engine, and so I decided to go with this for the TTS.  My main considerations in this area is the fact that it is totally local to the machine.  And also relatively compact for use with SBCs.

After looking at several SRE options I chose Pocket Sphinx because it also appeared to meet my criteria the best.  It's local to the machine, independent of the Internet, and specifically designed for use with SBC.  It's also supposedly fully configurable in terms of grammar and dictionaries.   So it looks promising for my application.  But I still have a lot to learn about both eSpeak and Pocket Sphinx.

I don't know whether I chose the best systems for my purpose yet or not.  All I can say is that they look promising.   But they both also lack any really good in-depth tutorials.  If I learn how to use them efficiently, that will change as I will definitely make my own tutorials on both of these.

In the meantime I'm in the early stages yet, and I could potentially switch over to something else depending on how well my experience goes with these.

I'm focusing on eSpeak right now.  TTS is totally different from SRE so working on both of them simultaneously has the potential to become confusing.

Although having said that it's also important that they can work together to some degree.  I was previously working with Microsoft Speech Platform which has both TTS and SRE, and they are well integrated so that it's easy to use the same grammatical structures and semantic indexing.   The only problem I had with MSP is that it is dependent on the Windows OS.   I wanted to move over to the Raspberry Pi and Jetson Nano so I needed to find something that runs on Linux.  I also wanted to move from C# to Python.   Both languages are great, but Python is typically used in AI applications so I wanted to do this in Python for the sake of AI tutorials.  I think more people would follow them using Python than if I used C#.

So this whole project for me is required to move away from Windows dependency.  Otherwise I could just stay with Microsoft Speech Platform which is actually pretty nice.  The only bad thing about it is that it's totally tied to Windows.  I think it's also pretty much tied to C# as well.  Although I think it can also be used in C++ but I don't think it's compatible with Python.  Even if it was, it's still tied to Windows which I don't want.

But for someone who doesn't mind being tied to Windows and C# or C++ the Microsoft Speech Platform is definitely a nice TTS/SRE package.   It's local, Internet independent, and fully flexible for personal configuration.  Plus it has far nicer voices that eSpeak.  Although with any luck I'll be able to improve the voices on eSpeak.  That's yet to be determined.   In fact, that's what I'm currently looking into.

By the way, I have modified the voices quite a bit from what comes with eSpeak out of the box.  But I'd like to do even more tweaking.  So I'm looking for ways to obtain even more control over the sound of the voice. 😎 

~~~~~~~

Perhaps I should mention here to that when it comes to TTS there are basically two ways to generate a voice.  One way is to synthesize it from scratch using digital techniques within the program.  This is how eSpeak generates its voice.   The other way is to use *.WAV files for the phonemes.  Those speech engines can sound very human because a human voice can actually be used to record the phonemes.  So that's a major difference there.

Like I say, I have a LOT to learn in terms of actually gaining experience with TTS and SRE, so where I'll actually end up is anyone's guess.  All I can say is that, for the moment, eSpeak and Pocket Sphinx have my attention.  They look promising for use with Linux, Python, and SBCs.  Only time will tell now.

DroneBot Workshop Robotics Engineer
James


   
ReplyQuote
Robo Pi
(@robo-pi)
Robotics Engineer
Joined: 5 years ago
Posts: 1669
Topic starter  

Progress report on the eSpeak voices

It's not much, but I was able to create my own new voice configuration file.   I named it "alysha" as this is the name of my robot.   And now I can use the setProperty("voice', "alysha") in Python and it uses my new voice configuration.   So this is great.  I'm making progress.

Originally I only had access to three parameters:

  1. Rate of speech
  2. Volume of speech
  3. Very limited Pitch change from 0 to 5 integer values only.

Now I have access to  23 parameters.  Many of which I haven't even had time to experiment with yet.   I also have far more control over the pitch instead of just having 5 integer options.  I can also modify how the pitch changes over time, such as rising or lowering the pitch at the end of a sentence.  So I've already gained far more control than I had before.   So progress is being made ever so slowly.

These changes can also be made programmatically on the fly so the robot can ultimately have far more control over its voice.  This will allow emotional expression to be incorporated into the Linquistic AI model. 😊 

So thus far this looks very promising.   I haven't seen any tutorials anywhere that even suggest this is possible, much less showing how to actually do it.   So this is GREAT!

This is just for the TTS right now.  I'll look into the Pocket Sphinx SRE later.  But that doesn't speak, it just deciphers what is spoken to the robot.  So that's a totally different beast.  This is why it's best not to confuse the two.  One speaks, the other one listens.   Right now I'm just focusing on TTS.

DroneBot Workshop Robotics Engineer
James


   
ReplyQuote
(@starnovice)
Member
Joined: 5 years ago
Posts: 110
 

@robo-pi

Congratulations on your progress.  It is the little successes that keep us working.  I'm looking forward to your tutorial so I don't have to repeat all of your pain :-). If I wasn't busy working on the programming for moving the DB1 I would tackle the Pocket Sphinx SRE, but maybe soon.

Pat Wicker (Portland, OR, USA)


   
ReplyQuote
Robo Pi
(@robo-pi)
Robotics Engineer
Joined: 5 years ago
Posts: 1669
Topic starter  
Posted by: @starnovice

I'm looking forward to your tutorial so I don't have to repeat all of your pain

Unfortunately the tutorial won't be happening anytime soon.  I'm doing yard work, and gardening today, slipping in every once in a while to take a break.  That's when I try to learn a tad bit more about eSpeak. I am learning more and more all the time.  But I won't produce a tutorial video until I've basically mastered everything that I would like to master about it.

Today I've just learned the difference between eSpeak and eSpeak-NG.  The NG stands for "Next Generation".  It's basically an improved version of eSpeak.   I'm still a bit confused on exactly what I'm using.  I actually installed eSpeak, but I notice that on my computer I have a lot of folders named espeak-ng.   I don't know whether that was already on the computer, or whether it came in when I installed espeak.  So still lot's to learn.

If you beat me to Pocket Sphinx I hope  you'll share your pain in a tutorial video as well. 😊 

Like I say, I have it installed and working (very well I might add) from the command line.  My next step with Pocket Sphinx is to learn how to access it from Python.  From what I've seen it is possible, but thus far I haven't been able to do it.

For example here's a link that claims to be using it with Python: Pocketsphinx Python

But in their code they have the following:

from pocketsphinx import LiveSpeech
for phrase in LiveSpeech(): print(phrase)

But when I try that I just get an error msg saying that it can't find the module pocketsphinx.

So apparently I'm doing something wrong.

But clearly it's possible to import LifeSpeech from pocketsphinx  in Python.   So I just need to figure out why my Python can't find the pocketsphinx module.

If I can get this to import LiveSpeech I think I'll be ok at least in terms of being able to access pocketsphinx from Python.   When I finally get it in Python I can move on to looking into dictionaries, etc.

DroneBot Workshop Robotics Engineer
James


   
ReplyQuote
Robo Pi
(@robo-pi)
Robotics Engineer
Joined: 5 years ago
Posts: 1669
Topic starter  

UPDATE April 17, 2020 on eSpeak and MBROLA

This was a journey of pulling teeth to be sure!  But fortunately I learned quite a bit and ended up with a system that I'm happy with.

The Early Mass Confusion

In the beginning I basically didn't know much at all about this TTS system.  All I knew that made it attractive where the following criteria that I found attractive for my Linguistic AI project.

  • eSpeak is independent of the Internet and runs entirely on the local computer.
  • eSpeak is designed for small computer systems such as SBCs.
  • eSpeak can be used on all platforms from Windows, to MacOS and Linux.
  • 'eSpeak also runs on the ARM64 processors.
  • eSpeak is fully configurable in terms of language rules and word dictionaries.
  • eSpeak is fully configurable in terms of many grammar features.
  • eSpeak creates its own computer synthesized voice
  • eSpeak can also be used with MBROLA voices which are very nice.

In the beginning I was running eSpeak from Python using Pyttsx3.   But later discovered that there are features available in eSpeak that Pyttsx3 actually inhibits.  So I have since dropped the Pyttsx3 as the Python module.  Instead I've chosen to use the Python subprocess library which allows eSpeak to be driven by the command line via Python without having use use os.system.

eSpeak-NG 

I was told that I would "need" to use eSpeak-ng which is supposedly an upgraded version of eSpeak.  However, after having installed it I found that I was having more problems with eSpeak-ng than I had with the original eSpeak.  So I'm not yet convinced that the so-called "upgrade" is necessary or even desirable. So I've gone back to using the earlier version of eSpeak.

The MBROLA voices

I finally got eSpeak to use an MBROLA voice.  That too was like pulling teeth but it was well worth it.  The MBROLA voice sounds much better than the standard voice that comes with eSpeak.  So now I have the following system up and running.

  • eSpeak
  • Python interface using subprocess to access eSpeak
  • and a nice configured MBROLA voice.

The eSpeak Dictionary Files.

Perhaps the greatest eSpeak function for my project is the totally configurable language dictionary files.  The language dictionary consists of two text files and a complied binary file.

  • lang_rules
  • lang_list
  • lang_duct

The lang_rules File

The lang_rules file is a text file that can be modified to your heart's content.  This file allows eSpeak to form words based purely on how they are spelled as well as whether it thinks they are a noun, verb, adjective, etc.

Exactly how the rules file works would take a very long time to learn.  But the fact that its so comprehensive makes it perfect for an AI project.   Obviously, the English language (which is what I'm using) does not loan itself to an easy or perfect translation between how words are spelled and how they are  pronounced.  The English language simply has too many exceptions to the rules.   Also some words can be spoken in different ways for example the word "read" can be spoken as "reed", or as "red".   Although the rules file does allow for some determination of past and future tense based on a larger context.  But as I say, you'd almost need to have a Ph.D. understanding of how this file actually works to improve upon it.  Apparently that's also what many people are working on.  Perhaps that's where eSpeak-ng excells?

In any case, the rules file allows eSpeak to try to speak based just on what is written so that's nice.  However there is a second file that allows specific words to be listed:

The lang_list File

The lang_list file is indeed a list.  It's a list of words with specific pronunciations.   It even allows you to use different pronunciations depending on whether a word is being used as a noun, verb, or past or present or future tense based on a larger context.   So this is pretty powerful as well.

The lang-dict File

The lang_dict file is simply a compiled version of the lang_rules and lang_list files.  This is compiled into a binary form that eSpeak can then use.   So we modified the lang_rules and lang_list files for our purposes.  As they are just text files.  And then compile them into a working lang_dict file.

Why this is Great is for an AI Project

The ultimate goal I have in mind is to have the robot modify its own lang_rules and lang_list file as it grows.   So this is why this particular system is so attractive to me.  It's more than just a "canned system" of predefined phrases.   This system actually operates on grammar rules, along with allowing for a dictionary to be included as the robot learns new words.

There Even More! - The eSpeak Edit Software. 

There is a program called espeakedit which allows you to modify phonemes for individual words.   These can then be added to the lang_list file so you can make the robot speak words precisely as you would like it to speak them.   This is especially useful for personal names which may not sound correctly if left up to the grammar rules. 

The espeakedit software also breaks down each word graphically and displays the sounds of each phoneme which you can then modify in terms of duration,  pitch, amplitude, accent and more.  You can also use this program to modify the fundamental phonemes that are used by the lang_rules file.

So there's clearly a lot to this eSpeak utility.  It appears to be precisely what I was hoping that it might be. 😊 

~~~~~~

Now that I have everything working, I'll need to make some videos on this.  As you can probably tell from all the information above, there is enough information to create an entire series of videos.   And that's just on eSpeak alone.  The actual Linquistic AI project that I'm hoping to create is a totally separate thing entirely.

Now that I have eSpeak figured out to some degree, and up and running in Python with a nice MBROLA voice, I'm going to move over to Pocket Sphinx and take a look at what is required to get that all ironed out and ready to use. Pocket Sphinx will allow the robot to recognize spoken words.

Then once I have both the TTS (text-to-speech) and the SRE (speech recognition engine) all set up, I can finally move forward to setting up my Linguistic AI system.

DroneBot Workshop Robotics Engineer
James


   
ReplyQuote
codecage
(@codecage)
Member Admin
Joined: 5 years ago
Posts: 1037
 

@robo-pi

WOW!  Great work. Can't wait to see a tutorial!

Can you give a very abbreviated set of steps as where to start, like what to download and install and in what order as if you were starting over and wanted to skip over all the wrong turns you made on your initial journey.

I had not gotten as far as thinking about my robot speaking yet, but your post has really peaked my interest! 

SteveG


   
ReplyQuote
codecage
(@codecage)
Member Admin
Joined: 5 years ago
Posts: 1037
 

@robo-pi

Just couldn't wait!  Downloaded and installed espeak using defaults and have played around just a little.  Not really happy with sound of the voices, so thought I'd look into the Mbrola addition, but all of the links to Mbrola in the espeak documents give me a "Bad Gateway" error.  Saw a couple of different GitHub sites related to Mbrola, but decide to wait on a couple of pointers from you before proceeding.

Having more fun than a barrel of monkeys!

SteveG


   
ReplyQuote
Page 1 / 4