The Sound of Learning: Using OpenAI’s Text-to-Speech API for a Simple Dictation Game for Kids

3 min readFeb 24, 2024

Update (March 1st 2024): The code in Github has been updated to alternately use the Google text-to-speech API as well which has much better results.

My elementary school kids regularly get English words to practice at home for dictation. I decided to make this into a fun little game built with Python that uses OpenAI’s Text-to-Speech API to convert words into speech and then helps the kids practice their dictation.

Configuring the practice words

In order to make the game quite configurable, I started off with a simple YAML file holding some of the game configuration including the word list. OpenAI’s API supports multiple voice options and I wanted to be able to switch between the voices easily.

Reading the YAML configuration is trivial with the PyYAML package https://pyyaml.org/

Converting the words to speech

OpenAI provides a Python library openai to interact easily with all their APIs. The package simply needs the OPENAI_API_KEY environment variable to set with a API key that can be obtained from https://platform.openai.com/api-keys. Passing each of the words from the wordlist to the client.audio.speech.with_streaming_response.create API, an MP3 file can be streamed for each word. To avoid repeated invocations of the API, I cached the MP3 files but the whole operation is quite inexpensive since each word is only a few characters and using the tts-1-hd is only $0.030 / 1K characters.

Playing the dictation game

With the word list and the speech files, playing the dictation game is simple. Select a few words for the game, play the MP3 file and get the kids to type in their responses. For the MP3 playback, I used the Pygame Mixer API https://www.pygame.org/docs/ref/mixer.html

Putting it all together, here’s what it looks like:

Problems with OpenAI TTS

While my kids enjoy this method of practicing dictation, I found that OpenAI’s TTS API isn’t great for short words even with HD. In some cases, the words sound a bit truncated, especially if the words are too short, and some of the pronunciations are just “wrong”.

Here’s the word “jam” using the Echo voice.

The word “jam” converted by OpenAI’s TTS HD API with the Echo voice

And the word “fin” with the same voice.

The word “fin” converted by OpenAI’s TTS HD API with the Echo voice

Perhaps using some other TTS service might produce better results here.

The full code for this project is on GitHub.

GitHub - arunkv/dictation: Simple dictation game that uses OpenAI's TTS API

Simple dictation game that uses OpenAI's TTS API. Contribute to arunkv/dictation development by creating an account on…

github.com