Generating Audio from Text with Local AI Using Kokoro-82M

Imagine being able to turn any text into audio with a natural and realistic voice, all running locally on your machine, without relying on cloud services or paid APIs. That’s exactly what Kokoro-82M offers.

Kokoro is an open-weight Text-to-Speech (TTS) model with only 82 million parameters. Despite being lightweight, it delivers audio quality comparable to much larger models, while being significantly faster and more cost-efficient. With an Apache license, it can be used in both personal projects and production environments.

In this tutorial, we’ll create a simple Python script that converts text into a .wav audio file using Kokoro with Brazilian Portuguese voices.

Prerequisites

This tutorial was tested with Python 3.12.12 on macOS, but it should work on any operating system with Python 3.10 to 3.12.

Installing the Python dependencies

With your virtual environment active, install the required libraries:

pip install kokoro soundfile

The kokoro package is the model’s inference library, and soundfile is responsible for saving the generated audio to a file.

Available voices for Brazilian Portuguese

Kokoro offers three voices for pt-br:

Voice	Gender
`pf_dora`	Female
`pm_alex`	Male
`pm_santa`	Male

The p prefix indicates the language (Brazilian Portuguese) and the following letter indicates the gender (f for female, m for male).

The code

Create a file called tts_kokoro.py and paste the following code:

import warnings
warnings.filterwarnings("ignore", message="dropout option adds dropout")
warnings.filterwarnings("ignore", message=".*weight_norm.*is deprecated")

from kokoro import KPipeline
import soundfile as sf
import numpy as np

pipe = KPipeline(lang_code="p", repo_id="hexgrad/Kokoro-82M")

audio = []
for _, _, chunk in pipe("A inteligência artificial avançou significativamente, "
                        "oferecendo, hoje em dia, vozes extremamente realistas.",
                        voice="pf_dora"):
    audio.append(chunk)

audio = np.concatenate(audio)
sf.write("saida.wav", audio, 24000)
print("Audio generated successfully: saida.wav")

Run it with:

python tts_kokoro.py

On the first run, the model will be automatically downloaded from Hugging Face (around 330 MB). On subsequent runs, it will be loaded from the local cache.

The result will be a saida.wav file with the generated audio.

Understanding the code

Let’s go through each part of the script.

Silencing PyTorch warnings

import warnings
warnings.filterwarnings("ignore", message="dropout option adds dropout")
warnings.filterwarnings("ignore", message=".*weight_norm.*is deprecated")

These filters must come before importing Kokoro. The model internally uses PyTorch components that emit two harmless warnings: one about dropout configuration in LSTM layers and another about the weight_norm function being replaced by a newer version. Neither affects the model’s functionality, but they clutter the terminal output.

Initializing the pipeline

from kokoro import KPipeline

pipe = KPipeline(lang_code="p", repo_id="hexgrad/Kokoro-82M")

The lang_code="p" parameter indicates that we’ll be using Brazilian Portuguese. Each language supported by Kokoro has its own code, such as "a" for American English, "j" for Japanese, and "z" for Mandarin. The repo_id points to the model’s repository on Hugging Face.

Generating the audio

audio = []
for _, _, chunk in pipe("Your text here...", voice="pf_dora"):
    audio.append(chunk)

The pipeline processes the text in chunks and returns an iterator with three values per iteration: the graphemes, the phonemes, and the audio for that segment. Since we only need the audio, we ignore the first two values with _. Each chunk is accumulated in the list and then concatenated with np.concatenate().

Saving the file

audio = np.concatenate(audio)
sf.write("saida.wav", audio, 24000)

The audio is generated at 24,000 Hz (24 kHz), which is the model’s native sample rate. soundfile saves the data in WAV format.

Experimenting with voices

To switch the voice, simply change the voice parameter in the pipeline call. Try all three available voices to find the one that best suits your project:

for _, _, chunk in pipe("Testando a voz masculina do Alex.", voice="pm_alex"):
    audio.append(chunk)

Conclusion

With just a few lines of code and no external services, we were able to generate high-quality audio in Brazilian Portuguese using Kokoro-82M. The model is lightweight, fast, and free, making it an excellent choice for accessibility projects, voice assistants, automated content narration, and much more.

To explore all available voices and languages, check out the official Kokoro repository on Hugging Face.

Note: Kokoro, being a lightweight and simple model, may struggle with technical terms in English (library names, tech jargon, etc.). In these cases, more robust models like Coqui tend to offer significantly better results.

Prerequisites#

Installing the Python dependencies#

Available voices for Brazilian Portuguese#

The code#

Understanding the code#

Silencing PyTorch warnings#

Initializing the pipeline#

Generating the audio#

Saving the file#

Experimenting with voices#

Conclusion#