Skip to content

Official Python Package for SupertonicΒΆ

Supertonic Banner

GitHub | Official Repo GitHub | Python Package Docs | Python PyPI DemoPage | Audio Samples Voice Builder | Cloning Demo Demo Models Colab

Supertonic-3: Multilingual synthesis across 31 languages plus a na fallback for text whose language is unknown or outside the supported set.

Quick StartΒΆ

pip install supertonic

PythonΒΆ

Every parameter is annotated inline, so the snippet doubles as copy-and-paste documentation for an LLM assistant:

from supertonic import TTS

# Note: first run downloads the model (~400MB) into ~/.cache/supertonic3/
tts = TTS(auto_download=True)       # Initialize TTS engine

style = tts.get_voice_style(voice_name="M1")   # 10 built-in voices: M1–M5, F1–F5

wav, duration = tts.synthesize(
    text="Supertonic is a lightning fast, on-device TTS system.",
    voice_style=style,              # Voice style object
    total_steps=8,                  # Quality: 5 (low) to 12 (high), default 8
    speed=1.05,                     # Speed: 0.7 (slow) to 2.0 (fast)
    max_chunk_length=300,           # Max characters per chunk (auto: 120 for Korean)
    silence_duration=0.3,           # Silence between chunks (seconds)
    lang="en",                      # ISO code; see "Supported Languages" below
    verbose=False,                  # Show detailed progress (default: False)
)
tts.save_audio(wav, "output.wav")

# Multilingual β€” just swap `lang` and the input text
wav_ko, _ = tts.synthesize("νšŒμ˜λŠ” μž μ‹œ 후에 μ‹œμž‘λ˜λ©° λͺ¨λ‘κ°€ μžλ¦¬μ— 앉아 κΈ°λ‹€λ¦½λ‹ˆλ‹€.", voice_style=style, lang="ko")
wav_es, _ = tts.synthesize("La reuniΓ³n comienza pronto y todos se sientan en silencio para escuchar.", voice_style=style, lang="es")

Custom voices (Voice Builder)ΒΆ

get_voice_style() loads one of the ten built-in voices (M1–M5, F1–F5). To use a voice created in Voice Builder (zero-shot cloning from a short reference clip), pass its JSON export to get_voice_style_from_path():

# Any voice-style JSON works here:
#   - a Voice Builder export, or
#   - one of the bundled defaults at
#     ~/.cache/supertonic3/voice_styles/{M1..M5,F1..F5}.json
#     (downloaded alongside the model on first run)
style = tts.get_voice_style_from_path("~/voices/my_voice.json")
wav, _ = tts.synthesize("Hello in my own cloned voice.", voice_style=style, lang="en")

CLIΒΆ

# Note: first run will download the model (~400MB) from HuggingFace
supertonic tts 'Supertonic is a lightning fast, on-device TTS system.' -o output.wav

# Pick a built-in voice and bump quality
supertonic tts 'Use a different voice.' -o output.wav --voice F1 --steps 10

# Use a custom voice β€” Voice Builder export, or a bundled
# ~/.cache/supertonic3/voice_styles/*.json file
supertonic tts 'Hello in my own cloned voice.' -o output.wav \
  --custom-style-path ~/voices/my_voice.json

# Multilingual support β€” each language with natural text handling
supertonic tts 'νšŒμ˜λŠ” μž μ‹œ 후에 μ‹œμž‘λ˜λ©° λͺ¨λ‘κ°€ μžλ¦¬μ— 앉아 κΈ°λ‹€λ¦½λ‹ˆλ‹€.' -o korean.wav --lang ko
supertonic tts 'La reuniΓ³n comienza pronto y todos se sientan en silencio para escuchar.' -o spanish.wav --lang es
supertonic tts 'A reuniΓ£o comeΓ§a em breve e todos se sentam em silΓͺncio para ouvir.' -o portuguese.wav --lang pt

Local HTTP serverΒΆ

Run Supertonic as a thin local HTTP wrapper for n8n, browser extensions, Electron apps, Unity, Home Assistant, or anything that already speaks the OpenAI Audio Speech API:

pip install 'supertonic[serve]'
supertonic serve --host 127.0.0.1 --port 7788

# Native endpoint
curl -X POST http://127.0.0.1:7788/v1/tts \
  -H 'content-type: application/json' \
  -d '{"text":"Supertonic is a lightning fast, on-device TTS system.","voice":"M1","lang":"en"}' \
  -o output.wav

See Local Server for the OpenAI-compatible alias, Voice Builder custom-voice import, and the batch endpoint.

Get Started with the Full Guide

Explore installation options, voice customization, and advanced configuration.

RequirementsΒΆ

Supertonic has minimal dependencies - just 4 core libraries:

  • onnxruntime - Fast ONNX model inference
  • numpy - Numerical operations
  • soundfile - Audio file I/O
  • huggingface-hub - Model downloads

✨ Highlights¢

⚑ Blazingly Fast β€” Low-latency, real-time synthesis across desktop, browser, mobile, and edge β€” fast enough to turn an entire webpage into audio in under a second

🌍 31-Language Multilingual β€” Synthesize directly from text across 31 languages, or pass lang="na" to let Supertonic process the text language-agnostically when you don't know the input language β€” no separate language adapters needed

πŸͺΆ 99M-Parameter Open-Weight Model β€” A compact, fully open-weight checkpoint β€” a fraction of the size of 0.7B–2B class open TTS systems β€” for smaller downloads, faster cold starts, and lower memory footprint

πŸ“± Edge-Device Ready β€” Runs locally on desktop, mobile, browsers, and resource-constrained hardware like Raspberry Pi or e-readers, with zero network dependency, complete privacy, and no GPU required

πŸ”Š 44.1kHz High-Quality Audio β€” Outputs studio-grade 44.1kHz 16-bit WAV directly, ready for production playback without any external upsampler

🎭 Expression Tags β€” 10 inline tags (e.g. <laugh>, <breath>, <sigh>) bring natural human nuance into generated speech without prompt engineering or reference audio

πŸ› οΈ Multi-Runtime SDKs β€” Ready-to-use examples through ONNX Runtime across Python, Node.js, Browser (WebGPU), Java, C++, C#, Go, Swift, iOS, Rust, and Flutter

Supported LanguagesΒΆ

Supertonic-3 supports the following 31 ISO codes, plus a special na fallback for unknown / unsupported languages:

Code Language Code Language Code Language Code Language
en English ko Korean ja Japanese ar Arabic
bg Bulgarian cs Czech da Danish de German
el Greek es Spanish et Estonian fi Finnish
fr French hi Hindi hr Croatian hu Hungarian
id Indonesian it Italian lt Lithuanian lv Latvian
nl Dutch pl Polish pt Portuguese ro Romanian
ru Russian sk Slovak sl Slovenian sv Swedish
tr Turkish uk Ukrainian vi Vietnamese na unknown / fallback
# Pick any supported code, or use 'na' for text whose language is unknown
wav, _ = tts.synthesize("Some uncommon text.", voice_style=style, lang="na")

Performance BenchmarksΒΆ

Metrics:
  • Characters per Second: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
  • Real-time Factor (RTF): Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).

Characters per SecondΒΆ

System Short (59 chars) Mid (152 chars) Long (266 chars)
Supertonic (M4 pro - CPU) 912 1048 1263
Supertonic (M4 pro - WebGPU) 996 1801 2509
Supertonic (RTX4090) 2615 6548 12164
API ElevenLabs Flash v2.5 144 209 287
API OpenAI TTS-1 37 55 82
API Gemini 2.5 Flash TTS 12 18 24
API Supertone Sona speech 1 38 64 92
Open Kokoro 104 107 117
Open NeuTTS Air 37 42 47

Notes: API = Cloud-based API services (measured from Seoul) Open = Open-source models Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX Supertonic (RTX4090): Tested with PyTorch model Kokoro: Tested on M4 Pro CPU with ONNX NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF

Real-time FactorΒΆ

System Short (59 chars) Mid (152 chars) Long (266 chars)
Supertonic (M4 pro - CPU) 0.015 0.013 0.012
Supertonic (M4 pro - WebGPU) 0.014 0.007 0.006
Supertonic (RTX4090) 0.005 0.002 0.001
API ElevenLabs Flash v2.5 0.133 0.077 0.057
API OpenAI TTS-1 0.471 0.302 0.201
API Gemini 2.5 Flash TTS 1.060 0.673 0.541
API Supertone Sona speech 1 0.372 0.206 0.163
Open Kokoro 0.144 0.124 0.126
Open NeuTTS Air 0.390 0.338 0.343
Additional Performance Data (5-step inference)

Characters per Second (5-step)

System Short (59 chars) Mid (152 chars) Long (266 chars)
Supertonic (M4 pro - CPU) 596 691 850
Supertonic (M4 pro - WebGPU) 570 1118 1546
Supertonic (RTX4090) 1286 3757 6242

Real-time Factor (5-step)

System Short (59 chars) Mid (152 chars) Long (266 chars)
Supertonic (M4 pro - CPU) 0.023 0.019 0.018
Supertonic (M4 pro - WebGPU) 0.024 0.012 0.010
Supertonic (RTX4090) 0.011 0.004 0.002

Natural Text HandlingΒΆ

Supertonic is designed to handle complex, real-world text inputs that contain numbers, currency symbols, abbreviations, dates, and proper nouns.

🎧 View audio samples more easily: Check out our Interactive Demo for a better viewing experience of all audio examples

Overview of Test Cases:

Category Key Challenges Supertonic ElevenLabs OpenAI Gemini Microsoft
Financial Expression Decimal currency, abbreviated magnitudes (M, K), currency symbols, currency codes βœ… ❌ ❌ ❌ ❌
Time and Date Time notation, abbreviated weekdays/months, date formats βœ… ❌ ❌ ❌ ❌
Phone Number Area codes, hyphens, extensions (ext.) βœ… ❌ ❌ ❌ ❌
Technical Unit Decimal numbers with units, abbreviated technical notations βœ… ❌ ❌ ❌ ❌
Example 1: Financial Expression

"The startup secured $5.2M in venture capital, a huge leap from their initial $450K seed round."

Challenges:

  • Decimal point in currency ($5.2M should be read as "five point two million")
  • Abbreviated magnitude units (M for million, K for thousand)
  • Currency symbol ($) that needs to be properly pronounced as "dollars"

Audio Samples:

System Result Audio
Supertonic βœ…
ElevenLabs Flash v2.5 ❌
OpenAI TTS-1 ❌
Gemini 2.5 Flash TTS ❌
VibeVoice Realtime 0.5B ❌
Example 2: Time and Date

"The train delay was announced at 4:45 PM on Wed, Apr 3, 2024 due to track maintenance."

Challenges:

  • Time expression with PM notation (4:45 PM)
  • Abbreviated weekday (Wed)
  • Abbreviated month (Apr)
  • Full date format (Apr 3, 2024)

Audio Samples:

System Result Audio
Supertonic βœ…
ElevenLabs Flash v2.5 ❌
OpenAI TTS-1 ❌
Gemini 2.5 Flash TTS ❌
VibeVoice Realtime 0.5B ❌
Example 3: Phone Number

"You can reach the hotel front desk at (212) 555-0142 ext. 402 anytime."

Challenges:

  • Area code in parentheses that should be read as separate digits
  • Phone number with hyphen separator (555-0142)
  • Abbreviated extension notation (ext.)
  • Extension number (402)

Audio Samples:

System Result Audio
Supertonic βœ…
ElevenLabs Flash v2.5 ❌
OpenAI TTS-1 ❌
Gemini 2.5 Flash TTS ❌
VibeVoice Realtime 0.5B ❌
Example 4: Technical Unit

"Our drone battery lasts 2.3h when flying at 30kph with full camera payload."

Challenges:

  • Decimal time duration with abbreviation (2.3h = two point three hours)
  • Speed unit with abbreviation (30kph = thirty kilometers per hour)
  • Technical abbreviations (h for hours, kph for kilometers per hour)
  • Technical/engineering context requiring proper pronunciation

Audio Samples:

System Result Audio
Supertonic βœ…
ElevenLabs Flash v2.5 ❌
OpenAI TTS-1 ❌
Gemini 2.5 Flash TTS ❌
VibeVoice Realtime 0.5B ❌

Note: These samples demonstrate how each system handles text normalization and pronunciation of complex expressions without requiring pre-processing or phonetic annotations.

CitationΒΆ

The following papers describe the core technologies used in Supertonic. If you use this system in your research or find these techniques useful, please consider citing the relevant papers:

SupertonicTTS: Main ArchitectureΒΆ

This paper introduces the overall architecture of SupertonicTTS, including the speech autoencoder, flow-matching based text-to-latent module, and efficient design choices.

@article{kim2025supertonic,
  title={SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech System},
  author={Kim, Hyeongju and Yang, Jinhyeok and Yu, Yechan and Ji, Seunghun and Morton, Jacob and Bous, Frederik and Byun, Joon and Lee, Juheon},
  journal={arXiv preprint arXiv:2503.23108},
  year={2025},
  url={https://arxiv.org/abs/2503.23108}
}

Length-Aware RoPE: Text-Speech AlignmentΒΆ

This paper presents Length-Aware Rotary Position Embedding (LARoPE), which improves text-speech alignment in cross-attention mechanisms.

@article{kim2025larope,
  title={Length-Aware Rotary Position Embedding for Text-Speech Alignment},
  author={Kim, Hyeongju and Lee, Juheon and Yang, Jinhyeok and Morton, Jacob},
  journal={arXiv preprint arXiv:2509.11084},
  year={2025},
  url={https://arxiv.org/abs/2509.11084}
}

Self-Purifying Flow Matching: Training with Noisy LabelsΒΆ

This paper describes the self-purification technique for training flow matching models robustly with noisy or unreliable labels.

@article{kim2025spfm,
  title={Training Flow Matching Models with Reliable Labels via Self-Purification},
  author={Kim, Hyeongju and Yu, Yechan and Yi, June Young and Lee, Juheon},
  journal={arXiv preprint arXiv:2509.19091},
  year={2025},
  url={https://arxiv.org/abs/2509.19091}
}

🏠 Main Repository: github.com/supertone-inc/supertonic

🎧 Try it live: Hugging Face Spaces

πŸ€— Model Repository: Hugging Face Models

LicenseΒΆ

Code: MIT License

Model: OpenRAIL-M License

Copyright Β© 2025 Supertone Inc.