Official Python Package for Supertonic¶
Supertonic-2: Now with multilingual support! 5 languages available: English, Korean, Spanish, Portuguese, and French.
Quick Start¶
CLI¶
# Note: First run will download the model (~305MB) from HuggingFace
supertonic tts 'Supertonic is a lightning fast, on-device TTS system.' -o output.wav
# Multilingual support - each language with natural text handling
supertonic tts '회의는 잠시 후에 시작되며 모두가 자리에 앉아 기다립니다.' -o korean.wav --lang ko
supertonic tts 'La reunión comienza pronto y todos se sientan en silencio para escuchar.' -o spanish.wav --lang es
supertonic tts 'A reunião começa em breve e todos se sentam em silêncio para ouvir.' -o portuguese.wav --lang pt
supertonic tts 'La réunion commence bientôt et tout le monde s’assoit en silence pour écouter.' -o french.wav --lang fr
Python¶
from supertonic import TTS
# Note: First run downloads model automatically (~305MB)
tts = TTS(auto_download=True)
# Get a voice style
style = tts.get_voice_style(voice_name="M1")
# Generate speech (English - default)
text = "The train delay was announced at 4:45 PM on Wed, Apr 3, 2024 due to track maintenance."
wav, duration = tts.synthesize(text, voice_style=style, lang="en")
# Multilingual synthesis - each language with natural text handling
wav_ko, _ = tts.synthesize("회의는 잠시 후에 시작되며 모두가 자리에 앉아 기다립니다.", voice_style=style, lang="ko")
wav_es, _ = tts.synthesize("La reunión comienza pronto y todos se sientan en silencio para escuchar.", voice_style=style, lang="es")
wav_pt, _ = tts.synthesize("A reunião começa em breve e todos se sentam em silêncio para ouvir.", voice_style=style, lang="pt")
wav_fr, _ = tts.synthesize("La réunion commence bientôt et tout le monde s’assoit en silence pour écouter.", voice_style=style, lang="fr")
# Save to file
tts.save_audio(wav, "output.wav")
Get Started with the Full Guide
Explore installation options, voice customization, and advanced configuration.
Requirements¶
Supertonic has minimal dependencies - just 4 core libraries:
- onnxruntime - Fast ONNX model inference
- numpy - Numerical operations
- soundfile - Audio file I/O
- huggingface-hub - Model downloads
Key Features¶
⚡ Blazingly Fast: Generates speech up to 167× faster than real-time on consumer hardware (M4 Pro)
🪶 Ultra Lightweight: Only 66M parameters, optimized for efficient on-device performance
📱 On-Device Capable: Complete privacy and zero latency
🌐 Multilingual (v2): Supports 5 languages — English, Korean, Spanish, Portuguese, and French
🎨 Natural Text Handling: Seamlessly processes complex expressions without G2P module
⚙️ Highly Configurable: Adjust inference steps, speech speed, and other parameters
🧩 Flexible Deployment: Deploy across servers, browsers, and edge devices
Performance Benchmarks¶
- Characters per Second: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
- Real-time Factor (RTF): Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).
Characters per Second¶
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|---|---|---|---|
| Supertonic (M4 pro - CPU) | 912 | 1048 | 1263 |
| Supertonic (M4 pro - WebGPU) | 996 | 1801 | 2509 |
| Supertonic (RTX4090) | 2615 | 6548 | 12164 |
API ElevenLabs Flash v2.5 | 144 | 209 | 287 |
API OpenAI TTS-1 | 37 | 55 | 82 |
API Gemini 2.5 Flash TTS | 12 | 18 | 24 |
API Supertone Sona speech 1 | 38 | 64 | 92 |
Open Kokoro | 104 | 107 | 117 |
Open NeuTTS Air | 37 | 42 | 47 |
Notes:
API= Cloud-based API services (measured from Seoul)Open= Open-source models Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX Supertonic (RTX4090): Tested with PyTorch model Kokoro: Tested on M4 Pro CPU with ONNX NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF
Real-time Factor¶
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|---|---|---|---|
| Supertonic (M4 pro - CPU) | 0.015 | 0.013 | 0.012 |
| Supertonic (M4 pro - WebGPU) | 0.014 | 0.007 | 0.006 |
| Supertonic (RTX4090) | 0.005 | 0.002 | 0.001 |
API ElevenLabs Flash v2.5 | 0.133 | 0.077 | 0.057 |
API OpenAI TTS-1 | 0.471 | 0.302 | 0.201 |
API Gemini 2.5 Flash TTS | 1.060 | 0.673 | 0.541 |
API Supertone Sona speech 1 | 0.372 | 0.206 | 0.163 |
Open Kokoro | 0.144 | 0.124 | 0.126 |
Open NeuTTS Air | 0.390 | 0.338 | 0.343 |
Additional Performance Data (5-step inference)
Characters per Second (5-step)
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|---|---|---|---|
| Supertonic (M4 pro - CPU) | 596 | 691 | 850 |
| Supertonic (M4 pro - WebGPU) | 570 | 1118 | 1546 |
| Supertonic (RTX4090) | 1286 | 3757 | 6242 |
Real-time Factor (5-step)
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|---|---|---|---|
| Supertonic (M4 pro - CPU) | 0.023 | 0.019 | 0.018 |
| Supertonic (M4 pro - WebGPU) | 0.024 | 0.012 | 0.010 |
| Supertonic (RTX4090) | 0.011 | 0.004 | 0.002 |
Natural Text Handling¶
Supertonic is designed to handle complex, real-world text inputs that contain numbers, currency symbols, abbreviations, dates, and proper nouns.
🎧 View audio samples more easily: Check out our Interactive Demo for a better viewing experience of all audio examples
Overview of Test Cases:
| Category | Key Challenges | Supertonic | ElevenLabs | OpenAI | Gemini | Microsoft |
|---|---|---|---|---|---|---|
| Financial Expression | Decimal currency, abbreviated magnitudes (M, K), currency symbols, currency codes | ✅ | ❌ | ❌ | ❌ | ❌ |
| Time and Date | Time notation, abbreviated weekdays/months, date formats | ✅ | ❌ | ❌ | ❌ | ❌ |
| Phone Number | Area codes, hyphens, extensions (ext.) | ✅ | ❌ | ❌ | ❌ | ❌ |
| Technical Unit | Decimal numbers with units, abbreviated technical notations | ✅ | ❌ | ❌ | ❌ | ❌ |
Example 1: Financial Expression
Challenges:
- Decimal point in currency ($5.2M should be read as "five point two million")
- Abbreviated magnitude units (M for million, K for thousand)
- Currency symbol ($) that needs to be properly pronounced as "dollars"
Audio Samples:
| System | Result | Audio |
|---|---|---|
| Supertonic | ✅ | |
| ElevenLabs Flash v2.5 | ❌ | |
| OpenAI TTS-1 | ❌ | |
| Gemini 2.5 Flash TTS | ❌ | |
| VibeVoice Realtime 0.5B | ❌ |
Example 2: Time and Date
Challenges:
- Time expression with PM notation (4:45 PM)
- Abbreviated weekday (Wed)
- Abbreviated month (Apr)
- Full date format (Apr 3, 2024)
Audio Samples:
| System | Result | Audio |
|---|---|---|
| Supertonic | ✅ | |
| ElevenLabs Flash v2.5 | ❌ | |
| OpenAI TTS-1 | ❌ | |
| Gemini 2.5 Flash TTS | ❌ | |
| VibeVoice Realtime 0.5B | ❌ |
Example 3: Phone Number
Challenges:
- Area code in parentheses that should be read as separate digits
- Phone number with hyphen separator (555-0142)
- Abbreviated extension notation (ext.)
- Extension number (402)
Audio Samples:
| System | Result | Audio |
|---|---|---|
| Supertonic | ✅ | |
| ElevenLabs Flash v2.5 | ❌ | |
| OpenAI TTS-1 | ❌ | |
| Gemini 2.5 Flash TTS | ❌ | |
| VibeVoice Realtime 0.5B | ❌ |
Example 4: Technical Unit
Challenges:
- Decimal time duration with abbreviation (2.3h = two point three hours)
- Speed unit with abbreviation (30kph = thirty kilometers per hour)
- Technical abbreviations (h for hours, kph for kilometers per hour)
- Technical/engineering context requiring proper pronunciation
Audio Samples:
| System | Result | Audio |
|---|---|---|
| Supertonic | ✅ | |
| ElevenLabs Flash v2.5 | ❌ | |
| OpenAI TTS-1 | ❌ | |
| Gemini 2.5 Flash TTS | ❌ | |
| VibeVoice Realtime 0.5B | ❌ |
Note: These samples demonstrate how each system handles text normalization and pronunciation of complex expressions without requiring pre-processing or phonetic annotations.
Citation¶
The following papers describe the core technologies used in Supertonic. If you use this system in your research or find these techniques useful, please consider citing the relevant papers:
SupertonicTTS: Main Architecture¶
This paper introduces the overall architecture of SupertonicTTS, including the speech autoencoder, flow-matching based text-to-latent module, and efficient design choices.
@article{kim2025supertonic,
title={SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech System},
author={Kim, Hyeongju and Yang, Jinhyeok and Yu, Yechan and Ji, Seunghun and Morton, Jacob and Bous, Frederik and Byun, Joon and Lee, Juheon},
journal={arXiv preprint arXiv:2503.23108},
year={2025},
url={https://arxiv.org/abs/2503.23108}
}
Length-Aware RoPE: Text-Speech Alignment¶
This paper presents Length-Aware Rotary Position Embedding (LARoPE), which improves text-speech alignment in cross-attention mechanisms.
@article{kim2025larope,
title={Length-Aware Rotary Position Embedding for Text-Speech Alignment},
author={Kim, Hyeongju and Lee, Juheon and Yang, Jinhyeok and Morton, Jacob},
journal={arXiv preprint arXiv:2509.11084},
year={2025},
url={https://arxiv.org/abs/2509.11084}
}
Self-Purifying Flow Matching: Training with Noisy Labels¶
This paper describes the self-purification technique for training flow matching models robustly with noisy or unreliable labels.
@article{kim2025spfm,
title={Training Flow Matching Models with Reliable Labels via Self-Purification},
author={Kim, Hyeongju and Yu, Yechan and Yi, June Young and Lee, Juheon},
journal={arXiv preprint arXiv:2509.19091},
year={2025},
url={https://arxiv.org/abs/2509.19091}
}
Related Projects¶
🏠 Main Repository: github.com/supertone-inc/supertonic
🎧 Try it live: Hugging Face Spaces
🤗 Model Repository: Hugging Face Models
License¶
Code: MIT License
Model: OpenRAIL-M License
Copyright © 2025 Supertone Inc.
