Cartesia TTS is a highly expressive text-to-speech model focused on liveliness, human-like delivery, and strong resemblance to recorded voice actors. It is a performance-driven model that excels at natural prosody and expressive delivery, making it especially suitable for short-form, character-led, or conversational content where energy and realism are prioritized over absolute stability. The Cartesia Sonic model is also the backbone behind the accents and speech-to-speech offering.
Key Features
-
Expressive, Human-Like Delivery
- Produces lively, life-like, and natural-sounding speech that closely resembles the original voice actor’s recorded performance.
-
Different Accents
- Cartesia offers localizing every voice to different accents in English. If the user is creating localized content, they can leverage Cartesia Sonic to make every voice sound local with a native American, British, Australian, or American-Southern accent
-
Speech-to-Speech Support
- Supports speech-to-speech, allowing a “voice change’ operation that transforms spoken audio into another voice while preserving perfect sync (timing), melody, emphasis, and other performance characteristics.
-
Main Weakness: Stability
- Can be prone to glitches, unexpected noises, or minor artifacts, especially in longer (2000+ characters) or more complex generations.
Technical Capabilities
- Modalities: Text to Speech, Speech to Speech
- Custom Voice Cloning: Not supported
- Supported Settings: Speed control (0.5-1.5), Voice Effects
- Emotions available: Best Fit, Optimistic, Surprised, Sad, Angry,
-
Voice Tags Options:
- Add pause; for one second pause, insert “<break time="1s" />” as part of the prompt
- Voice Tags; insert “[Laughter]” as part of the prompt to make a voice laugh
- Nuanced Emotions; use formula “<emotion value="emotion" />” as part of the prompt, and replace “emotion” with a wide array of supported emotions to get a more nuanced performance.
- Accents Available: American, British, Australian, Indian
- Languages Available: English, French, German, Portuguese, Spanish, Dutch, Italian, Japanese, Polish, Russian, Swedish, Turkish
Best Use Cases
Character-Driven Content
Ideal for scripts that require personality, charm, or a strong sense of human presence, such as short narratives, character lines, or branded characters.
Regional Accents
Best-in-class solution for localized content in English, by selecting the specific English accent for the voice
Voice Changer
Utilize Cartesia’s Speech-to-speech to change the voice of a recorded audio file
Conversational & Casual Voiceovers
Well-suited for dialogue-style reads, social content, or informal explainers where natural flow matters more than studio perfection.
Strengths and Limitations
Strengths
- High Liveliness: Voices sound energetic, engaging, and closer to real human delivery.
- Strong Voice Actor Similarity: Very accurate to the recorded voice actor’s original performance.
- Different local accents: ability to select between American, British, Australian, and Indian accents
- Speech-to-Speech Capability: Enables performance-preserving voice transformations.
Limitations
- Lower Stability: Can introduce random noises, glitches, or errors across generations.
- Limited Language Coverage: Far fewer supported languages compared to other models
- Not Ideal for Long-Form: Consistency can degrade in longer scripts.
Tips for Better Prompts
- Keep Scripts Short: Cartesia performs best on short to medium-length scripts.
- Write Naturally: Use conversational phrasing rather than formal narration-style writing.
-
Use Punctuation for Rhythm: Commas, periods, exclamation marks, and parentheses can be used to direct and guide natural, more expressive pacing.
- For example, for a more dramatic effect, the phrase:
- “Listen, If we walk away today, me, you, all of us, we may never get another chance.”
- Can be written:
- “Listen… If we walk away today? me… you… all of us: we may never! get another chance...”
- For example, for a more dramatic effect, the phrase:
- Add context: If optional for your workflow, add context to the prompt, and later cut it out using an editing software. For example, instead of “This is how you do it.”, write “And then the smug man softly said: “This is how you do it.””
- Spell out numbers and dates: instead of “Using Sonic 3.0 is great”, type out “Using Sonic three point oh is great.
- Leverage Speed Control: Slight speed adjustments can dramatically improve delivered performance.
- Insert Pauses Intentionally: Use “<break time="1s" />” to add pauses to create emphasis or breathing room. For a two-second pause, for example, add “<break time="2s" />”
Test Multiple Generations: Due to variability, running multiple takes often yields the best result.
Need some more help? Head back to our Help Center.