Skip to content

text_to_speech

Convert any text into natural-sounding speech using the MiniMax TTS API. The generated audio is returned inline in the MCP response and also saved to disk.

ParameterTypeRequiredDefaultDescription
textstringText to synthesize (max 10,000 chars)
modelstringspeech-02-hdTTS model: speech-2.8-hd, speech-2.8-turbo, speech-02-hd, speech-02-turbo
voice_idstringfemale-shaonvVoice ID. See the MiniMax voice list
speednumber1.0Playback speed [0.5, 2.0]
volnumber1.0Volume (0, 10]
pitchinteger0Pitch adjustment [-12, 12]
emotionstringEmotional tone: happy, sad, angry, fearful, disgusted, surprised, calm
formatstringmp3Output format: mp3, wav, flac, pcm
sample_rateinteger32000Sample rate (Hz): 8000, 16000, 22050, 24000, 32000, 44100
bitrateinteger128000Audio bitrate: 32000, 64000, 128000, 256000
channelinteger1Channels: 1 = mono, 2 = stereo
language_booststringLanguage-specific enhancement, e.g. English, Chinese, auto

Basic narration:

“Read this paragraph aloud: ‘In the depths of the ocean, creatures of light dance through the darkness.’”

Custom voice and emotion:

“Convert this text to speech using a calm male voice: ‘Welcome to our annual product review.’”

High-quality with specific settings:

“Generate speech from this script at 0.9× speed, stereo, 44100 Hz sample rate.”

Non-English text:

“Narrate this Chinese text with language_boost set to Chinese.”

Returns an audio content part (base64-encoded) with the chosen format, plus a text part containing the saved file path, duration (ms), format, and sample rate.

  • The speech-2.8-hd and speech-2.8-turbo models are the latest generation; speech-02-hd / speech-02-turbo are previous-generation alternatives.
  • Not all emotions are supported by every voice. If an unsupported combination is requested, the API ignores the emotion parameter.
  • Generated files are saved to a temporary directory and cleaned up according to your cache settings.