Hey SSML

4 min readMar 14, 2019

Hey Devs,

Now, let’s talk something super amazing about Actions On Google. Do you know that you can manage your agent’s response with SSML? By using SSML, you can make your agent’s response seem more life-like. Yup, SSML is a super amazing one! Let’s dive into it.

SSML stands for Speech Synthesis Markup Language. I think that all of you are familiar with HTML, which is a Markup language for web browsers. In the same way, SSML is the markup language for synthesizing speech on the web or some desktop or mobile applications, and for us to build a super efficient Action. SSML lets you manipulate how your assistant speaks right agent speaks right now. Well, my favorite part of SSML is I can create a persona for my Action by changing its voice’s pitch, speed and volume. Let’s move on!

So, here I have an audio file for you which is created with the given SSML code right here down below:

Now, let’s check some SSML elements!

<speak>

It’s the root element of SSML. Exact same as <html> in HTML. You have to start your SSML code with <speak> tag.

<break>

It’s an empty element that controls the pausing. Mostly used in between two words to make a pause. If this element is not present between words, the break is automatically determined based on the linguistic context.

It can accept two attributes time and strength to set the length of the break by seconds or milliseconds and strength of the pause. Valid values for strength are “x-weak”, “weak”, “medium”, “strong”, “x-strong”, and “none”.

<say-as>

This element allows you to indicate information on the type of text construct contained within the element and to help specify the level of detail for rendering the contained text.

The say-as element has three attributes: interpret-as, format, and detail. The interpret-as attribute is a required one, and the other two are optional. Interpret-as supports the following as valid values:
ordinal — say you want to speak in ordinals, the following line segment says “You’re first”
<speak> You are <say-as interpret-as=”ordinal”>1</say-as> </speak>
cardinal — your agent can say things as cardinal, the following code segment says “Twelve thousand three hundred forty-five” in US English and “Twelve thousand three hundred and forty-five” in UK English:
<speak><say-as interpret-as=”cardinal”>12345</say-as></speak>
characters — say things by characters, this example says, “S S M L”:
<speak><say-as interpret-as=”characters”>SSML</say-as></speak>
fraction — Dealing with fractions? Try this:
<speak><say-as interpret-as=”fraction”>⅘</say-as></speak>
date — With date as value for the interpret-as attribute you have to specify the format attribute too. Character codes in the format are y, m, and d for the year, month, and day respectively. Here is an example you can try:
<speak> <say-as interpret-as=”date” format=”ddmmyyyy”>03–05–2002</say-as></speak>
time — For time character codes in the format are h, m, s, Z, 12, 24 for an hour, minute, seconds, timezone, 12-hrs time, and 24-hrs time respectively. The default format is “hms12”
<speak><say-as interpret-as=”time” format=”hms12”>10:30am</say-as><speak>

There are some other possible values for the interpret-as attribute. You can see the W3 specification, or dive into Google Developers Documentation, right here!

<audio>

The audio element is to attach pre-recorded audio files to your speech.
It has a few no. of supported attributes. See this table from Google Developers Documentation.
Among those attributes src is the required one, and it specifies the source file audio file.

Please check out the W3 specification for detailed info.

<p>, <s>

Used for paragraphs and sentences: see the example:
<speak><p><s>Sentence one. </s><s> Sentence two. </s></p></speak>

And then, at last, my personal best,

<prosody>

This element is used to customize the pitch, speed, and volume of the speech. This can accept 3 optional attributes, pitch, rate, and volume.
See the W3 specifications to set these attributes.

There are three options for setting the value of the pitch attribute:

Relative: Specify a relative value (e.g. “low”, “medium”, “high”, etc) where “medium” is the default pitch.
Semitones: Increase or decrease pitch by “N” semitones using “+Nst” or “-Nst” respectively. Note that “+/-” and “st” are required.
Percentage: Increase or decrease pitch by “N” percent by using “+N%” or “-N%” respectively. Note that “%” is required but “+/-” is optional.

You know, SSML got a lot more to cover. But, I’m leaving the remaining up to you! As I said before, you can learn more about SSML from the Google Developers Documentation and W3 specifications.

And one last thing, we have a TTS Simulator in the Actions On Google Console. Check it out, go to Actions On Google Console, and into Simulator -> Audio. Type all the SSML code you want to test and click on the Update and Listen button to hear the TTS output.