The offset in bytes (not characters) of the object's end in the input text (not including viseme marks)
The offset in bytes (not characters) of the start of the object in the input text (not including viseme marks)
The timestamp in milliseconds from the beginning of the corresponding audio stream
The type of speech mark (sentence, word, viseme, or ssml)
This varies depending on the type of speech mark
SSML: SSML tag
word or sentence: a substring of the input text, as delimited by the start and end fields
Data that describe the speech that you synthesize, such as where a sentence or word starts and ends in the audio stream. for more information on Speech Marks visit, https://docs.aws.amazon.com/polly/latest/dg/speechmarks.html
SpeechMark