The Web Speech API provides two distinct areas of functionality — speech recognition, and speech synthesis (also known as text to speech, or tts) — which open up interesting new possibilities for accessibility, and control mechanisms. This article provides a simple introduction to both areas, along with demos.
Speech recognition
Speech recognition involves receiving speech through a device's microphone, which is then checked by a speech recognition service against a list of grammar (basically, the vocabulary you want to have recognized in a particular app.) When a word or phrase is successfully recognized, it is returned as a result (or list of results) as a text string, and further actions can be initiated as a result.
The Web Speech API has a main controller interface for this — SpeechRecognition
— plus a number of closely-related interfaces for representing grammar, results, etc. Generally, the default speech recognition system available on the device will be used for the speech recognition — most modern OSes have a speech recognition system for issuing voice commands. Think about Dictation on macOS, Siri on iOS, Cortana on Windows 10, Android Speech, etc.
Note: On some browsers, like Chrome, using Speech Recognition on a web page involves a server-based recognition engine. Your audio is sent to a web service for recognition processing, so it won't work offline.
Examples and Libraries
Let us look at some examples and libraries in React:
This is a button to start speech recognition using Web Speech API, with an easy to understand event lifecycle.
Background
Reasons why we need to build our own component, instead of using existing packages on NPM:
- Most browsers required speech recognition (or WebRTC) to be triggered by a user event (button click)
- Bring your own engine for Web Speech API
- Enable speech recognition on unsupported browsers by bridging it with cloud-based service
- Support grammar list thru JSpeech Grammar Format
- Ability to interrupt recognition
- Ability to morph into other elements
Design considerations
- Hide the complexity of Web Speech events because we only want to focus on recognition experience
- Complexity in lifecycle events:
onstart
,onaudiostart
,onsoundstart
,onspeechstart
onresult
may not fire in some cases,onnomatch
is not fired in Chrome- To reduce complexity, we want to make sure event firing are either:
- Happy path:
onProgress
, then eitheronDictate
oronError
- Otherwise:
onError
- Happy path:
- Complexity in lifecycle events:
- "Web Speech" could means speech synthesis, which is out of scope for this package
- "Speech Recognition" could means we will expose Web Speech API as-is, which we want to hide details and make it straightforward for recognition scenario
Step 1: Installation
First, install our production version by npm install react-dictate-button
. Or our development version by npm install [email protected]
.
Step 2: Use
Use a shown below:
import DictateButton from 'react-dictate-button';
export default () => (
<DictateButton
className="my-dictate-button"
grammar="#JSGF V1.0; grammar districts; public <district> = Tuen Mun | Yuen Long;"
lang="en-US"
onDictate={this.handleDictate}
onProgress={this.handleProgress}
>
Start/stop
</DictateButton>
);
Props
Name | Type | Default | Description |
---|---|---|---|
className |
string |
undefined |
Class name to apply to the button |
disabled |
boolean |
false |
true to abort ongoing recognition and disable the button, otherwise, false |
extra |
{ [key: string]: any } |
{} |
Additional properties to set to SpeechRecognition before start , useful when bringing your own SpeechRecognition |
grammar |
string |
undefined |
Grammar list in JSGF format |
lang |
string |
undefined |
Language to recognize, for example, 'en-US' or navigator.language |
speechGrammarList |
any |
window.SpeechGrammarList (or vendor-prefixed) |
Bring your own SpeechGrammarList |
speechRecognition |
any |
window.SpeechRecognition (or vendor-prefixed) |
Bring your own SpeechRecognition |
Note: change of
extra
,grammar
,lang
,speechGrammarList
, andspeechRecognition
will not take effect until next speech recognition is started.
Events
onClick
- Emit when the user click on the button, preventDefault
will stop recognition from starting :
(event: MouseEvent) => void
onDictate
- Emit when recognition is completed :
({
result: {
confidence: number,
transcript: number
},
type: 'dictate'
}) => void
onError
- Emit when error has occurred or recognition is interrupted,:
(event: SpeechRecognitionErrorEvent) => void
onProgress
- Emit for interim results, the array contains every segments of recognized text :
({
abortable: boolean,
results: [{
confidence: number,
transcript: number
}],
type: 'progress'
}) => void
onRawEvent
- Emit for handling raw events from SpeechRecognition
:
(event: SpeechRecognitionEvent) => void
Hooks
Although previous versions exported a React Context, it is recommended to use the hooks interface.
Name | Signature | Description |
---|---|---|
useAbortable |
[boolean] |
If ongoing speech recognition can be aborted, true , otherwise, false |
useReadyState |
[number] |
Returns the current state of recognition, refer to this section |
useSupported |
[boolean] |
If speech recognition is supported, true , otherwise, false |
Checks if speech recognition is supported
To determines whether speech recognition is supported in the browser:
- If
speechRecognition
prop isundefined
- If both
window.navigator.mediaDevices
andwindow.navigator.mediaDevices.getUserMedia
are falsy, it is not supported- Probably the browser is not on a secure HTTP connection
- If both
window.SpeechRecognition
and vendor-prefixed are falsy, it is not supported - If recognition failed once with
not-allowed
error code, it is not supported
- If both
- Otherwise, it is supported
Even the browser is on an insecure HTTP connection,
window.SpeechRecognition
(or vendor-prefixed) will continue to be truthy. Instead,mediaDevices.getUserMedia
is used for capability detection.
Event lifecycle
One of the design aspect is to make sure events are easy to understand and deterministic. First rule of thumb is to make sure onProgress
will lead to either onDictate
or onError
. Here are some samples of event firing sequence (tested on Chrome 67):
- Happy path: speech is recognized
onProgress({})
(just started, therefore, noresults
)onProgress({ results: [] })
onDictate({ result: ... })
- Heard some sound, but nothing can be recognized
onProgress({})
onDictate({})
(nothing is recognized, therefore, noresult
)
- Nothing is heard (audio device available but muted)
onProgress({})
onError({ error: 'no-speech' })
- Recognition aborted
onProgress({})
onProgress({ results: [] })
- While speech is getting recognized, set
props.disabled
tofalse
, abort recognition onError({ error: 'aborted' })
- Not authorized to use speech or no audio device is availablabortable: truee
onError({ error: 'not-allowed' })
Function as a child
Instead of passing child elements, you can pass a function to render different content based on ready state. This is called function as a child.
Ready state | Description |
---|---|
0 |
Not started |
1 |
Starting recognition engine, recognition is not ready until it turn to 2 |
2 |
Recognizing |
3 |
Stopping |
For example,
<DictateButton>
{({ readyState }) =>
readyState === 0 ? 'Start' : readyState === 1 ? 'Starting...' : readyState === 2 ? 'Listening...' : 'Stopping...'
}
</DictateButton>
Checkbox version
In addition to <button>
, we also ship <input type="checkbox">
out of the box. The checkbox version is better suited for toggle button scenario and web accessibility. You can use the following code for the checkbox version.
import { DictateCheckbox } from 'react-dictate-button';
export default () => (
<DictateCheckbox
className="my-dictate-checkbox"
grammar="#JSGF V1.0; grammar districts; public <district> = Redmond | Bellevue;"
lang="en-US"
onDictate={this.handleDictate}
onProgress={this.handleProgress}
>
Start/stop
</DictateCheckbox>
);
We also provide a "textbox with dictate button" version. But instead of shipping a full-fledged control, we make it a minimally-styled control so you can start copying the code and customize it in your own project. The sample code can be found at DictationTextbox.js.
Reference
2. react-say
A React component that synthesis text into speech using Web Speech API.
Try out the demo at https://compulim.github.io/react-say/.
Step 1: Install it
First, run npm install react-say
for production build. Or run npm install [email protected]
for latest development build.
Step 2: Synthesizing an utterance
react-say
offer comprehensive ways to synthesize an utterance:
- Synthesize text using
<Say>
component - Synthesize text using
<SayButton>
component - Synthesize utterance using
<SayUtterance>
component - Synthesize text or utterance using
useSynthesize
hook
Below are examples and use cases:
Using <Say>
component
The following will speak the text immediately upon showing up. Some browsers may not speak the text until the user interacted with the page.
import React from 'react';
import Say from 'react-say';
export default () =>
<Say speak="A quick brown fox jumped over the lazy dogs." />
Customizing pitch/rate/volume
To modify the speech by varying pitch, rate, and volume. You can use <Say>
to say your text.
import React from 'react';
import Say from 'react-say';
export default () =>
<Say
pitch={ 1.1 }
rate={ 1.5 }
speak="A quick brown fox jumped over the lazy dogs."
volume={ .8 }
/>
Selecting voice
To select different voice for synthesis, you can either pass a SpeechSynthesisVoice
object or a selector function to the voice
props`.
For selector function, the signature is (voices: SpeechSynthesisVoice[]) => SpeechSynthesisVoice
.
import React, { useCallback } from 'react';
import Say from 'react-say';
export default () => {
// Depends on Web Speech API used, the first argument may be an array-like object instead of an array.
// We convert it to an array before performing the search.
const selector = useCallback(voices => [...voices].find(v => v.lang === 'zh-HK'), []);
return (
<Say
speak="A quick brown fox jumped over the lazy dogs."
voice={ selector }
/>
);
}
Note: it also works with
<SayButton>
.
Using <SayButton>
component
If you want the web page to say something when the user click on a button, you can use the <SayButton>
.
import React from 'react';
import { SayButton } from 'react-say';
export default props =>
<SayButton
onClick={ event => console.log(event) }
speak="A quick brown fox jumped over the lazy dogs."
>
Tell me a story
</SayButton>
Using <SayUtterance>
component
Instead of synthesizing a text, you can prepare your own SpeechSynthesisUtterance
object instead.
import React, { useMemo } from 'react';
import { SayUtterance } from 'react-say';
export default () => {
const utterance = useMemo(() => new SpeechSynthesisUtterance('A quick brown fox jumped over the lazy dogs.'), []);
return (
<SayUtterance
utterance={ utterance }
/>
);
}
Using useSynthesize
hook
If you want to build your own component to use speech synthesis, you can use useSynthesize
hook.
import React, { useCallback } from 'react';
import { useSynthesize } from 'react-say';
export default () => {
const synthesize = useSynthesize();
const handleClick = useCallback(() => {
synthesize('A quick brown fox jumped over the lazy dogs.');
}, [synthesize]);
return (
<button onClick={ handleClick }>Tell me a story</button>
);
}
Cancelling an active or pending synthesis
Once you call synthesize()
function, the utterance will be queued. The queue prevent multiple utterances to be synthesized at the same time. You can call cancel()
to remove the utterance from the queue. If the utterance is being synthesized, it will be aborted.
import React, { useEffect } from 'react';
import { useSynthesize } from 'react-say';
export default () => {
const synthesize = useSynthesize();
// When this component is mounted, the utterance will be queued immediately.
useEffect(() => {
const { cancel } = synthesize('A quick brown fox jumped over the lazy dogs.');
// When this component is unmounted, the synthesis will be cancelled.
return () => cancel();
}, [synthesize]);
return (
<button onClick={ handleClick }>Tell me a story</button>
);
}
Bring your own SpeechSynthesis
You can bring your own window.speechSynthesis
and window.speechSynthesisUtterance
for custom speech synthesis. For example, you can bring Azure Cognitive Services Speech Services thru web-speech-cognitive-services
package.
import Say from 'react-say';
import createPonyfill from 'web-speech-cognitive-services/lib/SpeechServices';
export default () => {
// You are recommended to use authorization token instead of subscription key.
const ponyfill = useMemo(() => createPonyfill({
region: 'westus',
subscriptionKey: 'YOUR_SUBSCRIPTION_KEY'
}), []);
return (
<Say
ponyfill={ ponyfill }
speak="A quick brown fox jumped over the lazy dogs."
/>
);
}
Caveats
- Instead of using the native queue for utterances, we implement our own speech queue for browser compatibility reasons
- Queue is managed by
<Composer>
, all<Say>
,<SayButton>
, and<SayUtterance>
inside the same<Composer>
will share the same queue - Native queue does not support partial cancel, when
cancel
is called, all pending utterances are stopped - If
<Say>
or<SayButton>
is unmounted, the utterance can be stopped without affecting other pending utterances - Utterance order can be changed on-the-fly
- Queue is managed by
- Browser quirks
- Chrome: if
cancel
andspeak
are called repeatedly,speak
will appear to succeed (speaking === true
) but audio is never played (start
event is never fired) - Safari: when speech is not triggered by user event (e.g. mouse click or tap), the speech will not be played
- Workaround: on page load, prime the speech engine by any user events
- Chrome: if
Reference