What is SpeakRight?

SpeakRight is an open-source Java framework for writing speech recognition applications in VoiceXML.Unlike most proprietary speech-app tools, SpeakRight is code-based. Applications are written in Java using SpeakRight's extensible classes. Java IDEs such as Eclipse provide great debugging, fast Java-aware editing, and refactoring. Dynamic generation of VoiceXML is done using the popular StringTemplate templating framework. Read more...

See Getting Started, Tutorial, and Table of Contents

Saturday, February 24, 2007

Internal Architecture

A typical VoiceXML application has a software stack like this:
  • Application code
  • Web server
  • VoiceXML browser
  • VoiceXML platform
  • Telephony hardware, VOIP stack
Starting at the bottom, a phone call arrives at the telephony/VOIP layer. This layer notifies the VoiceXML platform layer, which allocates a speech recognition engine, a text-to-speech engine, and a VoiceXML browser. The browser plays the same role as a web browser; it makes a HTTP request to a web server which runs some application code. The application code is a mixture of static and dynamic web content that generates a VoiceXML page. The web server returns this page to the browser, which renders the page as audio. Speech input and DTMF digits are collected, according the VoiceXML tags. At some point, a or tag is executed, which makes a new HTTP request (sending user input as GET or POST data), and the process repeats.

SpeakRight lives in application code layer, typically in a servlet. The SpeakRight runtime dynamically generates VoiceXML pages, one per HTTP request. Between requests, the runtime is stateless, in the same sense of a "stateless bean". State is saved in the servlet session, and restored on each HTTP request.

The SpeakRight framework is a set of Java classes specifically designed for writing speech rec applications. Although VoiceXML uses a similar web architecture as HTML, the needs of a speech app are very different (see Why Speech is Hard TBD).

SpeakRight has a Model-View-Controller architecture (MVC) similar to GUI frameworks. In GUIs, a control represents the view and controller. Controls can be combined using nesting to produce larger GUI elements. In SpeakRight, a flow object represents the view and controller. Flow objects can be combined using nesting to produce larger GUI elements. Flow objects can be customized by setting their properties (getter/setter methods), and extended through inheritance and extension points. For instance, the confirmation strategy used by a flow object is represented by another flow object. Various types of confirmation can be plugged-in.

Flow objects contain sub-flow objects. The application is simply the top-level flow object.

Flow objects implement the IFlow interface. The basics of this interface are

IFlow getFirst();
IFlow getNext(IFlow current, SRResults results);
void execute(ExecutionContext context);
getFirst returns the first flow object to be run. A flow object with sub-flows would return its first sub-flow object. A leaf object (one with no sub-flows) returns itself. (See also Optional Sub-Flow Objects)

getNext returns the next flow object to be run. It is passed the results of the previous flow object to help it decide. The results contain user input and other events sent by the VoiceXML platform.

In the execute method, the flow object renders itself into a VoiceXML page. (see also StringTemplate template engine).

Execution uses a flow stack. An application starts by pushing the application flow object (the outer-most flow object) onto the stack. Pushing a flow object is known as activation. If the application object's getFirst returns a sub-flow then the sub-flow is pushed onto the stack. This process continues until a leaf object is encountered. At this point all the flow objects on the stack are considered "active". Now the runtime executes the top-most stack object, calling its execute method. The rendered content (a VoiceXML page) is sent to the VoiceXML platform.

When the results of the VoiceXML page are returned, the runtime gives them to the top-most flow object in the stack, by calling its getNext method. This method can do one of three things:
  • return null to indicate it has finished. A finished flow object is popped off the stack and the next flow-object is executed.
  • return itself to indicate it wants to execute again.
  • return a sub-flow, which is activated (pushed onto the stack).\
The result is a new VoiceXML page is generated. Execution continues like this until the flow stack is empty.

No comments: