What is SpeakRight?

SpeakRight is an open-source Java framework for writing speech recognition applications in VoiceXML.Unlike most proprietary speech-app tools, SpeakRight is code-based. Applications are written in Java using SpeakRight's extensible classes. Java IDEs such as Eclipse provide great debugging, fast Java-aware editing, and refactoring. Dynamic generation of VoiceXML is done using the popular StringTemplate templating framework. Read more...

See Getting Started, Tutorial, and Table of Contents
Powered By Blogger

Saturday, February 17, 2007

What is SpeakRight?

SpeakRight is an open-source Java framework for writing speech recognition applications in VoiceXML. Unlike most proprietary speech-app tools, SpeakRight is code-based. Developers write their application in Java using SpeakRight classes and their own extensions. Java IDEs such as Eclipse provide great debugging, and fast Java-aware editing and refactoring.

SpeakRight is based around the flow object which has a similar role to controls in GUI frameworks. A flow object manages presentation (prompts, grammars, and retry logic), and control flow. In MVC terms, a flow object is both the view and the controller. Flow objects can be customized (by setting properties), or extended using sub-classing. SpeakRight provides built-in objects for standard data types (time, date, alphanum), and for standard flow algorithms (forms, menus, list traversal, confirmation).

Flow objects can contain sub-flow objects. This nesting continues all the way out to a single application flow object. At runtime SpeakRight executes the application flow object. It may decide that the first thing to do is to execute a sub-flow. The sub-flow may execute its sub-flow. And so on. Nesting continues until a leaf flow object that actually produces VoiceXML content is encountered. At this point, a flow stack has been built up of all the active flow objects. The topmost item in the stack is the leaf. Only one flow object executes at a time. SpeakRight persists itself across HTTP requests (using serialization) . The results of VoiceXML page arrive back at the servlet as a new HTTP request. SpeakRight gives the results to the currently executing flow object. It can do one of three things:
  • execute again, and produce another VoiceXML page
  • return null to indicate it has finished. It will be popped off the flow stack and the next flow object will be executed.
  • "throw" an event. SpeakRight uses a Throw/Catch style of event handling. A thrown event causes a search down the stack looking for a handler.

SpeakRight prompts use a powerful formatting system. A prompt string such as "The current interest rate is {$M.rate} percent. Please wait while we get your portfolio {..}{music.wav}" represents:
  • A TTS (text-to-speech) utteran "The current interest rate is"
  • a model value
  • more TTS "percent. Please wait while we get your portfolio.."
  • silence (500 msec)
  • audio file
Applications can use in-place prompts, such as the one shown. Or they can use prompt ids, such as "id:sayInterestRateAndWait" which is a reference to an external XML file containing the prompt string. Multi-lingual apps are possible by deploying language-specific prompt XML files.

SpeakRight grammars are... GRXML files. I've totally punted in this early version. Dynamic and GSL prompts later, but for now, get the wonderful Grammar Editor that comes (free) with the Microsoft Speech Application SDK (SASDK for MSS 2004 R2).

MVC architecture requires a Model and SpeakRight has one. The model is used to share data between flow objects. User input (or data retrieved from a database) that needs to be used later in the call flow should be put into the model. SpeakRight provides binding so this occurs automatically. In SpeakRight you generate the model using code generation -- a tool called MGen. You write a simple XML file specifying your model fields (such as "Date departureDate") and MGen generates a type-safe Model.java file. Your application can now use intellisense (called CodeAssist in Eclipse) to inspect the model variables. Each model variable has get, set, and clear methods.

Applications need to have side effects. Their purpose is to interact with some back-end database to transfer funds, buy a train ticket, or vote. SpeakRight calls these transactions. Each flow object, upon finishes, can invoke a transaction. If the transaction is invoked asynchronously, SpeakRight pauses the callflow until the transaction completes.

Of course, before we accept data we need to validate it. Flow objects have a ValidateInput method that is called whenever user input has occurred. The method can accept or reject the input. If input is rejected, the flow object is executed again.

Another part of speech recognition is confirmation. This is done whenever the recognition confidence level is below a confirmation level. SpeakRight uses pluggable confirmation strategies. Some of the strategies are: None, YesNo, ConfirmAndCorrect, and ImplicitConfirm. You can basically plug any confirmation strategy into any flow object.

Finally, and most importantly, SpeakRight provides an extensible architecture for re-usable components called SRO (SpeakRight Objects). These speech objects provide standard mechanisms for common tasks. There are (to come) SROs for input of common data types such as time, date, currency, and zip code. There are also SROs for common control flow algorithms such as list traversal. SROs are highly configurable: prompts and grammars can be customized or replaced. Prompts can be customized at several levels, including pluggable formatters for rendering model values.

Re-usability is a key technical challenge in the project. It should be easier (and better) to re-use existing objects than rolling your own. Prompts, for example, need to be highly flexible. Users do not like prompts that switch back and forth between TTS and pre-recorded audio. A single voice is important. SpeakRight provides an audio matching feature, where all prompt segments are looked up in an audio match XML file that contains a replacement audio file for that segment. Of course, prompt coverage tools are needed to help build the list of prompts to be sent for professional recording. These tools are (cough) TBD.

2 comments:

Mister Meta said...

"If the transaction is invoked asynchronously, SpeakRight pauses the callflow until the transaction completes."

Synchronously?

Are there async transactions and how does callback work?

IanRae said...

Async transactions work like this. An HTTP postback occurs and SRRunner.continueApp() gets called. It starts executing the active flow object; that is calling its execute() method. If the flow object finishes (meaning its getNext() returns null), then its doTransaction() is called. Here is where you can put your business logic. For async business logic you kick it off inside doTransaction() and return false. SpeakRight pauses the execution; doesn't generate any
VoiceXML.

I'm no Java guru (I come from .Net world), but I'm pretty sure there is a way for a servlet to delay writing to its PrintWriter. If not then we'll have to add a wait of some sort in the servlet's doGet or doPost.

The async operation will end through some external event. For example, you start a thread to listen on a socket. When data arrives, you can resume the SpeakRight app using SRRunner.resume(). It will pick up where it left off and generate a VoiceXML page.

Hope that makes things a little clearer. I'll make an example for this in the next month or so.