What is SpeakRight?

SpeakRight is an open-source Java framework for writing speech recognition applications in VoiceXML.Unlike most proprietary speech-app tools, SpeakRight is code-based. Applications are written in Java using SpeakRight's extensible classes. Java IDEs such as Eclipse provide great debugging, fast Java-aware editing, and refactoring. Dynamic generation of VoiceXML is done using the popular StringTemplate templating framework. Read more...

See Getting Started, Tutorial, and Table of Contents
Powered By Blogger

Tuesday, December 4, 2007

The StringTemplate Template Engine

The StringTemplate template engine is a popular choice for generating markup text in Java. It comes from Terrence Parr, the inventor of ANTLR.

SpeakRight uses StringTemplate (ST) for all its VoiceXML generation. When a flow object is rendered, it is first converted into a SpeechPage object. A SpeechPage is not VoiceXML-specific, and allows SpeakRight to output other formats such as SALT, or whatever you want. It's also the glue that ST requires. SpeechPages are rendered using one of the ISpeechPageWriter classes. For testing, an HTML page writer is available. The main page writer though is VoiceXMLPageWriter.

VoiceXMLPageWriter into VoiceXML. A StringTemplate file defines the format for prompts, grammars, fields, forms and other VoiceXML tags. This gives a lot of flexibility. If your VoiceXML platform has special requirements, simply modify the speakright.stg template file.

Matt Raible on web frameworks

Matt Raible has a fascinating video comparing web frameworks. Comparisons are tricky since frameworks are changing rapidly, with multiple releases per year. However, he makes an interesting aside about the (lack of) value of visual IDEs. JSF comes with a drag-and-drop IDE that is "appealing to managers", but "if one wants to develop anything substantial, we're going to have to get down and dirty with the code."

I've been espousing a code-based approach for speech applications for while. Indeed that's the whole premise of the SpeakRight framework. Any substantial app will use dynamically generated code, and not be pages of handwritten markup text.

Things are a bit simpler for speech applications. The following criteria for comparing web frameworks don't apply

  • Bookmarkable URLs.
  • Avoiding the double-POST problem
  • AJAX
  • Massive scalability. Web applications may involve millions of users but speech apps are still orders of magnitude smaller.
  • Page decoration. The vast topic of graphical design doesn't exist in a speech app. Persona is as close as one gets to "decoration".

Monday, December 3, 2007

NBest

NBest is a very useful feature for handling similar sounding words. Normally a speech rec engine finds the grammar rule that is the best match for the user's speech utterance. Large grammars can suffer from substitution errors where the wrong rule is matched: caller says "Boston" but the engine selects "Austin". NBest helps the application sort out this type of ambiguity.

When enabled, NBest is a request to the speech rec engine to return the top N matches, sorted in order of decreasing confidence level. N is usually a small number, such as 4. Remember that the NBest value is a maximum; fewer results may be returned.

In SpeakRight, NBest is enabled using the QuestionFlow method enableNBest.


flow.enableNBest(4); //up to 4 results


When the SRResults come back for that question, you can check for NBest results. The SRResults method hasNBest which indicates that more than one result was returned.

NBest Pruning
The simplest thing an application can do, is check the NBest results in validateInput, and use additional application logic to select the most likely result. This is called NBest pruning. For example, if the user is asked for her account number, each result can be checked against the database. If only one result is a valid account number, the application could assume that's what the caller said.


String probableAccountNumber = "";
int matches = 0;
for (int i = 0; i <>
String value = results.getNBestValue(i);
if (CheckAccountNumber(value)) { //check against the database
probableAccountNumber = value;
matches++;
}
}
if (matches == 1) {
results.replaceInput(probableAccountNumber); //let's use it!
}



NBest Confirmation
A more common use for NBest is to do confirmation. When NBest results are returned, the application confirms each NBest result, stopping as soon as the user says "yes".

C: What city?
H: Boston
C: (returns 'Austin' and 'Boston' as NBest results) Do you want Austin?
H: No
C: Do you want Boston?
H: Yes
C: Great. Flying to Boston on what date?
...

The application may still want to prune the NBest results, re-ordering the results according to the most likely answers. This way the first confirmation question is more likely to be the correct one. This is an important part of NBest -- using additional context information and application logic to improve on the speech rec engine's results.

Pass an NBestConfirmerFlow and the question flow object to a ConfirmationWrapper object that will manage the confirmation process. It will ask the user to confirm values until the caller accepts one (by saying "yes" or whatever your confirmation grammar uses for acceptance). If the caller says "no" to all NBest values, then the question is asked again, and the process repeats. You can override NBestConfirmerFlow to adjust this behaviour.

Note that NBest confirmation is an extension of basic confirmation. A YesNoConfirmerFlow confirms a single result, while a NBestConfirmerFlow confirms multiple results.

ConfirmationWrapper cw = new ConfirmationWrapper(new AskCity(),
new NBestConfirmer("yesno.grxml"));


Skip Lists
A skip list is a list of words that the application will not confirm because the caller has already rejected them. This is an optional feature of NBestConfirmerFlow. Enable it with the enableSkipList method. If the caller says "no" to all NBest values, then the question is asked again. Before beginning confirmation, NBestConfirmerFlow will remove from the new NBest results any values that were rejected during the previous round of confirmation questions. If this results in only a single NBest result, then there is no need for confirmation.

C: What city?
H: Crosston
C: (returns 'Austin' and 'Boston' as NBest results) Do you want Austin?
H: No
C: Do you want Boston?
H: No
C: (asking the question again) Let's try again. What city?
H: Crosston
C: (returns 'Austin' and 'Crosston' and 'Aulston' as NBest results. Austin is removed.)Do you want Crosston?
H: yes
C: Got it. Flying to Crosston on what date?
...

If you don't use a skip list, the application can infuriatingly confirm the same wrong result again and again.

NBest With SROs
SpeakRight Reusable Objects (SROs) are pre-built flow objects for gathering common data such as numbers, dates, etc.

To enable NBest for an SRO, use its enableNBest method. This will use an SROConfirmNBest confirmer object. If you need to use a custom confirmer, call enableNBest followed by setConfirmer to pass in your custom confirmer.


SRONumber flow = new SRONumber("tickets", 1, 10);
flow.enableNBest(4); //up to 4 results


Friday, November 30, 2007

Version 0.1.4 now available

The latest release is available here. Transfer and Record are now supported. There are several new flow object classes, including RawContentFlow (roll your own VoiceXML), and GotoUrlFlow (transfer to another VoiceXML application).

Content-logging is a new feature that's helpful during development -- vxml content is dumped to text files so you can see what the rendered VoiceXML looks like.

Some code refactoring has also been done. Flow object classes are now in the package org.speakright.core.flows.

Enjoy.

Thursday, June 14, 2007

Initialization

SpeakRight apps normally run in three different environments: in a JUnit test, in the interactive tester, and most importantly in a servlet. You can avoid problems by creating a single piece of initialization code that is used across all environments. This pace is called the app factory. It should be derived from SRFactory, which performs standard initialization.

Your class should override onCreateRunner and onInitRunner to do additional initialization, such as:
  • create and attach a model object
  • register prompt file(s)
  • set the extension point factory
  • other things. For example, the SimpsonsDemo app records votes in a text file, and its Voting object needs to be initialized with the path
Initialization is done using the createRunner method of SRFactory

public SRRunner createRunner(String projectDir, String returnUrl, String baseUrl, ISRServlet servlet);

The projectDir is a path to the application's base directory, which usually has sub-directories audio, grammar, and sro.

The two URLs are only needed in a servlet environment. returnUrl is the URL that the VoiceXML page should postback to. baseUrl is used to generate URLs for audio and grammars.

servlet can be null. It's an extension point that allows the servlet to do extra initialization.

Now let's look at each environment in turn.

JUnit

In a unit test, the dependencies can be visualized like this, from top to bottom:

JUnit test class
App (your callflow)
SRRunner
SRFactory or your derived class
SRConfig

Use your app factory to create a runner
AppFactory factory = new AppFactory();
SRRunner run = factory.createRunner();
Then run your application using the start and proceed methods of SRRunner.

If your app uses properties in the srf.properties file, you need to initialize SRConfig first. JUnit 4 has a per-class initializer called @BeforeClass

@BeforeClass static public void redirectStderr() {
SRConfig.init("C:\\source\\app2\\", "srf.properties");
}


Interactive tester

The interactive tester is a console app. It's The dependencies can be visualized like this, from top to bottom:

App (your callflow)
SRInteractiveTester
SRRunner
SRFactory or your derived class
SRConfig

SRInteractiveTester inits SRConfig for you.

SRInteractiveTester tester = new SRInteractiveTester();

AppFactory factory = new AppFactory();
SRRunner runner = factory.createRunner(appDir, "http://def.com", "", null);

App app = new App();
tester.init(app, run);
tester.run();

Servlet

In a servlet, the dependencies can be visualized like this, from top to bottom:

Servlet
App (your callflow)
SRRunner
SRServletRunner
SRFactory or your derived class
SRConfig

In a servlet the SRServletRunner class is used. You pass your app factory and it does initialization, including SRConfig. The SRRunner's project directory is set to the directory corresponding to the web apps' "/" url.

The code in doGet should be
SRServletRunner runner = new SRServletRunner(new AppFactory(), null, request, response, "GET");

if (runner.isNewSession()) {
SRRunner run = runner.createNewSRRunner(this);

IFlow flow = new App();
runner.startApp(flow);
}
else {
runner.continueApp();
}

The code in doPost should be
SRServletRunner runner = new SRServletRunner(new AppFactory(), null, request, response, "POST");

if (runner.isNewSession()) {
//err!!
runner.log("can't get new session in a POST!!");
}
else {
runner.continueApp();
}


SRConfig

SRConfig provides access to an srf.properties file. Properties are often used by the constructors of flow objects. Therefore it's important to initialize SRConfig early:

SRConfig.init(path, "srf.properties");

For console apps or JUnit, a hard-coded path is used. For servlets, this is done for you by SRServletRunner, which uses the directory corresponding to the web app's "/" base url.

Currently the SpeakRight framework itself does not use any properties, but applications are free to.

Thursday, May 24, 2007

List of Flow Objects

Flow objects are the building blocks of SpeakRight applications. Here is the list of available objects:
  • BranchFlow Performs branching in the callflow based on an application-defined condition
  • ChoiceFlow Branches based on user input, such as in a menu
  • DisconnectFlow Hangs up the call
  • FlowList a sequence of flow objects, optionally ending with an AppEvent
  • GotoUrlFlow Redirects to an external URL
  • LoopFlow Iterates over a sequence of sub-flows
  • NBestConfirmerFlow confirms NBest results
  • PromptFlow Plays one or more prompts
  • QuestionFlow Asks the user a question. Has built-in error retries for silence and nomatch.
  • RawContentFlow Ouput raw VoiceXML
  • RecordAudioFlow record the caller's voice to an audio file.
  • SRApp The root flow object
  • TransferFlow Transfer the call
  • YesNoConfirmerFlow used to confirm a single result
Additional flow objects can be created by implementing the IFlow interface.

There are also SROs (SpeakRight Reusable Objects) which you can use.

Call Control

The focus of SpeakRight is to be a framework for voice user interfaces. As such, it's support for call control is fairly limited. Consider using CCXML for applications with heavy use of call control. Or the RawContentFlow to generate platform-specific transfer and conferencing features.

SpeakRight catches the disconnect event (usually connection.disconnect) in order to do a final postback. This results in the DISCONNECT event being thrown and onDisconnect being invoked. SRApp provides a default handler for this.

There are a number of call control flow objects.

DisconnectFlow
This flow object plays a final prompt and hangs up.

SROTransferCall
This flow object transfers the call using one of the VoiceXML 2.1 transfer types: Blind, Bridged, or Consultation. The specified destination parameter is a string, such as
"tel:+2560001112222"
The format of the dial string is often platform specific.

TransferFlow has two prompts. An initial prompt called main is played before the transfer is initiated. A transferFailed prompt is played if the transfer fails to complete. Both these prompts can be overriden in an app-specific prompt XML file.

If the transfer fails because, for example, the destination is busy, then onTransferFailed is invoked. It plays the transferFailed prompt. Or you can override onTransferFailed and provide your own behaviour.

TransferFlow
This flow object is a low-level object. We recommend the use of SROTransferCall instead.

RawContentFlow
This flow object is an escape hatch. The application can supply any VoiceXML it likes. RawContentFlow may be useful for invoking platform-specific call control features.

Thursday, May 17, 2007

Optional Sub-Flow Objects

It's common in a speech application for some sections of the callflow to be optional. if the user is a preferred customer do X. Or if the app has forceLogin enabled then do Y.

Applications can simply create or not create certain flow objects based on the required logic. However, an alternative solution is provided called optional sub-flows. The BasicFlow and LoopFlow classes have been enhanced. The flow objects they contain (called sub-flows) can indicate that they don't wish to run by returning false from shouldExecute. When this occurs, the sub-flow is skipped and the next sub-flow is executed.

The advantage of optional sub-flows is that the decision on whether to run or not can be deferred until a sub-flow is executed. The initialization code doesn't need to handle this.

However there are a restrictions: the final sub-flow cannot be optional. Because if the final sub-flow returns null from its getFirst, there's nothing to execute.

Example code that creates the callflow using a BasicFlow object:

//callflow creation..
BasicFlow flow = new BasicFlow();
flow.add(new PromptFlow("Welcome"));
flow.add(new RegisterUser());
flow.add(new MainMenu());
app.add(flow);

And the class definition for the optional sub-flow looks like this:

class RegisterUser extends BaseSROQuestion {
public Model M;

@Override public boolean shouldExecute() {
return M.userHasRegistered;
}
//..rest of class definition omitted..

Note. I'm not completely happy with this feature. It's handy but the cant-be-last restriction will be easy to forget and will only fail at runtime. Full test coverage is required!



Wednesday, May 16, 2007

Supported Platforms

SpeakRight has been tested on these platforms:

  • Voxeo Evolution (free hosting site http://community.voxeo.com/). VoiceCenter 5.5
  • Voxeo Prophecy platform version 8.0 beta
The main mechanism for porting to a new framework is to customize the template file; see StringTemplate template engine.

VoiceXML Tags Supported

Here's an alphabetical list of VoiceXML tags supported by SpeakRight

  • assign name expr used to track dialog state
  • audio src play audio file
  • block
  • break time time is in msec
  • catch connection.disconnect So app gets a final postback
  • disconnect
  • exit
  • field name
  • filled
  • form one per page
  • goto next To goto an external URL
  • grammar type src type can be "text/gsl", "application/srgs+xml", or"application/srgs".
  • noinput count bargein main prompt(s), one per escalation
  • nomatch count bargein main prompt(s), one per escalation
  • prompt count bargein main prompt(s), one per escalation
  • submit next namelist method
  • transfer type dest connecttimeout
  • var name expr used to track dialog state
In addition, the RawContentFlow can be used by an app to output custom VoiceXML. It's useful for features not yet supported by SpeakRight.

You can customize the VoiceXML; see the StringTemplate template engine.

Thursday, May 10, 2007

Version 0.0.3 Released

Latest code drop is available at SourceForge (see Download link on the left). The project has been elevated to alpha status and can be used to build some real apps. The SimpsonsDemo app is an example of this, and is included as part of the release.

Thursday, May 3, 2007

Automated Testing

Computers are good at doing repetitive work. Few things are more repetitive than testing software -- so let the computer do it!

SpeakRight provides an automated tester SRAutoTester. It runs your callflow in a test harness where user input comes from strings that you provide. It checks the progress through the callflow to validate the application logic. Tests can be run directly on a developer's machine, no VoiceXML platform is needed.

SRAutoTester uses the same test harness as SRInteractiveTester, so you can test manually (using the keyboard) and auto-test, and vice versa.

The format of the string you give is a set of commands separated by semi-colons, where each command's format is: cmd[~ExpectedCurrentFlowObject]

The commands are
  • "e" echo. toggle echo to log of VoiceXML on/off
  • "g" go. simulate the current VoiceXML page ending and posting back. Causes SpeakRight to proceed to the next flow object. Can contain user input, such as "go chicago" or if a slot is set use ":" like this "go city:chicago"
  • "q" quit
Let's do a quick example. An app begins with a welcome prompt then asks for user id and password. If the user input is a valid login then the app proceeds to "MainMenu" otherwise it plays "LoginFailed".

The commands for testing a good login are: g;g 4552;g 1234;g~MainMenu;q
Let's break that down:
  • "g" means run the first flow object, which is the welcome prompt
  • "g 4552" is the user id
  • "g 1234" is the password
  • "g~MainMenu" validates that we're at the MainMenu flow object
To test a bad login: g;g 9999;g 9999;g~LoginFailed;q

Note that these tests are high-level tests of the callflow. They allow requirements such as "A bad login does not proceed to the Main Menu". Testing the details of the VUI prompts and grammars need to be done as well; either with another SpeakRight tester (TBD) or on the VoiceXML platform itself.

Wednesday, May 2, 2007

Benefits of a Code-Based Approach

Sometime in the 1980s, voice applications (called IVRs) appeared. And drag-and-drop toolkits followed. IVR apps were structured much like a flowchart since the user navigated the callflow using one of 12 DTMF keys. A visual programming model seemed appropriate for IVR development. The tools work well on small projects of up to fifty nodes or so. On larger apps the visual approach breaks down. It become hard to navigate an app with hundreds or thousands of nodes. Code changes become tedious; try changing the MaxRetries value from 3 to 4 in all GetDigits nodes in a 200 node app! Also, the architectural weakness of visual programming becomes more apparent as size increases. It's programming model is really a 1960s FORTRAN model based on GOTOs and global variables. Structured programming features, let alone object-oriented features are simply not supported.

Drag-and-drop toolkits remain viable for DTMF apps because the apps are simple. Speech applications are much more complicated that the equivalent DTMF app. There are roughly nine times as many prompts (escalated versions of the main prompt and silence and nomatch prompts). Confirmation needs to be done since speech recognition is never 100% accurate. Lastly, speech apps are more complicated because, released from the limitations of 12 DTMF keys, they try to do more. This complexity means that speech apps need a more powerful development environment, such as the Java programming language.

The first wave of speech applications were written directly in VoiceXML. Again this is simple for small apps but doesn't scale. A large app has many voicexml files, and the relationship between them is not clearly shown. A login.vxml file may submit its results to main_menu.vxml, but that is not apparent in looking at a list of files. Raw VoiceXML does not have any modern programming constructs such as inheritance or design patterns. Lastly, unit testing and debugging are difficult.

This brings us to the final option: a code-based approach. Write the application in Java.

IDEs are powerful
Use the full power of a good IDE with refactoring support, code assist (AKA Intellisense), unit testing, debugging, and integrated source control. The Eclipse IDE, for example, is used by millions of programmers. It will be improved and extended at a far faster rate than any proprietary toolkit. And Eclipse is free.

Better Debugging
Java IDEs have real debuggers. Enough said.

Better Testing
Java IDEs have excellent unit testing. SpeakRight provides a keyboard-based interactive tester, and an HTML mode for executing an app using an ordinary web browser (HTML is generated instead of VXML).

More tools
There are source code tools for code coverage, profiling, generating documentation and design diagrams. Source code control tools allow important questions such as 'what's changed since last week' to be answered.

Code is flexible
Source code is extremely flexible. Unlike drag-and-drop tools that offer only a few levels of granularity, code can be organized and combined in many ways. Let's look at the ways code can be used.

See also Matt Raible On Web Frameworks

Configuration
An object can be configured by settings its properties. This allows re-usable objects to be customized for each use. The customization can be done in code

flow.m_maxAttempts = 4;

or it can come from external configuration files. SpeakRight allows prompts and grammars to reside in XML files that can be changed post-deployment without having to rebuild the app.

Sub-classing
The DRY principle is Don't Repeat Yourself. DRY reduces the amount of source code and makes modifications simpler. Java inheritance is one way of centralizing common code. Any class deriving from a base class gets the base classes' behaviour automatically. In our example concerning GetDigit nodes, the MaxRetries value can be defined a single place (in the base class). Changing the value in the base class causes the change to ripple down to all derived classes.

Java inheritance is flexible because values or behaviour can be overridden at any point in the class heirarchy.

Composition
Composition is the process of assembling together objects into useful components. Modern frameworks make much use of interfaces and extension points. An extension point allows behaviour to be changed by plugging in different implementations of it. In SpeakRight, confirmation is an extension point where different types of confirmation can be plugged-in: yes-no confirmation, confirm-and-correct confirmation, and implicit confimation.

Extension points increase re-use because the number of options multiply. If you have four types of GetNumber objects and three types of confirmation, you have twelve types of GetNumber-And-Confirm behaviour to choose from.

A menu is another example of an extension point. A menu is basically a question followed by a switch statement. Both are extension points that allow flexible menus to be created that still share the common base code.

Refactoring
Refactoring is the process of improving code quality without changing the external behaviour. Common code can be pulled into methods or classes. Interfaces and extension points can be added to increase the flexibility of a class. Code can be packaged into namespaces and libraries. Inherited behaviour can be overridden.

A code-based approach allows all these modern software development techniques to be applied to speech apps.

Changing The Framework
SpeakRight is open-source so everything is available to you for modification.

Everything is a Flow Object
In SpeakRight, the callflow is built out of flow objects. Everything from a low-level "yes-no" question, to a form with multiple fields, up to the app itself are flow objects.

Flow objects participate in generating content (VoiceXML). A flow object is notified of each prompt being rendered, and allowed to modify it. Flow objects can control which VoiceXML is generated, and if needed, the entire VoiceXML rendering can be replaced (it's another extension point).

Flow objects participate in deciding the execution path through the callflow. Because they return an IFlow object to be run next, it's easy to inject additional behaviour when needed. Confirmation is done this way.

Consider a VUI dialog for traversing a list of items. The common behaviour is the commands "next" and "previous" (and possibly "first" and "last"). These move to a new item and say its value, or play an error message if the end of the list has been reached. List traversal is a common VUI feature, but difficult to make into a re-usable artifact in a non-code-based approach. With code however, this is a standard sort of OO design task.
  • Prompts and grammars are made into fields. Default values are provided but can be overridden or configured using getter and setter methods.
  • The list is a generic Java Collection, allowing it to be a list of anything. An IItemFormatter interface is created so the rendering of a Java object (string, integer, XML, whatever) into a prompt becomes an extension point. The default formatter just uses toString.
  • SpeakRight's flow objects are pause-able. This means that a list traverser can pause while another VUI dialog runs, and resume when it finishes. A list traverser can now be a main menu for an app that works on a list of items (such as flights to select from). Additional commands can be added, so that in addition to the traversal commands, the menu can accept additional commands such as "details", "accept", and "search". All of this is built on top of the existing list traversal class; no code duplication is required.
And unlike a drag-and-drop toolkit where a list traversal node has a fixed set of features, there are no restrictions in a code-based approach. You want "next" to wrap-around when it reaches the end of the list? No problem.

Less Code, Less Testing
A code-based approach, by promoting re-use and the DRY principle, reduces the size of the application. This has many benefits:
  • Faster development. Less time spent in tedious repetitive work.
  • Easier to change. A class hierarchy is like the paragraph styles in a word processor. Rather than sprinkling formatting all over the document, it's kept in a few styles (base classes) where it's easily managed and changed.
  • Consistent Voice User Interface. Shared code leads to shared behaviour, which leads to a consistent user interface.
  • Reduced testing. This is a huge gain. When common code is used, it only needs to be tested once, even though its used multiple times in the app. For example, yes-no confirmation can be plugged in to the confirmation extension point of a flow object. Once you've validated that it works in one flow object, there's no need to test all other flow objects since they share the same (base class) code.


Flexible Development
A code-based approach lacks the artificial boundaries of drag-and-drop toolkits. Development can begin by using existing classes and configuring them as needed. When you find the same VUI dialogs appearing in multiple places, sub-classing can be used to centralize a common configuration, such as a MyGetPassword class. As more new classes are created they can be combined into a class heirarchy in order to share common code, with extension points added where variability is needed. When classes are re-used in other projects they can be packaged as a library.

Prompt Ids and Prompt XML Files

Tuning a speech application often involves changing prompts, to re-word a question or improve an error message. This should be possible without having to rebuild the app. Speakright uses prompt ids to provide this feature.

A prompt id is a ptext item beginning with "id:", such as "id:outOfRange". When the prompt is rendered, a set of XML files are searched to find an entry for that id. The entry looks like this:
<prompt name="outOfRange" def="true">That value is out of range. </prompt>
The prompt text for the id is a full PText; it can contain references to other ids, for example. It's an error if a prompt id cannot be found in any of the XML files.

Which set of XML files? Well you get to define them using SRRunner.registerPromptFile, usually one per app. The framework itself may register some; each SRO has its own prompt XML. The registration may be permanent (for the life of the app), or temporary (for the current flow object execution). The list of XML files is searched in reverse order so that your XML files are searched first, and framework XML files searched last.

Prompt Groups

Another useful feature is the ability to build an app using rough prompts, and then finalize the prompts later without having to do any code changes. Prompt Groups does this. Each flow object has a prompt group. The default value is the flow object's name. Prompt ids are looked up twice. First the prefix is added, so for an id "id:outOfRange" in a flow object "MyMenu" the first lookup is "id:MyMenu.outOfRange". If this prompt id is found, the value in the XML file is used. If not, then a second lookup without the prefix is done, which for our example would be "id:outOfRange".

All SROs use prompt ids with default values (see Prompts in SROs). The default prompts are usually good enough to get your app logic up and tested. Then you can create an app-specific prompt XML file and register it (using SRRunner.registerPromptFile). Now you can define the prompt text for all the flow objects at your leisure. No code changes needed!

Thursday, April 26, 2007

Simpsons Demo

SimpsonsDemo is a simple voting application. It's a working speech application that demonstrates the use of SpeakRight in a non-trivial app. In SimpsonsDemo, users can call in and vote for a character from the Simpsons TV show, one vote per phone call. After voting they can hear the current standings and other information about the application. In order to make the application more complex, each character is assumed to have a related character who they are closest to. For Mister Burns this might be Smithers (and vice versa!).

Development Process

Productivity of developers is important. When code is quick to create, it's quick to change (and test!). This encourages an agile development process.

A SpeakRight application is developed in stages.

Stage 1. The application flow objects are created and wired together. Concentrate on the Model and the overall callflow. Don't worry about grammars and prompts at this point. Use inline grammars and default prompts. Do all testing using the keyboard-based interactive tester (and unit tests). The goal at this stage is to get the callflow logic working. External data access can be done at this stage, or mocked out, and done later in parallel with other stages.

Stage 2. Define the grammars and prompts. Use external grammars that take into consideration pre-amble, post-amble, and various ways of saying things ("LA" and "Los Angeles"). An application prompt XML file should be created that defines the main and error prompts for your flow objects. Deploy the app as a servlet and test using the HTML mode, where the app can be executed using a web browser. This will flush out missing files and other errors.

Stage 3. Deploy the app to the VoiceXML platform. At this point the callflow logic should be already complete and well tested, and the prompts and grammars defined. All that remains is to listen for mis-pronounciations, poor prosody, and other VUI-level errors.

The idea is to get as much testing done before deploying to the VoiceXML platform, where testing becomes much slower and more difficult (especially automated testing). Of course, it's not a fixed waterfall approach; you may, for example, need to prototype some VUI design issues on the VoiceXML platform before tackling stage 1.

Description of the Call Flow
When a user calls in they choose a character by saying the character's name, such as "Mister Burns". The application plays a short description of the character and asks if the user wishes to vote for this character. If yes, the vote is recorded and the user is taken to the main menu. Otherwise the user is asked if they want to hear about the related character. If the user says yes, the related character's description is played and the user has the opportunity to vote for the
character as before. If the user chooses not to hear about the related character then they are asked to choose another character.

The main menu has four options.
  • choose character select a Simpsons character and vote. described above.
  • voting results. Hear the voting results. Results are played in sets of three. The top three characters are listed. If the user says 'next', then the next three characters are listed.
  • call statistics. Lists # of calls, average call duration, and other statistics.
  • speakright. Plays a description of the SpeakRight framework.
Sample Call
(For brevity we've left off examples of error handling for silence and no-input errors)

Computer: Welcome to the Simpsons Demo where you can vote for your favorite Simpson's character. {pause} Please say the name of a character, such as "Mister Burns".
Human: Homer
C: Homer Simpson is the show's main character. His signature annoyed grunt "D'oh!" has been included in the Oxford English Dictionary. Do you want to vote for Homer?
H: No
C: Do you want to hear about Marge, Homer's wife?
H: Yes
C: Marge Simpson is the well-meaning and patient wife of Homer. Do you want to vote for Marge?
H: Yes.
C: Vote recorded. {pause} Main Menu. You can say 'choose character', 'voting results', 'call statistics', or 'hear about speakright'.
H: call statistics
C: There have been 413 calls with average length of 65 seconds. The average completion is...
H: Hangup


Pseudo-Code for the Call Flow

Let's write the call flow logic as a series of actions with some basic pseudo-code to represent branching and looping. Labels are marked in bold.

Welcome
A: ChooseCharacter
B: SayCharacterInfo
VoteForCharacterYesNo
if yes then goto MainMenu
HearAboutRelatedCharacterYesNo
if yes then goto B else goto A

MainMenu: MainMenu
if 'character' then goto A
else if 'results' then SayVotingResults
else if 'statistics' then SayCallStatistics
else if 'speakright' then SaySpeakRightInfo

Writing the Call Flow in Java
In SpeakRight, the pseudo-code for a callflow can be converted into Java code in a fairly simple way. Each action becomes a flow object, and is represented by a class derived from one of the SpeakRight base classes, such as a PromptFlow that plays some audio output. A series of flow objects are executed in sequence.

Where branching is required, the getNext method of a flow object can be overridden to add the branching logic. getNext returns either a flow object to be executed, or an event object which causes execution to jump to a previous point. Event objects and event handlers act like "throw" and "catch" respectively.

The application's data is stored in the model, a special class generated by the SpeakRight tool MGen. The model can be used to hold user input (such as the currently selected Simpson's character) and retrieved data (eg. from a database). It can also hold control data that is used to control the execution path.

Let's get started. The outermost flow object represents the entire callflow, and is always derived from SRApp.


public static class SimpsonsDemo extends SRApp
{
public SimpsonsDemo()
{
addPromptFlow("Welcome");
add(new MainLoop());
}
}

The constructor adds two sub-flows: a welcome prompt and a loop flow object. The welcome is only played once and the remainder of the callflow is done in a loop since the user can go back and forth from selecting a character to the main menu as many times as he or she likes.

The MainLoop class defines a model variable M. By convention, SpeakRight will inject a value for M at runtime automatically.

public static class MainLoop extends BranchFlow
{
public Model M;

The first method called in a flow object is its onBegin method. Here we initialize the two main model values. nextAction is used to control what MainLoop does next, and currentCharacterId is the currently chosen Simpsons' character.


@Override
public void onBegin() {
M.nextAction().set("A");
M.CurrentCharacterId().clear();
}

The next method to be called is getFirst. If nextAction is set to choose a character then we build a sequence of sub-flows for selecting and voting for a character. Otherwise we return the main menu flow object.

 
@Override
public IFlow branch() {
if (M.nextAction().get() == "A") {
BasicFlow flow = new BasicFlow();
if (M.getCurrentCharacterId() == 0) { //no char selected?
flow.add(new AskCharacter());
}
flow.add(new new PromptFlow("{$M.CurrentCharacterId}"));
flow.add(new VoteYesNo());
flow.add(new RelatedCharacterYesNo());
return flow;
}
else {
MainMenu menu = new MainMenu();
return menu;
}
}

AskCharacter is a flow object that asks the user to enter a character name.


public static class AskCharacter extends BaseSROQuestion {

public AskCharacter() {
super("character");

m_main1Prompt = "Say the name of a Simpson's character";
m_slotName = "x";
m_modelVar = "currentCharacterId";
}
}

The voting flow object is a yes/no question. The VoteYesNo class asks the question and then it's getNext method handles the result. If 'yes' is input then record the vote and jump to the main menu. The GotoMainLoopEvent event object is used to do this.

public static class VoteYesNo extends SROYesNo {
public VoteYesNo()
{
m_main1Prompt = "Do you want to vote for {$M.CharacterName}";
}

@Override
public IFlow onYes() {
return new MainLoop.GotoEvent(MainLoop.BRANCH_MAIN_MENU);
}
}


The GotoMainLoopEvent is caught by MainLoop in its onCatchGotoBranchEventmethod. Here we set the branching condition M.nextAction.

protected void onCatchGotoBranchEvent(GotoBranchEvent ev)
{
log("branch. action: " + ev.m_action);
M.nextAction().set(ev.m_action);
}

MainLoop is a BranchFlow with it's loop-forever set. MainLoop will be executed again, and depending on nextAction do either the main menu or choose-a-character.

SimpsonsDemo has an interactive tester (for keyboard testing), and an auto-tester (see Automated Testing).

Thursday, April 19, 2007

SpeakRight Reusable Objects (SROs)

One of the biggest challenges for a speech app framework is to maximise the reuse of common VUI dialogs. Collection of comon data elements should be reusable. This includes as time, date, currency, numbers, phone numbers, and zip code. Another area of commonality is user interface elements: login, main menu, "hot word" commands, list traversal, the enter-or-cancel pattern, and confirmation.

SpeakRight provides a set of reusable speech objects called SROs. They are configurable and extensible. Here is a list of the ways SROs can be "tweaked":

  • SROs have a full set of prompts, with main, silence, no-reco, and help prompts. Up to four escalations of each can be defined.
  • any or all prompts can be replaced. Each SRO has a subject, a word such as "flights", which is used to build prompts. Changing the subject word is the simplest way to adjust the prompts. There is an extension point for handling the plurality of subject words ("flight", "flights"). Or the entire prompt can be replaced.
  • prompts can be conditional, such as a prompt that only plays the first time an SRO is executed.
  • prompts can be defined at compile time, at runtime in code, or in external XML files.
  • grammars are replaceable. Inline grammars or grammar files can be used. The only restriction on a grammar is that it uses the slot names required by the SRO.
  • validation code can be added. This server-side code inspects use input and either accepts it or causes the SRO to re-execute.
  • Model binding. An SRO has a model variable name. When set, the user input results are bound to the model (i.e. stored in the model for later use by the app)
  • Command phrases can be added to an SRO. A common data entry pattern is the enter-data-or-say-cancel pattern. SROs have a list of command phrases that you can add to.
  • confirmation can be added. An SRO has a confirmation plug-in that can be used to add various forms of confirmation (explicit, implicit, or confirm-and-correct).


The current list of SROs is:
  • SROCancelCommand
  • SROChoice
  • SROConfirmYesNo
  • SRODigitString
  • SROListNavigator
  • SRONumber
  • SROOrdinalItem
  • SROTransferCall
  • SROYesNo
Adding new SROs is simple. Code-generation (using StringTemplate) generates a base class (eg. genSRONumber) that manages prompts and grammars. You only need derive the actual SRO class to add specific logic.

Thursday, March 29, 2007

Prompts in SROs

SpeakRight speech objects (SROs) offer a highly-reusable approach to prompts. Recall that SRO classes are generated by the SROGen tool. Each SRO has an XML file that defines what code will be generated. The XML file contains the grammars and prompts for the SRO. You can modify this file and regenerate an SRO.

SROGen also generates an XML file holding the default prompts. This file is deployed and is read at runtime to load the SRO prompt fields. You can modify this file on a production system to modify the default prompts for an SRO; no re-compile (or restart) is necessary.

Here's what a prompt definition in the XML file looks like:

<prompt name="main1" def="no">What {%subject%} would you like?</prompt>

The prompt id main1, has a corresponding field in the SRO class called m_main1Prompt. A derived class, or a SR app, can modify the field using get/set methods.

This works fairly well with the PText feature {%fieldName%} for extracting values from a field.

Sub-Prompts
The one thing missing so far is the ability to define conditional prompts, such as a play-once prompt. You can do it in code of course, but that's not very flexible

if (executionCount() == 1) {
m_main1Prompt = "Let's get started. " + m_main1Prompt;
}

To remedy this, SROs support multiple sub-prompts to be defined for a single prompt, such as the MAIN prompt.

<prompt name="main1welc" group="MAIN" cond="once_ever">Let's get started.</prompt>
<prompt name="main1" group="MAIN" def="no">What {%subject%} would you like?</prompt>

Each prompt tag results in a field being created. Multiple prompts in the same group are rendered as a single VoiceXML prompt, in the order they occur in the XML file. The neat thing is that conditions can now be applied to individual sub-prompts. The cond attribute defines a condition. Here, "once_ever" means a play-once-ever condition. The first time the SRO is executed, the prompt will be: "Let's get started. What flight would you like?". On subsequent executions, the prompt is "What flight would you like?".

Implementation note: Sub-prompts are implemented using the m_subIndex field of Prompt. When prompts are rendered (in a PromptSet), all the rendered items are gathered together in the first sub-prompt. But since each sub-prompt is an independent Prompt object, its rendering can be enabled or disabled by its condition.

Tuesday, March 6, 2007

Release 0.0.2 is out

This second release is a bit more real. First two SROs (SRONumber and SROConfirmYesNo) are available. And confirmation and validation are working.

Here's a simple app to ask for the number of tickets in a travel application

SRApp flow = new SRApp();

SRONumber sro = new SRONumber("tickets", 1, 10);
sro.setConfirmer(new SROConfirmYesNo("tickets"));

flow.add(sro);
flow.addPrompt("You said {$INPUT}");

From this, the following dialog is possible

Computer: How many tickets would you like?
Human:
Computer: I didn't hear you. How many tickets would you like?
Human:
Computer: I still didn't hear you. How many tickets would you like?
Human: twelve
Computer: Sorry, I'm looking for a number between one and ten. How many tickets would you like?
Human: two
Computer: Do you want two?
Human: yes
Computer: You said two.

As you can see, escalating error prompts are given and user input is validated against the range given to the SRONumber flow object. Utterances below 80% confidence are confirmed, and finally the user's input is played back to them.

This release uses extremely simple grammar in SRONumber (a built-in grammar for one to nine). You're free to replace it with your own, and of course the next release will improve on this.

Enjoy :)

Saturday, March 3, 2007

Testing

Testing is an important part of SpeakRight. Running applications on an VoiceXML platform is not easy or quick to do. It's essential to be able to test your app under more programmer-friendly conditions.

Unit Tests

Both JUnit and XMLUnit are supported. Here is a basic test that creates an app and runs it, feeding the required user input. Notice the use of TrailWrapper, a flow object that tracks execution in a trail of flow object names. The test shown below is in org.speakright.sro.tests and uses the MockSpeechPageWriter. This mock object remembers the rendered SpeechPage object which can then be checked to see if the expected grammars and prompts are there.

@Test public void testConfirmNo()
{
log("--------testConfirmNo--------");
SRApp flow = createApp(true);
TrailWrapper wrap1 = new TrailWrapper(flow);

SRInstance run = StartIt(wrap1);
Proceed(run, "2", "num", 40); //question with low confidence
Proceed(run, "no"); //reject the confirmation
Proceed(run, "8", "num"); //question again
Proceed(run, ""); //you said...

assertEquals("fail", false, run.isFailed());
assertEquals("fin", true, run.isFinished());
assertEquals("start", true, run.isStarted());

ChkTrail(run, "SROQuantity;ConfYNFlow;SROQuantity;PFlow");
assertEquals("city", "8", M.city().get());
}

XMLUnit can also be used. See the TestRender.java file in org.speakright.core.tests. XML comparison is more fussy (and slower), but you can check the actual VoiceXML.

ISRInteractiveTester

Next up is the ISRInteractiveTester class. Use it in a console-app for executing a SpeakRight app interactively from the keyboard. See InteractiveTester.java in org.speakright.core.tests for an example. Run this file as a Java application.

Here are the commands it uses:

SpeakRight ITester..
RUNNING App2.........
1> ???
available cmds:
q -- quit
go -- run or proceed
bye -- simulate a DISCONNECT
echo -- toggle echo of generated content
version -- display version
status -- show interpreter status
out -- turn on file output of each page in tmpfiles dir
ret -- set return url
gramurl -- gram base dir
prompturl -- prompt base dir
html -- switch to HTML output
vxml -- switch to VXML output

The out command causes each rendered VoiceXML page to be output as a file (page1.vxml, page2.vxml, etc).

Running Servlet in HTML mode

Once you're satisfied with your app, it's time to test it inside a real servlet. Write your servlet; the one that will output VoiceXML. However when you run it, add the CGI param "mode=html", like this:

http:localhost:8080/MyServlet3/App1?mode=html

MyServlet3
is the name of your dynamic web project, and App1 is the servlet inside it.

SpeakRight will render into HTML inside VoiceXML. A sample page is shown on the left. Pressing the Next or Submit button simulates the VoiceXML platform returning results.



















Running Servlet in VoiceXML mode


OK, time for the real thing. Point your VoiceXML platform at your servlet's URL, like this:

http://www.someplace.com/MyServlet3/App1

Some platforms (such as Voxeo's) have a real-time debugger that shows events and log messages as they occur.

You can also use the log4j log file that SpeakRight writes. Here's a sample:

03-03 11:32:12.015 [or24] INFO srf - SR: startApp
03-03 11:32:12.015 [or24] INFO srf - START: MyApp
03-03 11:32:12.015 [or24] INFO srf - push MyApp
03-03 11:32:12.015 [or24] INFO srf - push PFlow
03-03 11:32:12.015 [or24] INFO srf - EXEC PFlow
03-03 11:32:12.031 [or24] DEBUG srf - prompt (1 items): Welcome to Joe's Pizza
03-03 11:32:12.359 [or24] INFO srf - SR: writing content.
03-03 11:32:12.375 [or24] INFO srf - SR: saving state.
03-03 11:32:12.375 [or24] INFO srf - SR: done.
03-03 11:32:14.109 [or25] INFO srf - SR: doing POST

Automated Testing

See Automated Testing.

Friday, March 2, 2007

Servlets

Let's consider how to enclose a SpeakRight application in a Java servlet. Servlets are a portable API for responding to HTTP requests. Tomcat or other types of web servers (such as TBD) support servlets.

SpeakRight provides a class, SRServletRunner, that does most of the work. Here's how to use it in the doGet method of a servlet

protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
SRServletRunner runner = new SRServletRunner(new AppFactory(), this, request, response, "GET");

if (runner.isNewSession()) {
SRRunner run = runner.createNewSRRunner(this);
IFlow flow = new App();
runner.startApp(flow);
} else {
runner.logger().log("contine in GET!!");
runner.continueApp();
}
}

First we pass the request and response objects into SRServletRunner, along with a string (used for logging) and our app factory (see Initialization). Then we check if this is a new session. If it is then we create our SpeakRight application and call startApp; otherwise we call continueApp. SRServletRunner manages passifying and re-activating the SpeakRight runtime between HTTP requests.

Logging is done using log4j.

Monday, February 26, 2007

Technorati sign-up

Technorati Profile

First Release!

Our first release is out! Click on the Download link on the left to try SpeakRight out. Pretty minimal support, but enough to get an idea of the approach and build some simple VoiceXML apps. Definitely more coming. Next up are multi-slot questions, transfer, and a starter set of SROs (re-usable speech objects).

This is my first open-source project. It's been an intense learning curve, but getting this far is due to some fine OSS software: Eclipse, Subversion, SourceForge, StringTemplate, Junit, XMLUnit, Skype, and the Voxeo community.

Saturday, February 24, 2007

Control Flow, Errors, and Event Handling

As we saw in Internal Architecture a flow stack is the basis of execution in SpeakRight. In this approach, pushing a flow object onto the stack is similar to making a sub-routine call. Popping a flow object off the stack is similar to returning from a sub-routine.

Execution of a sequence of flow objects is done by a flow object having a list of sub-flow objects. Every time its getNext method is called it returns the next object from the list.

Conditional flow is done by adding logic in getNext, to return one flow object or another based on some condition.

Looping is done by having getNext return its sub-flows more than once, iterating over them multiple times.

Error Handling

Callflows can have errors, such as user errors (failing to say anything recognizable) and application errors (missing files, db error, etc). SpeakRight manages error handling separately from the getFirst/getNext mechanism. getNext handles the error-free case. If an error occurs then of the IFlow error handling methods is called.

IFlow OnNoInput(current, results); //app was expecting input and none was provided by the user
IFlow OnException(current, results); //a generic failure such as exception being thrown.
IFlow OnDisconnect(current, results); //user terminated the interaction (usually by hanging up the phone. how in multimodal?)
IFlow OnHalt(current, results); //system is stopping
IFlow OnValidateFailed(current, results);

Note that a number of things that aren't really errors are handled this way. The goal is to keep the "nexting" logic clean, and handle everything else separately.

Errors are handled in a similar manner as exceptions; a search up the flow stack is done for an error handler. If the current flow doesn't have one, then its parent is tried. It's an runtime error if an error handler is not found.

The outermost flow is usually a class derived from SRApp. SRApp provides error handlers with default behaviour. They play a prompt indicating that a problem has occurred, and transfers the call to an operator.

Catch and Throw

The basic flow of control in a SpeakRight app is nesting of flow objects. These behave like subroutine calls; when the nested flow finishes, the parent flow resumes execution. Sometimes a non-local transfer of control is needed. SpeakRight supports a generic throw and catch approach. A flow can throw a custom flow event, which MUST be caught by a flow above it in the flow stack.

return new ThrowFlowEvent("abc");

and the catch looks like any other error handler

IFlow OnCatch(current, results, thrownEvent);

Note: like all other handlers a flow event can catch its own throw. May seem weird but this lets developers move code around easily.

Some control flow is possible in execute. If you want a flow to branch if a db error happens, then do this. However, in this case a flow object can not catch its own flow event, since that would cause Execute to be called again, and infinite recursion...

Update: See also Optional Sub-Flow Objects

GotoUrlFlow
The GotoUrlFlow flow object is used to redirect the VoiceXML browser to an external URL. It is used to redirect to static VoiceXML pages or to another application.

Internal Architecture

A typical VoiceXML application has a software stack like this:
  • Application code
  • Web server
  • VoiceXML browser
  • VoiceXML platform
  • Telephony hardware, VOIP stack
Starting at the bottom, a phone call arrives at the telephony/VOIP layer. This layer notifies the VoiceXML platform layer, which allocates a speech recognition engine, a text-to-speech engine, and a VoiceXML browser. The browser plays the same role as a web browser; it makes a HTTP request to a web server which runs some application code. The application code is a mixture of static and dynamic web content that generates a VoiceXML page. The web server returns this page to the browser, which renders the page as audio. Speech input and DTMF digits are collected, according the VoiceXML tags. At some point, a or tag is executed, which makes a new HTTP request (sending user input as GET or POST data), and the process repeats.

SpeakRight lives in application code layer, typically in a servlet. The SpeakRight runtime dynamically generates VoiceXML pages, one per HTTP request. Between requests, the runtime is stateless, in the same sense of a "stateless bean". State is saved in the servlet session, and restored on each HTTP request.

The SpeakRight framework is a set of Java classes specifically designed for writing speech rec applications. Although VoiceXML uses a similar web architecture as HTML, the needs of a speech app are very different (see Why Speech is Hard TBD).

SpeakRight has a Model-View-Controller architecture (MVC) similar to GUI frameworks. In GUIs, a control represents the view and controller. Controls can be combined using nesting to produce larger GUI elements. In SpeakRight, a flow object represents the view and controller. Flow objects can be combined using nesting to produce larger GUI elements. Flow objects can be customized by setting their properties (getter/setter methods), and extended through inheritance and extension points. For instance, the confirmation strategy used by a flow object is represented by another flow object. Various types of confirmation can be plugged-in.

Flow objects contain sub-flow objects. The application is simply the top-level flow object.

Flow objects implement the IFlow interface. The basics of this interface are

IFlow getFirst();
IFlow getNext(IFlow current, SRResults results);
void execute(ExecutionContext context);
getFirst returns the first flow object to be run. A flow object with sub-flows would return its first sub-flow object. A leaf object (one with no sub-flows) returns itself. (See also Optional Sub-Flow Objects)

getNext returns the next flow object to be run. It is passed the results of the previous flow object to help it decide. The results contain user input and other events sent by the VoiceXML platform.

In the execute method, the flow object renders itself into a VoiceXML page. (see also StringTemplate template engine).

Execution uses a flow stack. An application starts by pushing the application flow object (the outer-most flow object) onto the stack. Pushing a flow object is known as activation. If the application object's getFirst returns a sub-flow then the sub-flow is pushed onto the stack. This process continues until a leaf object is encountered. At this point all the flow objects on the stack are considered "active". Now the runtime executes the top-most stack object, calling its execute method. The rendered content (a VoiceXML page) is sent to the VoiceXML platform.

When the results of the VoiceXML page are returned, the runtime gives them to the top-most flow object in the stack, by calling its getNext method. This method can do one of three things:
  • return null to indicate it has finished. A finished flow object is popped off the stack and the next flow-object is executed.
  • return itself to indicate it wants to execute again.
  • return a sub-flow, which is activated (pushed onto the stack).\
The result is a new VoiceXML page is generated. Execution continues like this until the flow stack is empty.


Table of Contents

Overview

SpeakRight documentation

Project Plans
  • Project Plan
  • Wish List
  • Contributors
  • Powered By

Sunday, February 18, 2007

Grammars

Grammars define the actual spoken phrases that will be recognized. The grammar defines the return values (called slots). Grammars are an important abstraction layer in SpeakRight, because they abstract away user input values from how that input is exactly specified. Synonyms can map to the same user input value. Both "Los Angeles" and "LA" could map to the result input value city="Los Angeles". Multi-lingual apps use this feature; the spoken phrases are in the target language but the results are in the default language (usually English).

SpeakRight supports three types of grammars: external grammars (referenced by URL), built-in grammars, and inline grammars (which use a simplified GSL format). Grammars work much like prompts. You specify a grammar text, known as a gtext, that uses a simple formatting language:
  • (no prefix). The grammar text is a URL. It can be an absolute or relative URL.
  • inline: prefix. An inline grammar. The prefix is followed by a simplified version of GSL, such as "small medium [large big] (very large)".
  • builtin: prefix. One of VoiceXML's built-in grammars. The prefix is followed by something like "digits?minlength=3;maxlength=9"
Grammars, like prompts, can have a condition. Currently the only condition is DTMFOnlyMode (explained below).

When a flow object is rendered, the grammars are rendered using a pipeline, that applies the following logic:
  • check the grammar condition. If false then skip the grammar.
  • parse an inline grammar into its word list
  • parse the builtin grammar
  • convert relative URLs into absolute URLs
External Grammars

The grammar text is a URL. It can be an absolute URL (eg. http://www.somecompany.com/speechapp7/grammars/fruits.grxml), or a relative URL. Relative URLs (eg. "grammars/fruits.grxml") are converted into absolute URLs when the grammar is rendered. The servlet's URL is currently used for this.

The grammar file extension is used to determine the type value for the grammar tag
  • ".grxml" means type="application/srgs+xml"
  • ".gsl" means type="text/gsl"
  • all other files are assumed to be ABNF SRGS format, type="application/srgs"
A Grammar Editor is helpful, such as the wonderful GRXML editor that comes with the (free) Microsoft SASDK.

Built-In Grammars


TDB. Built-ins are part of VoiceXML 2.0, but optional. They are also intended for prototyping, and it's recommended that applications use full, properly tuned grammars.

Inline Grammars

GSL is (I believe) a propietary Nuance format. SpeakRight uses a simplified version that currently only supports [ ] and ( ).
A single utterance can contain one or more slots. The simplest type of directed dialog VUIs use single slot questions, such as "How many passengers are there?". SR only supports single slot for now.

Grammar Types

There are two types of grammars (represented by the GrammarType enum).
  • VOICE is for spoken input
  • DTMF is for touchtone digits
The SpeakRight class Question holds up to two grammars, one of each type.

DTMF Only Mode
Speech recognition may not work at all in very noisy environments. Not only will recognition fail, but prompts may never play due to false barge-in. For these reasons, speech applications should be able to fall back to a DTMF-only mode. This mode can be activated by the user by pressing a certain key, usually '*'. Once activated, SpeakRight will not render any VOICE grammars. Thus the VoiceXML engine will only listen for DTMF digits.

Slots
A grammar represents a series of words, such as "A large pizza please". The application may only care about a few of the words; here, the size word "large" is the only word of importance to the app. These words are attached tonamed return values called slots. In our pizza example, a slot called "size" would be bound to the words "small", "medium", or "large". Any of those words would fill the slot.

Slots define the interface between a grammar and a VoiceXML field. The field's name (shown below) defines a slot that grammar must fill in order for the field to be filled.

Any grammar that fills the slot "size" can be used.

A single utterance can fill multiple slots, as in "I would like to fly to Atlanta on Friday."
SpeakRight doesn't yet support multi-slot questions..

Prompts

A prompt text, known as a ptext, defines one or more items using a simple formatting scheme.
Here is a basic prompt that plays some text using TTS (text-to-speech):

"Welcome to Inky's Travel".

PTexts are Java strings. Here's another prompt:

"Welcome to Inky's Travel. {audio:logo.wav}"

This prompt contains two items: a TTS phrase and an audio file. Items are delimited by '{' and '}'. The delimiters are optional for the first item. This is equivalent:

"{Welcome to Inky's Travel. }{audio:logo.wav}"

PTexts can contains as many items as you want. They will be rendered as a prompt tag, (or possibly as a nomatch or noinput tag)


<prompt>Welcome to Inky's Travel. <audio src="http://myIPaddress/logo.wav"></audio>
</prompt>



For convenience, audio items can be specified without the "audio:" prefix. The following is equivalent to the previous prompt. The prefix is optional if the filename ends in ".wav" and contains no whitespace characters.

"{Welcome to Inky's Travel. }{logo.wav}"


You can add pauses as well using "." inside an item. Each period represents 250 msec. Pause items contain only periods (otherwise they're considered as TTS). Here's a 750 msec pause.

"{Welcome to Inky's Travel. }{...}{logo.wav}"


Model variables can be prompt items by using a "$M." prefix. The value of the model is rendered.

"The current price is {$M.price}"

The most recent user input can also be played back, like this:

"You chose {$INPUT}"

Also
fields (aka. member variables) of a flow object can be items by wrapping them in '%'. If a flow class has a member variable: int m_numPassengers; then you can play this value in a prompt like this:

"There are {%numPassengers%} passengers"

If you're familiar with SSML then you can use raw prompt items, that have a "raw:" prefix. These are output as it, and can contain SSML tags.

"That's a <emphasis>big</emphasis> order!"

Lastly, there are id prompt items, which are references to an external prompt string in an XML file. This is useful for multi-lingual apps, or for changing prompts after deployment. See Prompt Ids and Prompt XML Files.

"id:sayPrice"

Let's summarize. There are seven types of prompt items:
  • "audio:" audio prompts
  • "M$." model values
  • "%value%" field values (of currently executing flow object)
  • ".." pause (250 msec for each period)
  • "raw:" raw SSML prompts
  • "id:" id prompts
  • TTS prompt (any prompt item that doesn't match one of the above types is played as TTS)

Prompt Conditions

By default, all the prompts in a flow object are played. However there are occasions when the playing of a prompt needs to be controlled by a condition. Conditions are evaluated when the flow object is executed; if the condition returns false the prompt is not played.

Condition Description
none always play prompt
PlayOnce only play the first time the flow is executed. If the flow is re-executed (the same flow object executes more than once in a row), don't play the prompt. PlayOnce are useful in menus where the initial prompt may contain information that should only be played once.
PlayOnceEver only play once during the entire session (phone call).
PlayIfEmpty only play if the given model variable is empty (""). Useful if you want to play a prompt as long as something has not yet occured.
PlayIfNotEmpty only play if the given model variable is not empty ("")


Prompt Rendering

Prompts are rendered using a pipeline of steps. The order of the steps has been chosen to maximize usefulness.

Step Description
1 Apply condition. If false then return
2 Resolve ids. Read external XML file and replace each prompt id with its specified prompt text
3 Evaluate model values
4 Call fixup handlers in the flow objects. The IFlow method fixupPrompt allows a flow object to tweak TTS prompt items
5 Merge runs of of TTS items into a single item
6 Do audio matching. An external XML file defines TTS text for which an audio file exists. The text is replaced with the audio file.

The result is a list of TTS and/or audio items that are sent to the page writer.

Audio matching

Audio matching is a technique that lets you use TTS for the initial speech app development. Once the app is fairly stable, record audio files for all the prompts. Then you ceate an audio matching xml file that lists each audio file and the prompt text it replaces. Now when the SpeakRight application runs, matching text is automatically replaced with the audio file. No source code changes are required.

The match is a soft match that ignores case and punctuation. That is a prompt item "Dr. Smith lives on Maple Dr." would match an audio-match text of "dr smith lives on maple dr".

Audio matching works at the item level. Do we need to suport some tag for spanning multi items???