Gesture Interfaces for
Multimodal Systems in HCI
Andrea CORRADINI
Center for Human-Computer Communication
Department of Computer Science and Engineering
Oregon Health & Science University
Beaverton, Oregon 97006, USA
andrea@cse.ogi.edu
Abstract
When people interact with each other, they use a set of different modalities
(e.g., spoken language, facial expressions, eye gaze, hand gestures, body postures,
etc.) to convey information. In the domain of Human-Computer Interaction (HCI),
the aim is to create flexible, and natural interaction techniques, which are
easy to learn and use. In order for such interfaces to be properly natural,
we first need to understand the modalities involved in human-human communication.
To date, in HCI hand and arm gestures along with speech receive the most attention
and investigations. In HCI, the notion of gesture is loosely defined, and depends
on the context of the interaction. It varies from pen mark to hand/arm movement,
and usually researchers who create experimental gesture-based systems use their
own definitions that tend to be application-specific and therefore less spontaneous
and more learned. Thus, gestures simply become predefined templates to include
into or recognize from a library (gestobulary). In addition, despite the growing
evidence that gestures are integral to human-human communication, computer scientists
have mostly focused on gesture as its own language rather than as a complementary
or accompanying mode. We believe that the combination of speech and gesture
is a mandatory step toward powerful HCI. In this paper, we present two different
multimodal speech-gesture architectures, which have been developed at our Department.
The first system, Quickset, is a collaborative map-based pen mark gestures/voice
system that runs on wireless hand-held PC's allowing a user to set up a military
simulation, control the entities as it runs, and navigate in a synthetic 3-D
world. The second one is a pointing and speech alternative to the current paint
programs based on traditional devices like mouse, pen or keyboard. Drawing occurs
with natural human pointing by using the hand to define a line in space, while
spoken commands can be issued to act on the current painting in conformity with
a predefined grammar.
1 Introduction
Computers have become an almost necessary aspect of the everyday life, from
the way we communicate to the manner we interact with the environment. Typical
computer interfaces, however, are primarily functional as they are developed
for office applications, scientific computation, graphics visualization etc.
Those interfaces are not social or natural for the interaction is not like how
people interact with each other or with the world.
In the last decade, some researchers have been discussing post-WIMP (window,
icon, menu, pointing device) interaction techniques in an effort to support
alternative, natural, efficient, adaptive and expressive interaction techniques.
Especially the use of natural gestures provides a very appealing alternative
to traditional input devices for human-computer interaction.
In order to enable computers to take gestural input, human gestures need to
be understood better. Psychologists, neurologists, and cognitive scientists
who relate gestures to the humans expressions and social interaction have
performed most of the existing research. However, in the context of HCI, the
final goal is to make gestures understandable for machines rather than investigate
their relationships to humans expressions and social interactions. Several
definitions are used within the HCI community, yet they are chosen to fit the
task and the application at hand. Sometimes gestures are referred to as static
body postures, as hand movements or even as pen marks. The allowed gestures
are typically limited and severely restrict the input available to users. The
gestural command set is often small and unnatural leading one to wonder what
the benefits the user gains by using this mode.
Despite of that, there is no doubt that gestures are integral to human-human
communication and therefore need to be further investigated to build post-WIMP
natural interfaces. Gestures alone will not help much, though, as humans naturally
communicate multimodally using several other modes like speech (which is indeed
the primarily communication channel in normal environments), body posture, facial
expressions, mood and eye gaze. Although, there may be redundancy, in the information
contained within the many modes, each of them is commonly used to enrich the
information available.
Many researchers have investigated the relation between human gestures and speech.
Kendon [27] ordered gestures of varying nature along a continuum of linguisticity
(gesticulation -> language-like gestures -> pantomimes -> emblems ->
sign language) in which, going from one side to the other, the presence of speech
declines while the presence of language properties increases. Simultaneously,
moving along this continuum, (socially) regulated signs replace idiosyncratic
gestures. In other words, signs replace the formulated, linguistic component
of the expressions present in the speech. This supports the idea that gestures
and speech are an integrated form of expression of utterances where gestures
and speech are complementary. Concerning gesticulation, McNeill [29] stated
that there is no body language, but instead gestures, with their spatial representations,
complement spoken language and convey extra information about the internal mental
process of the speaker.
We also believe that gestures are not a mere embellishment or by-product of
the speech. If the premise is to make the interaction between humans and machines
take less efforts for the users, the logical conclusion is to build transparent
interfaces which consider speech and gestures where the users is not consciously
aware of using an interface. In the following sections, we briefly present two
multimodal systems we build. The first system, Quickset, is a collaborative
map-based pen mark pen gestures/voice system that runs on wireless hand-held
PC's allowing a user to set up a military simulation, control the entities as
it runs, and navigate in a synthetic 3-D world. In this system, gestures are
considered as drawing marks. The second one is a pointing and speech alternative
to the current paint programs based on traditional devices like mouse, pen or
keyboard. Drawing occurs with natural human pointing by using the hand to define
a line in space, while spoken commands can be issued to act on the current painting
in conformity with a predefined grammar.
2 QuickSet
In order to train personnel more effectively, the US military is developing
large-scale distributed simulation capabilities. Begun as SIMNET in the 1980's
[23], these distributed, interactive environments attempt to provide a high
degree of fidelity in simulating combat, including simulations of the individual
combatants, the equipment, entity movements, atmospheric effects, etc. There
are four general phases of user interaction with these simulations: Creating
entities, supplying their initial behavior, interacting with the entities during
a running simulation, and reviewing the results. The present research concentrates
on the first two of these stages.
Our contribution to the distributed interactive simulation (DIS) effort is to
rethink the nature of the user interaction. As with most modern simulators,
DISs are controlled via graphical user interfaces (GUIs). However, the simulation
GUI is showing signs of strain, since even for a small-scale scenario, it requires
users to choose from hundreds of entities in order to select the desired ones
to place on a map. To compound these interface problems, the military is intending
to increase the scale of the simulations dramatically, while at the same time,
for reasons of mobility and affordability, desiring that simulations should
be creatable from small devices (e.g., PDAs). This impending collision of trends
for smaller screen size and for more entities requires a different paradigm
for human-computer interaction.
We have argued generically that GUI technologies offer advantages in allowing
users to manipulate objects that are on the screen, in reminding users of their
options, and in minimizing errors [7]. However, GUIs are often weak in supporting
interactions with many objects, or objects not on the screen. In contrast, it
was argued that linguistically-based interface technologies offer the potential
to describe large sets of objects, which may not all be present on a screen,
and can be used to create more complex behaviors through specification of rule
invocation conditions. Simulation is one type of application for which these
limitations of GUIs, as well as the strengths of natural language, especially
spoken language, are apparent [6].
It has become clear, however, that speech-only interaction is not optimal for
spatial tasks. Using a high-fidelity "Wizard-of-Oz" methodology [20],
recent empirical results demonstrate clear language processing and task performance
advantages for multimodal (pen/voice) input over speech-only input for map-based
systems [17,18].
Figure 1: on the left, QuickSet running on a wireless handheld PC; on the right,
QuickSet interface as the user establishes two platoons, a barbed-wire fence,
a breached minefield, and then issues a command to one platoon to follow a traced
route.
2.1 Overview
To address these simulation interface problems, and motivated by the above results,
QuickSet has been developed at our department. QuickSet (see Figure 1) is a
collaborative, handheld, multimodal system for configuring military simulations
based on LeatherNet [5], a system used in training platoon leaders and company
commanders at the USMC base at 29 Palms, California. LeatherNet simulations
are created using the ModSAF simulator [10] and can be visualized in a CAVE-based
virtual reality environment [11, 26] called CommandVu. In addition to LeatherNet,
QuickSet is being used in a second effort called ExInit (Exercise Initialization),
that will enable users to create division-sized exercises. Because of the use
of OAA, QuickSet can interoperate with agents from CommandTalk [14], which provides
a speech-only interface to ModSAF.
QuickSet runs on both desktop and hand-held PC's, communicating over wired and
wireless LAN's, or modem links. The system combines speech and pen-based gesture
input on multiple 3-lb hand-held PCs (Fujitsu Stylistic 1000), which communicate
via wireless LAN through the Open Agent Architecture (OAA) [8], to ModSAF, and
also to CommandVu. With this highly portable device, a user can create entities,
establish "control measures" (e.g., objectives, checkpoints, etc.),
draw and label various lines and areas, (e.g., landing zones) and give the entities
behavior.
In the next subsections, we illustrate the system briefly, describe its components,
and discuss its application. See also [32] for more details.
2.2 System Architecture
Architecturally, QuickSet uses distributed agent technologies based on the Open
Agent Architecture for interoperation, information brokering and distribution.
An agent-based architecture was chosen to support this application because it
offers easy connection to legacy applications, and the ability to run the same
set of software components in a variety of hardware configurations, ranging
from stand-alone on the handheld PC, to distributed operation across numerous
workstations and PCs. Additionally, the architecture supports mobility in that
lighter weight agents can run on the handheld, while more computationally-intensive
processing can be migrated elsewhere on the network. The agents may be written
in any programming language (here, Quintus Prolog, Visual C++, Visual Basic,
and Java), as long as they communicate via an interagent communication language.
A brief description of each agent follows:
QuickSet interface: On the handheld PC is a geo-referenced map of the region
such that entities displayed on the map are registered to their positions on
the actual terrain, and thereby to their positions on each of the various user
interfaces connected to the simulation. The map interface agent provides the
usual pan and zoom capabilities, multiple overlays, icons, etc. The user can
draw directly on the map, in order to create points, lines, and areas. The user
can create entities, give them behavior, and watch the simulation unfold from
the handheld. When the pen is placed on the screen, the speech recognizer is
activated, thereby allowing users to speak and gesture simultaneously.
Speech recognition agent: The speech recognition agent used in QuickSet employs
either IBM's VoiceType Application Factory or VoiceType 3.0 recognizers. The
recognizers use an HMM-based continuous speaker-independent speech recognition
technology for PC's under Windows 95/NT/98/2000. Currently, the system has a
vocabulary of 450 words. It produces a single most likely interpretation of
an utterance.
Gesture recognition agent: OGI's gesture recognition agent processes all pen
input from a PC screen or tablet. The agent weights the results of both HMM
and neural net recognizers, producing a combined score for each of the possible
recognition results. Currently, 45 gestures can be recognized, resulting in
the creation of 21 military symbols, irregular shapes, and various types of
lines.
Natural language agent: The natural language agent currently employs a definite
clause grammar and produces typed feature structures as a representation of
the utterance meaning. Currently, for this task, the language consists of noun
phrases that label entities, as well as a variety of imperative constructs for
supplying behavior.
Multimodal integration agent: The multimodal interpretation agent accepts typed
feature structure meaning representations from the language and gesture recognition
agents, and produces a unified multimodal interpretation.
Simulation agent: The simulation agent, developed primarily by SRI International,
but modified by us for multimodal interaction, serves as the communication channel
between the OAA-brokered agents and the ModSAF simulation system. This agent
offers an API for ModSAF that other agents can use.
Web display agent: The Web display agent can be used to create entities, points,
lines, and areas. It posts queries for updates to the state of the simulation
via Java code that interacts with the blackboard and facilitator. The queries
are routed to the running ModSAF simulation, and the available entities can
be viewed over a WWW connection using a suitable browser.
Other user interfaces: When another user interface connected to the facilitator
subscribes to and produces the same set of events as others, it immediately
becomes part of a collaboration. One can view this as human-human collaboration
mediated by the agent architecture, or as agent-agent collaboration.
CommandVu agent: Since the CommandVu virtual reality system is an agent, the
same multimodal interface on the handheld PC can be used to create entities
and to fly the user through the 3-D terrain. For example, the user can ask "CommandVu,
fly me to this platoon <gesture on the map>."
Application bridge agent: The bridge agent generalizes the underlying applications'
API to typed feature structures, thereby providing an interface to the various
applications such as ModSAF, CommandVu, and Exinit. This allows for a domain-independent
integration architecture in which constraints on multimodal interpretation are
stated in terms of higher-level constructs such as typed feature structures,
greatly facilitating reuse.
CORBA bridge agent: This agent converts OAA messages to CORBA IDL (Interface
Definition Language) for the Exercise Initialization project.
More detail on the architecture and the individual agents are provided in [12,
22].
2.3 An Example
Holding QuickSet in hand, the user views a map from the ModSAF simulation, and
with spoken language coupled with pen gestures, issues commands to ModSAF. In
order to create a unit in QuickSet, the user would hold the pen at the desired
location and utter (for instance): "red T72 platoon" resulting in
a new platoon of the specified type being created at that location.
The user then adds a barbed-wire fence to the simulation by drawing a line at
the desired location while uttering "barbed wire." Similarly a fortified
line is added. A minefield of an amorphous shape is drawn and is labeled verbally,
and finally an M1A1 platoon is created as above. Then the user can assign a
task to the new platoon by saying "M1A1 platoon follow this route"
while drawing the route with the pen. The results of these commands are visible
on the QuickSet screen in the ModSAF simulation, and in the CommandVu 3D rendering
of the scene. In addition to multimodal input, unimodal spoken language and
gestural commands can be given at any time, depending on the user's task and
preference.
2.4 Multimodal Integration in QuickSet
Since any unimodal recognizer will make mistakes, the output of the gesture
recognizer is not accepted as a simple unilateral decision. Instead the recognizer
produces a set of probabilities, one for each possible interpretation of the
gesture. The recognized entities, as well as their recognition probabilities,
are sent to the facilitator, which forwards them to the multimodal interpretation
agent. In combining the meanings of the gestural and spoken interpretations,
we attempt to satisfy an important design consideration, namely that the communicative
modalities should compensate for each other's weaknesses [7, 16]. This is accomplished
by selecting the highest scoring unified interpretation of speech and gesture.
Importantly, the unified interpretation might not include the highest scoring
gestural (or spoken language) interpretation because it might not be semantically
compatible with the other mode. The key to this interpretation process is the
use of a typed feature structure [1, 3] as a meaning representation language
that is common to the natural language and gestural interpretation agents. Johnston
et al. [12] present the details of multimodal integration of continuous speech
and pen-based gesture, guided by research in users' multimodal integration and
synchronization strategies [19]. Unlike many previous approaches to multimodal
integration (e.g, [2, 9, 12, 15, 25]) speech is not "in charge," in
the sense of relegating gesture a secondary and dependent role. This mutually-compensatory
interpretation process is capable of analyzing multimodal constructions, as
well as speech-only and pen-only constructions when they occur. Vo and Wood's
system [24] is similar to the one reported here, though we believe the use of
typed feature structures provides a more generally usable and formal integration
mechanism than their frame-merging strategy. Cheyer and Julia [4] sketch a system
based on Oviatt's [17] results and the OAA [8], but do not discuss the integration
strategy nor multimodal compensation.
2.5 The future of QuickSet
QuickSet has been delivered to the US Navy (NRaD) and US Marine Corps. for use
at 29 Palms, California, where it is primarily used to set up training scenarios
and to control the virtual environment. It is also installed at NRaD's Command
Center of the Future. The system was used by the US Army's 82nd Airborne Corps.
at Ft. Bragg during the Royal Dragon Exercise. There, QuickSet was deployed
in a tent, where it was subjected to an extreme noise environment, including
explosions, low-flying jet aircraft, generators, and the like. Not surprisingly,
spoken interaction with QuickSet was not feasible, although users gestured successfully.
Instead, users wanted to gesture. Although we had provided a multimodal interface
for use in less hostile conditions, nevertheless we needed to provide,and in
fact have provided, a complete overlap in functionality, such that any task
can be accomplished just with pen or just with speech when necessary. Finally,
QuickSet is now being extended for use in the ExInit simulation initialization
system for DARPA's STOW-97 Advanced Concept Demonstration that is intended for
creation of division-sized exercises.
Regarding the multimodal interface itself, QuickSet has undergone a "proactive"
interface evaluation in that the studies that were performed in advance of building
the system predicted the utility of multimodal over unimodal speech as an input
to map-based systems [17, 18]. In particular, it was discovered in this research
that multimodal interaction generates simpler language than unimodal spoken
commands to maps. For example, to create a "phase line" between two
three-digit <x,y> grid coordinates, a user would have to say: "create
a line from nine four three nine six one to nine five seven nine six eight and
call it phase line green" [14]. In contrast, a QuickSet user would say
"phase line green" while drawing a line. Creation of area features
with unimodal speech would be more complex still, if not infeasible. Given that
numerous difficult-to-process linguistic phenomena (such as utterance disfluencies)
are known to be elevated in lengthy utterances, and also to be elevated when
people speak locative constituents [17, 18], multimodal interaction that permits
pen input to specify locations and that results in brevity offers the possibility
of more robust recognition.
Further development of QuickSet's spoken, gestural, and multimodal integration
capabilites are continuing. Research is also ongoing to examine and quantify
the benefits of multimodal interaction in general, and our architecture in particular.
3 A Speech-Gesture Painting System
While in QuickSet a gesture is a mark entered using a pen, in the painting system
a gesture is considered as a 3D hand movement. In the following, we describe
a pointing and speech alternative to the current paint programs based on traditional
devices like mouse, pen or keyboard.
We present a simple magnetic field tracker-based pointing system. It is used
as an input device for a painting system to provide a convenient means for the
user to specify paint locations on any virtual paper. The virtual paper itself
is determined by the operator as a limited plane surface in the three dimensional
space.
Drawing occurs with natural human pointing by using the hand to define a line
in space, and considering its possible intersection point with this plane. In
addition, some vocal commands can be utilized to act on the current painting
in conformity with a predefined grammar.
3.1 Estimating the pointing direction
For the whole system to work, the user is required to wear a hand glove on whose
top we put one Flock of Birds (FOB) [30] sensor. The FOB is a six-degree-of-freedom
tracker device based on magnetic fields which we exploit to track the position
and orientation of the users hand with respect to the coordinate system
determined by the FOBs transmitter The hands position is given by
the position vector P reported by the sensor at a frequency of approximately
103Hz. For the orientation, we put the sensor almost at the back of the index
finger with its relative x-coordinate axis directed toward the index fingertip.
In this way, using the quaternion values reported by the sensor, we can apply
mathematical transformations within quaternion algebra to determine the unit
vector X which unambiguously defines the direction of the sensor and therefore
that of pointing (see Figure 2).
The point P along with vector X is then used to determine the equation of the
imaginary line passing through P and having direction X. When the system is
started for the first time, the user has to choose the region he wants to paint
in. This is accomplished by letting the user choose three of the vertices of
the future rectangular painting region. These points are chosen by pointing
at them. However, since this procedure is to be done in the 3D space, the user
has to aim at each of the vertices from two different positions. The two different
vectors triangulate to select a point as vertex. In 3D space, two lines will
generally not have an intersection. In such cases, we will use the point of
minimum distance from both lines.
With natural human pointing behavior, the hand is used to define a line in space,
roughly passing through the base and the tip of the index finger. Normally,
this line does not lie in the target plane but may intersect it at some point.
It is this point that we aim to recover.
For this reason, when the region selected in the 3D space is neither a wall
screen, nor a general surface on which the input can be directly output (tablet,
the computers monitor etc.), the system can be properly used only when
the magnetic sensor is aligned and used together with a light pointer. However,
in this situation we also implemented a rendering module to draw the actual
painting on the screen regardless of the target plane chosen in the 3D space.
This part of the system was implemented by an agent (the VR agent in figure
2) on a SGI machine utilizing the Virtual Reality Peripheral Network (VRPN)
[31] driver for the FOB.
Figure 2: on the left, selecting a graphic tablet as target region for painting
enables directly visual feedback. The frame of reference of the sensor is shown
on the left. On the right: agent communication within the entire system. Hand
orientation and position are tracked by the FOB. Any valid voice command is
passed to the VR Agent for both determining the possible intersection with the
target region and performing the associated action. The target region along
with painting progress is eventually rendered on either the SGI screen or a
device allowing for visual feedback when, like in figure 2, the virtual paper
coincides with that device.
3.2 The Speech Agent
We make use of Dragon 4.0, a Microsoft SAPI 4.0 compliant speech engine. This
speech recognition engine captures an audio stream and produces a list of text
interpretations (with association probabilities of correct recognition) of that
speech audio. These text interpretations are limited by a grammar that is supplied
to the speech engine upon startup.
The following grammar specifies the possible sentences:
1: <Sentence> = <color> | <verb> | <answer>
2: <color> = green | red | blue | yellow | white | magenta | cyan
3: <verb> = draw on | draw off | cursor on | cursor off | zoom in |
zoom out | select begin | select end | line begin |
line end | paste | rectangle begin | rectangle end |
circle begin | circle end | save | load | delete | copy |
cancel | undo | send to background | exit | free buffer |
help | switch to foreground | switch to background | restart
4: <answer> = no | yes
The user uses voice commands to put the system into various modes that remain
in effect until he changes them. The system is a state machine whose modal nature
ensures consistent command sequences (e.g., °ßline begin°®
can only be followed by °ßundo°®, °ßcancel°®
or °ßline end°®). Speech commands can be entered at anytime
and are recognized in continuous mode.
3.3 Agent Architecture
The modules implemented for tracking, pointing and painting, and speech command
recognition, need to communicate with each other. Agents communicate by passing
Prolog-type ASCII strings (Horn clauses) via TCP/IP.
The central agent is the facilitator. Agents can inform the facilitator of their
interest in messages which match (logically unify) with a certain expression.
Thereafter, when the facilitator receives a matching message from some other
agent, it will pass it along to the interested agent. Since ASCII strings and
TCP/IP are common across various platforms, agents can be used as software components
that can communicate across platforms.
In this case, the Speech Agent is running on a Windows platform. The best off-the-shelf
speech recognition engines available to us (currently, Dragon) are on the Windows
platform. On the other hand, the Flock of Birds and the VRPN server are set
up for Unix. Therefore, it makes sense to tie them together with the agent architecture.
Communication is straightforward. The Speech Agent produces messages of the
type °ßparse_speech(Message)°® which the facilitator forwards
to the VR agent . The VR agent, with some simple parsing, can then extract speech
recognition alternate interpretations and their associated probabilities from
the message strings. The command associated with the highest probability value
above an experimental threshold (currently 0.85) is chosen. Eventually, the
VR agent either takes the action (such as changing drawing color, pen down or
pen up, etc.) associated with the speech command or issues an acoustic warning
signal if the verbal command is not allowed by the current system modality.
Depending on the performed action, the system may undergo a state change.
3.4 Future Work with the Painter
The presented system represents a real-time application of drawing in space
on a two-dimensional limited rectangular surface. This is a first step toward
a 3D multimodal speech and gesture system for computer aided design and cooperative
tasks. A system might perhaps recognize from the users input some 3D objects
from an iconic library and refine the users drawings accordingly. We anticipate
expanding the use of speech to operate with 3D objects.
Since the VR component (see Figure 2) is an agent, we are going to make it a
module in the entire QuickSet Adaptive Agent Architecture [28], to further use
it as a sort of virtual mouse for the QuickSet user interface. Possible alternative
applications for this system range from hand cursor control by pointing to target
selection in virtual environments.
4 Conclusions
There is still much to be done before gestural input can become pervasive, robust
and reliable. Most widely used interaction devices at this time (e.g. keyboard,
mouse, joystick) lower the naturalness and ease of interaction. The more direct
use of natural means of interaction like speech and human gestures play an essential
role in the solution of this problem.
In HCI, the notion of gesture is loosely defined, and depends on the context
of the interaction. It varies from pen mark to hand/arm movement, and usually
researchers who create experimental gesture-based systems use their own definitions
that tend to be application-specific and therefore less spontaneous and more
learned. Computer scientists have mostly focused on gesture as its own language
rather than as a complementary or accompanying mode. Beside the lack of definition,
it is also very difficult to assess the reliability of current gestural-based
systems as there is no common gestural data to use as benchmark.
In any case, we believe that the combination of speech and gesture is a mandatory
step toward powerful HCI. In this paper, we presented two different multimodal
speech-gesture architectures. The first system, Quickset, is a collaborative
map-based pen mark gestures/voice system that runs on wireless hand-held PC's
allowing a user to set up a military simulation, control the entities as it
runs, and navigate in a synthetic 3-D world. The second one is a pointing and
speech alternative to the current paint programs based on traditional devices
like mouse, pen or keyboard.
Acknowledgements
This research has been supported by the Office of Naval Research, Grants N00014-99-1-0377,
N00014-99-1-0380 and N00014-02-1-0038. We are also very thankful to Philip R.
Cohen, Richard M. Wesson and David McGee for fruitful suggestions, programming
help, and support.
References
1. Calder, J. Typed unification for natural language processing. In E. Klein
and J. van Benthem (Eds.), Categories, Polymorphisms, and Unification. Centre
for Cognitive Science, University of Edinburgh, Edinburgh, 1987, 65-72.
2. Brison, E. and N. Vigouroux. (unpublished ms.). Multimodal references: A
generic fusion process. URIT-URA CNRS. Université Paul Sabatier, Toulouse,
France.
3. Carpenter, R. The logic of typed feature structures. Cambridge University
Press, Cambridge, 1992.
4. Cheyer, A., and L. Julia. Multimodal maps: An agent-based approach. International
Conference on Cooperative Multimodal Communication (CMC/95), May 1995. Eindhoven,
The Netherlands, 1995, 24-26.
5. Clarkson, J. D., and Yi., J., LeatherNet: A synthetic forces tactical training
system for the USMC commander. Proceedings of the Sixth Conference on Computer
Generated Forces and Behavioral Representation. Institute for simulation and
training. Technical Report IST-TR-96-18, 1996, 275-281.
6. Cohen, P. R. Integrated Interfaces for Decision Support with Simulation,
Proceedings of the Winter Simulation Conference, Nelson, B. and Kelton, W. D.
and Clark, G. M., (eds.), ACM, New York, December, 1991, 1066-1072.
7. Cohen, P. R. The Role of Natural Language in a Multimodal Interface. Proceedings
of UIST'92, ACM Press, New York, 1992, 143-149.
8. Cohen, P.R., Cheyer, A., Wang, M., and Baeg, S.C. An Open Agent Architecture.
Working notes of the AAAI Spring Symposium Series on Software Agents Stanford
Univ., CA, March, 1994, 1-8.
9. Cohen, P. R., Dalrymple, M., Moran, D.B., Pereira, F. C. N., Sullivan, J.
W., Gargan, R. A., Schlossberg, J. L., and Tyler, S.W. Synergistic Use of Direct
Manipulation and Natural Language, Human Factors in Computing Systems: CHI'89
Conference Proceedings, ACM, Addison Wesley Publishing Co New York, 227-234,
1989.
10. Courtemanche, A.J. and Ceranowicz, A. ModSAF Development Status. Proceedings
of the Fifth Conference on Computer Generated Forces and Behavioral Representation,
Univ. Central Florida, Orlando, 1995, 3-13.
11. Cruz-Neira, C. D.J. Sandin, T.A. DeFanti, "Surround-Screen Projection-Based
Virtual Reality: The Design and Implementation of the CAVE," Computer Graphics
(Proceedings of SIGGRAH'93), ACM SIGGRAPH, August 1993, 135-142.
12. Johnston, M., Cohen, P. R., McGee, D., Oviatt, S. L., Pittman, J., and Smith,
I.. Unification-based multimodal integration, in submission.
13. Koons, D.B., C.J. Sparrell and K.R. Thorisson. 1993. Integrating simultaneous
input from speech, gaze, and hand gestures. In Mark T. Maybury (ed.) Intelligent
Multimedia Interfaces. AAAI Press/ MIT Press, Cambridge, MA, 257-276.
14. Moore, R., Dowding, J. Bratt, H. Gawron, J. M., and Cheyer, A., CommandTalk:
A Spoken-Language Interface for Battlefield Simulations, 1997, (this volume).
15. Neal, J.G. and Shapiro, S.C. Intelligent multi-media interface technology.
In J.W. Sullivan and S.W. Tyler, editors, Intelligent User Interfaces, chapter
3, pages 45-68. ACM Press Frontier Series, Addison Wesley Publishing Co., New
York, New York, 1991.
16. Oviatt, S. L., Pen/Voice: Complementary multimodal communication, Proceedings
of SpeechTech'92, New York, February, 1992, 238-241.
17. Oviatt, S.L. Multimodal interfaces for dynamic interactive maps. Proceedings
of CHI'96 Human Factors in Computing Systems (April 13-18, Vancouver, Canada),
ACM Press, NY, 1996, 95-102.
18. Oviatt, S. L., Multimodal interactive maps: Designing for human performance,
Human-Computer Interaction, in press.
19. Oviatt, S. L, A. DeAngeli, and K. Kuhn. In press. Integration and synchronization
of input modes during multimodal human-computer interaction. Proceedings of
the Conference on Human Factors in Computing Systems (CHI '97), ACM Press, New
York.
20. Oviatt, S. L., Cohen, P. R, Fong, M. W. and Frank, M. P., A rapid semi-automatic
simulation technique for interactive speech and handwriting, Proceedings of
the 1992 International Conference Spoken Language Processing, vol. 2, University
of Alberta, J. Ohala (ed.), October, 1992, 1351-1354.
21. Oviatt, S. L., Cohen, P. R., Wang, M. Q.,Toward interface design for human
language technology: Modality and structure as determinants of linguistic complexity,
Speech Communication, 15 (3-4), 1994.
22. Pittman, J.A., Smith, I.A., Cohen, P.R., Oviatt, S.L., and Yang, T.C. QuickSet:
A Multimodal Interface for Military Simulation. in Proceedings of the Sixth
Conference on Computer-Generated Forces and Behavioral Representation, Orlando,
Florida, 1996.
23. Thorpe, J. A., The new technology of large scale simulator networking: Implications
for mastering the art of warfighting. Proceedings of the 9th Interservice/industry
Training Systems Conference, Orlando, Florida, December, 1987, 492-501.
24. Vo, M. T. and C. Wood. Building an application framework for speech and
pen input integration in multimodal learning interfaces. International Conference
on Acoustics, Speech, and Signal Processing, Atlanta, GA, 1996.
25. Wauchope, K. Eucalyptus: Integrating natural language input with a graphical
user interface. Naval Research Laboratory, Report NRL/FR/5510--94-9711, 1994.
26. Zyda, M. J., Pratt, D. R., Monahan, J. G., and Wilson, K. P., NPSNET: Constructing
a 3-D virtual world, Proceedings of the 1992 Symposium on Interactive 3-D Graphics,
March, 1992.
27. Kendon, A., The Biological Fundations of Gestures: Motor and Semiotic Aspects,
Lawrence Erlbaum Associates, 1986.
28. Kumar, S., Cohen, P.R., Levesque, H.J. The Adaptive Agent Architecture:
Achieving Fault-Tolerance Using Persistent Broker Teams, in Proceedings of the
Fourth International Conference on Multi-Agent Systems, 2000, 159-166.
29. McNeill, D., Hand and Mind: what gestures reveal about thought, the University
of Chicago Press, 1992.
30. http://www.ascension-tech.com
31. http://www.cs.unc.edu/Research/vrpn/
32. http://www.cse.ogi.edu/CHCC/QuickSet/mainProj.html