Is speech-gesture production ballistic
or interactive?
Nobuhiro Furuyama
National Institute of Informatics
(JAPAN)
David McNeill,
University of Chicago
(USA)
Mischa Park-Doob
University of Chicago
(USA)
Abstract
A ballistic model of speech and gesture argues that once the planning
of a sentence
and a gesture is completed in the conceptualizer, the rest of the production
processes of
speech and gesture are independent from each other. Observing Arrernte speakers
who gesture
at arms length, De Ruiter and Wilkins hypothesized that the preparation
of their gestures
should take longer than usual, and that the co-expressive speech should be delayed
to the
same degree as the gesture to maintain synchrony between the two if their production
processes
are interactive. The result was that the preparation phase of Arrernte gesture
was longer
than usual, but speech did not become delayed. They concluded the ballistic
model was
therefore correct. The present paper asks whether the same holds true for non-Arrernte
speakers (e.g., English speakers) gesturing at arms length. Our preliminary
analysis shows
that the preparation phase does not necessarily take longer than usual and that
in any case
speech remains synchronized with the related gesture. We conclude that the production
of
speech and that of gesture are interactive throughout the entire process, and
that De Ruiter
and Wilkinss findings likely resulted from factors other than the modularity
of speechgesture
production.
Introduction
This paper argues that speech production and production of the concurrent gesture
are
interactive throughout the entire process, regardless of where in the gesture
space a gesture is
produced, or how long the preparation phase of a gesture lasts. This is meant
to be contrasted
with the ballistic (or modular) model of speech-gesture production (e.g., Levelt,
Richardson,
and La Heij 1985; De Ruiter and Wilkins 1998) which argues that once the planning
of a sentence
and a gesture is completed in the conceptualizer speech production and gesture
production
are independent from each other.
Evidence in support of the ballistic model allegedly comes in part from a crosslinguistic
(and cross-cultural) comparison of speech-gesture synchronization differences
between Arrernte speakers and Dutch speakers (De Ruiter and Wilkins 1998). Arrernte
is a
language spoken in central Australia. The key fact is that Arrernte speakers,
unlike non-
Arrernte speakers, perform gestures at arms length; such gestures,
performed at the outer limit
of the gesture space, require a longer preparation phase. The question
De Ruiter and Wilkins ask is, does the co-expressive (affiliated)
speech become delayed to the same degree as the gesture to maintain the
synchrony between gesture and speech? The finding was that the preparation phases
of Arrernte gestures were longer than for Dutch gestures, but speech did not
become delayed: gestures occurred after the co-expressive speech by an amount
equal to the extra time needed for the gesture preparation. Thus, De Ruiter
and Wilkins concluded, gesture and speech are modular in that speech
and gesture are on separate ballistic tracks; once launched they unfold independently
of each other. This is schematically shown in (a) and (b) in Figure 1.
(a) Arrernte (large) gesture with long
preparation phase
(b) non-Arrernte (small) gesture with short
preparation phase
Figure 1. Relative timing between speech
and the concurrent gesture (black = preparation
phase, gray = stroke phase, or between
the onset of stroke phase and the
offset of the entire gesture movement,
slanted stripe = speech).
One may naturally wonder what the synchronization relationship of speech and
gesture would be if non-Arrernte speakersDutch or Englishcould somehow
be induced to make their gestures at arms length, just as Arrernte speakers
do, albeit spontaneously. The present study examines this question, using English
speakers as participants. A ballistic model of speech and gesture would predict
that under these circumstances English speakers would exhibit behavior similar
to that of Arrernte speakers: the preparation phase of English speakers
gestures would take longer than normal, causing gesture strokes to become delayed
with respect to the co-expressive speech. If this turns out to be the case,
the ballistic model may prove to be correct. If not, as the present study argues,
the result could show two things: one is that the production of speech and gesture
are interactive throughout the duration of both processes, if despite an extension
of the arms to
the outer limit of the gesture space speech and gesture remain in synchrony.
The other would
be that the findings of De Ruiter and Wilkinss experimentthat in
Arrernte speech continues
forward while the gesture becomes delayed by an amount equal to the extra time
for gesture
preparationresulted from factors other than the modularity of speech-gesture
production.
The Present Predictions
There are two possible scenarios that fit the interactive theory. One is that
the preparation
phases of large gestures by speakers of English do not become delayed (i.e.,
they are
comparable to Dutch gestures), and that the relative timing between the onset
of speech and
that of the stroke phase is intact for large gestures.
The other possibility that also fits the interactive theory is that the preparation
phases
of large gestures by English speakers become longer than those of small (Dutch)
gestures, yet
the relative timing between the onset of speech and the stroke phase remains
intact.
These predictions are schematically shown in (c) and (d) of Figure 2, respectively.
These diagrams are to be contrasted with the prediction that De Ruiter and Wilkins
would
make, as shown in (a) and (b) of Figure 1 above.
Method
To induce English speakers to produce gestures at the outer-limit of the gesture
space
gestures that might induce longer preparation phaseswe modified
our standard procedure
of eliciting gestures via a cartoon narration.
Eight native speakers of American English participated in the study. Pair 1
consisted
of a female narrator and male listener, all four subjects in Pairs 2 and 4 were
female, and both
members of Pair 3 were male (the subjects volunteered in pairswe conducted
the experiment
without regard to gender). They were all undergraduate students at the University
of
Chicago at the time of the experiment, and were recruited through the study-pool
list, an
email database of interested volunteers maintained by the Department of Psychology.
The material used to elicit the narrative featured the well-known cartoon characters
Sylvester and Tweety. The cartoon is entitled Canary Row (Warner
Brothers, Inc.). A detailed
description of the story and some aspects of the animated film can be found
in the Appendices of McNeill (1992). We asked our subjects to recount the cartoon
story and, as they spoke, to point at relevant photos (still image clips printed
in color on standard US letter-sized paper, 23 in total) that were taken from
each of the episodes of the cartoon and posted in a random arrangement
on the walls of the experiment room. The participants were tested in pairs.
The narrator watched the entire cartoon once in a separate room. The narrator
then joined the listener in the experiment room and was given approximately
one minute to familiarize herself/himself with the locations of the image clips.
Then the narrator was instructed to recount the entire cartoon
story, and to point at the image clips posted on the walls of the experiment
room whenever
the narration came to a point where the image clips were relevant. The listener
was instructed
to attend carefully and attempt to remember the events recounted by the narrator.
The cameras were then switched on and the pair was left alone in the experiment
room until
the task was completed. The narrator was videotaped such that her/his whole
body would fit
on the screen even if s/he produced a gesture at the outer-limit of the gesture
space. The listener
was excluded from the camera field in favor of having a wider space available
for the
narrator.
(c) First prediction from the interactive theory
for large gestures
(d) Second prediction from the interactive
theory for large gestures
Figure 2. The present predictions on speechgesture
timing when a gesture is performed at
the outer-limit of the gesture space.
Results
The gesture space can be segmented as shown in Figure 3. The present analysis
is
limited to pointing gestures for which the preparation started in the center
gesture space and
terminated at the extreme periphery. This coding restriction ensured that we
could see the
effects of large movements on preparation phase duration and speech-gesture
timing.
Analysis shows that in most cases speech either remains synchronized with the
meaningfully
related gesture or even waits until the pointing gesture is fully executed in
the outerlimit
of the gesture space.
Figure 4 shows the histogram of the duration of preparation phase when English
speakers were induced to gesture at the outer-limit of the gesture space.
(1) The mean duration of the preparation phase of the English speakers is 631.11
msec.
(SD = 34.27 msec., the median is 566.67 msec.) This is more comparable to that
of
the Dutch speakers (Mean = 559 msec., SD unknown) than that of the Arrernte
speakers
(803 msec, SD unknown).
(2) The histogram also shows that the distribution of duration of preparation
phase is not
entirely normal. There are some cases of abnormally long preparation phases
towards
the right edge of the chart. Durations in the range 301-600 msec. have higher
frequencies
than other durations. These deviations from an otherwise relatively normal
distribution imply that there may be several interacting factors involved.
Crucially for the present purpose, however, performing the stroke phase of a
gesture
at the outer-limit of the gesture space does not necessarily make the duration
of preparation
phases longer. The longest preparation phase in the data set was 2166.67 msec
(=65 frames).
Figure 3. The gesture space (Pedelty 1987, cited in McNeill 1992).
When a preparation phase lasts as long as this, it typically is accompanied
by a so-called pretroke
hold between the onset of the preparation phase and the onset of the stroke
phase. Our longest preparation phase is a case in point: there was a long pre-stroke
hold. The shortest preparation phase for each subject was equal to or below
366.67 msec. (= 11 frames). The shortest preparation phase in the entire data
set lasted 333.33 msec. (=10 frames).
As to the relative timing between the stroke phase and the meaningfully related
speech segments, they remain robustly in synchrony for the English speakers
regardless of
the length of the preparation phase. This is clearly shown in examples (1) through
(4) below:
Example (1): A gesture with one of the shortest prep. phases1
[theres a picture / <h>of wha^t it looks like #]
Example (2): Another gesture with one of the shortest prep. phases
[Sylvester / on* / the pictures on the door / goes u]p /
Example (3): A gesture with one of the mid-length prep. phases
and then<n> / the camera sorta [pulls back n we see that / driving
the trolley #]
Example (4): A gesture with the longest prep. phase
[a<aa>nd / Tweetys ¿cage? // ¿right? is sitting on
the ¿window sill? /]
1The following symbols are used in the transcription: [ = onset of preparation
phase; ] = offset of retraction; ^ =
super imposed beat; / = unfilled speech pause; Bold face = stroke phase &
post stroke hold; _ = prestroke hold;
# = breath pause; < ... > = filled speech pause; * = aborted speech or
speech trouble; ¿...? = rising intonational
contour.
366.67msec (11 frames)
333.33msec (10 frames)
566.67msec(17 frames)
2166.67msec (65 frames)
Figure 4. The histogram of the duration of preparation phase in msec
when English speakers were induced to gesture at the outer-limit of the
gesture space.
Discussion
Speech-gesture synchrony by non-Arrernte speakers would imply that, for the
Arrernte,
the temporal separation of speech and gesture is not just a mechanical effect
of a larger
gesture space. That is, there is something extra causing the separation of speech
and gesture.
This extra something could be a rhetorical use of gesture. Suppose that, in
the Arrernte
culture, there is active control of gesture such that gestures are timed to
be deployed after the
co-expressive speech. This control might be rhetorical in that gestures
are made to occur as
reinforcements or echoes of what is said in speech.
Even if this hypothetical rhetorical use is not present in the Arrernte, as
long as there
is some form of active control of gesture, gesture could follow speech, as it
were, by design.
The large gesture space then would be an effect of the gesture-speech timing
difference rather
than a cause of it. The extra long preparation phase would be the instrument
of the speakers
active control over the timing of the gesture. Using the preparation phase as
the instrument
of control would automatically make the amount of gesture delay equal the extra
length of the
preparation phase, as De Ruiter and Wilkins report.
But whatever the extra factor in Arrernte gesture performance is, something
extra in
the control of gesture means that gesture is not ballistic. And this means that
Arrernte speech
and gestures cannot be taken as evidence of modularity. On the contrary, they
would reveal
the very opposite of modularitya continuing on-line process by Arrernte
speakers of controlling
the relationship between speech and gesture, in which the gesture is aimed to
occur at
the moment that the semantically co-expressive speech ends.
Conclusion
The evidence shows decisively that separation of speech and gesture in the Arrernte
manner is not the result of using a larger gesture space with its attendant
longer preparation
phase. While a ballistic model of speech and gesture would predict gesture to
be delayed according to length of preparation phase, our evidence has shown
that speakers can maintain
tight synchrony between speech and gesture even as preparation phase length
varies widely.
Such behavior is possible only if speakers exert careful control over the temporal
relationship
between the various parts of their simultaneously unfolding speech and gestures.
References
De Ruiter, J.P. and Wilkins, D. (1998). The synchronization of gesture and speech
in Dutch
and Arrernte (an Australian Aboriginal language). In S. Santi, I. Guaïtella,
C. Cavé
and G. Konopczynski (eds.), Oralité et Gestualité, pp. 603-607.
Paris: LHamattan.
Kita, S. (2000). How representational gestures help speaking. In McNeill (2000).
Levelt, W.J., Richardson, G. and La Heij, W. (1985). Pointing and voicing in
deictic expressions.
Journal of Memory and Language, 24:133-164.
McNeill, D. (1992). Hand and Mind: What Gestures Reveal about Thought. Chicago:
University
of Chicago Press.
McNeill, D. (2000). Language and Gesture. (Ed.). Cambridge: Cambridge University
Press.
McNeill, D. and Duncan, S., (2000). Growth points in thinking-for-speaking.
In McNeill.
(2000).
Nobe, S. (1996). Representational Gestures, Cognitive Rhythms, and Acoustic
Aspects of
Speech: A Network/Threshold Model of Gesture Production. Unpublished Ph.D. Dissertation,
Department of Psychology, The University of Chicago.