Is speech-gesture production ballistic or interactive?
Nobuhiro Furuyama
National Institute of Informatics
(JAPAN)
David McNeill,
University of Chicago
(USA)
Mischa Park-Doob
University of Chicago
(USA)


Abstract

A “ballistic” model of speech and gesture argues that once the planning of a sentence
and a gesture is completed in the conceptualizer, the rest of the production processes of
speech and gesture are independent from each other. Observing Arrernte speakers who gesture
at arm’s length, De Ruiter and Wilkins hypothesized that the preparation of their gestures
should take longer than usual, and that the co-expressive speech should be delayed to the
same degree as the gesture to maintain synchrony between the two if their production processes
are interactive. The result was that the preparation phase of Arrernte gesture was longer
than usual, but speech did not become delayed. They concluded the ballistic model was
therefore correct. The present paper asks whether the same holds true for non-Arrernte
speakers (e.g., English speakers) gesturing at arm’s length. Our preliminary analysis shows
that the preparation phase does not necessarily take longer than usual and that in any case
speech remains synchronized with the related gesture. We conclude that the production of
speech and that of gesture are interactive throughout the entire process, and that De Ruiter
and Wilkins’s findings likely resulted from factors other than the modularity of speechgesture
production.


Introduction

This paper argues that speech production and production of the concurrent gesture are
interactive throughout the entire process, regardless of where in the gesture space a gesture is
produced, or how long the preparation phase of a gesture lasts. This is meant to be contrasted
with the ballistic (or modular) model of speech-gesture production (e.g., Levelt, Richardson,
and La Heij 1985; De Ruiter and Wilkins 1998) which argues that once the planning of a sentence
and a gesture is completed in the conceptualizer speech production and gesture production
are independent from each other.

Evidence in support of the ballistic model allegedly comes in part from a crosslinguistic
(and cross-cultural) comparison of speech-gesture synchronization differences
between Arrernte speakers and Dutch speakers (De Ruiter and Wilkins 1998). Arrernte is a
language spoken in central Australia. The key fact is that Arrernte speakers, unlike non-
Arrernte speakers, perform gestures at arm’s length; such gestures, performed at the outer limit
of the gesture space, require a longer preparation phase. The question De Ruiter and Wilkins ask is, does the co-expressive (‘affiliated’) speech become delayed to the same degree as the gesture to maintain the synchrony between gesture and speech? The finding was that the preparation phases of Arrernte gestures were longer than for Dutch gestures, but speech did not become delayed: gestures occurred after the co-expressive speech by an amount equal to the extra time needed for the gesture preparation. Thus, De Ruiter and Wilkins concluded, gesture and speech are ‘modular’ in that speech and gesture are on separate ballistic tracks; once launched they unfold independently of each other. This is schematically shown in (a) and (b) in Figure 1.

(a) Arrernte (large) gesture with long
preparation phase
(b) non-Arrernte (small) gesture with short
preparation phase

Figure 1. Relative timing between speech
and the concurrent gesture (black = preparation
phase, gray = stroke phase, or between
the onset of stroke phase and the
offset of the entire gesture movement,
slanted stripe = speech).

One may naturally wonder what the synchronization relationship of speech and gesture would be if non-Arrernte speakers—Dutch or English—could somehow be induced to make their gestures at arm’s length, just as Arrernte speakers do, albeit spontaneously. The present study examines this question, using English speakers as participants. A ballistic model of speech and gesture would predict that under these circumstances English speakers would exhibit behavior similar to that of Arrernte speakers: the preparation phase of English speakers’ gestures would take longer than normal, causing gesture strokes to become delayed with respect to the co-expressive speech. If this turns out to be the case, the ballistic model may prove to be correct. If not, as the present study argues, the result could show two things: one is that the production of speech and gesture are interactive throughout the duration of both processes, if despite an extension of the arms to
the outer limit of the gesture space speech and gesture remain in synchrony. The other would
be that the findings of De Ruiter and Wilkins’s experiment—that in Arrernte speech continues
forward while the gesture becomes delayed by an amount equal to the extra time for gesture
preparation—resulted from factors other than the modularity of speech-gesture production.


The Present Predictions

There are two possible scenarios that fit the interactive theory. One is that the preparation
phases of large gestures by speakers of English do not become delayed (i.e., they are
comparable to Dutch gestures), and that the relative timing between the onset of speech and
that of the stroke phase is intact for large gestures.

The other possibility that also fits the interactive theory is that the preparation phases
of large gestures by English speakers become longer than those of small (Dutch) gestures, yet
the relative timing between the onset of speech and the stroke phase remains intact.
These predictions are schematically shown in (c) and (d) of Figure 2, respectively.
These diagrams are to be contrasted with the prediction that De Ruiter and Wilkins would
make, as shown in (a) and (b) of Figure 1 above.


Method

To induce English speakers to produce gestures at the outer-limit of the gesture space
—gestures that might induce longer preparation phases—we modified our standard procedure
of eliciting gestures via a cartoon narration.

Eight native speakers of American English participated in the study. Pair 1 consisted
of a female narrator and male listener, all four subjects in Pairs 2 and 4 were female, and both
members of Pair 3 were male (the subjects volunteered in pairs—we conducted the experiment
without regard to gender). They were all undergraduate students at the University of
Chicago at the time of the experiment, and were recruited through the “study-pool” list, an
email database of interested volunteers maintained by the Department of Psychology.
The material used to elicit the narrative featured the well-known cartoon characters
Sylvester and Tweety. The cartoon is entitled “Canary Row” (Warner Brothers, Inc.). A detailed
description of the story and some aspects of the animated film can be found in the Appendices of McNeill (1992). We asked our subjects to recount the cartoon story and, as they spoke, to point at relevant photos (still image clips printed in color on standard US letter-sized paper, 23 in total) that were taken from each of the episodes of the cartoon and posted in a random arrangement
on the walls of the experiment room. The participants were tested in pairs. The narrator watched the entire cartoon once in a separate room. The narrator then joined the listener in the experiment room and was given approximately one minute to familiarize herself/himself with the locations of the image clips. Then the narrator was instructed to recount the entire cartoon
story, and to point at the image clips posted on the walls of the experiment room whenever
the narration came to a point where the image clips were relevant. The listener was instructed
to attend carefully and attempt to remember the events recounted by the narrator.
The cameras were then switched on and the pair was left alone in the experiment room until
the task was completed. The narrator was videotaped such that her/his whole body would fit
on the screen even if s/he produced a gesture at the outer-limit of the gesture space. The listener
was excluded from the camera field in favor of having a wider space available for the
narrator.

(c) First prediction from the interactive theory
for large gestures
(d) Second prediction from the interactive
theory for large gestures
Figure 2. The present predictions on speechgesture
timing when a gesture is performed at
the outer-limit of the gesture space.


Results

The gesture space can be segmented as shown in Figure 3. The present analysis is
limited to pointing gestures for which the preparation started in the center gesture space and
terminated at the extreme periphery. This coding restriction ensured that we could see the
effects of large movements on preparation phase duration and speech-gesture timing.
Analysis shows that in most cases speech either remains synchronized with the meaningfully
related gesture or even waits until the pointing gesture is fully executed in the outerlimit
of the gesture space.

Figure 4 shows the histogram of the duration of preparation phase when English
speakers were induced to gesture at the outer-limit of the gesture space.
(1) The mean duration of the preparation phase of the English speakers is 631.11 msec.
(SD = 34.27 msec., the median is 566.67 msec.) This is more comparable to that of
the Dutch speakers (Mean = 559 msec., SD unknown) than that of the Arrernte speakers
(803 msec, SD unknown).
(2) The histogram also shows that the distribution of duration of preparation phase is not
entirely normal. There are some cases of abnormally long preparation phases towards
the right edge of the chart. Durations in the range 301-600 msec. have higher frequencies
than other durations. These deviations from an otherwise relatively normal
distribution imply that there may be several interacting factors involved.
Crucially for the present purpose, however, performing the stroke phase of a gesture
at the outer-limit of the gesture space does not necessarily make the duration of preparation
phases longer. The longest preparation phase in the data set was 2166.67 msec (=65 frames).
Figure 3. The gesture space (Pedelty 1987, cited in McNeill 1992).
When a preparation phase lasts as long as this, it typically is accompanied by a so-called pretroke
hold between the onset of the preparation phase and the onset of the stroke phase. Our longest preparation phase is a case in point: there was a long pre-stroke hold. The shortest preparation phase for each subject was equal to or below 366.67 msec. (= 11 frames). The shortest preparation phase in the entire data set lasted 333.33 msec. (=10 frames).
As to the relative timing between the stroke phase and the meaningfully related
speech segments, they remain robustly in synchrony for the English speakers regardless of
the length of the preparation phase. This is clearly shown in examples (1) through (4) below:
Example (1): A gesture with one of the shortest prep. phases1
[there’s a picture / <h>of wha^t it looks like #]
Example (2): Another gesture with one of the shortest prep. phases
[Sylvester / on* / the picture’s on the door / goes u]p /
Example (3): A gesture with one of the mid-length prep. phases
and then<n> / the camera sorta [pulls back n’ we see that / driving the trolley #]
Example (4): A gesture with the longest prep. phase
[a<aa>nd / Tweety’s ¿cage? // ¿right? is sitting on the ¿window sill? /]
1The following symbols are used in the transcription: [ = onset of preparation phase; ] = offset of retraction; ^ =
super imposed beat; / = unfilled speech pause; Bold face = stroke phase & post stroke hold; _ = prestroke hold;
# = breath pause; < ... > = filled speech pause; * = aborted speech or “speech trouble”; ¿...? = rising intonational
contour.
366.67msec (11 frames)
333.33msec (10 frames)
566.67msec(17 frames)
2166.67msec (65 frames)
Figure 4. The histogram of the duration of preparation phase in msec
when English speakers were induced to gesture at the outer-limit of the
gesture space.


Discussion

Speech-gesture synchrony by non-Arrernte speakers would imply that, for the Arrernte,
the temporal separation of speech and gesture is not just a mechanical effect of a larger
gesture space. That is, there is something extra causing the separation of speech and gesture.
This extra something could be a rhetorical use of gesture. Suppose that, in the Arrernte
culture, there is active control of gesture such that gestures are timed to be deployed after the
co-expressive speech. This control might be ‘rhetorical’ in that gestures are made to occur as
‘reinforcements’ or echoes of what is said in speech.

Even if this hypothetical rhetorical use is not present in the Arrernte, as long as there
is some form of active control of gesture, gesture could follow speech, as it were, by design.
The large gesture space then would be an effect of the gesture-speech timing difference rather
than a cause of it. The extra long preparation phase would be the instrument of the speaker’s
active control over the timing of the gesture. Using the preparation phase as the instrument
of control would automatically make the amount of gesture delay equal the extra length of the
preparation phase, as De Ruiter and Wilkins report.

But whatever the extra factor in Arrernte gesture performance is, something extra in
the control of gesture means that gesture is not ballistic. And this means that Arrernte speech
and gestures cannot be taken as evidence of modularity. On the contrary, they would reveal
the very opposite of modularity—a continuing on-line process by Arrernte speakers of controlling
the relationship between speech and gesture, in which the gesture is aimed to occur at
the moment that the semantically co-expressive speech ends.


Conclusion

The evidence shows decisively that separation of speech and gesture in the Arrernte
manner is not the result of using a larger gesture space with its attendant longer preparation
phase. While a ballistic model of speech and gesture would predict gesture to be delayed according to length of preparation phase, our evidence has shown that speakers can maintain
tight synchrony between speech and gesture even as preparation phase length varies widely.
Such behavior is possible only if speakers exert careful control over the temporal relationship
between the various parts of their simultaneously unfolding speech and gestures.


References

De Ruiter, J.P. and Wilkins, D. (1998). The synchronization of gesture and speech in Dutch
and Arrernte (an Australian Aboriginal language). In S. Santi, I. Guaïtella, C. Cavé
and G. Konopczynski (eds.), Oralité et Gestualité, pp. 603-607. Paris: L’Hamattan.
Kita, S. (2000). How representational gestures help speaking. In McNeill (2000).
Levelt, W.J., Richardson, G. and La Heij, W. (1985). Pointing and voicing in deictic expressions.
Journal of Memory and Language, 24:133-164.
McNeill, D. (1992). Hand and Mind: What Gestures Reveal about Thought. Chicago: University
of Chicago Press.
McNeill, D. (2000). Language and Gesture. (Ed.). Cambridge: Cambridge University Press.
McNeill, D. and Duncan, S., (2000). Growth points in thinking-for-speaking. In McNeill.
(2000).
Nobe, S. (1996). Representational Gestures, Cognitive Rhythms, and Acoustic Aspects of
Speech: A Network/Threshold Model of Gesture Production. Unpublished Ph.D. Dissertation,
Department of Psychology, The University of Chicago.