
|

| Home | Articles | Article Details |
  |
Abstract
Children with disordered vowel phonologies exhibit systematic substitution patterns, but the reason for these substitutions is a matter of debate. We describe two experiments that potentially shed light on this issue. In the two experiments, adult male speakers with normal phonologies attempted to imitate vowel-like target sounds they had produced earlier. In the first experiment, "serial imitation," speakers were presented with a self-produced target and their imitations of the target were played back as the target for the next imitation. This continued for ten steps, resulting in a chain of imitations. In the second experiment, "multiple imitation," speakers imitated each self-produced target several times. Formant analysis of the targets and the imitations showed that (a) the speakers were unable to imitate themselves accurately, (b) the imitations deviated from the targets in systematic ways, and (c) the deviations did not appear to be structured by the linguistic background of the speaker. The patterns of deviation were reminiscent of the substitution patterns exhibited in vowel disorders. Therefore, we propose a hypothesis for the cause of the deviations and discuss its implications for the etiology and treatment of vowel disorders.
Introduction
In speech research, consonants have historically been seen as more indicative of speech-specific processing than vowels, and therefore more theoretically interesting. Likewise, in clinical speech pathology, more attention has been paid to consonantal disorders than vowel disorders. There is a pragmatic aspect to this: the diagnosis of vowel phonological disorders is more difficult because of the confound between normal developmental change (e.g. producing a diphthong as a monophthong) and dialect variation.
However, in recent years there has been an increased interest in describing, diagnosing and treating vowel disorders. Pollock (2002) gives a good overview of the issues, and also a preliminary estimate of the incidence of vowel disorders. She reports that at 36 months of age, 4% of children with normal consonant phonologies had mild vowel problems, although none had severe problems. Of the children with disordered consonant phonologies, 35% had mild vowel problems and 9% had severe problems (Pollock, 2002; Table 3-4).
Despite the difficulty of diagnosis, researchers have been able to identify some systematic patterns in disordered vowel production, such as lowering (/e/ ? /a/), fronting (/?/ ? /a/), diphthong reduction (/e?/ ? /e/), and a "preference for peripheral vowel quality" (Reynolds, 2002). It is not clear why these patterns are preferred over others. One approach is to look for similar patterns in diachronic and synchronic change. For example, lowering/raising, coloring (fronting and/or rounding), bleaching (lowering and/or unrounding), and tensing/laxing are all diachronic processes (Donegan, 2002). Likewise, certain kinds of vowel errors, such as the preference for /i,a,u/, may be described by formalisms for synchronic change (Ball, 2002).
Another approach to understanding why certain patterns are preferred is to think about the perceptual representations of vowels: Are they coded discretely or continuously? Is perception more variable (or unstable) in certain parts of the vowel space? What kinds of perception and production errors can result from such instability? In this paper, we discuss two experiments that address these issues and concerns.
Theoretical Background
Vowel perception is usually non-categorical and highly context-sensitive (Fry et al, 1962), i.e., people hear more distinctions among vowels than they can categorize. Consequently, theories of vowel processing come in two flavors: they either try to find the acoustic transformations that predict human vowel categorization, or they try to explain the acoustic correlates underlying subcategorical perceptual differences. In our experiments (below) we were primarily interested in subcategorical distinctions, so we concentrated on theories of the second type (for reviews of categorization models, see Rosner & Pickering, 1994; Nearey, 1989; for overviews of speech perception theories, see Wright, Frisch, & Pisoni, 1999).
The non-categorical perception of vowels has striking similarities to the perception of non-speech sounds (Eimas, 1963), which suggests vowels are more influenced by auditory processes than are consonants. However, there is evidence that vowel discrimination is partly mediated by phonetic labels (Repp, Healy & Crowder, 1979) and is worse among more prototypical exemplars of a category (Iverson & Kuhl, 2000; see also Shigeno, 1991). Further, there is a significant stimulus order effect in vowel discrimination that seems to depend on the phonetic range and location of the tokens (Repp & Crowder, 1990). Thus, the perception of a vowel seems to be influenced by its phonetic status and its location in the vowel space.
Part of the difficulty with resolving this issue is methodological. To detect subtle perceptual variations using discrimination tests or multidimensional scaling, stimulus continua must be finely divided. This increases the overall number of stimuli and the total number of trials in the experiment (for example, a multi-dimensional experiment with 20 stimuli takes four times as long as one with 10 stimuli). Consequently, there is a trade-off between the area of the vowel space that can be explored and the stimulus resolution.
One alternative, suggested by Chistovich, Fant, de Serpa-Leitao, and Tjernlund (1966), exploits the fact that listeners are also speakers by using vowel imitation as a measure of vowel perception and the F1/F2 patterns of the imitations as clues to vowel organization. Vowel production is intrinsically continuous and multidimensional, so in principle subjects can directly "identify" a graded and multidimensional vowel percept. Moreover, imitation appears to be an integral aspect of speech processing. Infants only a few weeks old attempt to mimic the vocalizations of adults around them and 20-week old infants can imitate the point vowels /i/, /a/ and /u/ (Kuhl & Meltzoff, 1996). When adult subjects are asked to listen to a sentence or syllable sequence and repeat it with as little delay as possible, they perform with remarkable accuracy at latencies as short as 150 msec for syllables (Porter & Lubker, 1980). In fact, the response latency for shadowing is faster than the latency for simply detecting the vowel and uttering a standard response. These results suggest that speech perception and production are deeply linked in some manner, and that imitation does not require the conscious recognition and evaluation of speech. Thus, imitation seems to be methodologically and theoretically suited to studying speech.
Previous imitation studies
The utility of imitation to speech perception research has been recognized since the 1960s. In one of the first imitation studies, Chistovich et al. (1966) synthesized 12 vowels along an [a-e-i] trajectory in F1/F2 space and asked a phonetically-trained female subject to imitate them. When the mean F2 of the imitations was plotted against the ordinal of the stimulus (1-12), the resulting curve showed four well-defined plateaus. In addition, the F2 histogram showed four peaks near the category centers and the standard deviation of F2 showed peaks near the boundaries of the putative categories. Chistovich et al. interpreted these results to indicate that the subject had between four and six categories across the [a-e-i] continuum. Since the subject's native language (Russian) has only 3 vowel phonemes across the continuum, they posited that vowel representation is fundamentally discrete but at a finer granularity than the native phonemic categories. This conclusion does not imply vowels are discretely perceived, as it is possible that the discreteness occurs in vowel production.
The Chistovich et al. study, while suggestive, is of limited generality since it examined productions from only a single subject. Furthermore, the categorical structure was only observed for F2; F1 and F3 were much more ambiguous. Kent (1973) attempted to replicate the study with four speakers of American English. He asked subjects to imitate synthetic targets along two continua, /æ-i/ and /u-i/ and used the criteria that category boundaries are marked by peaks in formant standard deviations and category centers are marked by peaks in the formant histograms. The results tentatively suggested that the /æ-i/ continuum is more categorical than the /u-i/ continuum, but the data were too irregular to identify categories within each continuum (also see Kent and Forner, 1979). Repp and Williams (1985) re-examined the issue of discreteness in an in-depth study of imitation patterns of two phonetically-trained male speakers. Their task was to imitate 150 ms synthetic vowels along the [æ-i] and [u-i] continua. There were clear peaks in the resulting formant histograms, indicating the speakers had distinct response preferences. However, the formant frequency curves did not exhibit the plateaus observed by Chistovich et al. and the standard deviations did not show any consistent pattern across the two speakers.
It is possible the response preferences in the imitations occurred because the subjects were physiologically unable to match the synthetic target sounds, rather than due to how the vowels are represented internally. Repp and Williams (1987) tested this idea by asking speakers to imitate self-produced rather than synthetic vowel; thus, the subjects are physiologically capable of perfectly matching the acoustics of the target. They used as subjects the same two speakers from their earlier experiment. For each speaker, they chose 12 self-produced targets that were approximately equidistant along the [u-i] and [æ-i] continua. In addition, they also recorded the speakers' [hVd] utterances, and were thus able to map the speaker's "prototypical" vowels.
The results of the experiment were striking. First, the subjects were consistently inaccurate in imitating the (self-produced) targets, indicating the response preferences seen in earlier experiments were not solely due to a physiological mismatch. Second, the inaccuracies did not seem to be influenced by the locations of the vowel prototypes, supporting Chistovich et al's observation that the discreteness (if any) is not at the level of the native phonemic categories.
In summary, the line of research initiated by Chistovich et al. (1966) showed that speakers prefer certain formant frequency regions. There are also clear nonlinearities in vowel imitation that hint at the existence of representational categories but none of the studies have been able identify the putative categories or relate them to phonemes or allophones. A related problem is that all the experiments used only one-dimensional continua (such as /æ-i/), so it is unclear whether the observed nonlinearities are specific to certain parts of the vowel space, or whether they are part of a larger pattern. For example, Wildgen (1991) proposed that vowel systems may be organized with respect to the corner vowels /i/ /a/ and /u/, and that these vowels might serve as attractors during perception and production. This proposal is plausible, since there is evidence that the perceptual space of vowels is indeed warped with respect to a linear formant space (e.g. Kewley-Port & Atal, 1989, Iverson & Kuhl, 1995).
The two experiments (below) re-examine global vowel organization using the method of self-imitation. As discussed earlier, the imitation method allows us to cover the entire vowel space, so as to get a more comprehensive picture and possibly make sense of the complex results from the previous studies. Self-imitation provides two additional advantages. First, it serves as a best-case scenario for imitation and bypasses the question of physiological capability. Second, it allows us to avoid the issue of vowel normalization. If the subject is imitating vowels produced by someone else, then are response preferences due to "incorrect" normalization, or higher-level perceptual processes, or both? If the subject is imitating him/herself, then "incorrect" normalization is not an issue.
Experiment 1: Serial imitation of isolated vowels
This experiment was inspired by Frederic Bartlett's investigations of memory, in particular the task of "serial reproduction" (Bartlett, 1932/1995). This task is the familiar "telephone game." A subject is presented with a structured form (e.g. stories or sketches) and asked to reproduce it some time later to a second subject. The second subject reproduced the story to a third subject and so on. Bartlett was interested in whether the stories tended towards a common or archetypical form. Our experiment was slightly different in that each subject imitated themselves, and we were interested whether certain areas of the vowel space would function as "attractors" pulling the successive imitations (see Note 1).
Method
Subjects
The subjects (CD, CK, EL, TS) were four male monolingual speakers of American English. All subjects grew up in American English-speaking households, though from different dialect regions (CD grew up in New York City, CK in Denver and San Diego, EL in Connecticut, and TS in N. Florida). The subjects had no phonetic training, had not participated in any previous imitation experiments, and did not have any hearing problems.
Stimuli & Procedure
The experiment was conducted over two days. On Day 1, subjects were asked to imitate a set of 100 synthetic vowel-like stimuli; the purpose of this step was to encourage the subject to produce vowel-like sounds that were likely to be well distributed in formant space. They were also asked to read /hVd/ words in citation and sentence contexts. For each subject, 15 vowels were chosen from the 100 self-produced vowels to serve as the targets. On Day 2, the 15 self-produced targets were presented for serial imitation. The interval between the two days was 17 days (CK), 15 (CD), 25 (EL) and 13 (TS).
Synthetic stimuli (Figure 1a): One hundred steady-state vowel-like stimuli were synthesized using the Klatt Synthesizer (Klatt, 1980). Each stimulus had three formants, with F1 varying from 300 to 750 Hz in steps of ~65 Hz, F2 varying from 998 Hz to 2400 Hz in steps of ~115 Hz, and F3 constant at 2500 Hz. For all stimuli, F0 was 120 Hz and the duration was 200 ms.
Recording Procedure. The subject was seated in a sound-shielded room and stimuli were presented binaurally via headphones at a comfortable loudness. The subject's productions were recorded and digitized at a sampling rate of 10 kHz. Both the stimulus presentation and recording were computer-controlled.
Imitation of the synthetic stimuli (Figure 1b). The 100 stimuli were randomized and divided into 2 blocks of 50 stimuli each. The same randomized block and presentation order was used for all subjects. Each stimulus was presented and imitated twice consecutively, and only the second imitation was used for further analysis. After each stimulus presentation, the subject had 4 seconds to initiate his imitation. There was a 1second silent gap between the end of an imitation and the start of the next trial. Subjects were encouraged to produce all their imitations at the same level of subjective loudness.
Prototypical vowels (Figure 1b). Each subject was asked to read heed, hid, head, had, hud, hod, hoed, hood, and who'd in 5 citation contexts and 5 sentence contexts. These readings provided 10 tokens of the subject's natural productions for each of the 9 monophthongal vowels in American English.
Selection of the self-produced targets. The "vowel quadrilateral" of each speaker was approximated by a 5x3 grid imposed over the 100 imitations (Figure 1c). The cells of the grid are denoted by row#-col#, where the top-left cell is 1-1 and bottom-right cell is 5-3. For each subject, one production was selected from each cell of the vowel grid, for a total of 15 targets (Figure 1d). These targets are referred to as "seeds".
Imitation of the self-produced targets. The 15 seeds were the starting points for 15 trajectories. Each "trajectory" was recorded as follows: the seed was played to the subject and his imitation was recorded. After a 0.5s silent gap, the imitation was played back to the subject (without any modification) as the target for his next imitation. This continued for 10 steps - the imitation of the seed is step 1, its imitation is step 2, and so on. The subject was asked to imitate each sound as well as possible, and to disregard the previous targets in the trajectory (the session was monitored to ensure that these instructions were generally followed). After the final step, there was a 5s silent gap and the seed of the next trajectory was presented. In all cases, the subject had 3s to initiate his imitation after the presentation of the target. If the trajectory was interrupted for any reason (e.g., subject clearing his throat or a recording glitch) then at the end of the block the seed of the affected trajectory was presented again, and the entire trajectory was re-recorded.
The trajectories were recorded in two blocks, 7 trajectories in the first block and 8 in the second. This acquisition of the 15 trajectories was Set 1, and it was repeated two more times (Set 2 and Set 3). The presentation order of the seeds was randomized, but the same randomized order was used for all sets and all subjects. This was intended to minimize the variability introduced by the vowel context of the different seeds. Prior to the imitation session, the subjects were familiarized with the protocol in a brief training session consisting of 3-step trajectories of 4 seeds.
Formant Analysis. The formants for each imitation were estimated using a customized LPC analysis tool. The analysis frame was Hamming-windowed, pre-emphasized at 100%, and submitted to LPC analysis (Vallabha & Tuller, 2002). If the imitations were diphthongized, the selection was made from the portion of the diphthong that most closely matched the quality of the target. Fortunately, most of the imitations were monophthongs with unambiguous steady states.
Results
Figure 2 shows the raw trajectories for two subjects. There are 11 points on each trajectory (including the seed) and consequently, 10 steps. As is evident, there was a lot of noise and the trajectories were smoothed to reveal larger patterns (the smoothing consisted of a 3-point average for each formant). Figure 3 shows the smoothed trajectories for the same two subjects. There are three points to note.
Firstly, the subjects' imitations are not accurate and many of the deviations are large enough to affect the phonetic quality of the vowel. Moreover, the trajectories are extremely erratic, sometimes doubling back and intersecting themselves. This behavior is especially remarkable since the speakers are imitating their own productions at all times. Nonetheless, there are some interesting regularities. Even allowing for the difference in the vowel spaces, the two subjects are qualitatively different - CD shows a preference for the high-back and high-front regions, while TS completely ignores the high region with the vowels /i/, /?/ and /u/. Subject EL (not shown) had a general preference for /i/, but exhibited a general preference for raising that no other subject displayed.
Secondly, the trajectories exhibit local consistency in that trajectories with nearby starting points have similar evolutions. For example, notice the trajectories in the high-back region of CD's set 3, or the front axis and high-back region for TS. Such behavior suggests trajectories are principled or lawful in some way and evolved under shared constraints. However, few trajectories end at the prototype regions, as would be expected if these regions were acting as attractors. There are a few instances where trajectories seem influenced by the prototype regions - such as near CD's /i/, /u/ and /a/ - but these are exceptions rather than the rule.
Finally, observe that trajectories behave differently in different sets, even if they start with the same seed (e.g. the trajectories in CD's cell 1-2 and 5-, and TS's cell 1-2 and 3-2). This indicates that general patterns of deviation are slightly different for each set. Figure 4 shows the average F1 and F2 step size for each of CD's trajectories and sets. For example, in CD's Set 1, the trajectory starting in cell 1-2 has an average F1 step size of 13.7 Hz, and an average F2 step size of -1.1 Hz. In Set 2, the trajectory from the same seed has average F1 and F2 step sizes of 6.4 Hz and -13.9 Hz, respectively. These changes across sets occur for all four subjects.
However, there is one pattern common to all subjects and sets. Figure 5 shows the Euclidean magnitude of each step, averaged across all trajectories, as calculated using the Bark values of the formants (see Note 2). The key pattern is the sharp decrease in the step magnitude after the first step. Again, recall that in all steps the subject is imitating his own production; the only difference is that the target for step 1 was produced during Day 1 (imitation of the synthetic stimuli) whereas the targets for the subsequent steps were produced a few seconds earlier).
Discussion
The main conclusion of the above experiment is that subjects cannot accurately imitate their own vowels. The imitations deviate from the targets in a systematic and locally consistent manner, and these patterns of deviations differ across the subjects. It is also significant that the imitations of the synthetic vowels (Figure 1b) are adapted to the subjects' individual vowel spaces, suggesting that the subjects are not trying to reproduce the absolute formant frequencies of the target, but are rather focusing on higher-order phonetic and/or auditory qualities.
Nonetheless, there are two possible confounds that need to be addressed. The first is whether the deviations are methodological artifacts. One possibility is that the deviations are introduced into the waveform during recording or playback. Alternatively, during imitation a speaker's self-perception of his production is distorted, perhaps by bone conduction, so that he is not really producing what he thinks he is producing. However, these effects would cause very similar deviations within and across subjects, which was clearly not the case. Additionally, deviations cannot be ascribed to a lack of skill on part of the subjects; Repp & Williams (1987) reported similar levels of inaccuracy even though their subjects had extensive phonetic training, and we too, obtained similar results in pilot studies with experienced subjects.
The second issue is whether the results observed here are due to vowel context effects. It is known that vowel perception is proactively and retroactively influenced by surrounding vowels (Repp, Healy & Crowder, 1979; also see Traunmüller & Lacerda, 1987). The effects are usually contrastive, and ambiguous vowels are more susceptible to the effect than are strongly prototypical vowels (Thompson & Hollien, 1970). Thus when subjects produce successive imitations, the preceding sequence of targets and imitations will have an influence on how the current target is perceived. However, in the studies cited above the vowel context was imposed on the subject, whereas in the current experiment the preceding sequence of imitations is entirely self-produced, and therefore the context is "set up" by the subject himself. Hence, it seems reasonable to assume that context effects (if any) would be consistent with underlying response preferences.
While methodological artifacts can be ruled out as a confound, the precise origin of the deviations is still unclear. The vowel space did not seem to be governed by fixed attractors and the prototypical vowels of the subjects did not seem to influence the deviations. One clue to the structure of the space is given by Figures 4 and 5, which show the differences in imitation behavior between different sets and different steps of a trajectory. These differences suggest that the imitation is conditioned by the context; in different contexts, the subject emphasizes (and perhaps hears) slightly differently qualities of the target sound. When a sound is produced in one context and perceived in another, the context-mismatch would show up as a deviation. We shall refer to this effect of context on perception and production as the "perceptuomotor stance" (see Note 3). Thus, the variety in Figure 4 may have been caused by subjects performing different sets with slight different perceptuomotor stances, and the large size of step 1 in Figure 5 may be due to a difference between the subjects' stances during the two sessions of the experiment.
One problem with the idea of a perceptuomotor stance is the pervasive noise in the imitations. Each trajectory is produced within the same set, so all imitations in that trajectory should have been perceived and produced with the same stance; even so, the trajectories are usually very irregular. The use of the three sets was originally intended as a measure of the noisiness, but it confounds two possible sources of systematic deviation - the change in stance between the sessions, and the change between different sets. Further compounding the problem, each target in a trajectory (other than the seed) was imitated only once. Therefore, it is difficult to determine how much of the deviation is due to a systematic component and how much is due to noise in perception and production. The next experiment tries to separate these two components in a more rigorous manner and looks at possible theoretical explanations of the systematic components.
Experiment 2: Multiple imitation of isolated vowels
The serial imitation experiment showed that subjects cannot imitate themselves accurately. The deviations from the targets are systematic over the vowel space and indicate a preference for certain areas and avoidance of others. These preferences do not depend directly on the subject's prototypical vowels and in fact seem to change over sessions and (to a lesser degree) over sets within a session.
The primary goal of the current experiment is to tease out the systematic component of the deviation from the noise in perception and production. The general idea is as follows: In the first session, the imitations of the synthetic sounds are recorded (this step is the same as in the previous experiment). In the second session, only the productions recorded in the first session are presented as targets, and the subject imitates each target multiple times in randomized order. Thus, any observed deviations in the imitations would be due only to the (hypothesized) change in perceptuomotor stance between the two sessions. Since each target is imitated several times, it is then possible to calculate the systematic and noise components.
A secondary goal is to test the null hypothesis that inaccuracies in imitation are caused by random articulatory or perceptual fluctuations and do not reflect any deeper phonetic organization. To evaluate this claim, we compare the speakers' patterns of imitation inaccuracies with those from an articulatory model (Rubin et al., 1981) and a perceptual model based on formant difference limens.
Method
Subjects
The subjects (CD, DR, and FC) were three male native speakers of American English who volunteered to participate in the experiment. They had different dialect backgrounds (CD grew up in New York City, DR in North Carolina, and FC in New Hampshire). The subjects did not have any phonetic training, were not fluent in a second language, and did not have any hearing problems. CD had participated in the serial imitation experiment.
Stimuli & Procedure
The basic design of the experiment was similar to Experiment 1. On Day 1, subjects were asked to imitate a set of 100 synthetic vowel-like stimuli, and to read /hVd/ words in citation and sentence contexts. For each subject, 45 of the 100 self-produced sounds were chosen as the targets. On Day 2, the 45 targets were presented for imitation 10 times each. The interval between "Day 1" and "Day 2" was 20 months (CD), 26 months (FC) and 2 days (DR). The parameters for the synthetic stimuli, the protocol for their imitation, and the elicitation of the prototypical vowels are exactly the same as in Experiment 1. CD's prototypical vowels and imitations of the 100 synthetic sounds were already available from Experiment 1 and were reused.
Selection of the self-produced targets: As in Experiment 1, the vowel space of each subject was divided into a 5x3 grid (Figure 1c). For each subject, three productions were selected from each cell, for a total of 45 targets. If a cell did not have at least three productions, then the closest productions from adjacent cells were assigned to it.
Imitations of the self-produced targets: The presentation list for the imitations of the self-produced targets contained 10 instances of each of the 45 unique targets. The sequence of 450 instances was randomized and divided into 9 blocks of 50 instances each. The protocol for each imitation trial was same as for the synthetic stimuli, and the same randomized block and presentation order was used for all subjects. Prior to the imitation session, there was a training session with 15 randomly chosen self-produced targets.
Formant Analysis: All productions were analyzed using the methods described in Experiment 1.
Models
The results from Experiment 1 suggest that we can expect systematic deviations and noise from the multiple imitation. Therefore, to draw interesting conclusions about the underlying mechanisms, it is necessary to formulate plausible null hypotheses. We therefore simulated two models - one assumed there was random noise in the articulatory system, and the other, assumed random noise in the perceptual system.
Articulatory Model
The effect of random articulatory perturbations was simulated using the Haskins articulatory synthesizer (ASY; Rubin et al., 1981). In brief, the vocal tract configurations for six key vowels /i ɛ æ а ɤu/ and an additional /а/-like sound were created in ASY (using the datafrom Rubin & Goldstein, 1998). Next, the 10 ASY parameters that make up a vocal tract configuration were linearly interpolated between the key vowels (the key parameters were the location of the tongue body center and the amount of lip rounding); this resulted in 196 vocalic sounds. A vowel grid was constructed with these sounds using the same method as with the speakers’ productions. Finally, six vowels were chosen from each of the 15 cells to serve as “targets (Figure 6a). Each target was “imitated” (perturbed) 10 times by adding zero-mean Gaussian noise was independently added to the x and y coordinates of the target’s tongue body center (Figure 6c). The sd of the noise was 1 mm for both coordinates (see Table IV of Beckman et al., 1995).
Perceptual Model
As noted earlier, our experiment focused on subcategorical changes in the vowel percept. However, most models of vowel perception (e.g. Syrdal & Gopal, 1986; Nearey, 1989) emphasize categorization behavior without a systematic account of gradations in vowel quality. We therefore formulated a very simple model of perception based on discrimination limens (DL), as follows: Each vowel percept is treated as a point in Bark formant space (Kewley-Port & Atal, 1989; Syrdal & Gopal, 1986). There is intrinsic noise in the perceptual process which independently perturbs the F1 and F2 values; this noise is Gaussian and its sd is given by the formant DL (this interpretation of DL as a measure of intrinsic variability is taken from signal-detection approaches; Macmillan, Goldberg & Braida, 1988; van Hessen & Schouten, 1998). The vowel formant DL, and hence the sd of perceptual noise, is assumed to be 0.28 barks for both F1 and F2 (Kewley-Port and Zheng, 1998).
To allow the articulatory (ASY) and perceptual models to be meaningfully compared, the 90 ASY targets were also used for the perceptual perturbation. Each ASY target was converted to Bark space and perturbed 10 times. In each perturbation, zero-mean Gaussian noise with 0.28-bark sd was independently added to F1 and F2.
Results
The following presentation of the results is limited to F1 and F2 patterns since only the F1 and F2 locations of the targets were controlled. When examining the results, it is important to note that we do not know whether the subjects perceived differences between the targets and their imitations as they were producing them.
Figure 7 summarizes the imitation behavior of subjects CD and FC. Each arrow will henceforth be referred to as a "bias vector". The principal component ellipses show the variation around the corresponding means, and are shown separately in order to make the overall bias pattern more salient (all formants were converted to barks before the bias vectors and ellipses were calculated; the Hertz scale of the figure is for presentation only). The main point to note is that both subjects (and also DR, not shown) exhibited distinctive patterns of bias over the entire vowel space. The pattern of bias is different for each subject, and the bias vectors do not seem to be influenced by the nearest prototypes (this is consistent with the results from Experiment 1). The biases do not seem to be a simple centralization, since the bias vectors seem to depart from the center as frequently as they enter it. Some patterns are interesting by their absence: low-back vowels are lowered or raised but rarely move directly to the center, and in no case does a mid-back vowel move towards the high-back region.
Figure 8 shows the "imitation" plots of the articulatory and perceptual models overlaid with the 1-sd principal component ellipses (in order to make the plots less crowded, ellipses are shown for only 76 of the 90 targets). The key point to observe is that the articulatory model's bias patterns do not match the subjects'. The model's bias vectors are smaller than the subjects' and are much less consistent, i.e. adjacent targets do not usually move in the same direction (in fact, the ASY model's bias vectors seem as irregular as the perceptual model's). In addition, the ASY ellipses have much greater variation along F2 than either the perceptual model or the subjects'. The dramatic difference between Figures 7 and 8 suggests that the subjects' biases are not caused by either articulatory or perceptual noise alone. This qualitative conclusion is supported by a statistical analysis of the bias vectors (Table 1). There is an additional remarkable fact: while each subject has a different pattern of bias vector directions, the distribution of bias vector magnitudes (measured in bark F1/F2 space) is very similar for all the subjects (p > .01, Kolmogorov-Smirnov test).
Table 2 shows the variability of the imitations. Note that the subjects' F1 and F2 sds are in the range 0.23 - 0.31 barks, and that this range is much more similar to the perceptual model than the articulatory model. Furthermore, F1 and F2 are only weakly correlated. These results suggest the variability is driven partly by perceptual noise and this conclusion is reinforced by the fact that the variability is independent of the directions and magnitudes of the bias vectors.
Discussion
The results described above confirm that the deviations observed in Experiment 1 are indeed systematic, and that there is variability in the imitation process. Part of the variability may be attributed to the context effects (the stimuli are presented in randomized order, and each presentation of a target is likely to have a different sequence of preceding vowel sounds). However, the similarity of the subjects' F1 and F2 standard deviations to those of the perceptual model suggests that a large part of the variance stems from noise in the imitation process, and that this noise fluctuates across the vowel space and across different sessions.
This view of noisy imitation fits well with current theories. Noise is endemic in vowel production studies (e.g. Beckman et al., 1995) and there is substantial evidence that the same stimulus can generate different percepts. This perceptual variability is clear in vowel experiments that use ratings or absolute identification rather than categorization (e.g. Sawusch & Nusbaum, 1979) and affects even phonetically trained listeners (Laver, 1965). Moreover, the assumption of normally distributed percepts is central to several theories of vowel perception (Macmillan et al., 1988; Chistovich et al., 1966; Maddox et al., 2001; Uchanski & Braida, 1998).
There is one curious result, however. The formant variability of the articulatory model is approximately at the same level as that of the perceptual model, even though the parameters of the two models were independently motivated - the 1 mm perturbation radius of the articulatory model was based on the variability of tongue movement, while the 0.28 bark sd of the perceptual model was based on vowel discrimination. These results suggest some kind of attunement between the articulatory and perceptual systems, which may be due to co-development (speakers tacitly learn the precision with which vowels may be produced or perceived) or co-evolution (the intrinsic noise levels of the production and perception systems have become matched/balanced over time).
The results related to the imitation bias, on the other hand, are novel, robust, and were previously unanticipated by existing theories of speech perception and production. As discussed earlier, the biases cannot be ascribed to artifacts in recording or playback, bone conduction effects, or to subjects' lack of attention or skill. Below, we examine whether current theories of vowel production and perception can account for the imitation bias.
Production-based explanations. The ASY model of noise only examined the consequences of random noise around each articulatory configuration and showed that such noise does not explain the directionality of the subjects' imitations. However, ASY models only the gross anatomy and kinematics of the vocal tract and does not (usually) take into account muscles that actually move the articulators or the functional synergies that exist between them. Thus, it is still possible that articulatory noise, shaped by physiological or functional constraints that are omitted from the ASY model, can lead to the kinds of biases seen in the current data.
The issue of physiological noise was addressed by Mooshammer, Perrier and Payan (1999), using a 2D biomechanical tongue model that included the major tongue muscles and elastic properties of the tissues. In the context of the equilibrium-point motor hypothesis, they added independent signal-related Gaussian noise to the muscle commands and found that the noise does not account for the token-to-token articulatory variability observed with real speakers. Moreover, the acoustic variability (i.e. F1/F2 dispersion ellipses) of the Mooshammer et al. model is qualitatively similar to that seen with ASY (Figure 8a). Thus, physiological noise fails to account for the directionality of the imitations.
Alternatively, the noise may not be at the level of individual articulators but at the level of functional relations between them. For example, the "functional variables" for vowels could be the location and degree of the vocal tract constriction (cf. Stevens, 1989). However, if increases and decreases in constriction degree and location are equally likely, we would expect imitations to be distributed around the target (Gay, Boé, and Perrier, 1992), which is clearly not the case with the current data. One possibility is that increases and decreases are not equally likely; for example, one speaker might prefer alveolar constrictions and a less constricted vocal tract, while another may prefer velar constrictions. There is no evidence as yet of such preferences; however, the idea is quite similar to the notion of "perceptuomotor stance" that we discussed earlier.
Finally, observe that quantal theory (QT; Stevens, 1989) is also an unlikely explanation for the observed biases. QT tries to explain how point vowels such as /i/ and /u/ are stable and therefore prevalent in vowel systems, but the current data show subjects moving away from the /i/ and /u/ regions. In addition, QT has been applied primarily to peripheral point vowels, whereas the data show biases in non-peripheral regions also. Perception-based explanations. As discussed earlier, imitation biases have traditionally been explained as assimilation caused by a categorical phonemic code. Similarly, the perceptual magnet effect (Iverson & Kuhl, 1995) predicts perceptual assimilation towards the phonemic prototypes. Alternatively, if perception and production are seen as aspects of a single linguistic system with a common control space (cf. motor theory, Liberman & Mattingly, 1985), then key locations of this space may function as attractors. Yet all these explanations fall short because in the two-dimensional space it is clear that biases are not always influenced by the nearest phoneme (and in any case, the bias patterns are too different across the subjects).
One possibility is that the subjects were moving away from the centers of the prototypes and toward the boundaries. Macmillan et al (1988) found that vowels were labeled most reliably near category boundaries, and proposed that they are encoded in terms of distance from "perceptual anchors" located at the category boundaries. Moreover, Repp and Crowder (1990) observed that "vowels held in memory were assimilated toward some standard(s) located between prototypes" (p. 2088). However, this perceptual anchor/memory decay theory is untenable because earlier experiments indicate that imitations biases are unaffected by the response latency (Repp & Williams, 1985, 1987). Hence, the biases seen in the current data cannot be caused by representations changing across time and in particular, by any kind of memory decay process.
There is one other possibility. In a recent study, Dissard & Darwin (2000) played synthetic one- and two-formant sounds to subjects, and asked them to match the target by adjusting the formant frequency of a comparison sound (this is akin to imitation with a synthetic "vocal tract"). Interestingly, the matched sound deviated systematically from the target when the two sounds had different F0s. Dissard & Darwin suggested that this occurs because listeners' estimate of the formant location is biased towards the closest F0 harmonic (cf. Vallabha & Tuller, 2002). However, this explanation makes two predictions that are contradicted by the current experiments: (1) the serial imitation trajectories should gravitate towards the harmonics of F0, and (2) because there are fewer harmonics in the range of F1 frequencies, the biases should be more prominent across F1 than F2.
In summary, none of the current theories offer a plausible explanation of the directions or the magnitudes of the bias, which is not too surprising since these theories attempt to explain linguistic phenomena and the biases show no evident linguistic influence.
Verification of experimental results
The above experiment has some limitations. In particular, the three subjects did not speak the same dialect of American English, making it impossible to evaluate whether the intersubject differences in bias patterns are due to individual differences or dialect differences. Another concern is that the subjects imitated isolated steady-state vowels. Such sounds are uncommon in English, and it is possible that their "strangeness" evoked the imitation biases.
We tested these possibilities in a new experiment (see Experiment 4 in Vallabha, 2003). We chose four naive subjects (two male, two female) who were matched to have the same dialect background. They underwent the same protocol as in Experiment 2, except that they had an additional condition where the vowels were embedded in a /dV/ context (i.e. in Session 1, they imitated synthetic /dV/ sounds where the F1 and F2 of the vowel nucleus were manipulated; these imitations were used as the targets in Session 2). The results showed that (a) the magnitudes of the imitation biases were very similar for the /dV/ and /V/ targets, (b) the pattern of bias directions is not the same across subjects with the same dialect, and (c) the imitation bias occurred even if the target productions were prototypical vowels. These data confirm that the imitation bias is definitely not due to the linguistic background of the subject or the temporal context of the target sound.
General Discussion
The conclusion from Experiment 2 is that the bias is not primarily linguistic. This leaves us with the perplexing problem of a robust phenomenon (imitation bias) that is not due to recording artifacts, subjects' skill, "low-level" auditory or production noise, or the language background. Moreover, the phenomenon does not fit well with current theories of vowel perception.
One clue to the mystery is the local consistency of the biases (i.e. neighboring targets tend to have approximately similar bias directions and magnitudes) and their overall patterns. For example, CD and DR showed a preference for lowering and for midfront and midback vowels, and FC shows a preference for lowering, raising, and depalatalization (Figure 7). In addition to their general preference for lowering, CD and DR also had a slight tendency to lengthen their vowels.
It is instructive to compare these patterns with Donegan’s (2002). She notes that in normally developing children and in synchronic and diachronic change, lax, lengthened, or uncolored vowels are typically lowered (here, coloring refers to a frontal or a rounded quality). In addition, vowels with weak color are prone to loss of color. For example, children substitute [ɪ]→[ɛ], [ʊ]→[o] and [æ]→ [ɑ] (Reynolds, 2002, also notes [ɛ]→[a] substitutions). Moreover, in certain languages lengthened [ɪ,ʊ] alternate with [ɛː,ɔː], and historical [æ] →[ɑ] and [ɔ,ɛ ] →[ə] changes have been observed. Another pattern is the “implicational” or linked change. For example, “other things being equal, a lower vowel is more susceptible to depalatalization than a higher vowel of the same series, so if [e] depalatalizes, [æ] must also depalatalize, and if [i] depalatalizes,[e] and [æ] depalatalize as well.” (Donegan, 2002, p. 17). This pattern is particularly notable with regard to FC’s biases with high targets (Figure 7).
The one explanation consistent with the experimental results is that the imitation biases (and possibly the vowel substitutions in children) stem from the "perceptuomotor stance" of the listener. We hypothesize that each subject approaches the task with assumptions about what counts as a "good" imitation and tacitly, about what kinds of information must be attended or ignored (cf. Francis & Nusbaum, 2002). The assumptions and tacit knowledge shape the perceptuomotor stance of the subject and consequently influence the perception and production of the sound. This stance, akin to an overall "cognitive policy", shifts across different sessions and possibly within the same session. The pattern of bias directions can vary across subjects and sessions (reflecting the pliability of the policy) but the distribution of magnitude and variability does not (reflecting the constancy of the mechanisms that implement the policy).
The idea of a perceptuomotor stance also suggests that vowel perception is not categorical or subcategorical in the ways traditionally thought. Rather, listeners appear to pay tacit attention to certain complex qualities of the vowel (e.g. labiality, height, sonority, frontness), and the attentional weighting of the different qualities changes across time. In theoretical phonology, this idea is most closely related to government phonology (Ball, 2002; Harris, 1994). In this view, a vowel is treated as a fusion of several "elements" (the extreme sounds [i], [a] and [u]) with one element being dominant. For example, if [i] and [a] are the two elements being fused, then dominance of [a] produces [æ], and dominance of [i] produces [e]). We link this idea to the perceptuomotor stance by supposing that listeners have preferences for certain kinds of dominance, so a listener with an [i]-preference may hear a sound as [æ] whereas another may hear it as [e]. Presumably, normally developing children learn a set of attentional preferences that allow them to perceive and produce accurately the vowels of their native language.
Relevance to vowel disorder etiology
The hypothesis that children learn attentional preferences suggests a possible behavioral cause of vowel disorders. Under this view, disordered production occurs because children pay attention to the wrong dimensions or pay insufficient attention to the correct ones (e.g. focusing on height when it is necessary to focus on palatalization). Merzenich et al. (1996) proposed a similar explanation for language-learning impaired (LLI) children: "the deficits underlying the phonetic reception limitations of a LLI child might arise in early life as a consequence of abnormal perceptual learning that then contributes to abnormal language learning" (Merzenich et al, 1996, p. 77). Such ingrained attentional "bad habits" are more likely with vowels than with consonants since it is more difficult for a listener to resolve vowel ambiguities by referring to visible articulator positions. In addition, there is much more latitude in the perception and production of vowels (see Laver, 1965, for an illustration of the variability of vowel perception in even trained phoneticians).
The above hypothesis may also be related to the well-known problem of acquiring non-native speech contrasts. For example, Japanese adults have difficulty distinguishing between English /r/ and /l/ sounds. This is thought to occur because the Japanese adults do not pay attention to the F3 cue that reliably signals the /r/-/l/ contrast for native speakers of English (Iverson & Kuhl, 1996). Interestingly, Japanese listeners are quite capable of using F3 to distinguish between other English contrasts, such as /d/-/g/ (Mann, 1986). This suggests that attentiveness to F3 is not "fixed" by early language exposure but varies with the sound context, an idea akin to our proposal of a time-varying perceptuomotor stance (however, note that the dimensions in our proposal are not auditory dimensions like F1 and F2, but complex qualities of the entire sound).
Relevance to Therapy
If in fact vowel disorders are caused by attention to the wrong dimensions of sounds, then exposure to many exemplars of the sounds, even with consistent feedback, may not be effective since the listener does not know what aspects of the sound to attend to (see Gibbon & Beck, 2002, for a review of therapies). One alternative is to use computer-based methods that analyze productions in real time and give visual feedback about their approximate perceptual quality. Zimmer, Dai, and Zahorian (1998) and Hatzis and Green (2001) describe two such promising methods. However, it should be noted that in these methods the perceptual-quality spaces were defined in an ad-hoc manner, and were not experimentally compared to normal human data. This limitation could potentially be rectified by first asking normal listeners to perform multi-dimensional scaling with an ensemble of vowel sounds, and using the scaled distances as additional constraints in constructing the perceptual-quality spaces.
Another way to sensitize a listener to particular vowel dimensions is the method of perceptual fading. In this scheme, the listener is presented with extreme exemplars of sounds from the two categories, and asked to classify them. Once he or she is able to accomplish this reliably, the exemplars are made less extreme and the process is repeated. The extreme exemplars can either be synthesized, or they can be acoustically “extrapolated” from recorded normal exemplars (see McCandliss et al., 2002, for an application of this procedure to /r/-/l/ training; also see Merzenich et al, 1996). A third possible method is to exploit contrastive vowel context effects. For example, to sensitize listeners to an [ɛ] - [ə] contrast, the listener could be presented with a sequence of front vowels, followed by [ə]. This presentation sequence may sharpen the perceptual prominence of the depalatalization or backing. (We caution that this particular application of context effects is speculative. While context has been shown to have been robustly shown with one-dimensional vowel continua, their effect with two-dimensional arrangements of stimuli is unclear).
The method of self-imitation used in Experiments 1 and 2 also has implications for therapy. It may be possible to use imitation to map the pattern of biases for a child with disordered vowels (as in Figure 7); however, it is currently an open research issue how to distinguish between the imitation biases exhibited with a normal vowel phonology and the biases exhibited by disordered vowel phonologies. Alternatively, the imitation method may simply be used to elicit a wide variety of productions (as in Session 1 of the experiments), thereby "exercising" the perceptual and articulatory capacities. References
Ball, M. J. (2002). Clinical phonology of vowel disorders. In M. J. Ball & F. E. Gibbon (Eds). Vowel Disorders (pp. 187-216). Boston, MA: Butterworth-Heinemann.
Bartlett, F. C. (1932/1995). Remembering. Cambridge: Cambridge University Press.
Beckman, M. E., Jung, T., Lee, S., de Jong, K., Krishnamurthy, A. K., Ahalt, S. C., Cohen, K. B., & Collins, M. J. (1995). Variability in the production of quantal vowels revisited. Journal of the Acoustical Society of America, 97(1), 471-490.
Chistovich, L., Fant, G., de Serpa-Leitao, A., & Tjernlund, P. (1966). Mimicking of synthetic vowels. Quarterly Progress and Status Report, Speech Transmission Lab, Royal Institute of Technology, Stockholm(2), 1-18.
Dissard, P., & Darwin, C. J. (2000). Extracting spectral envelopes: Formant frequency matching between sounds on different and modulated fundamental frequencies. Journal of the Acoustical Society of America, 107(2), 960-969.
Donegan, P. D. (2002). Normal vowel development. In M. J. Ball & F. E. Gibbon (Eds). Vowel Disorders (pp. 1-36). Boston, MA: Butterworth-Heinemann.
Eimas, P. D. (1963). The relation between identification and discrimination along speech and non-speech continua. Language and Speech, 6, 206-217.
Francis, A. L., & Nusbaum, H. C. (2002). Selective attention and the acquisition of new phonetic categories. Journal of Experimental Psychology: Human Perception and Performance, 28(2), 349-366.
Fry, D. B., Abramson, A. S., Eimas, P. D., & Liberman, A. M. (1962). The identification and discrimination of synthetic vowels. Language and Speech, 5, 171-189.
Gay, T., Boe, L. J., & Perrier, P. (1992). Acoustic and perceptual effects of changes in vocal tract constrictions for vowels. Journal of the Acoustical Society of America, 92(3), 1301-1309.
Gibbon, F. E. , & Beck, J. M. (2002). Therapy for abnormal vowels in children with phonological impairment. In M. J. Ball & F. E. Gibbon (Eds). Vowel Disorders (pp. 217-248). Boston, MA: Butterworth-Heinemann.
Harris, J. (1994). English Sound Structure. Cambridge, MA: Blackwell.
Hatzis, A. & Green, P.D. (2001) . A two-dimensional kinematic mapping between speech acoustics and vocal tract configurations. Paper presented at the Workshop on Innovation in Speech Processing (WISP'01), April, 2001.
Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America, 97(5), 3099-3111.
Iverson, P., & Kuhl, P. K. (1995). Mapping the perceptual magnet effect for speech using signal-detection-theory and multidimensional-scaling. Journal of the Acoustical Society of America, 97(1), 553-562.
Iverson, P., & Kuhl, P. K. (1996). Influences of phonetic identification and category goodness on American listeners' perception of /r/ and/l/. Journal of the Acoustical Society of America, 99(2), 1130-1140.
Iverson, P., & Kuhl, P. K. (2000). Perceptual magnet and phoneme boundary effects in speech perception: Do they arise from a common mechanism? Perception & Psychophysics, 62(4), 874-886.
Kent, R. D. (1973). The imitation of synthesized vowels and some implications for speech memory. Phonetica, 28, 1-25.
Kent, R. D., & Forner, L. L. (1979). Developmental study of vowel formant frequencies in an imitation task. Journal of the Acoustical Society of America, 65(1), 208-217.
Kewley-Port, D., & Atal, B. (1989). Perceptual differences between vowels located in a limited phonetic space. Journal of the Acoustical Society of America, 85(4), 1726-1740.
Kewley-Port, D., & Zheng, Y. (1998). Vowel formant discrimination: Towards more ordinary listening conditions. Journal of the Acoustical Society of America, 106(5), 2945-2958.
Klatt, D. H. (1980). Software for a cascade/parallel formant synthesizer. Journal of the Acoustical Society of America, 67, 971-995.
Kuhl, P. K., & Meltzoff, A. N. (1996). Infant vocalizations in reponse to speech: Vocal imitation and formant change. Journal of the Acoustical Society of America, 100(4), 2425-2438.
Laver, J. D. M. H. (1965). Variability in vowel perception. Language and Speech, 8(2), 95-121.
Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception revised. Cognition, 21, 1-36.
Macmillan, N. A., Goldberg, R. F., & Braida, L. D. (1988). Resolution for speech sounds: Basic sensitivity and context memory on vowel and consonant continua. Journal of the Acoustical Society of America, 84(4), 1262-1280.
Maddox, W. T., Molis, M. R., & Diehl, R. L. (2001). Generalizing a neuropsychological model of visual categorization to auditory categorization of vowels. Perception & Psychophysics, 64, 584-597.
Mann, V. A. (1986). Distinguishing universal and language-dependent levels of speech perceptio: Evidence from Japanese listeners' perception English /l/ and /r/. Cognition, 24, 169-196.
McCandliss, B. D., Fiez, J. A., Protopapas, A., Conway, M., & McClelland, J. L. (2002). Success and failure in teaching the [r]-[l] contrast to Japanese adults: Tests of a Hebbian model of plasticity and stabilization in spoken language perception. Cognitive, Affective, & Behavioral Neuroscience, 2(2), 89-108.
Merzenich, M. M., Jenkins, W. M., Johnston, P., Schreiner, C., Miller, S. L., & Tallal, P. (1996). Temporal processing deficits of language-learning impaired children ameliorated by training. Science, 271, 77-80.
Mooshammer, C. R., Perrier, P., & Payan, Y. (1999). Simulation of token-to-token variability in vowel production. Journal of the Acoustical Society of America, 105(2), 1356.
Nearey, T. M. (1989). Static, dynamic, and relational properties in vowel perception. Journal of the Acoustical Society of America, 85(5), 2088-2113.
Pollock, K. (2002). Identification of vowel errors: Methodological issues and preliminary data from the Memphis Vowel Project. In M. J. Ball & F. E. Gibbon (Eds). Vowel Disorders (pp. 83-116). Boston, MA: Butterworth-Heinemann.
Porter, R. J., & Lubker, J. F. (1980). Rapid reproduction of vowel-vowel sequences - Evidence for a fast and direct acoustic-motoric linkage in speech. Journal of Speech and Hearing Research, 23(3), 593-602.
Repp, B. H., & Crowder, R. G. (1990). Stimulus order effects in vowel discrimination. Journal of the Acoustical Society of America, 88(5), 2080-2090.
Repp, B. H., & Williams, D. R. (1985). Categorical trends in vowel imitation: Preliminary observations from a replication experiment. Speech Communication, 4, 105-120.
Repp, B. H., & Williams, D. R. (1987). Categorical tendencies in imitating self-produced isolated vowels. Speech Communication, 6, 1-14.
Repp, B. H., Healy, A. F., & Crowder, R. G. (1979). Categories and context in the perception of isolated steady-state vowels. Journal of Experimental Psychology: Human Perception and Performance, 5(1), 129-145.
Reynolds, J. (2002).Recurring patterns and idiosyncratic systems in some English children with vowel disorders. In M. J. Ball & F. E. Gibbon (Eds). Vowel Disorders (pp. 115-144). Boston, MA: Butterworth-Heinemann.
Rosner, B. S., & Pickering, J. B. (1994). Vowel perception and production. Oxford, UK: Oxford University Press.
Rubin, P., & Goldstein, L. (1998). Articulatory Synthesis: ASY sample synthesis tables, [Data files]. Retrieved from: http://www.haskins.yale.edu/Haskins/MISC/ASY/DYNAMIC/ samples.html.
Rubin, P., Baer, T., & Mermelstein, P. (1981). An articulatory synthesizer for perceptual research. Journal of the Acoustical Society of America, 70(2), 321-328.
Sawusch, J. R., & Nusbaum, H. C. (1979). Contextual effects in vowel perception I: Anchor-induced contrast effects. Perception & Psychophysics, 25(4), 292-302.
Shigeno, S. (1991). Assimilation and contrast in the phonetic perception of vowels. Journal of the Acoustical Society of America, 90(1), 103-111.
Stevens, K. N. (1989). On the quantal nature of speech. Journal of Phonetics, 17, 3-45.
Syrdal, A. K., & Gopal, H. S. (1986). A perceptual model of vowel recognition based on auditory representation of American-English vowels. Journal of the Acoustical Society of America, 79, 1086-1100.
Thompson, C. L., & Hollien, H. (1970). Some contextual effects on the perception of synthetic vowels. Language and Speech, 13, 1-13.
Traunmüller, H. (1990). Analytical expressions for the tonotopic sensory scale. Journal of the Acoustical Society of America, 88(1), 97-100.
Traunmüller, H., & Lacerda, F. (1987). Perceptual relativity in identification of two-formant vowels. Speech Communication, 6, 143-157.
Uchanski, R. M., & Braida, L. D. (1998). Effects of token variability on our ability to distinguish between vowels. Perception & Psychophysics, 60(4), 533-543.
Vallabha, G.K. (2003). Perceptuomotor bias in vowel imitation. Unpublished doctoral dissertation, Florida Atlantic University, Boca Raton, FL.
Vallabha, G. K., & Tuller, B. (2002). Systematic errors in the formant analysis of steady-state vowels. Speech Communication, 38(1-2), 141-160.
Wildgen, W. (1990). Basic principles of self-organization in language. In H. Haken & M. Stadler (Eds.), Synergetics of Cognition (pp. 429-452). Berlin: Springer-Verlag.
Wright, R., Frisch, S., & Pisoni, D. B. (1999). Speech perception. In J. Webster (Ed.), Encyclopedia of electrical and electronics engineering (pp. 175-195). New York: John Wiley & Sons.
Zimmer, A. M., Dai, B., & Zahorian, S. A. (1998). Personal Computer Software Vowel Training Aid for the Hearing Impaired. Paper presented at the International Conference on Acoustics, Speech, and Signal Processing.
Author Notes
Gautam K. Vallabha, Center for Complex Systems & Brain Sciences, Florida Atlantic University; Betty Tuller, Center for Complex Systems & Brain Sciences, Florida Atlantic University.
Gautam K. Vallabha is now at the Center for the Neural Basis of Cognition, Carnegie Mellon University. This paper is based on the doctoral dissertation of the first author. The research described herein was supported by NIMH Predoctoral Training Grant MH19116 and NIMH Grant MH-42900.
Correspondence concerning this article should be addressed to Gautam K. Vallabha, 4400 Fifth Avenue, Mellon Institute Room 110, Pittsburgh, PA 15213, U.S.A. Electronic mail may be sent to vallabha@cnbc.cmu.edu.
Notes
1 These experiments are described in more detail in Vallabha (2003; see Chapters 4 and 5).
2 The following equation was used for the transformation from Hertz to Bark units (Traunmüller, 1990): Bark(F) = [26.81 / (1 + (1960 / F))] - 0.53. The Bark transformation approximately equates the perceptual sensitivity to changes in F1 and F2 (see Syrdal & Gopal, 1986, for an application).
3 The more common term for this is psychological "set", defined as "a state of psychological preparedness usually of limited duration for action in response to an anticipated stimulus or situation" (Merriam-Webster Collegiate Dictionary). We avoid this term because of the potential confusion with our existing usage of the word "set".
Table 1: Statistics of the bias vectors

a calculated over the 45 bias vectors (for the subjects) and 90 (for the models). b Number of significant bias vectors, using the Hotelling T2 test.
Table 2: Variability of the imitations

a The statistics (such as F1 sd) were computed for 10 imitations of each target, and averaged across all the targets (90 for the models, 45 for the subjects). b Number of correlations significant at p < .05, using the t-test.

Figure 1. Selection of the self-produced targets for subject CD. (a) The synthetic sounds (empty circles) and the mean locations of American English vowels from Hillenbrand et al. (1995). (b) The imitations of the synthetic sounds (tips of the blue lines), and the 1-sd ellipses for the prototypical vowels of the subject (filled ellipses). (c) The subject's imitations (empty blue circles) and the corresponding 5x3 grid. (d) The 15 imitations chosen to be the targets ("seeds") for the trajectories (red circles).

Figure 2. The raw trajectories for two subjects. The solid circles are the start of the trajectories (i.e. the seeds). The filled ellipses are the 1-sd principal component ellipses of the subject's prototypical vowels (also shown is the 5x3 vowel grid of the subject). The plot is not color coded; the different shades of the lines are used only for contrast.

Figure 3. The smoothed trajectories for the subjects, for the first and last sets. The smoothing is a 3-point average for each formant (see text). The solid circles are the seeds, and the arrowheads are the smoothed successive imitations.

Figure 4. The average F1 and F2 step sizes for each of subject CD's trajectories and sets. The labels on the horizontal axis denote the cell containing the seed of the trajectory (see text).

Figure 5. The Euclidean magnitude of each step, averaged across all the trajectories and sets.

Figure 6: (a) "Vowel space" of the ASY model, showing the locations of the 7 key vowels (large blue circles; E = [e], ae = [æ], Y = [?]), the locations of the 196 interpolated vocal tract configurations (small blue circles), the corresponding vowel grid (solid lines), and the 90 selected targets (red circles). (b) An illustration of the interpolated ASY vocal tract configurations from /i/ to /a/. (c) Examples of the ten perturbations for three sample configurations.

Figure 7. (a) The imitation behavior of subject CD. Right subplot: the base of each arrow is a target and the tip is the mean of the 10 imitations of that target. Shaded regions are the 1-sd principal components for the /hVd/ vowels. Red arrows indicate statistically significant bias vectors (Hotelling's T2 test, p < .05). Top left: a sketch of the overall movement tendencies. Bottom left: The 1-sd principal components for each set of imitations; the center of the ellipse is the mean imitation. (b) The imitation behavior of subject FC.

Figure 8: The mean "imitations" and 1-sd ellipses for (a) the articulatory (ASY) model, and (b) the perceptual model. Red arrows indicate statistically significant bias vectors (Hotelling's T2 test, p < .05). The ellipses are the 1-sd principal components for each set of "imitations".
|
|
|
|