| Serveur © IRCAM - CENTRE POMPIDOU 1996-2005. Tous droits réservés pour tous pays. All rights reserved. |
Identification of Concurrent Harmonic and Inharmonic Vowels: A Test of the Theory of Harmonic Cancellation and Enhancement
Alain de Cheveigné (CNRS, URA1028), Stephen McAdams, Jean Laroche (ENST/CNRS), Muriel Rosenberg (ENST/CNRS)
Journal of the Acoustical Society of America, 97, 3736-3748 (1995)
Copyright © ASA 1995
Abstract
The improvement of identification accuracy of concurrent vowels with
differences in fundamental frequency (F0) is usually attributed to
mecanisms that exploit harmonic structure. To decide whether identification is
aided primarily by the harmonic structure of the target ("harmonic
enhancement") or that of the ground ("harmonic cancellation"), pairs of
synthetic vowels, each of which was either harmonic or inharmonic, were
presented to listeners for identification. Responses for each vowel were
scored according to the vowel's harmonicity, the harmonicity of the vowel that
accompanied it, and F0. For a given target, identification was better
by about 3% for a harmonic ground unless the target was also harmonic with the
same F0. This supports the cancellation hypothesis. Identification was worse
for harmonic than for inharmonic targets by 3-8%. This does not support the
enhancement hypothesis. When both vowels were harmonic, identification was
better by about 6% when the F0s differed by 1/2 semitone. However, when at
least one vowel was inharmonic, F0 had no significant effect.
Identification of constituents of pairs was generally not the same when the
target was harmonic and the ground inharmonic or vice-versa. Results are
interpreted in terms of harmonic enhancement and harmonic cancellation, and
alternative explanations such as phase effects are considered.
Introduction
When two voices are present at the same time, differences in fundamental
frequency (F0) can help listeners attend to one or the other voice and
understand what is being said. This has been verified for natural and
synthetic speech (Brokx and Nooteboom 1982) and for pairs of synthetic vowels
(Scheffers 1983; Culling and Darwin 1993a). One interpretation is that
differences in F0 allow the voices to segregate from each other. Various
models and methods have been proposed to explain or reproduce this process (see
de Cheveigné 1993a, for a review). Some make use of the harmonic
structure of a voice to identify its components within the composite spectrum.
The voice is then isolated by enhancing those components relative to the
ground. Others make use of the harmonic structure of the interfering voice,
which is then removed by cancelling its components. Either strategy (or
both) can be used if both voices are harmonic, as long as they have different
F0s. Both strategies fail if the vowels have the same F0, which explains why
performance in double vowel identification experiments is not as good in this
case.
Each strategy has its advantages and disadvantages. Harmonic enhancement
allows harmonic sounds such as voiced speech to emerge from any type of
interference (except harmonic interference with the same F0 as the target).
Harmonic cancellation on the other hand allows any type of target to emerge
from harmonic interference. Enhancement works best when the signal-to-noise
ratio is high, because the F0 of the target is then relatively easy to
estimate. However separation is probably most needed when the signal-to-noise
ratio is low, in which case cancellation should be easier to implement.
Cancellation removes all components that belong to the harmonic series of the
interference, and may thus distort the spectrum of the target. Enhancement
should cause no spectral distortion to the target, as long as it is perfectly
harmonic. Cancellation of perfectly harmonic interference can be obtained using
a filter with a short impulse response, whereas enhancement requires a filter
with a long impulse response to be effective (de Cheveigné 1993a). The
non-stationarity of speech may limit the effectiveness of such filters.
The aim of this paper is to study the degree to which each strategy is used by
the auditory system in a double vowel identification experiment. An answer to
this question may allow us to better understand auditory processes of sound
organization, and refine our models of harmonic sound separation. We first
review the literature on mixed vowel identification experiments and present the
rationale and predictions for our experiment. We then present our experimental
design and methods, report the results, and analyze them in relation to the
predictions.
A. Double Vowel Identification Experiments
Bregman (1990) has suggested that, in order to understand speech in an
environment with many competing sound sources, the auditory system must first
analyze the acoustic scene and then build up an integral perceptual
representation of the behavior of a voice over time. A mixture of several
voices poses what Cherry (1953) called the "cocktail party problem". Cherry
showed that among the cues useful to the listener trying to track a source is
its spatial position, which creates binaural information that the auditory
system uses to segregate the source. Another important cue for the separation
of natural speech is the fundamental frequency. Brokx and Nooteboom (1982)
found that this cue helped listeners separate competing speech streams and
better reproduce the message carried by one or other stream. Fundamental
frequency differences are especially effective when reinforced by onset
asynchronies (Darwin and Culling 1990) or binaural cues (Summerfield and
Assmann 1991; Zwicker 1984).
Another cue that might be expected to reinforce F0 differences is frequency
modulation, particularly if competing streams are modulated incoherently.
McAdams (1989) and Marin and McAdams (1991) demonstrated that frequency
modulation increased the perceptual prominence of a vowel presented
concurrently with two other vowels at relatively large F0 separations (5
semitones, or 33%). However, they also found that this increase was independent
on whether the vowels were modulated coherently or not. Subsequent studies
confirmed that the effects of frequency modulation incoherence can be accounted
for by the instantaneous differences in F0 that it causes (Demany and Semal
1990; Carlyon 1991; Summerfield 1992; Summerfield and Culling 1992)
These results all suggest a crucial importance of harmonicity, exploited by the
auditory system when there are differences in fundamental frequency
(F0) between constituents of an acoustic mixture. The effects of
F0 have been studied in detail by a number of authors (Assmann and
Summerfield, 1989, 1990; Scheffers, 1983; Summerfield and Assmann, 1991;
Zwicker 1984; Chalikia & Bregman, 1989, 1993; Darwin and Culling, 1990;
Culling and Darwin, 1993a). In these studies, two synthetic vowels were
presented simultaneously at various F0 values and subjects were
requested to identify both vowels from a predetermined set of five to eight
vowels. Identification scores reflecting the ability to identify both vowels
(combinations-correct score) for several of these studies are plotted in Figure
1.
Fig. 1 Dotted lines: combination-correct identification rates as a function of F0 reported in previous studies. Continuous line: combination correctrates obtained in this study for mixtures of harmonic vowels (HH condition).
There are large differences in overall identification rate between studies
that may be attributed to differences in training of subjects, presence or
absence of feedback, size of vowel set, inclusion of pairs of identical vowels,
stimulus duration, level, etc.. A common trend is a rapid increase in
identification performance with F0 up to between 1/2 and 2 semitone
separation (3-12% difference in F0), followed by an asymptote. This effect is
usually explained by assuming that the mechanism that exploits the harmonic
structure of the vowel spectrum is effective when the F0s are different but
fails when they are the same and the harmonic series of both vowels coincide.
However, a question that none of these studies has addressed is whether it is
primarily the harmonicity of the vowel being recognized that aids its
segregation and subsequent identification, or that of the background vowel.
This leaves dangling many issues involved in the design of voice separation
models. The primary aim of the present study is to directly test the effect of
the harmonicity of both the target vowel and the background vowel on the
target's identification.
One study that approached this question was conducted by Lea (1992; Lea and
Summerfield 1992). He presented listeners with pairs of vowels of which each
could be either voiced or whispered, and requested them to identify both
vowels. He scored results according to the harmonicity of the vowel being
answered (the target) and that of the other vowel (the ground). He found that
targets were better identified when the ground was voiced than when it was
whispered. There was no significant advantage when the target itself was
voiced rather than whispered. However with a slightly different method, Lea
and Tsuzaki (1993a,b) found that targets were better recognized when they were
voiced.
A difficulty with this experiment is that it requires voiced and whispered
vowels to be equivalent in both "phonetic quality" and "masking power" (except
insofar as these depend on harmonicity). This is a difficult requirement
because it is not evident how one should go about matching the continuous
spectrum of a whispered vowel to the discrete spectrum of a voiced vowel. Lea
(1992) used a model of basilar membrane excitation to match the vowels, but the
possibility remains that some imbalance, for example of level, might have
affected the results. Here we describe a similar experiment in which whispered
vowels are replaced by inharmonic vowels with spectral structure and density
closer to those of harmonic vowels.
B. Experimental Rationale and Predictions
We wish to determine whether the auditory system uses the harmonicity of the
target or that of the ground to segregate the target from the mixture. For
that purpose we used stimuli consisting of pairs of vowels, each of which was
either harmonic or inharmonic. Inharmonic vowels were obtained by perturbing
the frequencies of the components of a harmonic vowel by small random amounts,
as explained in section I-B and Appendix A-2. In addition to the harmonicity
state we introduced a difference in fundamental frequency (F0) in order to compare their effects and study their interaction, as well
as to allow comparisons with previous studies. Pairs of vowels were presented
simultaneously. Subjects were asked to identify both vowels and respond with an
unordered pair of vowel names. For each vowel in the stimulus, the answer was
deemed correct if the vowel's name appeared within the response pair. This
answer was classified according to the harmonic state of that vowel (the
target), the state of the other vowel (the ground), and the nominal F0
difference between them. This step was repeated for the second vowel in the
pair, reversing the roles of target and ground.
In this paper, the notation 'HI', for example, indicates a harmonic target with
an inharmonic ground, and 'R(HI)' indicates the identification rate for that
target. Other combinations are noted IH, HH, and II. Where necessary, the
relation between the F0s may also be specified: HI0 signifies the same F0 and
HIx signifies a different F0 (HI implies that both F0 conditions are
taken together). For each hypothesis concerning the strategy that is used by
the auditory system to separate harmonic sounds, specific predictions can be
made concerning the outcome of this experiment.
1. Enhancement
According to this hypothesis, harmonicity of the target promotes segregation
(unless the ground is also harmonic and has the same F0). All else being equal,
a target should be better identified if it is harmonic:
R(HI0) > R(II0),
R(HIx) > R(IIx),
R(HHx) > R(IHx).
If the hypothesis is false, these differences should be insignificant.
2. Cancellation
According to this hypothesis, harmonicity of the ground allows the target to be
segregated (unless it is also harmonic and has the same F0). All else being
equal, identification should be better when the ground is harmonic:
R(IH0) > R(II0),
R(IHx) > R(IIx),
R(HHx) > R(HIx).
If the hypothesis is false, the differences should be insignificant. In
addition to these two hypotheses that our experiment was specifically designed
to test, there are others that are worth considering.
3. Symmetric Mechanisms
According to Bregman (1990), a characteristic of primitive segregation is the
symmetry of its effects: segregation causes both parts of a mixture to become
equally accessible. We might thus expect vowels in a pair to be equally
affected by factors that promote segregation such as differences in harmonic
structure:
R(IH0) = R(HI0),
R(IHx) = R(HIx).
Specific cues or mechanisms that might show that behavior are:
a) Component Mismatch
According to this explanation, harmonicity per se is unimportant; segregation
is limited by the proximity of components and increases when harmonic
structures are different. In the HH0 condition harmonic series coincide,
whereas all other conditions introduce a mismatch between component frequencies
that should ease identification of both constituents:
R(all conditions other than HH0) > R(HH0).
b) Beating between Partials
Culling (1990) and Culling and Darwin (1993a,b) suggested that beating between
partials in the F1 region might explain improvements in identification with
F0. Beating occurs for example if two partials belonging to different
vowels fall within the same auditory filter: the output fluctuates at a rate
that depends on the difference in frequency between the partials. Fluctuations
may allow the amplitudes of the two partials to be better estimated, as long as
they are neither too slow to be appreciable within the duration of the
stimulus, nor too fast to be resolved temporally by the auditory system.
Beating is likely to affect identification in a complex fashion, but insofar as
it depends on frequency difference between partials of both vowels, both should
be equally affected.
c) Quality Differences (pitch, timbre)
Vowels that share the same pitch and harmonic nature (such as constituents of
the HH0 and II0 conditions) may "sound alike" and thus be difficult to
segregate when mixed. Differences in quality should promote segregation:
R(all conditions other than HH0, II0) > R(HH0,II0).
In contrast to the predictions of the component-mismatch hypothesis, the II0
condition does not promote segregation here (assuming that all of the
inharmonic stimuli used are percieved as having a similar quality).
4. The Effect of F0
For all models, F0 effects are likely to be smaller when either vowel
is inharmonic than when both are harmonic. For example in the IH condition the
effectiveness of enhancement would be reduced, whereas that of cancellation
should change relatively little with F0 (much of the F0
effect in the HH condition is due to the fact that when F0=0 all
target components fall precisely on the ground vowel's harmonic series, and are
cancelled together with those of the ground). Component mismatch or beating
should also be less affected by F0 than in the HH condition, leading
to smaller effects when either vowel is inharmonic. On the other hand when
both vowels are harmonic, all models predict alike:
R(HHx) > R(HH0).
It is or this reason that classic double-vowel experiments do not allow us to
choose between hypotheses.
I. Stimuli
A. Spectral Envelopes
Vowels belonged to a set of French vowels -- /a/, /e/, /i/, /o/, /u/ -- which
are also common to many different languages. The spectral envelopes were
derived from natural voiced speech by a screening procedure that produced a set
of 10 allophones for each vowel (see Appendix A-1). Envelopes for each
experimental condition were drawn at random from the allophone set. By using
random allophones, we hoped a) to reduce the likelihood that a listener might
learn the spectra of particular combinations of synthetic vowels and respond
correctly without using separation mechanisms, b) to make the task more
difficult in conditions such as equal F0 and thus obtain larger effects when
the F0s differed, and c) to lower the overall recognition rate to avoid ceiling
effects. We reasoned that intraclass variability would make the task more
typical of situations in which human beings recognize speech.
B. Harmonic Structure
Vowels were synthesized in one of two harmonicity states (harmonic and
inharmonic) and at three nominal fundamental frequencies (125 Hz and ±
1/4-semitone, or ± 1.45% of the F0). Harmonic vowels had component
frequencies equally spaced at multiples of the F0. For inharmonic vowels, each
component frequency was shifted from the harmonic series by an amount drawn at
random from a uniform distribution bounded by ±3% of the harmonic frequency,
or half the spacing between adjacent harmonics, whichever was smaller (see
Appendix A-2 for more details). The "nominal F0" of an inharmonic vowel is by
definition that of the harmonic series before modification. We chose to use a
rather mild perturbation to ensure that the spectral density was similar to
that of a harmonic vowel shaped by the same envelope. A different component
frequency pattern was used for each inharmonic allophone (however this pattern
remained the same at different nominal F0s). Inharmonic patterns are
illustrated in Fig. 2, together with a histogram illustrating the distribution
of inter-component spacings in the F1 and F1-F3 regions.
The F0 values we chose allow F0s of 0 and 2.9% (1/2 semitone) to be
investigated. Based on previous studies (Fig. 1), such values should ensure an
effect large enough to be significant while leaving room for improvement with
other factors. The range corresponds approximately to the maximum frequency
shift of the partials of our inharmonic vowels, and to the mistuning up to
which individual partials still make a full contribution to virtual pitch, as
estimated by Moore et al. (1985)
Fig. 2 (a) Top: harmonic series; middle: range of frequencies from which
inharmonic partials are drawn; bottom: a particular inharmonic series. (b)
Histogram showing the distribution of inter-component spacings (divided by F0)
for inharmonic series. Full line: spacings between components up to 750 Hz (F1
region). Dotted line: spacings between components up to 3 kHz (F1-F3
region).
C. Synthesis
Individual vowels were generated by 16-bit additive synthesis at a sampling
rate of 16 kHz. Their spectra comprised 45 components with amplitudes
determined by interpolated look-up in a spectral envelope table corresponding
to a given envelope. There was an additional -5 dB/component de-emphasis from
the 30th to the 45th component. All components started in sine phase. Stimuli
were 200 ms in duration including 25 ms raised cosine onset and offset ramps.
II. Pre-test: Single Vowel Identification
The purpose of the pre-test was to screen listeners for their ability to
identify the synthesized vowels used in the experiment. We were also interested
in whether there were any systematic effects of harmonicity or F0 on the
identifiability of vowels, as such effects might interfere with the effects
studied in the main experiments.
A. Subjects
Subjects were 21 male and 11 female caucasian homo sapiens volunteers
recruited from the staff and students at IRCAM and ENST (including the four
authors). Their ages ranged from 23 to 50 years (mean 31.4). None of the
subjects reported having a hearing disorder. The subjects had French as either
their mother tongue (23) or as a highly fluent second language which they
practised on a daily basis in their professional lives (9). The large majority
had extensive experience producing and listening to synthesized sounds.
Nineteen of the subjects had participated in a similar pilot experiment about
two months prior to this one.
B. Stimuli
Ten allophones of the vowels /a/, /e/, /i/, /o/ and /u/ were each synthesized
at the three F0s to be used in the main experiment (123.208, 125.0, 126.818
Hz). Each of these 30 combinations was synthesized in both harmonic and
inharmonic versions. All stimuli were equalized for rms level.
C. Procedure
Subjects were seated in a Soluna SN-1 double-walled sound-proof booth, in front
of a computer terminal that was used for prompting and to collect responses.
Digital stimuli stored on hard disk were upsampled to 44.1 kHz, sent through
the NeXT Cube D-A converters and presented diotically over Sennheiser HD 520 II
earphones. The sound system was calibrated using a flat-plate coupler
connected to a Bruel & Kjaer 2209 sound level meter to obtain a level of
approximately 60 dBA.
Subjects were informed that they would hear individual vowel sounds and were to
identify them as one of /a/, /e/, /i/, /o/, /u/ by typing the appropriate key
on the computer keyboard (a, e, i, o, u, respectively). They were informed
that they needed to attain a criterion performance level of 95% to continue on
to the main experiment. The computer logged the spectral envelope, F0,
harmonicity and response to each stimulus in a separate file for each subject.
Each combination of allophone, nominal F0, and harmonicity was presented once
for a total of 300 trials that were presented in random order (Appendix
4).The pre-test lasted 15 minutes on average.
D. Results
All but two of the subjects attained 95% criterion performance and continued on
to participate in the main experiment. The identification rates for those
subjects were 91% and 94%. Overall performance for a given subject varied from
91% (27 errors) to 100%. The global mean was 98% (11.2 errors; s.d. = 1.85%).
Performance varied for different vowels: the order according to identification
rate was /a/ (99%), /i/ (99%), /u/ (98%), /e/ (97%), and /o/ (97%). The only
allophones whose performance fell below 90% were an /e/ and an /o/. The
performance of one allophone (a /u/) fell between 90% and 95%. The remaining
allophones were identified at better than 95%. The confusion matrix indicating
the frequency of each response category assigned to each stimulus was examined.
The only confusions that represented more than 1% of the judgments were
confusions between /e/ and /i/ and between /o/ and /u/. These confusions
accounted for 1.5% and 3.9% of the identification errors, respectively.
A multivariate repeated measures analysis of variance on factors vowel class
(5) X harmonicity (2) X F0 (3) was performed with, as the dependent variable,
proportion correct identifications across allophones by each subject within a
given condition. Each data point was based on 10 judgments. The analysis
revealed that the main effect of vowel noted above was significant
(F(4,124)=3.5, p=0.016, GG=0.78). [1] There was
no significant effect of fundamental frequency nor any significant interactions
involving this factor. There was no main effect of harmonicity but the
interaction between vowel and harmonicity was highly significant (F(4,124)=6.4,
p=0.0002, GG=0.88) indicating an effect of harmonicity on vowel identification
that is limited to certain vowels. Contrasts between harmonic and inharmonic
versions for each vowel class showed that the effect of harmonicity was only
significant for /e/ and /u/ vowels. Harmonic stimuli were better identified
than inharmonic ones for /e/ by 2.8% (F(1,124)=19.5, p<0.0001, GG=0.88) and
the reverse was true by 1.4% for /u/ (F(1,124)=4.5, p=0.041, GG=0.88). We can
summarize these results by noting that there were small, though significant,
effects of harmonicity for some stimuli and no effect of F0 for any
stimuli. The general level of performance is quite good for the large majority
of allophones in both harmonic and inharmonic versions.
III. Main Experiment: Double Vowel Identification
A. Subjects
Subjects were the 30 who attained criterion performance on the pre-test.
Nineteen of these had participated in a similar pilot experiment about two
months prior to this one.
B. Stimuli
The stimulus set consisted of pairs of synthesized vowel allophones belonging
to the set: /a/, /e/, /i/, /o/, /u/. Vowels within a pair were always
different, yielding 10 unordered combinations. Each vowel within a pair was
either harmonic or inharmonic, yielding four combinations of harmonicity.
Finally, there were two conditions of F0 difference: 0 and 1/2-semitone (2.9%).
All factors, vowel pair (10), harmonicity (4) and F0 (2), were crossed
giving 80 different combinations.
In addition to the factors that interest us, the design contained others that
might also influence the phonetic quality of the target or the masking power of
the ground: absolute F0, choice of inharmonic pattern, choice of allophone, or
presentation order. To avoid any systematic bias due to these factors, the
following precautions were taken: a) Pairs were duplicated so that each vowel
of each pair occurred once at the higher and once at the lower F0 when
F0!=0. b) For each inharmonic allophone, the same component pattern
was used to synthesize different F0 conditions. c) Allophones were assigned in
a balanced fashion across conditions. For example the subset of allophones
representing the eight repetions of the vowel /a/ (2 positions X 4 other
vowels) in the HH0 condition within a presentation of the stimulus set also
represented that vowel in all other main conditions (HHx, HI0, etc.). Other
subsets were used for other presentations. d) Stimuli were presented in random
order, and this order was renewed for each run and each subject.
In the inharmonic state each allophone used a different component pattern.
Since vowels within a pair were different, component patterns within an
inharmonic-inharmonic pair were also different. As noted above, the same
subsets of allophones appeared in all conditions, but for practical reasons it
was not possible to guarantee that the occurence of allophone pairs was
similarly balanced. Allophones were paired at random, and the pairing was
renewed for each presentation and subject. Duplication of F0
conditions resulted in a 160-stimulus set.
Preliminary experiments had shown that when vowels are mixed at equal rms
signal levels, one vowel might dominate the pair due to unequal mutual
interference, as noted by McKeown (1992). In that case, the identification
probability of one vowel is likely to be at its "floor" and the other at its
"ceiling", both being thereby insensitive to the conditions of interest. To
avoid such a situation, we performed a preliminary experiment to determine
levels of equal "mutual interference" (see Appendix A-3). From these results we
derived a level correction factor for all pairs, such that identification rates
for both vowels were the same. Vowel levels were adjusted according to this
factor, the vowels were summed, and the rms signal level of the sum was set to
a standard level for all pairs.
C. Procedure
The experimental apparatus was the same as in the pre-test. The double vowel
stimuli were presented at a level of about 60 dBA. Subjects were
informed that they would hear a complex sound composed of two different vowels
from the set /a/, /e/, /i/, /o/, /u/. Each vowel pair was presented once,
followed by a visual prompt on the terminal screen. Subjects were required to
hit two keys in succession, corresponding to the two vowels heard (two of a, e,
i, o, u)--or else Q to quit temporarily. Any other response produced a message
reminding the subject of the options, and requesting a new response. A response
with two identical vowels produced a message reminding the subject that the
vowels were different, and requesting a new response. Aside from information
about response constraints, no feedback was given concerning the correct
response. Subjects were presented with three consecutive runs of all
combinations of vowel, harmonicity, and F0 in randomized order for a
total of 480 stimuli.
Responses for each subject were gathered in a file. Each response was scored
twice, once for each vowel present within the stimulus. The vowel was deemed
correctly identified if its name appeared within the response pair. This
partial response was classified according to the harmonic state of that vowel
(the target), the state of the other vowel (the ground), the nominal F0
difference between them, and the names of both vowels. This procedure was
repeated for the other constituent vowel, reversing the roles of target and
ground, leading to a total of 960 "answers" for each subject. Figure 3
summarizes these conditions and their notation. This method of scoring is
equivalent to that used by Lea (1992) to obtain "constituents-correct"
scores.
Fig. 3 Response conditions: Target harmonicity X Ground harmonicity X
F0 X vowel pairs.
D. Results
Within each harmonicity and F0 condition, proportion correct
identification measures for each target vowel were calculated for each subject
across all vowel combinations, yielding eight data points per subject. Each
data point was based on 120 judgments (20 vowel pairs X 2 vowel identifications
X 3 repetitions). A multivariate repeated measures analysis of variance was
performed on factors F0 (2), target harmonicity (2), and ground
harmonicity (2). All main effects and interactions were statistically
significant (see Table I). Subsequent discussion will focus on tests of the
various hypotheses outlined in the introduction.
Table I Analysis of variance table for the main
experiment. Dependent variable: Mean Identification performance for target
vowels across vowel pairs. Independent variables: fundamental frequency
difference (F0), target harmonicity (Tar), ground harmonicity (Gnd).
1. Effect of F0
In Figure 4 the means across subjects are plotted as a function of F0.
Each line represents one of the four combinations of target and ground
harmonicity. Filled symbols represent harmonic targets and open symbols
inharmonic targets. Squares represent harmonic grounds and circles inharmonic
grounds. When both vowels are harmonic, performance increases with F0
by about 6%. Planned contrasts show that this effect is highly significant
(F(1,29)=50, p<0.0001). When at least one vowel is inharmonic the effect is
not significant (for HI: F(1,29)=0.1; for IH: F(1,29)=0.4; for II:
F(1,29)=0.4). We take advantage of this fact to group these conditions across
F0 in subsequent contrasts.
Fig. 4 Identification rate as a function of F0 for each of the harmonicity conditions. Error bars represent ± 1 standard error of the mean. The standard deviations vary between 0.066 and 0.081. Data points for HH and HI are displaced horizontally for visibility.
2. Effect of Harmonicity of Ground.
The data are replotted in Fig. 5 to emphasize the effects of ground and target
harmonicity. Contrasts planned to test the cancellation hypothesis
(Introduction, B.1) show that identification improves significantly when the
ground is harmonic, unless the target is also harmonic and F0=0 (IH
vs. II: F(1,29)=26, p<0.0001; HHx vs. HI: F(1,29)=14, p=0.0008). The
improvement in identification rate is about 3%. These results are compatible
with the cancellation hypothesis. An additional contrast shows that when the
target is harmonic and F0=0, performance is significantly worse with a
harmonic ground, also by about 3% (HH0 vs. HI0: F(1,29)=13, p=0.0009).
3. Effect of Harmonicity of Target.
Whatever the F0 and whatever the nature of the ground, identification
is worse when the target is harmonic. Contrasts planned to test the enhancement
hypothesis (Introduction, B.2) are highly significant (HI vs. II: F(1,29)=15,
p=0.0004; HHx vs. IH: F(1,29)=13, p=0.0008), but the direction of the effects
observed is opposite to that predicted by that hypothesis. The effect
is similar in size, about 3%, to what was observed for ground harmonicity. An
additional contrast shows that the larger effect (about 8%) obtained when the
ground is harmonic and F0=0 is also significant (HH0 vs. IH0:
F(1,29)=99, p<0.0001).
Fig. 5 Identification rate of target as a function of ground harmonicity, for harmonic and inharmonic targets and nominal F0 differences of 0 and 1/2
semitone.
4. Evidence of Symmetrical Segregation.
A contrast planned to test the hypothesis of symmetrical segregation
(Introduction, B.3) shows that, contrary to what this hypothesis predicted,
performance is significantly better for IH than for HI conditions (HI vs. IH:
F(1,29)=96, p<0.0001), by about 5% (Fig. 5). Symmetric segregation
mechanisms cannot account for our results. They might however coexist with
other asymmetric mechanisms, so it is of interest to consider contrasts
specific to the various symmetric segregation hypotheses.
Performance for HH0 is worse than for all other conditions (HI vs. HH0:
F(1,29)=19, p<0.0001; IH vs. HH0: F(1,29)=142, p<0.0001; II vs. HH0:
F(1,29)=59, p<0.0001). This would be consistent with the component-mismatch
hypothesis, were it not for the asymmetry between HI and IH mentioned above.
Performance is better for IH than for II (F(1,29)=26, p<0.0001) but worse
for HI than for II (F(1,29)=15, p=0.0004). This is inconsistent with the
quality differences hypothesis, already weakened by the asymmetry between HI
and IH.
5. Confusion Matrix
The confusion matrix for vowel pairs is shown in Table II. There was a slight
bias towards responses containing 'o' (22.0%) and 'e' (21.0%), rather than
those containing 'i' (19.1%), 'u' (19.1%) or 'a' (18.8%). The unordered
response pair 'ou' was recorded most often (14.2 %), and 'au' least often
(7.1%). The vowel /u/ appears to be correctly identified most often (85%),
followed by /o/ (80%), /e/ (76%), /a/ (73%) and /i/ (72%). Vowels paired with
/a/ are identified correctly most often (91%) followed by those paired with /i/
(86%), /e/ (82%), /o/ (79%) and /u/ (49%). The poor rate for vowels paired with
/u/ is most certainly due to the excessive level emphasis given to /u/ relative
to other vowels (see Appendix A-3).
Response:
|
---|
| ae | ai | ao | au | ei | eo | eu | io | iu | ou
|
---|
Stimulus
|
---|
ae | 1182 | 81 | 41 | 18 | 55 | 45 | 13 | 1 | 1 | 3
|
---|
ai | 169 | 1114 | 21 | 42 | 53 | 3 | 9 | 11 | 10 | 8
|
---|
ao | 29 | 2 | 953 | 24 | 0 | 236 | 3 | 78 | 2 | 113
|
---|
au | 3 | 3 | 28 | 434 | 5 | 26 | 209 | 19 | 149 | 564
|
---|
ei | 172 | 47 | 1 | 1 | 957 | 82 | 111 | 22 | 47 | 0
|
---|
eo | 139 | 6 | 134 | 8 | 48 | 906 | 37 | 82 | 3 | 77
|
---|
eu | 31 | 44 | 13 | 76 | 148 | 140 | 371 | 102 | 297 | 218
|
---|
io | 29 | 73 | 46 | 13 | 130 | 262 | 51 | 707 | 66 | 63
|
---|
iu | 3 | 66 | 9 | 103 | 59 | 22 | 172 | 182 | 494 | 330
|
---|
ou | 2 | 3 | 26 | 298 | 2 | 32 | 234 | 20 | 160 | 663
|
---|
total | 1759 | 1439 | 1272 | 1017 | 1457 | 1754 | 1210 | 1224 | 1229 | 2039
|
---|
% | 12.2 | 9.9 | 8.8 | 7.1 | 10.1 | 12.2 | 8.4 | 8.5 | 8.5 | 14.2
|
---|
Table II Confusion matrix for vowel pairs (response orders are confounded). The total number of times each stimulus type was actually presented was 1440. The bottom two rows give the total number of responses of each kind, and the proportion of total responses they represent.
6. Dependency of Effects on Vowel Pair
Our experiment was designed assuming that data would be averaged over vowel
pairs (and thus over allophone pairs and component pattern pairs). This was
deemed appropriate because we had no theoretical reason to expect major
differences in the way different vowel pairs, allophone pairs, or pattern pairs
might affect the dependency of identification rate on our main conditions:
ground harmonicity, target harmonicity, and F0. It is nevertheless of
interest to note such effects. Figure 6 displays the identification rate as a
function of ground harmonicity for each of the 20 vowel pairs, for both
conditions of F0 and both conditions of target harmonicity. Vowel
pairs differ considerably in overall identification rate, as well as in the
size and direction of the effects of ground harmonicity. These differences
might be due to genuine vowel specificities, or to some effect of the level
correction factors that we applied, or possibly to differences between the
component patterns used to synthesize each vowel pair (each allophone had its
own inharmonic pattern when it was synthesized in an inharmonic state; each
vowel was thus represented by a different set of patterns). Our experimental
design does not allow us to decide which of these factors are responsible for
the differences. It is however of interest to keep them in mind when
interpreting our main effects. For example it may be that the population of
"inharmonic" patterns that we treat as homogenous is actually made up of
members with widely differing properties.
Fig. 6 Identification rate of target vowel as a function of ground harmonicity for each vowel pair and for all four conditions of F0 and target harmonicity. The thick lines without markers represent the effect averaged over vowel pairs, also plotted in Fig. 5.
IV. Discussion
A. Effect of F0 in Comparison with Previous Studies
Most previous studies report the proportion of responses for which
both vowels in a pair were correctly identified (combinations-correct
rates). To allow comparisons to be made, combinations-correct scores were
calculated from our data for the HH condition and plotted in Fig. 1 together
with data from those studies. The effect of F0 is quite similar.
Although our task was relatively easy (chance level is 10%, as in Culling and
Darwin (1993a) and Lea (1992), compared to 3.8% for Scheffers (1983), or 6.7%
for Summerfield and Assmann (1991)), our rates are relatively low. This
probably reflects the greater variability of our stimulus material, and
differences in training (we used a large number of relatively untrained
subjects).
B. Evidence for Cancellation
At a F0 of a 1/2-semitone, whatever the target, and at F0=0
when the target is inharmonic, identification is better when the ground is
harmonic. This is consistent with the cancellation hypothesis. No advantage
was to be expected for a harmonic ground in the fourth condition (F0=0
with a harmonic target), but identification was actually worse when the
ground was harmonic than when it was inharmonic (an unexpected outcome). One
possible explanation is that our inharmonic stimuli were approximately harmonic
with a "pseudo period" that differed from their nominal period (on informal
listening they often appeared to have a pitch different from that of a harmonic
vowel of same nominal F0). A harmonic sieve tuned to reject the "peudo-period"
might partially remove the inharmonic ground without completely removing the
target, whereas that target would be eliminated if both vowels were harmonic
and had the same F0. Another possible explanation is that other mechanisms are
at work in addition to cancellation.
Lea (1992) also found evidence for cancellation: when the target was a 112 Hz
voiced vowel, identification rates were better by 3% for a 100 Hz voiced ground
than for a whispered ground. When the target was a whispered vowel, the
advantage was 8%. Subsequent experiments (Lea and Tsuzaki 1993a,b) gave similar
results. The largest effect found by Lea (1992) was greater by a factor of
2.7 than the ground harmonicity effects we found (~3%). The
smaller size of our effects may be due to the fact that our inharmonic vowels
were more "harmonic" than whispered vowels.
C. Evidence for Enhancement
Our results do not support enhancement. In fact identification rates are
worse when the target is harmonic, whereas in the absence of enhancement
we predicted a null effect. This result is unexpected. It is worth considering
in more detail at this point the assumptions upon which we based our
predictions. We assumed that both vowels could be retrieved simultaneously via
independent processing channels involving enhancement and/or cancellation, and
thus that both hypotheses could be tested independently. If instead the
auditory system must choose between strategies, factors that favor one
may penalize the other. If, for example, cancellation is used systematically,
it may tend to "lock" onto whatever happens to be harmonic within the stimulus,
and thus impair the identification of harmonic targets. Inharmonic targets
would be relatively immune. Thus the unexpected outcome of our experiment may
be due to the mutual interference between segregation mechanisms. If so, we
cannot rule out the eventuality that enhancement is used, but its
effects are swamped by the side effects of cancellation. Enhancement would
eventually show up in tasks in which cancellation is less likely to come into
play. Our results contrast with those of Lea (1992), who found no significant
difference between whispered and voiced targets, and Lea and Summerfield
(1992), who found an advantage for targets that were voiced rather than
whispered.
An explanation for the apparent preference of the auditory system for
cancellation over enhancement may be found in an experiment by McKeown (1992).
He requested subjects to identify both vowels within a pair, and at the same
time judge which vowel was "dominant", and which was "dominated". Improvements
in identification with F0 only concerned the dominated vowel. If we
suppose that it is easier to estimate the F0 of a dominant vowel than that of a
dominated vowel, it should follow that cancellation is easier to apply to
segregate the dominated vowel (de Cheveigné 1993a). It is then
reasonable that factors upon which cancellation depends should affect the
scores. Another explanation may be found in an experiment of de
Cheveigné (1993b, 1994). Harmonic enhancement and cancellation were
implemented in a speech recognition system to reduce the effects of co-channel
speech interference. Cancellation was more effective than enhancement,
presumably because it was less affected by the non-stationarity of speech. The
synthetic vowels used in our experiments are stationary so this consideration
should not apply here. However, the auditory system may have evolved to use
only strategies that are robust for natural stimuli.
D. Compatibility with F0-guided Models of Concurrent Vowel Perception
A variety of models make use of explicit F0 information. Some clearly take
sides for either enhancement (Frazier et al. 1976) or cancellation (Childers
and Lee 1987; Hanson and Wong 1984; Naylor and Boll 1987), but most other
models are capable of both. Models come in three sorts: spectral,
spectro-temporal, and temporal.
The harmonic sieve employed by spectral models based on Parson's harmonic
selection method (Assmann and Summerfield 1990; Parsons 1976; Scheffers 1983;
Stubbs and Summerfield 1988, 1990, 1991) can be used in either of two modes:
to retain components that fall close to a harmonic series, or else to remove
them. These modes correspond to enhancement and cancellation, respectively.
The sieve may be applied in turn for each F0, correlates of one voice being
selected among those rejected by the sieve tuned to the other. In that case
each voice retrieved is actually a product of both strategies. Similar remarks
can be made concerning models derived from Weintraub's spectro-temporal model
(Assmann and Summerfield 1990; Lea 1992; Meddis and Hewitt 1992; Weintraub
1985): channels dominated by the period of a voice can be retained
(enhancement) or else removed (cancellation). If both operations are applied in
turn, each voice retrieved is really the product of both strategies. In the
model of Meddis and Hewitt (1992), only one F0 was used, so one voice (the
dominant one) was purely the product of enhancement, whereas the other voice
was purely the product of cancellation. However nothing in the model prevents
it from being extended to use both strategies to segregate both voices.
Finally, de Cheveigné (1993a) proposed a time-domain comb-filtering
model implemented by neural circuits involving inhibition that was capable of
either enhancement or cancellation.
Since most models allow both strategies, our results do not allow us to choose
among them, but they do allow us to better understand how each model functions.
E. Compatibility with other Models of Concurrent Vowel Perception
A number of models that do not require explicit extraction of F0 have been
proposed to explain improvement of identification with F0. Summerfield
and Assmann (1991) suggested that such an improvement might be explained by
misalignment between partials of constituent vowels. At unison the partials of
both vowels coincide, and their relative contributions to the combined spectrum
are obscured by phase-dependent vector summation. Misaligned partials on the
other hand may show up as independent peaks within a high-resolution spectrum
and thus template-matching strategies might be more successful. Summerfield and
Assmann (1991) found some evidence for an effect of component misalignment for
vowels with widely-spaced components (200 Hz), but none for monaurally
presented vowels at 100 Hz. On the other hand in a masking experiment in which
thresholds were determined for synthetic vowels masked by vowel-like maskers,
Summerfield (1992) attributed up to 9 dB of a 17 dB release from masking to
component misalignment. The remaining 8 dB were attributed to F0-guided
mechanisms. Our results certainly cannot be explained solely in terms of
component misalignment. HI and IH conditions involve the same inter-component
intervals, yet they produce identification rates that are very different.
However if harmonic misalignment were involved together with other mechanisms,
it might help explain for example why the HH0 condition was significantly worse
than the HI0 condition. Our experiments used II pairs in which the inharmonic
patterns of the vowels were different, and thus partials did not coincide at
nominal F0=0. It would be worth investigating a similar condition in
which both vowels have the same inharmonic pattern. Comparisons between
the two would allow us to factor out eventual effects of component
misalignment.
If the period of a vowel is long relative to time constants of integration
within the auditory system, the vowel's auditory representation may fluctuate
during the period. Mutual interference between concurrent vowels may be more or
less severe according to whether the fluctuations of their respective
representations line up in time or not. A small F0 difference is equivalent to
a gradually increasing delay of one vowel relative to the other, and this might
allow the auditory system to select some favorable interval on which to base
identification. Differences in F0 might thus enhance identification.
Summerfield and Assmann (1991) investigated the effects of pitch period
asynchrony on identification rate using vowels with same F0 but varying degrees
of phase shift. They found a significant effect at 50 Hz, but none at 100 Hz,
presumably because the integrating properties of the auditory representation
smooth out fluctuations at this rate. Our vowels had even higher F0s, so this
explanation is unlikely to account for our data.
Slower fluctuations may occur in the compound representation of the
vowel pair. Two partials falling within the same peripheral channel produce
beats with a depth that depends on their relative amplitudes, and a rate equal
to their difference frequency. Three or more partials produce yet more complex
interactions. These fluctuations may cause the auditory representation to take
on a shape that momentarily allows one vowel or the other, or both together, to
be better identified. Culling and Darwin (1993a,b) suggested that such beats
might explain increases of identification rate with differences in F0. Assmann
and Summerfield (1994) found that successive 50 ms segments excised from a 200
ms stimulus composed of two vowels with different F0s were not equally
identifiable. For small F0, identification of the whole stimulus
could be accounted for assuming it was based on the "best" of the segments that
composed it. This result is compatible with the notion that F0 differences
cause the auditory representation to fluctuate (as does, indeed, the short-term
spectrum itself), and provide the auditory system with various intervals upon
which to base identification, one of which may be particularly favorable to
either vowel or both.
Inharmonicity or F0 differences between vowels can be interpreted as slowly
varying phase relationships between partials of harmonic vowels with a same F0.
The "best interval" provided by beating can be interpreted simply as a phase
relationship that is particularly favorable for identification. The harmonic
vowels used in our experiments were all synthesized in sine phase, whereas the
partials of inharmonic vowels can be interpreted as progressively moving out of
this phase relationship. If the masking power of vowels in sine-phase is
relatively small and the resistance to masking of vowels in sine-phase is
relatively poor, then harmonic vowels will appear to be both less well
recognized and less effective as maskers, as indeed we found. Phase effects
thus constitute a possible alternative explanation of our results.
F. Harmonicity and the Cohesion of Sound
The lack of a positive effect of harmonicity on target vowel identification is
the most surprising result of this study. It has been suggested that
harmonicity labels parts of a sound as belonging together in several ways:
continuity of F0 indicates that successive parts of speech belong to the same
voice, the same F0 indicates that different formants belong to the same vowel,
a common F0 signals that partials within a formant belong together (Bregman
1990; Broadbent and Ladefoged 1957; Darwin 1981). Without this "harmonic glue"
components would fall apart, and the sound might lose its intelligibility or be
more easily masked. Nevertheless Darwin (1981) found that speech sounds
synthesized with different formants on different F0s retained their phonetic
quality. Culling and Darwin (1993a) synthesized vowels with a difference in F0
between their first and higher formants, and paired them so that the components
making up the first formant of one vowel belonged to the same harmonic series
as the higher formants of the other. In other words, the F0s were swapped
between vowels in the F1 region. Identification was as good as for vowels with
unswapped F0s for all but the largest F0s, from which Culling and
Darwin concluded that a common F0 between formants does not affect how
they are grouped together. Our results go a step further. They suggest that a
common F0 between partials has no positive effect (and apparently even a
negative effect) on the identification of the sound that they form. This result
is counter-intuitive, and is contradicted by some other studies. For example,
Darwin and Gardner (1986) found that mistuning a single partial within a
formant affected the phonetic quality of a vowel. However the effect of
mistuning (which was phase-dependent) was not always in the direction expected
on the basis of harmonic grouping.
A common F0 does have one important effect: it produces the impression of a
single source. The presence of multiple F0s within a sound, what Marin (1991)
calls "polyperiodicity", produces the impression of multiple sources, and thus
signals to the auditory system that segregation is called for. This signal
might have great value for sound organization, and yet have no effect in
psychoacoustic experiments for which the listening frame is already determined
by the task.
V. Summary and Conclusion
1) Listeners identified vowels within synthetic pairs better by about 3% when
they were inharmonic than when they were harmonic, except when the ground was
harmonic and F0=0, in which case the advantage was 8%. This result is
contrary to what one would expect if a strategy of harmonic enhancement was
used to segregate the vowels.
2) Listeners identified vowels within synthetic pairs better by about 3% when
the vowels accompanying them were harmonic than when they were inharmonic,
except when the target vowel was also harmonic and F0=0, in which case
they were less well identified by about 3%. These results are consistent
with the hypothesis of harmonic cancellation.
3) When both vowels within a pair were harmonic, they were better identified by
about 6% when there was a difference in F0 of 1/2 semitone. This is consistent
with results of previous studies. When either vowel was inharmonic, a
difference in F0 did not affect identification.
4) When one vowel within a pair was harmonic and the other inharmonic, the
inharmonic component was identified significantly better than the harmonic
component. Effects do not follow the symmetric pattern that is sometimes
assumed to be characteristic of primitive segregation.
5) Our experiments employed a particular starting phase pattern (sine) to
synthesize all vowels. In the light of recent results that demonstrate the role
of beats in the identification of concurrent vowels (Assmann and Summerfield
1993, Culling and Darwin 1993b), we cannot rule out the possibility that our
results are partly specific to this phase pattern.
Fundamental frequency had two putative roles for Darwin (1981): to "group
consecutive sounds together into the continuing speech of a single talker" and
to "group together the harmonics from different formants of one talker, to the
exclusion of harmonics from other sources" (p. 186). Our results suggest that
these roles are minor in comparison to a third: to group together components
that belong to an interfering source to better eliminate it. The lack of
benefit of target harmonicity for identification is surprising, as target
harmonicity can in principle be exploited by a majority of harmonic sound
separation models. The question merits further examination, perhaps using
tasks that do not trigger cancellation.
Acknowledgments
Thanks to Gérard Bertrand for technical assistance, to Laurent Ghys for
guiding some of us through the mysteries of the NeXT Machine, and to Nina Fales
for assistance during the experiments. Thanks to John Culling and Quentin
Summerfield for providing data on which Fig. 1 was based, and to Andrew Lea for
useful discussions. This research was supported by a grant from the "Cognitive
Sciences" program of the French Ministry of Research and Space.
Appendix A-1: Preparation of Spectral Envelopes
We wished to use stimuli with high intra-class variability in order to make the
identification task more difficult and more typical of real speech
communication. We reasoned that the best place to look for such variability is
in natural, continuous speech. We systematically extracted voiced
(quasi-periodic) tokens from a multi-speaker speech database to obtain samples
of a wide range of spectra. We then screened them in several stages to obtain
a set of spectral envelopes that were consistently identifiable as given vowels
after resynthesis. The thresholds of acceptance in these screening tests were
chosen to strike an (arbitrary) balance between the goals of variability and
consistent identifiability.
Database
The database consisted initially of 50 phonetically balanced French sentences
pronounced by 11 adult speakers (5 male, 6 female), belonging to the CD6_GRECO1
disk of the GRECO1 database (GRECO 1987). To this initial database we later
added 16 sentences containing mainly /u/ vowels and a set of CVCV (V=/u/) words
from the same database. Data were sampled at 16 kHz with 16-bit resolution.
Estimation of Period and Periodicity, Extraction of Tokens
The initial database was processed by an F0 estimation algorithm based on the
Average Magnitude Difference Function (AMDF) algorithm (described in Appendix
B-2 of de Cheveigné 1993a), that produces as a by-product a measure of
periodicity. The F0 and periodicity measure were used to label portions of
voiced speech as follows: wherever the periodicity measure was above an
arbitrary threshold (2.0) for more than 50 ms, and the F0 was within the range
111 Hz - 141 Hz, an index was set every 50 ms. A total of 1788 indices were
thus set.
First Screening
For each index, a 50 ms stimulus was synthesized by extracting a single period
of the wave form and repeating it by concatenation at the original F0. These
stimuli were screened by informal listening by one of the authors (M.R.).
Stimuli that did not sound like vowels were rejected and the others were
labelled with a vowel name. 572 vowel tokens were kept (107 /a/, 180 /e/, 113
/i/, 94 /o/, 78 /u/).
Spectral Analysis
Spectral analysis was performed according to the following procedure: for each
labeled token, a single period was extracted and a DFT was performed on this
period to obtain a magnitude spectrum. A 512-point, 0-8 kHz spectral envelope
was derived by linear interpolation between the magnitude spectrum coefficients
(each representing a harmonic of the original vowel) [2]. An envelope file was produced for each
token.
Second Screening
For each envelope file, a periodic synthetic stimulus was produced by additive
synthesis. Forty-five harmonics of 125 Hz were added in sine phase with
amplitudes determined by the envelope, with a 5 dB/component de-emphasis from
the 30th to the 45th component to reduce edge effects. Stimuli were 400 ms in
duration, including 50 ms raised-cosine onset and offsets. Subjects were the
four authors. All stimuli were presented three times in random order,
diotically via headphones in an office environment. Subjects could listen
repeatedly to each stimulus and were required to press a key representing the
vowel, or "x" if the stimulus was not identifiable as one of the target vowels.
Stimuli that were consistently identified as the same vowel (independently of
their initial label) at least 11 times out of 12 were kept. We thus retained
79 /a/, 54 /e/, 73/i/ and 20 /o/, but only 3 /u/. To obtain more /u/ tokens, we
extended the analyzed database and repeated the previous analysis/synthesis
steps to obtain 58 more /u/ tokens.
Third Screening
The allophones were screened once more. The same four subjects judged the
"quality" of the vowels on a 3-point scale (1-excellent, 2-average, 3-poor).
With a rejection criterion of 1.5 (1.7 for the /u/), we retained 75 /a/, 49
/e/, 35 /i/, 13 /o/ and 13 /u/.
Clustering
For each vowel we wished to obtain a set of 10 allophones with as much
intra-set variability as possible. To obtain a better distribution than by
random choice, we used a clustering algorithm based on the classic K-means
algorithm, using Euclidean distance on the 512-point spectral envelopes. The
algorithm performed the following steps:
1) Choose an initial 10-point reference set. The first point is chosen at
random, each following point is chosen as far away as possible from previously
chosen points.
2) Partition the set into clusters, assigning each data point to its closest
reference.
3) Replace each reference by the centroid of its cluster, then loop to 2) until
a convergence criterion is met.
4) For each centroid finally obtained, choose the closest original data
point.
The result was a set of 50 allophones, 10 for each vowel.
Final Screening
A pilot version of our experiments served as a final screening test. Stimuli
were 50 harmonic and 50 inharmonic allophones with a nominal F0 of 125 Hz,
presented diotically via headphones in a sound-treated booth in random order.
Subjects were 30 adults, of which 27 had French as their mother tongue and 3
used it as a highly fluent second language. A criterion was set to eliminate
subjects whose identification rate fell below 92%. The analysis of the
confusion matrix of the 20 subjects that remained showed a very high error rate
for /u/ stimuli (18.6%). Errors for /u/ represented 62% of all errors, and 95%
percent of these errors were /u/-/o/ confusions. Practically all errors
occurred for four /u/ allophones that tended to be identified as /o/, even by
subjects that had consistently classified them as /u/ in previous screening
tests. This result no doubt illustrates effects of context on vowel
identification. We eliminated these allophones, duplicated four of the
remaining allophones, renamed them, and proceeded as if /u/ had the same number
(10) of allophones as the other phonemes.
We repeatedly met difficulties with /u/. For some reason, very few portions of
speech isolated from our database sounded like /u/ after resynthesis, even
those taken from sentences labelled as containing mainly /u/ phonemes. A
tentative explanation is that in French /u/ is articulated with a protrusion of
the lips. The target position may require some time to be attained, and the
resulting spectral transition may in fact be necessary for identification.
Evidently no such transition is present in the resynthesized vowel. This does
not explain however why a few tokens do sound reasonably /u/-like after
synthesis. Overall, surprisingly few of the original voiced speech tokens were
identified consistently as vowels after resynthesis: less than 10 % of the
original tokens survived the final screening. In real speech vowel identity is
probably largely determined by contextual or dynamic features that are absent
from the resynthesized vowels (Hillenbrand and Gayvert 1993).
Appendix A-2: Synthesis of Inharmonic Component Patterns
We wished to obtain vowels that were inharmonic, but with a spectral density
close to that of a harmonic vowel. The frequency of each component of a
harmonic series was shifted by a random amount drawn from a uniform
distribution bounded by ±3% of the harmonic frequency, or half the spacing
between adjacent harmonics, whichever was smaller. We synthesized twice the
required number of component patterns (50), then screened out the "least
inharmonic" half by choosing those with the greatest values of the following
measure of inharmonicity:
where fn is the frequency of the nth component.
Appendix A-3: Level Correction Factors
When vowels are mixed at equal rms signal levels, one vowel may dominate the
pair due to unequal mutual interference. We wished to avoid this situation.
Informal listening showed that equal rms level results in approximately equal
loudness; we concluded that matching for equal loudness was unlikely to fulfil
our goal. Instead we decided to determine experimentally a corrective factor
to balance mutual interference.
We first determined informally, for each of the 10 vowel pairs,
the rms level differences for which either vowel appeared to be absent. We
then centered a scale with 4 dB steps and 10 levels on the mean of these two
differences, and synthesized pairs of unison harmonic vowels according to this
scale. There were 10 such scales, one for each different vowel pair. The
stimuli were presented 5 times each in random order to four subjects (the four
authors). At each presentation the stimulus was repeated twice; after each
repetition the subject had to identify one constituent (SRSR pattern). A
response could be any of the five vowels, or 'x' if no vowel could be heard,
but the two responses had to be different. Psychometric functions were plotted
for each component of a pair, and their intercept was taken as the corrective
factor (more sophisticated interpolation techniques were judged unnecessary for
our purpose of adjusting levels to avoid complete dominance within a pair).
The corrective factors for all pairs are shown in Table A3-I
| /e/ | /i/ | /o/ | /u/
|
---|
/a/ | -5.0 | -7.5 | -17.5 | -31.0
|
---|
/e/ | | 1.0 | -11.5 | -17.0
|
---|
/i/ | | | -2.0 | -16.0
|
---|
/o/ | | | | -16.5
|
---|
Table A3-I Level correction factors for vowel pairs, in dB.
The level after correction of one vowel relative to another is shown at the
intersection of the row and column that they label, respectively. For example
to synthesize /ae/ the rms level of /a/ should be set to be 5.0 dB less than
that for /e/.
These results are roughly compatible with those reported by McKeown (1992) for
three of his four subjects: /a/ tends to dominate all other phonemes, /u/ tends
to be dominated by all others. Other phonemes are intermediate: /o/, /i/, /e/
in order of increasing dominance. However, our factors were determined before
the final screening that eliminated four allophones of /u/. Levels are
therefore certainly biased too far in favor of /u/ to compensate for their poor
quality. This is evident in the rates as a function of ground vowel,
particularly low when /u/ is ground (III-D-6), but it should not affect our
main conclusions concerning the effects of harmonicity or F0: they
remain quite similar when pairs containing /u/ are removed from analysis. We do
not recommend that these particular level correction factors be used in other
studies.
Appendix A-4: Randomization
It is worth stating explicitly what we mean by random order and quantities.
Inharmonic vowel component patterns were "random" in the limited sense that a
set of 50 inharmonic patterns was obtained by random perturbation of a harmonic
series. This same set was used for all subjects and presentations. Stimulus
presentation order was random in the sense that the stimulus set was shuffled
before each run according to a fresh series of random numbers. The pairing of
allophones representing a vowel pair was random in the sense that it depended
on the random order with which allophones were called upon to represent a given
vowel within a stimulus set. All series of random numbers were produced by the
random() routine of the UNIX C library, after initialization by a call to
srandom() with an argument derived from the system clock.
References
Assmann, P. F. and Summerfield, Q. (1989). "Modeling the perception of
concurrent vowels: Vowels with the same fundamental frequency," J. Acoust. Soc.
Am. 85, 327-338.
Assmann, P. F. and Summerfield, Q. (1990). "Modeling the perception of
concurrent vowels: vowels with different fundamental frequencies," J. Acoust.
Soc. Am. 88, 680-697.
Assmann, P. F. and Summerfield, Q. (1994). "The contribution of waveform
interactions to the perception of concurrent vowels," J. Acoust. Soc. Am. 95,
471-484.
Bregman, A. S. (1990). Auditory scene analysis (MIT Press, Cambridge, Mass.).
Broadbent, D. E. and Ladefoged, P. (1957). "On the fusion of sounds reaching
different sense organs," J. Acoust. Soc. Am. 29, 708-710.
Brokx, J. P. L. and Nooteboom, S. G. (1982). "Intonation and the perceptual
separation of simultaneous voices," Journal of Phonetics 10, 23-36.
Carlyon, R. P. (1991). "Discriminating between coherent and incoherent
frequency modulation of complex tones," J. Acoust. Soc. Am. 89, 329-340.
Cherry, E. C. (1953). "Some experiments on the recognition of speech with one,
and with two ears," J. Acoust. Soc. Am. 25, 975-979.
Childers, D. G. and Lee, C. K. (1987). "Co-channel speech separation", IEEE
ICASSP, 181-184.
Culling, J. (1990). "Exploring the conditions for the perceptual segregation of
concurrent voices using F0 differences," Proc. of the Institute of Acoustics
12, 559-566.
Culling, J. F. and Darwin, C. J. (1993a). "Perceptual separation of
simultaneous vowels: within and across-formant grouping by F0," J. Acoust. Soc.
Am. 93, 3454-3467.
Culling, J. F. and Darwin, C. J. (1993b). "Perceptual and computational
separation of simultaneous vowels: cues arising from low frequency beating,"
draft submitted for publication.
Darwin, C. J. (1981). "Perceptual grouping of speech components differing in
fundamental frequency and onset-time," Q. J. Exp. Psychol. 33A, 185-207.
Darwin, C. J. and Culling, J. F. (1990). "Speech perception seen through the
ear," Speech Communication 9, 469-475.
Darwin, C. J. and Gardner, R. B. (1986). "Mistuning of a harmonic of a vowel:
grouping and phase effects on vowel quality," J. Acoust. Soc. Am. 79,
838-845.
de Cheveigné, A. (1993a). "Separation of concurrent harmonic sounds:
Fundamental frequency estimation and a time-domain cancellation model of
auditory processing.," J. Acoust. Soc. Am. 93, 3271-3290.
de Cheveigné, A. (1993b). "Time-domain comb filtering for speech
separation", ATR Human Information Processing Laboratories technical report
TR-H-016.
de Cheveigné, A., Kawahara, H., Aikawa, K., and Lea, A. (1994). "Speech
separation for speech recognition", Proc. 3rd French Congress of Acoustics,
Toulouse, 1994.
Demany, L. and Semal, C. (1990). "The effect of vibrato on the recognition of
masked vowels," Perception & Psychophysics 48, 436-444.
Denbigh, P. N. and Zhao, J. (1992). "Pitch extraction and separation of
overlapping speech," Speech Communication 11, 119-125.
Frazier, R. H., Samsam, S., Braida, L. D., and Oppenheim, A. V.. (1976).
"Enhancement of speech by adaptive filtering", IEEE ICASSP, 251-253.
GRECO, (1987). "BDSONS, base de donnees des sons du francais, GRECO1", edited
by Jean-Francois Serignat and Ofelia Cervantes, ICP, Grenoble (France).
Hanson, B. A. and Wong, D. Y. (1984). "The harmonic magnitude suppression (HMS)
technique for intelligibility enhancement in the presence of interfering
noise", IEEE ICASSP 2, 18A.5.1-4.
Hillenbrand, J. and Gayvert, R. T. (1993). "Identification of steady-state
vowels synthesized from the Peterson and Barney measurements.," J. Acoust. Soc.
Am. 94, 668-674.
Lea, A. (1992). "Auditory models of vowel perception", unpublished doctoral
dissertation (University of Nottingham, UK).
Lea, A. and Tsuzaki, M. (1993a). "Segregation of competing voices: perceptual
experiments," Proc. Acoust. Soc. Jap., Spring session, 361-362.
Lea, A. P. and Tsuzaki, M. (1993b). " Segregation of voiced and whispered
concurrent vowels in English and Japanese," J. Acoust. Soc. Am. 93, 2403 (A).
Lea, A. P. and Summerfield, Q. (1992). "Monaural segregation of competing
voices," Proc. Acoust. Soc. Japan committee on Hearing H-92-31, 1-7.
Marin, C. (1991). "Processus de séparation perceptive des sources
sonores simultanées", unpublished doctoral dissertation
(Université de Paris III, France).
Marin, C. and McAdams, S. (1991). "Segregation of concurrent sounds. II:
Effects of spectral envelope tracing, frequency modulation coherence, and
frequency modulation width," J. Acoust. Soc. Am. 89, 341-351.
McAdams, S. (1989). "Segregation of concurrent sounds. I: Effects of frequency
modulation coherence," J. Acoust. Soc. Am. 86, 2148-2159.
McKeown, J. D. (1992). "Perception of concurrent vowels: the effect of varying
their relative level," Speech Communication 11, 1-13.
Meddis, R. and Hewitt, M. J. (1992). "Modeling the identification of concurrent
vowels with different fundamental frequencies," J. Acoust. Soc. Am. 91,
233-245.
Moore, B. C. J., Glasberg, B. R., and Peters, R. W. (1985). "Relative
dominance of individual partials in determining the pitch of complex tones," J.
Acoust. Soc. Am. 77, 1853-1860.
Naylor, J. A. and Boll, S. F. (1987). "Techniques for suppression of an
interfering talker in co-channel speech", IEEE ICASSP, 205-208.
Parsons, T. W. (1976). "Separation of speech from interfering speech by means
of harmonic selection," J. Acoust. Soc. Am. 60, 911-918.
Scheffers, M. T. M. (1983). "Sifting vowels", unpublished doctoral dissertation
(University of Gröningen, the Netherlands).
Stubbs, R. J. and Summerfield, Q. (1988). "Evaluation of two voice-separation
algorithms using normal-hearing and hearing-impaired listeners," J. Acoust.
Soc. Am. 84, 1236-1249.
Stubbs, R. J. and Summerfield, Q. (1990). "Algorithms for separating the speech
of interfering talkers: evaluations with voiced sentences, and normal-hearing
and hearing-impaired listeners," J. Acoust. Soc. Am. 87, 359-372.
Stubbs, R. J. and Summerfield, Q. (1991). "Effects of signal-to-noise ratio,
signal periodicity, and degree of hearing impairment on the performance of
voice-separation algorithms," J. Acoust. Soc. Am. 89, 1383-1393.
Summerfield, Q. (1992). "Roles of harmonicity and coherent frequency modulation
in auditory grouping," in The auditory processing of speech, edited by B.
Schouten (Mouton deGruyter, Berlin), 157-165.
Summerfield, Q. and Assmann, P. F. (1991). "Perception of concurrent vowels:
effects of harmonic misalignment and pitch-period asynchrony," J. Acoust. Soc.
Am. 89, 1364-1377.
Summerfield, Q. and Culling, J. F. (1992). "Auditory segregation of competing
voices: absence of effects of FM or AM coherence," Phil. Trans. R. Soc. Lond. B
336, 357-366.
Weintraub, M. (1985). "A theory and computational model of auditory monaural
sound separation", unpublished doctoral dissertation (Stanford University,
USA).
Zwicker, U. T. (1984). "Auditory recognition of diotic and dichotic vowel
pairs," Speech Communication 3, 256-277.
Footnotes:
- [1]
- In all reports of F statistics in this article the probabilities reflect, where necessary, an adjustment of the
degrees of freedom by the Greenhouse-Geisser factor to correct for the inherent
correlation of repeated measurements (Geisser and Greenhouse, 1958). GG
indicates the epsilon factor by which degrees of freedom were multiplied to
determine the probability level. This is a conservative correction factor.
- [2]
- We tried a cepstral smoothing technique in an early stage, but rejected it
because we were dissatisfied with the quality of re-synthesized vowels.
However we met similar problems with other methods we tried, so it seems likely
that the poor quality was essentially due to variability and coarticulation
effects in the speech we analyzed, and the harshness of purely periodic
resynthesized speech. We nevertheless stuck to the present procedure that has
the advantage that it makes no normative assumptions about the shape of the
spectral envelope.
____________________________
Server © IRCAM-CGP, 1996-2008 - file updated on .
____________________________
Serveur © IRCAM-CGP, 1996-2008 - document mis à jour le .