IRCAM - Centre PompidouServeur © IRCAM - CENTRE POMPIDOU 1996-2005.
Tous droits réservés pour tous pays. All rights reserved.

Auditory processing of frequency modulation coherence

Stephen McAdams, Cécile M.H. Marin

Proceedings of the 6th Annual Meeting of the International Society for Psychophysics, 1990
Copyright © International Society for Psychophysics 1990


Thresholds for detection of frequency modulation coherence and incoherence were measured. The task consisted of detecting a "figure" composed of 1 or more frequency components embedded in a multi-component "ground". The relations among components were either harmonic or inharmonic. Components were modulated in frequency with a random waveform. Frequency modulation width was varied in an adaptive tracking procedure. The components of the figure were always modulated coherently such as to maintain constant frequency ratios. The components of the ground were either modulated coherently or independently of one another. Thresholds for coherent grounds were considered to reflect the sensitivity to modulation incoherence between figure and ground components. Thresholds for incoherent grounds were considered to reflect the sensitivity to modulation coherence among the figure components. The results show that thresholds in the coherent ground conditions depend primarily on the proximity of figure and ground components, suggesting that interactions within auditory channels are used to detect incoherent modulation. Thresholds for incoherent grounds are higher than those for coherent grounds and show that coherently modulated components must fall within the same auditory channels in order for the coherence to be detected, but that information about coherence can be integrated across channels to some extent.


Frequency modulation coherence has been shown to contribute to the perceptual segregation of concurrent sounds ([1], [3], [5]). Two kinds of information related to this phenomenon have been proposed [2]: within-channel incoherence generated by adjacent components that derive from different sound sources and cross-channel coherence existing among components deriving from the same sound source. Whether the information that is encoded in the auditory system truly allows access to both of these properties of the stimulus is currently being debated. The experiments described below were designed to examine both kinds of information by controlling 1) the proximity among figure components and between figure and ground components, 2) the harmonicity (or periodicity) of all the components, 3) the coherence of modulation among ground components, and 4) the number of figure components. Combinations of these parameters allow us to understand the relative contributions of within-channel and cross-channel cues derived from the patterns of activity generated by frequency modulation in the auditory system.


General Stimulus Properties

The stimuli consisted of a series of 8, 9 or 15 equal-amplitude partials whose starting phases were randomized for each tone presented. The tones were presented at a level of 45 dB/component. In test tones, some of the partials composed the "figure" to be detected while the rest composed the ground. All components were modulated with a low-frequency random waveform (jitter) derived from a white noise signal low-pass filtered at 30 Hz. The same jitter waveform was applied to all components of the figure, though this waveform changed from tone to tone. The ground components were either coherently or incoherently modulated. For coherent grounds, the same waveform was applied to all components, while for incoherent grounds, independent waveforms were applied to each component. Figure jitters were always independent of ground jitters. Tones were synthesized at a sampling rate of 9 kHz and low-pass filtered at 3.6 kHz. Tone duration was 1 s with 200 ms raised cosine attack and decay ramps.

Experimental Conditions

Figure conditions. Figures were composed of 1, 4 or 7 frequency components. For 1-component figures, only the coherent ground was presented. The component ranks for these figures included partials numbered 3, 5, 7, 9, 11, 13; that is, one of these components was modulated independently of the other 14. For the other figure types, both coherent and incoherent grounds were presented. The components were either all adjacent or interleaved with those of the ground. In 4-component figures, the lowest component was either of rank 2 or 5 (labeled fig4/2 and fig4/5, respectively). In 7-component figures, the lowest component was always of rank 2 (fig7/2). The possible combinations of figure component adjacency and ground coherence are shown in Figure 1 for the fig4/5 figure.

Figure 1. Schematic representation of stimulus configurations for a 4-component figure (fig4/5) with all components belonging to the same harmonic series.

Harmonicity conditions

Three types of frequency series were used: harmonic, perturbed harmonic and equal-Bark spacing. The Bark scale is a psychophysical scale that reflects the distance between activity patterns stimulated by frequency components on the basilar membrane in terms of critical band rate. Components separated by more than 1 Bark are considered not to interact substantially within a single auditory channel. For the harmonic series, 15 harmonics of 200 Hz were generated. All figure, adjacency and ground coherence conditions were tested with the harmonic series. For the perturbed harmonic series, each harmonic component was displaced randomly by 3-4.5% above or below its ideal harmonic value. This gave component spacings that were similar to those in the harmonic series but the periodicity of the waveform was perturbed. Only single-component and fig4/5 figures were presented with this series. The third series was composed of 8 or 9 partials with intercomponent spacings of 0.5, 1.0, or 1.5 Barks. This allows control of the degree of within-channel interaction between components, while destroying any periodicity. Only 4-component figures were tested in this condition. For figures with interleaved components, the lowest was of rank 2 and the total number was 9. For figures with adjacent components, the rank of the lowest component was 3 and the number of total partials was 8 in order to keep approximately the same number of partials in the two cases. These complex tones were centered in frequency on the geometric mean of the normal and perturbed harmonic series (1015 Hz).


In each trial subjects heard two tones separated by a silence of 200 ms. In one tone a figure was modulated independently of the ground components and in the other, no figure was present, i.e. all components were modulated either coherently or incoherently depending on the ground coherence condition in the test tone. All components were modulated at the same rms modulation width in a given trial. The task was to detect the interval containing the embedded figure and press the appropriate button. A feedback light was illuminated after each response. A 1-up, 2-down adaptive procedure was used to determine figure detection threshold at 71% performance as a function of rms frequency modulation width. Each run consisted of 12 turnarounds of which the last 8 were averaged to estimate threshold. The step-size was adjusted for each condition and each subject such that it was no more than 25% of the threshold modulation width. Since the psychometric functions were non-monotonic, care was taken to start each condition at a modulation width positioned near the upper end of the lowest positively sloped portion of the function.

Prior to each run, the subject was allowed to explore the stimulus set at will since there were great differences in the perceptual cues that indicated the presence of the figure for the different experimental conditions. One button presented a tone pair in which the figure was in the first tone. Another button presented a tone pair with the figure in the second tone. A third button changed the modulation width. Conditions were loosely blocked by stimulus type. Runs were performed until thresholds stabilized and then 3 threshold estimates for each experimental condition were obtained for each subject. These estimates were then averaged across subjects.

Figure 2. Mean detection thresholds of 1-component figures embedded in harmonic and perturbed harmonic series as a function component rank. Data points represent the mean of 3 thresholds for each of 4 subjects, with the exception of rank = 3 where one subject did not attain threshold.


Threholds for 1-component Figures

Figure 2 displays the mean thresholds for single-component figures for both harmonic and perturbed harmonic stimuli. These reflect the sensitivity to incoherent modulation on a single component. Thresholds decreased with increasing rank up to 7 and then remained at about 0.3% rms modulation width above that rank. Once the 7th component in these stimuli is reached, adjacent components are separated by less than a critical bandwidth and the degree of interaction of incoherent components within an auditory channel has apparently attained its maximum. The same qualitative results hold for the perturbed harmonic series except that thresholds tend to be 2-4 times higher. Thus when the background is less periodic, a greater modulation width is needed before the incoherence can be detected, though it is still quite possible as is witnessed by the relatively low thresholds (~1%) for the higher ranked components.

Thresholds for Multi-component Harmonic Figures

Detection thresholds for three multi-component figure conditions are shown in Figure 3.

Figure 3. Mean detection thresholds for harmonic figures with 4 or 7 components. Vertical bars indicate +/-1 standard deviation. Means represent 3 thresholds for each of 4 subjects with the following exceptions : fig4/5+coherent+adjacent: 3 subjects; incoherent+interleaved: 1 subject.

Coherent ground conditions

These conditions reflect the sensitivity to modulation incoherence. There were no differences among mean thresholds as a function of number of figure components or of component proximity for figures in either the adjacent or the interleaved dispositions. There was a sizable effect of adjacency however: thresholds for adjacent figures were about 4 times higher than those for interleaved figures. This is most likely due to a combined effect of increased proximity between figure and ground components and of an increase in the number of interacting figure and ground components. Proximity was greater in the interleaved condition compared to the adjacent condition since the harmonic numbers were higher (e.g. 2, 4, 6, 8 vs 2, 3, 4, 5 for fig4/2 with interleaved and adjacent components, respectively). The number of component interactions also increased since there were only two regions of interaction for adjacent figures (e.g. components 2/3 and 5/6 for fig4/2), whereas each figure component was surrounded by ground components for interleaved figures. The presence of within-channel incoherence alone is not sufficient to explain detection. One must also consider the number of channels that contain incoherent modulation.

A comparison of multi-component and single-component thresholds was made. Those for adjacent figures were higher than would be expected from the 1-component stimulus that is equivalent to the most proximal figure and ground components in the complex figure (e.g. compare fig4/2 in Fig. 3 with the 1-component figure at rank 5 in Fig. 2). The reverse is true for interleaved figures where the thresholds were lower than those for the most proximal single-component stimuli (e.g. compare fig4/2 with the 1-component figure at rank 9). These results also indicate that a simple within-channel incoherence detection mechanism would be too simplistic to explain the data.

Incoherent ground conditions

These conditions reflect the sensitivity to modulation coherence. Since the entire ground is incoherently modulated, the figure's coherence generates the only cues available to perform the task. The results are shown in the right half of Figure 3. In contrast to conditions where the ground is coherent and the figure components are adjacent, there are large effects of the proximity of figure components (fig4/2 vs fig4/5) as well as their number (fig4/5 vs fig7/2). Note that in the latter comparison, the two figures share the same upper 4 partials for an adjacent disposition of the components. This suggests that the lower components in the 7-component figure also contribute information to the detection task even though they are separately resolved by the auditory system. To perform this task the subject had to detect the region where adjacent components were coherently modulated. These data suggest that this comparison is performed across the array of stimulated auditory channels.

For interleaved figures, only 1 of the 4 subjects could perform the task and only for two of the figure conditions. In cases where a threshold could be measured, it was very much larger than that for an adjacent figure in an incoherent ground. This indicates an inability to use cross-channel coherence unless some modulating periodicity information is present within a channel as would be the case for the higher components of the two thresholds found for one subject. However, even here the modulation width values are so large as to fall completely outside of the range of natural jitter found in physical forced-vibration systems (0.4% in sustained musical instrument tones and generally less than 1.5% in trained voice in the absence of vibrato [4].

Comparison of Harmonic and Inharmonic 4-component Figures

Detection thresholds for multi-component figures in three harmonicity conditions are shown in Figure 4.

Figure 4. Mean detection thresholds for multi-component figures with harmonic, perturbed harmonic or equal-Bark spaced components. Vertical bars indicate +/-1 standard deviation. Means represent 3 thresholds for each subject with the following exceptions : perturbed harm/incoherent/adjacent -- 2 of 4 subjects; equal-spaced/coherent -- 3 of 4 subjects.

Coherent ground conditions

These conditions are shown in the two positions on the left of Figure 4. Perturbed harmonic conditions yield thresholds that were about 4 times higher than those for comparable harmonic conditions. The equal-Bark spaced series yielded even higher thresholds. Thresholds were measurable for only 3 of the 4 subjects in these latter conditions. Thresholds for the adjacent disposition are higher than those for the interleaved disposition for all three types of frequency series. The effect of component spacing in the equal-Bark series depends on the adjacency condition of the figure. A 1.5 Bark spacing results in the highest thresholds for both dispositions. A 1 Bark spacing yields the lowest thresholds for the adjacent disposition, whereas the 0.5 Bark spacing gives the lowest thresholds in the interleaved spacing. It should be noted that the actual spacing of the figure components in the interleaved disposition is twice that in the adjacent disposition. It is surprising that in spite of the destruction of harmonicity in the equal-spaced inharmonic series, most subjects are still quite sensitive to incoherent modulation between proximal figure and ground components, even when the components are separated by more than a critical band. For all harmonicity conditions, as the amount of incoherence across channels increases with the interleaving of figure and ground components, the detectability of incoherence also increases. There is, then, information about incoherent modulation that is not necessarily signaled by a loss of periodicity in the activity within auditory channels.

Incoherent ground conditions

Since only one subject could perform the detection task for interleaved figures in the presence of an incoherent ground for harmonic and equal-spaced (1.5 Bark) stimuli, only the adjacent disposition is shown on the right of Figure 4. The data for only 2 subjects was collected for the equal-Bark spaced series in the presence of an incoherent ground. Thresholds are virtually identical for harmonic and peturbed harmonic fig4/5 conditions (which have mean component spacings of 1.04 Barks) as well as for the 1.0 Bark-spaced figures. Subjects cannot perform the 0.5 Bark condition. Thresholds for 1.5 Bark spacing are greater than those for 1.0 Bark spacing, and those for the harmonic fig4/2 condition (mean component spacing of 1.55 Barks) are even higher. In this condition, the main stimulus factor that appears to covary with thresholds is the mean component spacing, though the mean spacing for fig7/2, which has the lowest threshold, is about 1.3 Barks. The relative lack of effect of harmonicity is quite surprising and we have no ready explanation for it.


One of the main results of this study can be gleaned from a comparison of interleaved figures embedded in coherent or incoherent grounds. In order for subjects to perform the figure detection task in an incoherent ground with an interleaved figure, they would need access to information on modulation coherence across auditory channels. The fact that they are generally unable to perform this task suggests that this information is not available in a raw form.

Another main result is that the auditory system is very sensitive to within-channel modulation incoherence even for inharmonic sounds, at least in the case of complex sounds composed of sinusoidal components and presented in quiet. It is also relatively sensitive to coherent modulation of adjacent figure components embedded in an incoherent ground even when these components have inharmonic relations. This raises some very interesting questions about auditory sensitivity to dynamic stimuli, that can not be totally explained by mechanisms based purely on detection of periodic activity in a given set of auditory channels (2).

The task of detecting frequency modulation coherence and incoherence may involve storing a representation of a coherence profile across channels and comparing it to the succeeding tone. This profile has the greatest clarity in the case of two independently modulated harmonic series as is attested to by the very low figure detection thresholds for harmonic figures in a coherent harmonic ground. When regions of coherence are periodic, the detection of incoherence is very sensitive. Likewise, when the ground is incoherent and the task is to detect a region of coherent modulation (adjacent disposition), this is more easily performed when the figure is harmonic, and thus periodic. However, this task can also be performed with inharmonic stimuli, though the thresholds increase with increasing departure of the series from harmonicity and with decreasing proximity of frequency components.

These results lead us to consider a model of frequency modulation coherence and incoherence detection based on the storage of a global activity profile that can then be compared with a succeeding tone. This profile must necessarily contain a great deal of fine-grained temporal information to account for the sensitivity of subjects to modulation coherence of harmonic as well as inharmonic sounds.


Bregman, A. & Doehring, P. Fusion of simultaneous tonal glides: The role of parallelness and simple frequency relations. Perc. & Psychophys., 1984, 36, 251.

Carlyon, R. & Stubbs, R. Detecting single-cycle frequency modulation imposed on sinusoidal, harmonic and inharmonic carriers. J. Acoust. Soc. Am., 1989, 85, 2563.

Marin, C.M.H. & McAdams, S. Segregation of concurrent sounds. II: Effects of spectral envelope tracing, frequency modulation coherence, and frequency modulation width. J. Acoust. Soc. Am., 1990 (accepted subject to revisions).

McAdams, S. Spectral fusion, spectral parsing and the formation of auditory images, Chap. 3. PhD. Dissertation, Stanford University, Stanford, California, 1984.

McAdams, S. Segregation of concurrent sounds. I: Effects of frequency modulation coherence. J. Acoust. Soc. Am., 1989, 86, 2148.

Server © IRCAM-CGP, 1996-2008 - file updated on .

Serveur © IRCAM-CGP, 1996-2008 - document mis à jour le .