Résumé 
Classification and generation of sound require a modeling approach
that takes into account, additionally to the common sound features,
also the statistical behaviour of the sound components. Such
statistics include the stationary random fluctuations in amplitude and
frequency that occur during sustained portions of the sound and the
stochastic behaviour of sound during its lifetime.
In our work we have considered so far statistical models of the
variations that occur during a sustained portion of the sound.
Various aspects, such as phase coupling and its relation to Higher
Order Statistical (HOS) analysis were investigated and shown to be
important for sound characterization.
The purpose of the current work is to extend this research towards
modeling the temporal behaviour of sound. We are considering a
unified model that combines spectral and HOS features and apply a new
method for comparison between the temporal evolutions of these
features. Typical applications envisioned are very broad and include
characterisation for analysis/synthesis, coding and sound database
retrieval.
In order to understand the problems in comparing sounds, one must note
that there are different temporal scales for sound behaviour. This
includes short term correlations related to the timbral properties
(such as formants), correlations due to pitch period, slower
modulations such as vibrato, expressivity inflections, and transitions
between different notes. Thus a sequence that might seem stationary
on one time scale, departs from stationarity and ergodicity on another
time scale.
This situation poses a problem for assessing the right probability
function for the sequence of samples. Moreover, for purposes of
classification, introducing similarity measures between sounds is
usually based upon specific models (like Markov models of a certain
order) or apriori knowledge of the parametric shape of the probability
distribution, a situation which we would like to avoid.
A possible solution for this problem is to consider the Markovian
property at different time scales by using multiple features and
capturing their temporal behaviours. Thus, we consider a model
composed of features that represent stationary segments (states) and
transition between these states.
For short time description of the sound we use a of spectral envelopes
(Mel Frequency Cepstral Coefficients (MFCC), like in speech), which
allow for up to 90% of data reduction in sound representation.
Moreover, a vector quantisation (VQ) procedure further reduces the set
of envelopes by optimally representing the complete dataset with just
a few typical envelopes.
In order to capture the information present in higher cepstral
coefficients as well, additional parameters were used. These higher
cepstral coefficients correspond to the excitation signal (also called
the residual). Variations in the fundamental frequency and HOS
parameters that describe the residual properties (such as kurtosis
which is related to phase coupling) were used.
The investigation into temporal structure of the signal was done along
two lines:
1). the short time temporal evolution is described by specific
features such as cepstral "difference" and "acceleration". The
evolution is considered in terms of transition between "typical"
envelopes found by VQ. This method gives excellent performance for
limited data sets such as isolated notes by matching both the
instantaneous spectral shapes and their evolution.
2). for the long term behaviour of the signal we applied
informationtheoretic tools for classification of the feature
sequences. Using ZivMerhav ``universal'' sequence classification
method, the crossentropy comparison is done without estimation of a
specific Markov model. The model requires long feature sequences to
reveal its structure and is applicable for complex sounds such as note
sequences and some nonmusical sounds.
The model, classification scheme and refinements for specific types of
sounds will be presented in the paper.
