Animal Vocalization Analysis: the gPLP

Wayne Staab
December 2, 2012
Prologue:

Every now and then one has to release the nerd inside them.  This blog is one of those times.  However,   this venture turned out worse than I had imagined.  So, it might be helpful if I provided some background as to why I delved into this topic.

As some of you might have noticed, a couple of my previous blogs have been about animals and the sounds they make.  This was my way to show a few of my favorite videos and photos and still provide a blog that “in part” was related to hearing.  It was the “in part” issue that got me started on the current blog because acoustical descriptions of animal sounds that I was able to find were minimal at best.  Most are merely audio recordings and essentially fail to describe the sound.  And, I still had a few animals I wanted to write about, specifically the leopard, cape buffalo, and rhino, and I could find little scientific information about the sounds they produced.  My investigative efforts produced what looked like a good wading pool of information, and like a politician with no regard for prior investigation, I jumped in with both feet.  Unfortunately, the wading pool was deeper than I anticipated, and I found myself drowning in details with only small knowledge improvements to share in this blog.

So, while this blog was an attempt to find a method to meaningfully and easily describe animal sound analyses, it seems that such actual attempts have yet to reach animal researchers – at least for the animals I am most interested in.  Like them, I would most likely be interested in the excitement of field observations rather than laboratory work.  As a result, the information in this blog comes primarily from a single source, Clemins and Johnson in 2006 {{1}}[[1]]  Clemins, PJ and Johnson, MT.  Generalized perceptual linear prediction features for animal vocalization analysis, J. Acoust. Soc. Am., Vol. 120, No. 1, pp 527-534[[1]], with some of the text taken directly from the article.

Animal Vocalization Analysis: the gPLP (generalized perceptual linear prediction) model

One of the primary tasks in analyzing animal vocalizations is to determine and measure acoustically relevant features.  For the most part, the features used are based on the entire vocalization, often extracted by hand from spectrogram plots.  Some of the features are common to analysis of human vocalizations as well:

  • Duration
  • Fundamental frequency
  • Amplitude
  • Spectral information such as Fourier transform coefficients

However, these traditional features (referred to as global features) are unable to capture temporally fine details of animal vocalization because each feature has only one value for the entire vocalization.  Additionally, these features are often susceptible to researcher bias because the features are determined interactively.

An alternative to this feature extraction paradigm is to divide signals into frames and then extract features automatically on a frame-by-frame basis.  Doing this generates a feature matrix for each vocalization that captures information about how the vocalization changes over time – something that is characteristic of most animal sounds {{1}}[[1]] Dfefferle, D., West, P., Grinnell, J., Packer, C., Fischer, J. Do acoustic features of lion, Panthera leo, roars reflect sex and male condition? J. Acoust. Soc. Am. 121 6, June 2007, pp 3947-3953.[[1]] (Figure 1)

Figure 1. (a) Spectrogram of a complete male lion call (FFT length: 1024;
Frame [%]: 100; Window: Hamming; Overlap: 87.5). The call starts with a
series of soft introductory moans (1), followed by a series of full-throated
roars (2), and terminates with a sequence of grunts (3). (b) Example of one
call unit extracted from the series of full-throated roars (2). (FFT length:
1024; Frame [%]: 100; Window: Hamming; Overlap: 87.5; Sampling frequency:
5000 Hz).

.

Another limitation of traditional features, either global or frame based, is that they typically do not use explicit information about the perceptual abilities of the species under study in the feature extraction process.

PLP vs gPLP

Analyses of the sounds that animals make have generally employed random procedures.  But, more recently, a new feature extraction model (gPLP – generalized perceptual  linear prediction) has been developed to calculate a set of perceptually relevant features for digital signal analysis of animal vocalizations.  The model is a generalized adaptation of the perceptual linear prediction (PLP) model popular in human speech processing, including the same components.  However, the gPLP incorporates perceptual information such as frequency warping and equal loudness normalization into the feature extraction process.  Because such perceptual information is available for a number of animal species, this new approach integrates that information into a generalized model to extract perceptually relevant features for a particular species.

PLP  (perceptual linear prediction) identifies the traditional global feature extraction procedure, whereas  gPLP (generalized perceptual linear prediction) refers to the generalized prediction model that uses frame-based extractions.

The gPLP feature extraction model generates features based on the source filter model of speech production.  Although originally developed for human speech processing, it has been shown to be applicable to the vocalizations of terrestrial mammals to describe their vocal production mechanisms.  For example, in land mammals, the source excitation is modeled as a pulse train for voiced sound or white noise for unvoiced sound, produced by the glottis.  In birds this is the tympaniform membrane, and air sacs in marine animals.  This excitation then propagates through a filter consisting of the vocal tract and nasal cavity in terrestrial animals or the body cavity and melon in marine animals.  The gPLP model is designed to suppress excitation information and quantify the vocal tract filter characteristics of the vocalizations.  The excitation information includes the fundamental frequency contour, while vocal tract characteristics are represented by formant information.  In human speech, vocal tract features carry the majority of information.  However, in animals, there is reason to believe that excitation information is also important for vocalization discrimination.  In fact, many studies have used fundamental frequency measures to classify vocalizations {{2}}[[2]] Buck, J. R., and Tyack, P. L. 1993. A quantitative measure of similarity for tursiops truncatus signature whistles, J. Acoust. Soc. Am. 94(5), 2497-2506[[2]]  {{3}}[[3]] Darden, S., Dabelsteen, T., and Pedersen, S. B. 2003.  A potential tool for swift fox (Vulpes velox) conservation: Individuality of long-range barking sequences, J. Mammal. 84(4), 1417–1427 [[3]].

The gPLP feature extraction model generates features in what is called the discrete cepstral domain. *

c[n] = F-1{log[F(s[n])]}

Suffice it to say that this formula represents the signal into two primary components: a) the Fourier transform (F), and b) the original sampled time domain signal (s[n]).  This is a preferred speech processing domain because the general shape of the spectrum is accurately described by the first few cepstral coefficients, yielding an efficient signal representation.   The cepstral domain is particularily appropriate for source filter model analysis because the logarithm operation effectively separates the excitation from the vocal tract filter.  Also, because the cepstral values tend to be relatively uncorrelated with each other, the coefficients are good for statistical analysis methods.

Figure 2 diagrams the feature extraction process to the species under study.  Those components identified by dotted boxes indicate where species-specific perceptual information is incorporated into the model.

The following information relates to Figure 2:

Figure 2. gPLP block diagram.

Preprocessing

The vocalization is first filtered using a preemphasis filter.  This gives greater weight to higher frequencies to emphasize the higher frequency formants and reduce spectral tilt.  It also reduces the dynamic range of the spectrum so that the spectrum is more easily approximated by the autoregressive modeling component.

The vocalization is then broken into frames, with frame size usually chosen to include several fundamental frequency peaks (30+msec depending on the species).  These frames allow spectral estimation based on quasistationary segments of the signal to ensure the precision of the spectral estimation.

Power spectral estimation

A discrete FFT is used to estimate the power spectrum based on the windowed frame of the signal, but other spectral estimation methods could be used as well.

Filter bank analysis

It is at this stage that the power spectrum is modified, taking into account various psychoacoustic phenomena; specifically, frequency masking and the nonlinear mapping between cochlear position and frequency sensitivity.

The frequency masking model implemented uses more simplified triangular-shaped filters to approximate the critical band masking curve rather than the computationally complex critical bands of Fletcher {{4}}[[4]] Fletcher, H. 1940. Auditory patterns, Rev. Mod. Phys. 12, 47–65[[4]].  Besides, very little is known about critical bands (auditory filter shapes) for animals, so it made no sense at this time to look further and use more complicated approaches than are necessary.

Nonlinear mapping goes by a number of different names (frequency warping, frequency mapping, etc.). Frequency warping is a process where one spectral representation on a certain frequency scale, and with a certain frequency resolution, is mapped to another representation on a new frequency scale (what we see in some of our hearing aids today).  The new representation has a uniform frequency resolution on the new scale – however, it has a non-uniform resolution when observed from the old scale.  In doing this, warping of a function will change its local spectral density.  Of interest is that the warping of the frequency axis can be tuned in such a way that it resembles the frequency distribution of critical bands in the basilar membrane of the ear.  In the case of animals, however, this is an unknown quantity.

Equal loudness normalization

Once the filter bank energies are calculated, an equal loudness curve is used to normalize the filter bank energies, approximated from the species audiogram (provided one has been obtained, which seems to be rare).

Intensity-loudness power law

The last psychoacoustic related operation is the application of the intensity-loudness power law as utilized with humans.  Although this exact relationship may not hold for other species, it is likely that the structural similarities between species yield a comparable correspondence between power and loudness.  This relationship may also be different for marine species because of the differences in the propagation of sound through air and water.  Regardless of the appropriate power coefficient, this operation is beneficial from a mathematical modeling sense because it reduces the spectrum’s dynamic range to make the normalized filter bank energies  more easily modeled by a low-order autoregressive all-pole model.

Autoregressive modeling

The last two components of the gPLP model transform the filter bank energies into more mathematically robust features.  The appropriate order of the LP analysis for various species is dependent on the number of harmonics present in the vocalization, the relative complexity of the power spectrum, and the task being performed.

Cepstral* domain transform

The autoregressive coefficients from the LP analysis can be transformed directly into equivalent cepstral coefficients using a recursive formula.  Cepstral coefficients are generally less correlated with each other than autoregressive coefficients because they are based on an orthonormal set of functions.  In its final form, the acoustic features should be more easily visualized (Figure 3).  The cepstral frequency scale is evident.

Figure 3. Perceptual spectrograms. The top plots are traditional FFT-based spectrograms, while the bottom plots are perceptual spectrograms created using gPLP features. The left plots are of an African elephant’s noisy rumple, and the right plots are of a beluga whale’s whistle. Notice how the perceptual spectrogram enhances the peaks and valleys of the spectrum and warps the frequency axis according to the Greenwood cochlear map function.

Summary:

Collectively, the features generated by the gPLP model outlined can be used to perform many types of analyses on animal vocalizations.

gPLP spectrograms are shown to enhance the spectral peaks and suppress broadband background noise. For the identification task, the perceptual information included in the gPLP feature extraction model improves classification accuracy.  Finally, the MANOVA analysis can show that the animals produce significantly different vocalizations, which is consistent with the speaker identification task.

gPLP coefficients can be added to a feature vector of traditional features before a statistical analysis and because they are relatively uncorrelated with each other, they can be added before or after principal component analysis or a related technique.

Finally, gPLP coefficients have no interpretive bias and decrease analysis time because they can be automatically extracted from the vocalization.  Because of its efficiency and adaptability to various species’ perceptual abilities, the gPLP model for feature extraction is an innovative and valuable addition to current tools available for bioacoustic signal analysis.  And, I can’t help but wonder if such feature extraction might not be beneficial in measuring hearing aid amplification systems.

*Cepstral is a modification of the word “spectral,” in which the first four letters are reversed.  It is identified as providing information about the rate of change in different spectrum bands – something that helps identify many animal sounds because they change in spectral content as they are prolonged.

Leave a Reply