Top Post of 2017 for Hear the Music

Marshall Chasin
December 12, 2017

Speech is not a broadband signal… but music is

We tend to be biased, both in our training and in our technologies that we use. We tend to look at things based on spectra or frequencies.  Phrases such as “bandwidth” and long term average speech spectrum show this bias. The long term average speech spectrum, with is averaged over time, is indeed a broad bandwidth spectrum made up of lower frequency vowels and higher frequency consonants, but this is actual false.

At any point in time, speech is either low frequency OR high frequency, but not both.  A speech utterance over time, may be LOW FREQUENCY VOWEL then HIGH FREQUENCY CONSONANT then SOMETHING ELSE, but speech will never be both low frequency emphasis and high frequency emphasis at any one point in time. 

Speech is sequential.  One speech segment follows another in time, but never at the same time.

In contrast, music is broadband and very rarely narrow band. 

With the exception of percussion sounds, music is made up of a low frequency fundamental note (or tonic) and then a series of progressively higher frequency harmonics whose amplitudes and exact frequency location define the timbre.  It is impossible for music to be a narrow band signal.

It is actually a paradox that (1) hearing aids have one receiver that needs to have similar efficiency at the both the high frequency and the low frequency regions and (2) musicians “prefer” in-ear monitors and earphones that have more than one receiver.  If anything, it should be the opposite.  I would suspect that the musicians’ preference for more receivers (and drivers) is a marketing element where “more” may be perceived as “better”.

At any one point in time, musicians should be wearing a single receiver, single microphone, and single bandwidth in-ear monitor.  This will ensure that what is generated in the lower frequency region (the fundamental or tonic) will have a well-defined amplitude (and frequency) relationship with the higher frequency harmonics.  This can only be achieved with a truly single channel system. A “less is more solution”.

This same set of constraints does not hold for speech.  If speech contains a vowel (or nasal, collectively called a sonorant), it is true that there are well-defined harmonics that generate a series of resonances or formants but for good intelligibility one only needs to have energy up to about 3500 Hz.  Indeed, telephones only carry information up to 3500 Hz.  If speech contains a sibilant consonant, also known as obstruents (‘s’, ‘sh’, ‘th’,’f’,…) there are no harmonics and minimal sound energy below 2500 Hz.  Sibilant consonants can extend beyond 12,000 Hz, but never have energy below 2500 Hz.

Speech is either low frequency sonorant (with well-defined harmonics) or high frequency obstruent (no harmonics), but at any one point in time it’s one or the other, but not both.  Music must have both low and high frequency harmonics and the exact frequencies and amplitudes of the harmonics provide much of the definition to music.

This also has ramifications for the use of frequency transposition or shifting.

It makes perfect sense to use a form of frequency transposition or shifting for speech.  This alters the high frequency bands of speech where no harmonics exist.  Moving a band of speech (e.g. ‘s’) to a slightly lower frequency region will not alter any of the harmonic relationships.

But for music, which is defined only by harmonic relationships in both the lower and the higher frequency regions, frequency transposition or shifting will alter these higher frequency harmonics.

Clinically for a music program, if there are sufficient dead cochlear regions or severe sensory damage, reducing the gain in a frequency region is the correct approach, rather than changing the frequency for a small group of harmonics.


  1. Great article. I have always felt the Music is easier to hear and understand than speech. Speech is complex, each word is made of vowels and consonants. The vowels are loud but different consonants are soft or even softer. Add that to multiple words in a sentence and then add that to a person who does not project their voice. Or enunciate, or has a different dialog. It gets worse with background noise.
    Music on the other hand is not as complex as speech. Sure there are different instruments or vocals. But it is much easier to listen too.
    So stop talking and listen to the music!

Leave a Reply