|<-- Back to Previous Page||TOC||Next Chapter -->|
Chapter 2: The Digital Representation of Sound,
When we start talking about taking 44.1 kHz samples per second, each one of those samples has a 16-bit value, so were building up a whole heck of a lot of bits. In fact, its too many bits for most purposes. While its not too wasteful if you want an hour of high-quality sound on a CD, its kind of unwieldy if we need to download or send it over the Internet, or store a bunch of it on our home hard drive. Even though high-quality sound data arent anywhere near as large as image or video data, theyre still too big to be practical. What can we do to reduce the data explosion?
If we keep in mind that were representing sound as a kind of list of symbols, we just need to find ways to express the same information in a shorter string of symbols. Thats called data compression, and its a rapidly growing field dealing with the problems involved in moving around large quantities of bits quickly and accurately.
The goal is to store the most information in the smallest amount of space, without compromising the quality of the signal (or at least, compromising it as little as possible). Compression techniques and research are not limited to digital sounddata compression plays an essential part in the storage and transmission of all types of digital information, from word-processing documents to digital photographs to full-screen, full-motion videos. As the amount of information in a medium increases, so does the importance of data compression.
What is compression exactly, and how does it work? There is no one thing that is "data compression." Instead, there are many different approaches, each addressing a different aspect of the problem. Well take a look at just a couple of ways to compress digital audio information. Whats important about these different ways of compressing data is that they tend to illustrate some basic ideas in the representation of information, particularly sound, in the digital world.
There are a number of classic approaches to data compression. The first, and most straightforward, is to try to figure out whats redundant in a signal, leave it out, and put it back in when needed later. Something that is redundant could be as simple as something we already know. For example, examine the following messages:
Its pretty clear that leaving out the vowels makes the phrases shorter, unambiguous, and fairly easy to reconstruct. Other phrases may not be as clear and may need a vowel or two. However, clarity of the intended message occurs only because, in these particular messages, we already know what it says, and were simply storing something to jog our memory. Thats not too common.
Now say we need to store an arbitrary series of colors:
This is easy to shorten to:
In fact, we can shorten that even more by saying:
We could shorten it even more, if we know were only talking about colors, by:
We can reasonably guess that "y" means yellow. The "b" is more problematic, since it might mean "brown" or "black," so we might have to use more letters to resolve its ambiguity. This simple example shows that a reduced set of symbols will suffice in many cases, especially if we know roughly what the message is "supposed" to be. Many complex compression and encoding schemes work in this way.
A second approach to data compression is similar. It also tries to get rid of data that do not "buy us much," but this time we measure the value of a piece of data in terms of how much it contributes to our overall perception of the sound.
Heres a visual analogy: if we want to compress a picture for people or creatures who are color-blind, then instead of having to represent all colors, we could just send black-and-white pictures, which as you can well imagine would require less information than a full-color picture. However, now we are attempting to represent data based on our perception of it. Notice here that were not using numbers at all: were simply trying to compress all the relevant data into a kind of summary of whats most important (to the receiver). The tricky part of this is that in order to understand whats important, we need to analyze the sound into its component features, something that we didnt have to worry about when simply shortening lists of numbers.
MP3 is the current standard for data compression of sound on the web. But keep in mind that these compression standards change frequently as people invent newer and better methods.
Soundfiles 2.8, 2.9, and 2.10 were all compressed into the MP3 format but at different bit rates. The lower the bit rate, the more degradation. (Kbps means kilobits per second.)
Perceptually based sound compression algorithms usually work by eliminating numerical information that is not perceptually significant and just keeping whats important.
µ-law ("mu-law") encoding is a simple, common, and important perception-based compression technique for sound data. Its an older technique, but its far easier to explain here than a more sophisticated algorithm like MP3, so well go into it in a bit of detail. Understanding it is a useful step toward understanding compression in general.
µ-law is based on the principle that our ears are far more sensitive to low amplitude changes than to high ones. That is, if sounds are soft, we tend to notice the change in amplitude more easily than between very loud and other nearly equally loud sounds. µ-law compression takes advantage of this phenomenon by mapping 16-bit values onto an 8-bit µ-law table like Table 2.6.
Notice how the range of numbers is divided logarithmically rather than linearly, giving more precision at lower amplitudes. In other words, loud sounds are just loud sounds.
To encode a µ-law sample, we start with a 16-bit sample value, say 330. We then find the entry in the table that is closest to our sample value. In this case, it would be 324, which is the 28th entry (starting with entry 0), so we store 28 as our µ-law sample value. Later, when we want to decode the µ-law sample, we simply read 28 as an index into the table, and output the value stored there: 324.
You might be thinking, "Wait a minute, Our original sample value was 330, but now we have a value of 324. What good is that?" While its true that we lose some accuracy when we encode µ-law samples, we still get much better sound quality than if we had just used regular 8-bit samples.
Heres why: in the low-amplitude range of the µ-law table, our encoded values are only going to be off by a small margin, since the entries are close together. For example, if our sample value is 3 and its mapped to 0, were only off by 3. But since were dealing with 16-bit samples, which have a total range of 65,536, being off by 3 isnt so bad. As amplitude increases we can miss the mark by much greater amounts (since the entries get farther and farther apart), but thats OK toothe whole point of µ-law encoding is to exploit the fact that at higher amplitudes our ears are not very sensitive to amplitude changes. Using that fact, µ-law compression offers near-16-bit sound quality in an 8-bit storage format.
A third type of compression technique involves attempting to predict what a signal is going to do (usually in the frequency domain, not in the time domain) and only storing the difference between the prediction and the actual value. When a prediction algorithm is well tuned for the data on which its used, its usually possible to stay pretty close to the actual values. That means that the difference between your prediction and the real value is very small and can be stored with just a few bits.
Lets say you have a sample value range of 0 to 65,536 (a 16-bit range, in all positive integers) and you invent a magical prediction algorithm that is never more than 256 units above or below the actual value. You now only need 8 bits (with a range of 0 to 255) to store the difference between your predicted value and the actual value. You might even keep a running average of the actual differences between sample values, and use that adaptively as the range of numbers you need to represent at any given time. Pretty neat stuff! In actual practice, coming up with such a good prediction algorithm is tricky, and what weve presented here is an extremely simplified presentation of how prediction-based compression techniques really work.
The Pros and Cons of Compression Techniques
Each of the techniques weve talked about has advantages and disadvantages. Some are time-consuming to compute but accurate; others are simple to compute (and understand) but less powerful. Each tends to be most effective on certain kinds of data. Because of this, many of the actual compression implementations are adaptivethey employ some variable combination of all three techniques, based on qualities of the data to be encoded.
A good example of a currently widespread adaptive compression technique is the MPEG (Moving Picture Expert Group) standard now used on the Internet for the transmission of both sound and video data. MPEG (which in audio is currently referred to as MP3) is now the standard for high-quality sound on the Internet and is rapidly becoming an audio standard for general use. A description of how MPEG audio really works is well beyond the scope of this book, but it might be an interesting exercise for the reader to investigate further.
|<-- Back to Previous Page||Next Chapter -->|
©Burk/Polansky/Repetto/Roberts/Rockmore. All rights reserved.