Music DSP Frequently Asked Questions

Please note that some of the dsp FAQs are unattibuted quotes i've taken from postings to the music-dsp mailing list. Others have been posted explicitly as FAQs by list members or added to the old musicdsp.org FAQs wiki. Please send additions/corrections/comments to douglas * music columbia edu

The Questions

Meta FAQs
Code and theory FAQs
Development FAQs
Administrative FAQs
1. music-dsp administrative FAQs

Answers

Meta FAQs

where can I find off-the-shelf algorithms for integration into my products?
Companies that provide off-the-shelf audio signal processing algorithms:
- Algorithmix GmbH - http://fwww.algorithmix.com/
- BTE Audio - http://www.bteaudio.com/
- Clearec GmbH - http://www.clearec.com/
- DSP Dimension - http://www.dspdimension.com/
- DSPECIALISTS - http://www.dspecialists.de
- iZotope Inc - http://www.izotope.com/
- superpowered - http://superpowered.com
- zplane.development - http://www.zplane.de/

What's a good book for learning about digital signal processing?
- There are lots and lots and lots of dsp/computer music books out there.
  see the music-dsp book reviews page for info on many of them.
- Also, there is a comprehensive book list the the DSPguru web site at http://www.dspguru.com/info/refinfo2.htm
- "DSP Processor Fundamentals : Architectures and Features" by Phil Lapsley, Jeff Bier, Amit Shoham, Edward A. Lee. ISBN: 0780334051

Is there a good online dsp guide?
- Miller Puckette's "The Theory and Technique of Electronic Music" is available in various formats, both online and in print.
- The Scientist and Engineer's Guide to Digital Signal Processing by Steven W. Smith, Ph.D. is available online ( http://www.dspguide.com/) for free!
- http://www.bores.com/index_online.htm is a free online course that covers most of the basics and is a good intro for beginners written in plain English (well, plain enough). There is also an advanced section for more experienced programmers.
- Also, be sure to check out Julius Orion Smith's ever evolving DSP site: http://ccrma.stanford.edu/~jos There is a wealth of information to be found here.(Nikhil Sarma). (And eventually, the answer will be: 'you're soaking in it'.)

Should i use a dsp board or a regular computer?
DSP boards are still useful in certain situations (installations, small spaces, wearable computers) but most general sound processing algorithms can now run comfortably on regular desktop computers. So unless you have special needs you should probably start out using a regular computer.

Which dsp board should i use?
Using DSP boards is a nice perspective, but almost all DSP boards are expensive and really impractical. I recommend to concentrate on creating efficient NATIVE plugins (means, for your computer central processor) to test/evaluate audio algorithms. (George Yohng)

Which language should I use?
This depends on your ultimate development goals. Almost any language can be used for non-performance-critical DSP work. Some interpreted ones like Perl are fairly useless except for non-real-time calculation. Python fares better. But for performance you need a fast compiled language or assembly in some cases. Angelo Farina did some comprehensive testing - see musicdsp mailing list archives.
- C and C++
  C is in many ways the "lingua franca" of modern computer programming, so if you want to use DSP as a route to learning general programming practice, it might be a good choice to start with, though it's not the easiest programming language for a beginner. C++ is assumed by the widely-used VST audio plugin framework. It has many features beyond those of its parent language, C, and thus is even more complex to learn.
- Delphi
  Some people like to use Delphi (essentially a modern, object-oriented dialect of Pascal) to do their DSP work. Performance is similar to C/C++, but it's a lot harder to find support resources and to get questions answered. However, working in Delphi seems to give some people remarkable powers of imagination and productivity, so don't be embarrassed if it works for you. Delphi also lacks C/C++'s wide variety of methods for shooting oneself in the foot.
- Java and .NET
  Newest C# and .NET framework trials yielded ten times worse performance, than C/C++. C# should perform on par with Java if well written. Java is similar to C++ and has the same modern object-oriented design, but is far more elegant. Unfortunately, traditional Java implementations are much slower at raw math operations than C/C++; the exact performance hit varies by platform and processor, but Java can be as slow as half the speed of C in some cases. (Although there are ways of using C code from inside of Java to speed up critical parts of your code.) For more information on Java performance in numerical computing apps, see http://www.philippsen.com/JGI2001/finalpapers/18500097.pdf Java may be effecient with JIT (just in time compilation), but generally you cannot control the result directly.
- Assembly
  For dedicated DSP chips, assembly language appropriate to that architecture is generally necessary. Inline assembly within C or C++ may be useful for optimal inner loops.
- REALbasic
  REALbasic is a modern, object-oriented version of BASIC (perhaps more OO than Java) that provides a visual development environment, creates native applications and compiles to machine code. The IDE runs on Windows 98-XP as well as both Mac Classic and Mac OS X (and soon Linux as well) and it cross-compiles to Windows, Mac and Linux. If you're looking for a language and environment that will let you quickly create apps that run on Windows, Linux and Mac, it's one to consider. The compiler has not yet been optimized but users rarely complain about the performance as it is. http://www.realbasic.com

Where can I go to study sound-related dsp?
- The University of Southern California at Los Angeles has an excellent graduate program officially titled M.S.E.E - Multimedia and Creative Technologies. The required courses for completion of graduation requirements include the extremely interesting EE 522 - Immersive Audio Signal Processing (under Prof. Chris Kyriakakis) and DSP for Speech (this may be called something else now).
- A part-time job in the sound department at the USC School of Cinema-Television (the best in the US) is a great way to get some great hands on experience and the opportunity to talk to the audio legend, Tom Holman.
- The CCRMA at Stanford is also an excellent choice going by their courses and the body of research that has been generated there.

Should I study Electronic Engineering or Computer Science or something else?
That depends on what your goal is. It's really a matter of personal choice, as DSP is such a broad field. The field can be seen as a combination of acoustics and physical sound, digital audio, computational numerical methods, statistics and probablity, psychoacoustics, and classical music theory. Obviously, no one course can provide all these in depth. Are you a musician looking to learn more about DSP so that you can use it in your projects? Are you hoping to build DSP software or plugins? Or is your dream to build DSP hardware? (Or all of the above?) I personally intended to double major in CS/EE and minor in music, but then I realized I didn't have quite enough time for all that, so I decided to go with an EE major with heavy focus in CS and take lots of music classes on the side. It also depends on where you go to school; in some schools the music department has the most to offer in DSP, other schools it is the EE department. Whatever you decide, make sure to take lots of math classes. An EE/CS major is generally a good choice. Studying formal classic music theory is unnecessary for DSP. The important parts of music theory to understand if you plan to write your own applications are terms such as octaves, harmonics, decibels etc. One must distinguish between formal, academic knowledge, and experience. Very few degree or masters programmes are able to supply both. If you are serious about learning DSP, while studying for a degree, spend time with books by authorities on DSP and learn from example code on the net.
There is a poll on KvR initiated by Remy Muller http://www.kvr-vst.com/forum/viewtopic.php?t=45855

What is the situation with MPEG patents?
Consult http://en.wikipedia.org/wiki/MP3 Apparently mp3 codec licencing is handled via http://www.mp3licensing.com rather than the directly by teh Fraunhofer Intitute. See Eric Scheirer's: MPEG, Patents, and Audio Coding FAQ: http://music.columbia.edu/cmc/music-dsp/FAQs/MPEG_FAQ.html
If you need a decent audio compression/decompression engine for a custom application, consider using OGG Vorbis, which is a patent-free replacement for MP3. The quality is better, the format is not compatible though. http://www.vorbis.com/

Code and theory FAQs

How do I make an equalizer or guitar effect pedal?
The music-dsp source code archive has code for many filter and guitar-pedal type effects, as well as code for doing lots of other strange things to sound. If you're serious about sound processor design, you should research existing algorithms published by the various conferences and the Journal of the AES (audio engineering society) (available online http://www.aes.org) You could also look in publications of the DAFX conference (available free online at http://www.dafx.de) and the International Computer Music Conference (http://www.computermusic.org, not free online but available in many university libraries)
I love Rock and/or Roll, how can I make digital distortion for my guitar?
See Jeffrey Traer Bernstein's Digital Guitar Distortion FAQ: http://music.columbia.edu/cmc/music-dsp/FAQs/guitar_distortion_FAQ.html

How do I implement digital filters?
For questions about IIRs, FIRs, EQs, etc. There are many resources. RBJs filter cookbook is a standard work, and there's a grand digital filters FAQ on its way...Meanwhile, here's a 'free' book which has probably more than you want to know about filters: http://www-ccrma.stanford.edu/%7Ejos/filters/filters.html
Some more links that might be helpful:
- KVR http://www.kvraudio.com/forum/viewtopic.php?t=55471&postdays=0&postorder=asc&highlight=highpass&start=0
- Smart Electronix http://jaha.smartelectronix.com/

How do I make white/pink/brown/pale/friendly noise?
White noise (digitally speaking) is a series of random numbers. This produces a spectrum which (when time-averaged) is flat (equal energy per Hz). The sound of white noise is a bright hiss.
Another kind of noise is "Rustle Noise", a random series of pulses, which if rapid enough, approximate white noise. This is something like the clicks of a Geiger counter.http://www.sfu.ca/sonic-studio/handbook/Rustle_Noise.html
Sometimes noise with other properties will sound better. Other kinds of noise are often referred to with "color" names, by analogy with the optical spectrum. Just as white light has equal energy across the spectrum, so does white noise. Light with more energy at the low frequency end will appear reddish or pinkish. Light with more high frequency energy will appear bluish, and so on. For example, pink noise has a spectrum which rolls off linearly toward the high end, such that it has a spectrum with equal energy per octave (or in other words, equal energy per decade). This is also referred to as 1/f (one-over-f) noise. This is done by filtering white noise with a 3 dB per octave lowpass filter. Note that this is a gentler rolloff than even a 1-pole filter, so it is slightly tricky to do. One could use an FIR filter, which of course can produce any arbitrary response. However, FIR filters use a fairly high amount of CPU power, so something more efficient is desirable. Another common trick is to approximate the rolloff by combining (summing) the outputs of multiple 1-pole lowpass filters. The filter frequencies and amplitudes are selected to space them equally in log-frequency, with progressively lower gains for higher frequency filters. For example, if the frequencies are spaced an octave apart over the available bandwidth, each filter would be summed in with -3 dB of gain compared to the previous (1 octave lower) filter. This method of approximation will produce a curve with positive ripples at the corner frequencies, and negative ripples halfway between, but if enough sub-filters are used, it sounds close enough. Here are some links:
- http://musicdsp.org/files/pink.txt
- http://www.firstpr.com.au/dsp/pink-noise/
Brown noise rolls off at 6 dB per octave, or one-over-f-squared. This is easy; just pass white noise through an integrator, or more practically, through a lowpass filter. The corner frequency of the lowpass defines the "left end" of the accurate portion of the brown noise approximation. The problem with an integrator is that it has an infinitely high response to DC, and still a very high response at low frequencies, and might therefore saturate. Brown noise is not only named because of its analogy to "brown" light, but also it is named after an English botanist who observed this kind of motion in microscopic particles. Another name for Brown noise is Gausiian noise, after Karl Gauss, the German physicist who described it mathematically. Here are links:
Blue, green and other noise colours seem not to be rigorously defined although the word "colour" is used a lot in describing noise. Some define the 7 rainbow colours to correspond to a width of about three critcal bands in the Bark frequency scale such that green lies in the corresponding point of greatest sensitivity for the ISO226 equal loudness curve as for an optical wavelength of 530nm (the greatest sensitivity for the eye). This identifies green noise as the most troublesome for speech systems. Blue refers to noise with emphasis in the high frequencies. One could readily imagine analogies to pink noise ("azure"?) or brown noise ("violet" or "navy blue"?) which would use +3 dB or +6 dB per octave high pass filters. Doepfer makes an analog noise module with "Red" and "Blue" knobs: http://www.doepfer.de/a100_man/A118_man.pdf

What's a fast, stable, accurate method for sinewave generation?
One of the best methods - is to just use sinf() function from math.h which works surprisingly fast. For when the software is ready and performance is not satisfying, there are other ways to generate sine waves, such as wavetable lookup or second order IIR oscilator. For LFOs less accurate methods are appropriate such as parabolic approximation. Where optimisation is important and good harmonic purity is needed polynomial and tuncated Taylor series methods can be used.

What is aliasing?
To understand what aliasing is, let's start with a visual example:
When watching a movie, especially a good old western, on tv or in cinema, everybody (and his Auntie) once has wondered how vehicles could move forward while their wheels seem to turn backward or at least on a different speed. When you can explain this effect, you understand aliasing. A basic concept of movies is that they slice continuous motion into single pictures. Watching a movie in cinema means looking at thousands of pictures, each shown for only a 24th of a second.
Now, imagine a wheel with 18 spokes. That is one spoke every 20 degree. If the wheel turns an angle of 24 degrees in a second, every picture shows a difference of 1 degree then. Each spoke is then shown near to the position where it was in the previous picture (or frame in the correct film terminology). Hence you can see each single spoke move smoothly forward.
But when the wheel gains speed and rotates with, say, 360 degrees per second, things become more complicated. Then a spoke makes 15 degree each frame (360 degree/24 frames = 15 degree/frame). That's more than half of the distance between two spokes. Each spoke is now shown nearer to the position of another spoke than to it's own position spotted in the previous frame. So to say, each spoke appears to be recognized as another (being an alias!) from frame to frame. Accidently, the spokes do not seem to make 15 degrees forward but rather 5 degrees backward!
A special case occurs, when each spoke makes half the angle between two spokes, that is 10 degree per frame or 240 degree per second. Then we see a flickering image of a wheel with 36 spokes. Depending on the initial conditions we may perceive a wheel turning either forward or backward. When each spoke makes exactly 20 degree per frame the wheel seems to stand still!!! To describe the same phenomenon in digital audio we use a slightly different terminology.
Instead of frames we have samples. The slicing of motion into frames we call discrete as opposed to continuous. The number of samples per second is called sample rate and it's not 24 or 30 but rather 44100Hz or 48000Hz. So digital audio is more continous than movies are. Henry Nyquist (1889 - 1976) who in 1938 stated that the frequency range of a discrete signal is limited to half its sample rate. This is called the Nyquist Theorem. It means that the highest frequency a 48000 Hz system can provide is 24000Hz or 24kHz (1kHz == 1000Hz).
Returning to the wheel analogy, we could see the wheel turn backwards when the frequency rises beyond half the sampling rate, in this case degree per frame in relation to angle between spokes. In audio there is corresponding backwards movement. Aliased frequencies above the Nyquist seem to decrease. The Nyquist frequency behaves like a mirror for those frequencies going beyond it. Putting it simply, a sinusoidal signal of 24001 Hz sampled with a rate of 48000 Hz becomes a sinusoidal signal of 23999 Hz.
Oscillators in a digital synthesiser can produce frequencies (partials) beyond the Nyquist. The workaround here is called bandlimiting, to remove these frequencies as the waveform is produced. Many processes (pitchshifting, vibrato, compression, tremolo, filters etc.) may produce aliasing. Care must be taken not to introduce excessive modulation because once aliasing occurs no effort can practically remove it. It must be stopped before it actually happens! A demonstration of aliasing and some figures can be found at http://fen.wikipedia.org/wiki/Aliasing.

How do I avoid aliasing?
1. Filtering, reverb, delay, flanger, phaser, mixer: Normally these processes won't cause aliasing, unless you modulate them very rapidly. These are all linear processes and therefore don't add any frequency content to the audio. Filters with distortion (e.g. models of classic analog synth filters) will have to be treated as distortion algorithms; see below.
2. Ring Modulation, AM: Remember that the output frequencies from a Ring or Amplitude Modulator are the sums and differences of all the input frequencies. Therefore you must ensure that none of the sums exceed the Nyquist rate. One obvious way to do this is to lowpass filter both inputs so they contain nothing over 1/2 Nyquist (1/4 of sample rate). Or you could filter one to 1/3 of Nyquist, and the other to 2/3 Nyquist, etc.
3. Frequency Modulation, rapid filter modulation: This is harder, because the output spectrum of a Frequency Modulator is complicated; each partial in the modulation source generates an infinite series of partials in the output. Rapidly changing the cutoff frequency of a filter is similar to Frequency Modulation.Your realistic goal is to keep the amplitude of high frequency input partials low enough that the aliased partials in the output, are not too objectionable. You could also try the brute-force method of up-sampling, modulating, filtering, and finally down-sampling.
4. Distortion, distorting filters: Another hard case. If possible, don't distort too hard. For example, clipping or rectification generate much more high frequency content than does soft saturation. If the end product doesn't need extreme quantities of high frequencies, it's more efficient to go easy on the distortion, than to distort hard and then filter away the excess. Another option is brute force: up-sample, distort, filter, and finally down-sample. Aliasing can be reduced by lowpass filtering before the distortion, this reduces the number of frequency components created above nyquist during the distortion process.

What is up/down sampling?
Upsampling and downsampling refer to increasing or decreasing the sample rate of a signal. In both cases, the most important problem is to avoid creating alias frequencies. You will need to insert a lowpass filter in the middle of the process to prevent aliasing.

How can efficient and accurate up/down sampling be done?
If you are down-sampling, you need to lowpass filter the signal to remove anything above the new Nyquist rate, then discard samples. If you are upsampling, you need to insert zero samples between the old samples, then lowpass filter to remove anything above the old Nyquist rate. In both cases, you can take advantage of the fact that either many of the input samples to the filter are zero, or many of the output samples from the filter are not used. The trick to doing this is called Poly-Rate or Poly-Phase filtering. Here are a few links:
- http://www-ccrma.stanford.edu/%7Ejos/Interpolation/Digital_Audio_Resampling_Home.html
- http://www.cmsa.wmin.ac.uk/%7Eartur/Poly.html
- http://leute.server.de/wilde/resample.html
- The multirate FAQ at the dspGuru site is very informative as well. The upsampling/interpolation FAQ has a clear explanation of polyphase filtering: http://dspguru.com/info/faqs/mrfaq.htm
- There is also a library for up/down sampling. Its called Secret Rabit Code (Secret Rabbit Code is a Sample Rate Converter) and is available here: http://www.mega-nerd.com/SRC/ It's distibuted under dual license, the GNU General Public License or alternatively a commercial use license. It is distributed as source code and compiles on Linux, *nix, MacOSX and Win32. Secret Rabbit Code can do time varying sample rate conversion and is basically a reimplementation of the algorithm desrcibed in Julius O. Smith's "Digital Audio Resampling Home" page above.
- 3-stage resampling is another way to achieve fast and high-quality resampling. There is a document explaining the theory behind: http://ldesoras.free.fr/prod.html#doc_resampler
- Resampler is a crossplatform C++ library dedicated to sampler softwares, using the 3-stage method. Source code is distributed under LGPL license:http://ldesoras.free.fr/prod.html#src_resampler

How do I do pitch detection?
There are many ways of detecting pitch depending on what kind of accuracy and reliability you need. Autocorrelation, FFT methods, zero crossing and phase locked tracking are some that may be appropriate depending on the quality of the wavefrom. See wikipedia: http://en.wikipedia.org/wiki/Pitch_detection_algorithm Here are some links:
- http://www.iua.upf.es/%7Exserra/articles/msm/pitch.html
- http://www.cnmat.berkeley.edu/tristan/Report/node4.html

How do I do pitch shifting?
Pitch Shifting and Frequency Shifting are somewhat different operations. Pitch Shifting MULTIPLIES all the frequencies of an input signal by some constant, so the output is the same sound at the same tempo but a higher pitch. More sophisticated pitch shift algorithms also attempt to preserve the formant shapes (overall spectral envelope) of the original signal, so for example, you end up with what sounds like the same person singing a higher note, rather than a Munchkin singing. Frequency shifting is quite different: it ADDS a constant to all the input frequencies. In general, this will convert a series of harmonics to non-harmonic relationships, and will produce metallic timbres. For example, a slight downward frequency shift combined with a corresponding upward re-tuning, can simulate the characteristic of a plucked string, where higher harmonics are actually slightly sharp compared with the fundamental.
Frequency shifting is actually an older technology from the analog days, and is also somewhat simpler than pitch shifting. Nonetheless there are some things to watch out for. When done with ring modulation as part of the process (most methods do this), you must filter carefully to prevent aliasing. The best (simple efficient clean) current method is Weaver modulation; see this link: http://www.csounds.com/ezine/summer2000/processing/
Pitch shifting is generally done by a sort of granular resynthesis, where short sections of the input are either played faster and repeated, or played slower, with some dropped. The sections will generally have an envelope applied to eliminate glitches. The problems here are to select the audio sections so they synchronize with transients in the audio, and in better algorithms, to extract pitch and formant information, so pitch and formants can be separately processed. By the way, formant-uncorrected pitch shifting is equivalent to time stretching/compression combined with resampling (or sample interpolation). Here are some links:
- http://www.dspdimension.com/data/html/timepitch.html
- http://www.tc-helicon.tc/Files/helicon_files/Pitch_shifting.pdf
- http://www.dspdimension.com/data/html/pshiftstft.html (pitch shifting tutorial, comes with full C source code)

How does the FFT work, and what are its applications?
The Fourier transform of a signal produces an array/list of the strengths of frequencies in it. The Discrete Fourier Transform (DFT) is a mathematical tool that operates on a short window of a signal, so we know that in a particular time interval certain frequencies are present. The FFT (fast Fourier transform) is an algorithm to compute a discrete transform. It is very efficient and is the reason much of the real-time signal processing all around us exists today, from cell-phones to digital TV to effects boxes. Here are some links:
- http://www.fftw.org
- http://ourworld.compuserve.com/homepages/steve_kifowit/fftnote.asc
- http://www.dspdimension.com/data/html/dftapied.html ("Mastering The Fourier Transform in One Day")
The Short Time Fourier Transform (STFT) refers to any process where the Fourier Transform is performed on short time segments of input.
Overlap add (OLA) and Overlap Save (OLS) fall into this category since they use the Discrete Fourier Transfrom (DFTs) of segments of a signal to implement convolution, aka FIR filtering, in the frequency domain.
Since multiplication in the frequency domain corresponds to convolution in the time domain we can take advantage of the speed of the Fast Fourier Transfor (FFT) to do convolution faster in the frequency domain since in most situations the FFT and subsequent multiply and Inverse FFT result in less multiplications then a time domain convolution. To perform convolution in the frequency domain you take equal length transforms of the 2 signals (zero padded as necessary), multiply them, then take the inverse transform of the result.
The only problem is that multiplication in the frequency domain corresponds to cyclic convolution in the time domain wheras we wish to perform linear convolution. In cyclic convolution, the response to the end of a block of samples wraps around and sums to the beginning. This is known as time-aliasing.
Overlap add and Overlap Save work around this problem by using a tranform size that ensures there is no wrap around thereby emulating linear convolution.
So now you can perform frequency domain convolution on an entire signal all at once or you can hop along and take successive windows of the input.
In OLA and OLS segments of the input are extracted by hopping and taking successive windows of the input.
A window is a segment of a larger signal. Much like the window in your house provides a limited view of the outdoors (if you don't stick your head out of it)
There are many different windows used to take a segment of a signal. A rectangular window is equivalent to simply extracting a block of numbers from a signal (look at it as multiplying the signal by a rectangle of height 1 with zeros everywhere else, shifted in time to the location of the block you wish to extract). Other windows taper the signal on either end.
Say you have a window of input data of length N and an FIR filter of length M, If you linearly convolve the result is of size N+M-1 so you must use a transform size at least this long so there is no wrap-around.
However, now this result is longer than the original window of data so what do we do with this extra tail?
In Overlap Add (OLA) we take a window of N samples of input data and zeropad the signal to a length of at least N+M-1. When we convolve in the frequency domain the first N samples contain the filtered data and the remaining M-1 samples are the tail, or filter ringing, that would wrap around in cyclic convolution. So we take the N+M-1 samples and add it into the output. The next block of N+M-1 samples is then added onto this so that it overlaps the tail, or ringing, of the previous block, from whence derives the name of this process.
As long as the window and hop size (overlap) satisfy the Constant Overlap Add (COLA) constraint this process is exactly equivalent to convolution in the time domain. The COLA constraint is that if you take the sum of the superposition of all the windows at their respective times it must add up to a constant (preferably 1 for no gain) across all time.
Here are some example COLA constraints:
- Rectangular window hop size of the window length (no overlap)
- Hanning window (raised cosine) hop size of half the window (50% overlap)
- Bartlett (or triangular) window hop size of half the window (50% overlap)
- Blackman window hop size of one third the window (66% overlap)
It is interesting to note that Overlap Add can also be interpreted as a filterbank summation and obtain precisely the same result. In Overlap Save (OLS) we use a rectangular window (the use of other windows is decidedly more complicated and better suited to Overlap Add) and also take a transform of length N+M-1, however, in this case we take a window of N samples of input data and reach back to include the M-1 samples at the end of the previous window of data (at the beginning we just prepend with M-1 zeros). This means that when we convolve, the tail or ring is wrapped around to the beginning M-1 samples of the block, which is discarded since it contains the time-domain aliasing. The good part (the N samples) are simply placed into the output and already contains the tail or ringing from the previous block. This way the overlaping and adding of the tails effectively takes place implicity at the input instead of the explicitly at output as in OLA. This is called Overlap Save since we save the end of the previous input block to become the beginning portion of the next block. Overlap Save is also precisely equivalent to time-domain convolution.
If there is no analysis involved and only filtering you need only employ a rectangular window with either OLS or OLA since there is no real need for increased time resolution (i.e. more transforms per unit time) nor better side lobe rejection etc. People frequently use OLA to include analysis. When analysis is involved the choice of window becomes important. Such as if you wished the filter to vary in time based on the signal, for example: using a time varying bandpass filter to de-noise. This is not well suited to OLS since the ringing from the wrong filter would be added into each block resulting in some artifacts and an altogether incorrect result.
A fairly complex and interesting application of the STFT is Sinusoidal Modelling Synthesis (SMS) or Sines+Noise+Transients models which can be used for all sorts of fun stuff like time compression/expansion, pitch shifting, or data compression.
If you do not need a convolution filtering, but a simple filter/equalizer for e.g. a synthesizer, avoid FFT as much as possible. FFT (as you see in the Jeffrey Traer Bernstein's description) tends to create problems, such as latency, inefficiency and artifacts. On the other hand, FFT is the best way to perform spectral processing/convolution. If the frequency response of the desired filter doesn't change very often, you can do an inverse FFT on the frequency response to get an impulse response, then implement an FIR filter with that response.

How do I mix sounds together?
Mixing n sounds together is one of the simplest operations in audio signal processing. It means adding the signals together, sample by sample. In short, Mix(n) = An + Bn + Cn and so on. You just add corresponding digital samples together. The operation is neat because this is also how different sounds usually combine in air (by superposition). In practice mixing is slightly more complicated because we have to worry about the finite dynamic range of digital arithmetic.
Suppose we are working in 16-bit fixed point arithmetic and we want to mix two signals. If both signals happen to have near-full scale values, they could sum to a number which takes 17 bits to represent losslessly. More generally, doubling the number of signals added will require an extra bit to represent all the possible sums, so we would need log_2(n) bits of headroom.
The first way to deal with the problem is the simplest: we can just add the signals, clip any overflows, and pray. This is a bad idea, especially in audio DSP, because clipping will quickly add noticeable distortion and aliasing.
The second solution is to throw more bits at the problem. If we add n signals, we use log_2(n) extra bits in our intermediate results. If we can spare the bits, they will assure perfect representation of the sum. The downside is that wider integers are costlier to compute with, and we now have to worry about more than one integer format in our software.
Alternatively we could divide the sum by the same factor to make sure even the largest results do not overflow. This is the commonest solution in fixed point processing, but it gives rise to further trouble. First, the individual signals are now attenuated and none of them can drive the sum to full scale by themselves. In many applications this is unacceptable because often some of the inputs are actually quiet. When this happens, we are wasting dynamic range on the output. Second, division loses accuracy. If we mix two signals, halving the sum, we lose one bit of precision on each of the input signals.
One way to mitigate the drawbacks is to note that most sums of full-scale signals do not actually take an extra bit to represent. This is the consequence of a deep result in mathematical probability called the "central limit theorem", which states that adding large numbers of independent random variables will tend to lead to normal distributions. This means that when our n is large, the expected amplitude of the sum is not n-fold, but typically grows as the square root of n. This does not work all the time, because we might be adding signals which are not independent or have pathological distributions. Sometimes there will be brief expected overflows as well. But dividing by a number between n and sqrt(n) is still good advice.
We should also remember that dropping resolution isn't actually quite as simple as that. Strictly speaking we aren't allowed to just truncate the result of the division, but should actually dither the result. This is difficult to implement efficiently.
Usually the best way to deal with dynamic range limitations is to work with a wide, uniform bit-width throughout the software. Most computing platforms can deal efficiently with 32-48 bit numbers, which is quite enough for mixing needs. Many current practitioners also advocate floating point processing, because of its robustness and huge dynamic range. Nowadays we can afford the computational cost, so it makes sense to reserve lots of headroom in all signals to deal with the extra bits generated while processing. Going about it this way simplifies our algorithms considerably.

What is a denormal number and why is it a problem?
A denormal number is a way that certain processors represent very small or large numbers. See http://en.wikipedia.org/wiki/Denormal for a more accurate description. When the numbers become very small the ALU of the processor calculates them in a more expensive way, so the performance drops and many DSP programs use much more CPU. Dernormals are a problem with certain processors (specifically Intel/AMD x86-based) and not a general problem with ADC/DACs, or any other part of the signal chain. The typical DSP case where denormals occur are IIR filters, once the input signal goes to 0, and reverbs, again once the input signal goes to 0. The reason for this is that these structures use feedback, which results in exponential decay, which eventually will make the internal coefficients reach values small enough to be denormal.
The denormal number issue is not a design flaw, but rather a side effect of the ability of Intel/AMD x86-based processors to calculate accurate results for incredibly small numbers. Some engineers may opt for integer and fixed point calculations because of denormal performance penalty with floating point numbers.
For scientific calculations, precision is desired throughout the whole range of the floating point representation. This comes at the expense of speed when the value of any number in the calculation is below the denormal threshold. For Audio processing, the performance penalty is unacceptable because only a finite amount of time is available before the next set of calculations must begin. This is the point at which the CPU meter in a DAW software package reaches 100%, or the software just crashes.
The single precision floating point format consists of a sign bit, eight exponent bits in offset-127 unsigned form, and 23 mantissa bits. The mantissa never gets small, because just when you think it's going to get small, the exponent will decrease, and it'll be full range again. The mantissa always starts with a leading "1" bit because of this, and the leading "1" bit is never encoded because, well, it's always 1! Thus, 23 bits mantissa can correctly and accurately encode 24-bit unsigned integers -- add the sign bit, and single-precision floating point is sufficient to represent the full resolution of a 25-bit ADC. (Of course, at the levels where this makes any difference, anything from the 18th bit down or so is lost by your ears anyway -- or even more, if you're old and "disco injured")
So how do you represent "0" if there's always an implicit leading 1? Well, I lied. There is one exception to the case where there's always a leading 1 assumed in the mantissa; this is when the exponent is 0. When the exponent is 0, there is no leading 1 assumed, so if the mantissa is all 0s, and the exponent is all 0s, then your floating point value is equal to 0. Note that the sign bit doesn't matter, so there can potentially be two "0" values! When the exponent is 0, but the mantissa is NOT 0, then there is no leading implicit 1, and the number is said to be "denormal" because it no longer can contain a full 24 bits of mantissa value (the leading 1 is gone). Because of the way the circuitry in the FPU is implemented, it works much slower with denormal values, because it can't just assume there's 24 bits of precision and an exponent shift to play tricks with. This slow-down can be very substantial. The lowest reprentable number that's larger than 0 is about 1.401298464324817e-45, this is a denormal number. The largest denormal number is about 1.175494210692441e-38, followed by the smallest non-denormal single precision floating point number, 1.175494350822288e-38. You can find these numbers out for yourself using the following program. Note that there's some error in the program, because var-args parameters (like used in printf()) are always pushed as doubles, so the denormal number is loaded as such, and then converted to a double (where it is not denormal) before being printed:

#include int main() { unsigned int i1 = 0x00000001; unsigned int i2 = 0x007fffff; unsigned int i3 = 0x00800000; printf( "%.15e\n", *(float *)&i1 ); printf( "%.15e\n", *(float *)&i2 ); printf( "%.15e\n", *(float *)&i3 ); return 0; }
Avoiding denormals
To avoid denormals, don't let any value get too small. Luckily for Audio applications, any number small enough to become denormal is inaudible.
1. Test for denormals, and nuke them to 0 when you find them. You can use this handy macro if you wish:
  #DEFINE IS_DENORMAL(f) (((*(unsigned int *)&(f))&0x7f800000) == 0)
  You can use this on each value you put into your reverb, but for IIR filters, a better approach is to nuke denormals to 0 only in the coefficient memory, and only after each block of input samples has been processed. Thus, worst case, you'll take a denormal hit on one block of samples. If you want to avoid that, you can nuke any number that is "very small" to 0. Use this macro for the test:
  #DEFINE IS_ALMOST_DENORMAL(f) (((*(unsigned int *)&(f))&0x7f800000) < 0x08000000)
  or simple choose a number and compare in floating point, if your CPU makes that faster:
  #DEFINE IS_ALMOST_DENORMAL(f) (fabs(f) < 3.e-34)
2. A simple way to avoid denormals - is to add a very small number to a variable before multiplication to cause it to never reach denormal state. As a 24-bit converter with p2p range of -1.0 to 1.0 puts the quantization point at about .00000011920928955078 (which is about 1e-7) you can add noise that's another seven digits less significant than that and still stay well above the denormal floor. I e, if you add noise of the magnitude 1e-14, there is no way that this noise will be amplified so that it's actually hearable, and it's still sufficient to prevent pretty much any system to go into denormal degradation. You dont' have to use high-quality noise, a 32-element table you cycle through is probably quite sufficient.So, instead of:
  y = x*a0 + y*feedback_coeff;
  
  we write:
  #define TOOSMALL 0.0000000000000000000000001f y = x*a0 + y*feedback_coeff + TOOSMALL;
3. You can use special FPU modes that treat denormals as 0, available in Pentium III and Pentium 4 SIMD mode, and probably on other platforms, too. This usually requires assembler. As an interesting side note, it is said that the Intel 486 built-in FPU treated 0 as a denormal number (because it had a 0 exponent) and thus high-performance code went to great lengths to try to avoid using the value 0. These days, you luckily don't have to worry about that :-)
There's also a very nice PDF file by Laurent de Soras that explains denormals in greater detail: http://ldesoras.free.fr/doc/articles/denormal.pdf
In ISO C FLT_MIN from gives 1.17549435e-38f as the smallest number that can be expressed in 32-bit floating point. http://www.rustyspigot.com/Programming/IEEE%20754%20Floating%20Point%20Standard.htm#denormal

How do I perform ring modulation digitally?
To ring modulate 2 signals, X and Y, at sampling rate fs you need to upsample to 2fs to avoid aliasing. Why? Because if you multiply to signals with frequencies of fs/2 you produce the sum and difference of the signals the former being fs and the later being 0 so you must oversample to 2fs and filter to avoid this aliasing. A recipe to multiply signals X and Y with sampling rate fs without aliasing:
1. Upsample X to 2fs by adding zeros between all the samples (this doubles the spectrum produces frequencies above fs/2 even though the original signal was bandlimited) (b) brick-wall lopass with cutoff @ fs/2 to correctly bandlimit.
2. Lather, rinse, repeat with Y
3. Multiply X * Y
4. Brick-wall lopass X * Y with cutoff at fs/2

What is swept sine measurements, swept bandpass filters and time delay spectrometry all about?
To analyze the frequency response (in magnitude and phase) of a system, sine signals with variable frequency have proven useful. It is of interest how swept sine sweeps and sine signals with an arbitrary overall frequency spectrum are generated and what to bear in mind when doing measurements. Furthermore, harmonics and noise rejection can be accomplished using a bandpass filter whose center frequency tracks the momentary frequency of the swept sine with a delay to accomodate for propagation delay. All this is covered in the paper by Swen M�ller and Paolo Massarani linked below. From the abstract:
Compared to using pseudo-noise signals, transfer function measurements using sweeps as excitation signal show significantly higher immunity against distortion and time variance. Capturing binaural room impulse responses for high-quality auralization purposes requires a signal-to-noise ratio of >90 dB which is unattainable with MLS-measurements due to loudspeaker non-linearity but fairly easy to reach with sweeps due to the possibility of completely rejecting harmonic distortion. Before investigating the differences and practical problems of measurements with MLS and sweeps and arguing why sweeps are the preferable choice for the majority of measurement tasks, the existing methods of obtaining transfer functions are reviewed. The continual need to use pre-emphasized excitation signals in acoustical measurements will also be addressed. A new method to create sweeps with arbitrary spectral contents, but constant or prescribed frequency-dependent temporal envelope is presented. Finally, the possibility of simultaneously analysing transfer function and harmonics is investigated.

See http://www.anselmgoertz.de/Page10383/Monkey_Forest_dt/Manual_dt/Aes-swp.pdf

Development FAQs

How do I get microsoft API xxx to work?
When it comes to audio design, Microsoft is a serious nut. It is near to impossible to understand how to write DirectX/DirectShow plugins from the Microsoft documentation, therefore I have supplied links to wrappers, made by Cakewalk and Sonic Foundry.
- Cakewalk DX/DXi SDK: http://www.thedirectxfiles.com/developers.htm
- Sonic Foundry DX SDK: http://download.sonypictures.com/dev/spde_audiopidk.zip [ download.sonypictures.com/dev/spde_audiopidk.zip ]
Cakewalk SDK has used to be more advanced, but as the time goes, no new versions come. The latest SDK version is dated almost two years ago, and is extremely buggy.
If you plan to develop audio plugins, please, consider using VST standard instead, as DX design is very complex and completely bizarre. Cakewalk supports it for concurrency politics reason only.
Microsoft has recently developer DMO plugin standard, which is a simplified version of DX/DirectShow, and is now supported by the applications. For more information, see Gargle/DMO samples from DirectX SDK.
For using WDM Audio for sound output see Microsoft DirectKS sample http://www.microsoft.com/whdc/hwdev/tech/audio/DirectKS.mspx

How do I write cross-platform sound code?
"PortAudio is a cross platform, open-source, audio I/O library. It lets you write simple audio programs in 'C' that will compile and run on Windows, Macintosh, Unix(OSS), SGI, and BeOS." PortAudio got its start on the music-dsp list! http://www.portaudio.com/
"JSyn allows you to develop interactive computer music programs in Java. You can run them as stand-alone applications, or as Applets in a web page using the JSyn Plugin." JSyn uses PortAudio (see above). It's a great way to easily write cross-platform sound apps in Java.http://softsynth.com/jsyn
Lots of people also use Steinberg's cross-platform VST plugin API.
Also, an OpenSource? JAVA VST wrapper is available: http://jvstwrapper.sourceforge.net

How do I read and write sound files?
It's certainly possible to write your own code to read/write sound files, but why would you bother when a library like libsndfile: http://www.mega-nerd.com/libsndfile/ is so easily avaiable. It is Licensed under the GNU Lesser General Public License which means it can be used for Free Software, Open Source, Shareware and Proprietary software as long as you use libsndfile as a windows DLL or a Unix style shared library. Libsndfile reads and writes a large number of file types and a large number of file encodings.
Another option would be MiniAiff? available fromhttp://www.dspdimension.com/data/html/download.html. It reads AIFF files in a very straightforward and easy way that enables you to read and write AIFF sound files without bothering about setting up a full fledged library.

Contributions by Ross Bencina, Jeffrey Traer Bernstein, Russell Borogove, Andy Farnell, Urs Heckmann, Rebecca Lovelace, Joe Orgren, Geoff Perlman, Nikhil Sarma, George Yohng, last edit by Andy Farnell 12/05/08)