Learning sparse instrument models

One of the first steps toward high-level analysis of audio recordings is decomposing the signal into a representation that can be easily digested by a computer.  A more or less standard approach is to carve up the signal into a sequence of small frames (say, 50ms long), and then extracting some features from each frame, such as chroma/pitch distributions, or timbre/Mel-frequency cepstral coefficients.

One of the things that I’ve been working on is learning audio features which are informed by commonly used instrumentation in jazz recordings. The idea here is that if we can decompose a song into its constituent instruments — even approximately — it may be easier to detect high-level patterns, such as repetitions, instrument solos, etc. Lofty goals, indeed!

As a first step in this direction, I gathered the RWC Instrument Database, and extracted recordings of all the instruments we’re likely to encounter in any given jazz recording. These instrument recordings are extremely clean: one note at a time, in a controlled environment with almost no ambient noise. So it’s not exactly representative of what you’d find in the wild, but it’s a good starting point under nearly ideal conditions.

Each recording was chopped up into short frames (~46ms), and each frame was converted into a log-amplitude Mel spectrogram in \R^{128}.

Given this collection of instrument-labeled audio frames, my general strategy will be to learn a latent factorization of the feature space so that each frame can be explained by relatively few factors.

If we assume that the factors (the codebook) D \in \R^{128\times k} are already known, then an audio frame x_t can be encoded via non-negative sparse coding:

    \[ f(x_t \given D, \lambda) := \argmin_{\alpha\in\R_+^{k}} \frac{1}{2}\|x_t - D\alpha\|^2 + \lambda \|\alpha\|_1, \]

where \lambda > 0 is a parameter to control the amount of desired sparsity in the encoding f(\cdot).

Of course, we don’t know D yet, so we’ll have to learn it.  We can do this on a per-instrument level by grouping all the n audio frames x_t^I associated with the Ith instrument, and alternately solving the following problem for both D^I and \alpha^I:

    \[ \min_{D^I, \alpha_{1,2,\dots,n}^I} \sum_{t=1}^n \|x_t^I - D^I \alpha_t^I\|^2 + \lambda\|\alpha_t^I\|_1. \]

After doing this independently for each instrument, we can collect each of the codebooks D^I into one giant codebook D.  In my experiments, I’ve been allowing 64 basis elements for most instruments, and 128 for those with high octave range (piano, vibraphone, etc).  The resulting D has around 2400 elements.

It can be difficult to discern much from visual inspection of thousands of codebook elements, but some interesting things happen if we plot the correlation between the learned features across instruments:

Not quite surprisingly, there’s a large amount of block structure in this figure.  Let’s zoom in on few interesting regions.  First up, the upper-left block:

From this, we can see that piano, electric piano, vibraphone, and flute might be difficult to tease apart, but both acoustic and electric guitar separate nicely.  Note that the input features here have no notion of dynamics, such as attack and sustain, which may help explain the collision of flute with piano and vibes. [Future work!]

The picture is much clearer in the middle block, where instruments seem to separate out by their range and harmonics.  Note that violin still collides with piano and vibes (not pictured).

Finally, the lower-right block includes a variety of instruments, percussion, and human voice.  With the exception of kick/toms, it’s largely an undifferentiated mess:

It seems a bit curious that cymbals show such strong correlations with almost all other instruments.  One possible explanation is that most instrument codebooks will need to include at least one component that models broad-band noise; but cymbals are almost entirely broad-band noise.  So, although the basis elements themselves appear ambiguous, it may be that the encodings derived from them are still interpretable: at least, interpretable by a clever learning algorithm.  More on this as it develops…

Posted in Uncategorized | Comments Off on Learning sparse instrument models

loudness vs duration

I’ve been playing with plotting various EN analysis quantities against one another. I thought that pitch vs loudness or pitch vs segment duration might turn up something interesting, but visually at least, there’s not much of interest. Then I tried loudness vs duration, and wow! Some pretty distinct distributions. Parker playing Ornithology is fairly consistent, whereas our 10 versions of Autumn Leaves by ten different ensembles are more varied. Almost all of the tracks seem to have mostly short, loud segments. That might just be the nature of segments — they’re distinct events, so very quiet moments probably don’t end up as independent segments…

Ornithology:

Autumn Leaves:

Posted in Uncategorized | Comments Off on loudness vs duration

Infinite tracks

I used Paul Lamere’s Infinite Jukebox app to generate some fun examples for the j-disc MIR launch event a couple weeks ago:

* Sonny Stitt: Autumn Leaves

* Bill Monroe: Roanoke

* Kenny G: Careless Whispers

 

Posted in Uncategorized | Comments Off on Infinite tracks

segments vs beats

We’re starting to think that maybe beat tracking, as it’s usually implemented, isn’t really that useful for a lot of jazz. Not only do many jazz tracks seem to confuse beat trackers, but it’s not clear that “beats” are really that useful when asking the kinds of questions we’re interested in.

Here is a 2nd round of graphs looking at tempo over time. But this time I’ve plotted both beat lengths and segment lengths of many versions of Autumn Leaves using the EchoNest analysis engine. Beat detectors try to make an estimate of the track’s tempo and then find a beat grid that maps nicely onto the events in the track. That works well for most pop music, since there is a beat grid to be found. That’s often not quite the case in jazz. Segments are simply short snippets of sound that are meant to represent individual audio events, regardless of tempo/beat. Generally a beat will be composed of several segments, and segments can and often do cross over beat divisions.

The graphs aren’t particularly revelatory, but some of the differences between the beat curves and the segment curves are interesting. Next we need to listen through these while watching the segments curves to see if anything intriguing pops out…

N.B.: The Y axis is now fixed at 0.0-1.0 seconds to make it easier to compare across tracks. This also tames some of the wild jumps in beat length that appeared in the previous graphs.

 

Posted in Uncategorized | Comments Off on segments vs beats

The tempo of Autumn Leaves

A few years ago Paul Lamere from The Echo Nest posted some experiments in click track detection: http://musicmachinery.com/2009/03/02/in-search-of-the-click-track

While we’re not interested in finding click tracks in jazz recordings, the question of beat duration stability is an interesting one for jazz. What, if anything, can you tell about the players on a recording by analyzing the stability of the beat? Does the drummer determine the tempo of a performance? If so, do different drummers have identifiable beat stability profiles? Does a group maintain a beat stability profile across different performances of the same composition? Across different compositions?

A nice thing about working with jazz is that there are often lots of different versions of a given composition by both the same group and by different groups. So you can compare, for example, ten different performances of Autumn Leaves by ten different groups, or sometime even ten different performances of one composition, say Ornithology, by one performer, Charlie Parker.

Just to get things rolling I decided to do a version of Paul’s analysis on ten recordings of Autumn Leaves by ten different groups. I used the Echo Nest remix API to gather the data and python/gnuplot to make the graphs of beat length over time. The X axis is elapsed seconds in the recording, the Y axis is the duration of each beat. I applied a little bit of smoothing to the data so that the trends are easier to see.

We’re still listening to the tracks while looking at these graphs to try to understand if they’re telling us anything interesting. At the least they’ve suggested a bunch of other experiments to do, particularly looking at multiple recordings of the same composition by the same group to see if a beat duration profile exists.  More on that soon.

Final note: the Echo Nest analyzer borks on the Ahmad Jamal recording (and Don Byas and possibly some others). So that big plateau you see towards the end is an artifact produced by bad/confused output rather than a big change in the beat duration. You should always do a reality check of some sort on your analysis results!

 

Posted in Uncategorized | Comments Off on The tempo of Autumn Leaves