Blue Flower

Let's start with the piano. In very simple terms, this musical instrument is a set of white and black keys, clicking on each of which a certain sound is extracted predetermined frequency from low to high. Of course, each keyboard instrument has its own unique timbre of sound, thanks to which we can distinguish, for example, on a piano accordion, but if roughly summarized, then each key is simply a generator of sinusoidal acoustic waves of a certain frequency.

When the musician playing the composition, it simultaneously or alternately compresses and releases the keys, causing multiple sinusoidal signals are superimposed to form a pattern on each other. It is this pattern is perceived by us as the melody, so that we can easily learn one product running on different instruments in different genres or even unprofessional person humming.

image


Graphic illustration of the musical pattern


Frequency Definition (guitar tuner mode)
image

The inverse problem is to parse a sounding piece of music on the music. That is to decompose the total acoustic signal picked up by the ear, to the original sine wave. In essence, this process is a direct Fourier transform. A press of the buttons and removing the sound has an inverse Fourier transform process.

Mathematically, in the first case, periodic complex decomposes (at a certain time interval) function in a number of orthogonal elementary functions (sine and cosine). A second reverse their sum, ie the synthesis of a complex signal.

Orthogonality, in some way, is immiscible functions. For example, if we take a few pieces of colored clay and glue them, then still be able to make out what colors were originally, but if you mix well several jars of gouache paint, then just restore the original color without more information it would be impossible.

(!) It is important to understand the when we undertake to analyze the real signal by the Fourier transform, we idealize the situation and proceed from the assumption that it is at the current time periodic interval and consists of elementary sine waves . Often this is the case, since the acoustic signals generally have a harmonic nature, are possible but generally more complex cases. All our assumptions about the nature of the signal usually lead to partial distortions and errors, but without this extract useful information from it is extremely difficult.

We now describe the whole process of analysis in more detail:

1. It all starts with the fact that the sound waves shake the membrane microphone, which converts them into analog electric current oscillations.

2. Then the sampling analog electrical signal to digital form. At this point is to dwell.

Since the analog signal consists of an infinite mathematically continuous in time the amplitude of the set-point values ​​in the measurement process, we can distinguish it from a finite number of values at discrete points in time, that is, in fact, perform quantization of time ...

As a rule, reference-values are taken at short regular intervals, that is, with a certain frequency, for example, 16000 or 22000 Hz. However, in general discrete reference may go and uneven, but it complicates the mathematical analysis of the device, so do not usually applied in practice.

image

There is an important Theorem Kotelnikov-Nyquist-Shannon , which states that the analog periodic signal having a finite (limited width) spectrum can be unambiguously restored without distortion or loss in its reference taken at a frequency of greater than or equal to twice the upper frequency range (called the sampling rate or Nyquist).

For this recovery must apply special interpolating function, but the problem is that when using the calculation of these functions to be performed on an infinite time interval, which in practice is impossible. Therefore, in real life you can not arbitrarily increase the sampling rate artificially without distortion even if initially it satisfies the Nyquist theorem, Nyquist-Shannon. For this operation apply Farrow filters.

Also, sampling takes place not only on time but also on the amplitude level values ​​as a computer can manipulate only a limited set of numbers. It also makes a small error.

3. The next step is very direct discrete Fourier transform .

We distinguish short frame (interval) of the composition, consisting of a digital readout, which is conventionally considered periodic and apply the Fourier transform. As a result, conversion of complex numbers we get an array containing information about the amplitude and phase spectra analysis frame. Moreover, the spectra are also discrete increments equal to (sampling frequency) / (number of counts). That is, the more we take counts, the more accurate frequency resolution obtained. However, at a constant sampling rate by increasing the number of counts, we are increasing the analyzed time period, and since the actual pieces of music notes are of different duration and sound can be quickly rotated, is their imposition, so amplitude long notes "overshadows" a short amplitude. On the other hand for guitar tuners such a way to increase the frequency resolution is suitable well as a note usually sounds a long time and one.

There is also a fairly simple trick to increase the frequency resolution - you need to fill in the original digital signal zeros between samples. However, as a result of the filling phase spectrum is strongly distorted, but it increases the amplitude resolution. Farrow also possible to use filters and an artificial increase in the sampling frequency, however, and it distorts the spectrum.

frame duration is typically about 30 ms to 1 s. Than it is shorter, the better the resolution of the time, we obtain, but the worst in frequency than the length of the sample, the best frequency, but the worst time. This is very similar to Heisenberg uncertainty principle of the quantum mehaniki..i not easy, as the Wikipedia, the uncertainty in quantum mechanics, in the mathematical sense is a direct consequence of the Fourier transform properties ... < br>
Another interesting fact is that as a result of the analysis of a sample of a single sine wave amplitude spectrum is very similar to the diffraction image ...

Sine wave, limited rectangular window, and his "diffraction"
image
image

Diffraction of light waves


In practice, this undesirable effect of hampering the analysis of signals, so it is trying to reduce through the use of window functions. Such functions invented a lot, here are the implementation of some of them, as well as the comparative influence on the spectrum of a single sine wave.

Window function is applied to the input frame is very simple:

for (var i = 0; i < frameSize; i++)
{
    frame[i] *= Window.Gausse(i, frameSize);
}


using System;
using System.Numerics;

namespace Rainbow
{
    public class Window
    {
        private const double Q = 0.5;

        public static double Rectangle(double n, double frameSize)
        {
            return 1;
        }

        public static double Gausse(double n, double frameSize)
        {
            var a = (frameSize - 1)/2;
            var t = (n - a)/(Q*a);
            t = t*t;
            return Math.Exp(-t/2);
        }

        public static double Hamming(double n, double frameSize)
        {
            return 0.54 - 0.46*Math.Cos((2*Math.PI*n)/(frameSize - 1));
        }

        public static double Hann(double n, double frameSize)
        {
            return 0.5*(1 - Math.Cos((2*Math.PI*n)/(frameSize - 1)));
        }

        public static double BlackmannHarris(double n, double frameSize)
        {
            return 0.35875 - (0.48829*Math.Cos((2*Math.PI*n)/(frameSize - 1))) +
                   (0.14128*Math.Cos((4*Math.PI*n)/(frameSize - 1))) - (0.01168*Math.Cos((4*Math.PI*n)/(frameSize - 1)));
        }
    }
}


image

As for computers, it was once the algorithm Fast Fourier Transform , which minimizes the number of mathematical operations required for its calculation. The only requirement of the algorithm is that the number of counts were powers of two (256, 512, 1024 and so on).

Following his classic recursive implementation in the language C #.

using System;
using System.Numerics;

namespace Rainbow
{
    public static class Butterfly
    {
        public const double SinglePi = Math.PI;
        public const double DoublePi = 2*Math.PI;

        public static Complex[] DecimationInTime(Complex[] frame, bool direct)
        {
            if (frame.Length == 1) return frame;
            var frameHalfSize = frame.Length >> 1; // frame.Length/2
            var frameFullSize = frame.Length;

            var frameOdd = new Complex[frameHalfSize];
            var frameEven = new Complex[frameHalfSize];
            for (var i = 0; i < frameHalfSize; i++)
            {
                var j = i << 1; // i = 2*j;
                frameOdd[i] = frame[j + 1];
                frameEven[i] = frame[j];
            }

            var spectrumOdd = DecimationInTime(frameOdd, direct);
            var spectrumEven = DecimationInTime(frameEven, direct);

            var arg = direct ? -DoublePi/frameFullSize : DoublePi/frameFullSize;
            var omegaPowBase = new Complex(Math.Cos(arg), Math.Sin(arg));
            var omega = Complex.One;
            var spectrum = new Complex[frameFullSize];

            for (var j = 0; j < frameHalfSize; j++)
            {
                spectrum[j] = spectrumEven[j] + omega*spectrumOdd[j];
                spectrum[j + frameHalfSize] = spectrumEven[j] - omega*spectrumOdd[j];
                omega *= omegaPowBase;
            }

            return spectrum;
        }

        public static Complex[] DecimationInFrequency(Complex[] frame, bool direct)
        {
            if (frame.Length == 1) return frame;
            var halfSampleSize = frame.Length >> 1; // frame.Length/2
            var fullSampleSize = frame.Length;

            var arg = direct ? -DoublePi/fullSampleSize : DoublePi/fullSampleSize;
            var omegaPowBase = new Complex(Math.Cos(arg), Math.Sin(arg));
            var omega = Complex.One;
            var spectrum = new Complex[fullSampleSize];

            for (var j = 0; j < halfSampleSize; j++)
            {
                spectrum[j] = frame[j] + frame[j + halfSampleSize];
                spectrum[j + halfSampleSize] = omega*(frame[j] - frame[j + halfSampleSize]);
                omega *= omegaPowBase;
            }

            var yTop = new Complex[halfSampleSize];
            var yBottom = new Complex[halfSampleSize];
            for (var i = 0; i < halfSampleSize; i++)
            {
                yTop[i] = spectrum[i];
                yBottom[i] = spectrum[i + halfSampleSize];
            }

            yTop = DecimationInFrequency(yTop, direct);
            yBottom = DecimationInFrequency(yBottom, direct);
            for (var i = 0; i < halfSampleSize; i++)
            {
                var j = i << 1; // i = 2*j;
                spectrum[j] = yTop[i];
                spectrum[j + 1] = yBottom[i];
            }

            return spectrum;
        }
    }
}


There are two kinds of FFT algorithm - a decimation-in-time and in frequency, but both give identical results. Function takes an array of complex numbers, filled with real values of the amplitudes of the signal in the time domain, and after its execution returns an array of complex numbers containing information about the amplitude and phase spectra. It is worth remembering that the real and imaginary parts of a complex number - this is not the same as its amplitude and phase
!
magnitude = Math.Sqrt(x.Real*x.Real + x.Imaginary*x.Imaginary)
phase = Math.Atan2(x.Imaginary, x.Real)

The resulting array of complex numbers is filled with useful information exactly half, the other half is a mirror image of the first, and could easily be excluded from consideration. If you think about it, then this point is well illustrated by Nyquist-Shannon sampling theorem, Nyquist-Shannon sampling, that the sampling rate must be at least twice the maximum frequency of the signal...

There is also a variety of FFT algorithm without recursion on the Cooley-Tukey , which is often applied in practice, but it a little more difficult to understand.

Immediately after the calculation of the Fourier transform is convenient to normalize the amplitude spectrum:

var spectrum = Butterfly.DecimationInTime(frame, true);
for (var i = 0; i < frameSize; i++)  
{
	spectrum[i] /= frameSize;
}

This will lead to the fact that the amplitude values ​​of the same order quantity results regardless of sample size.

Calculating the amplitude and frequency spectrum, easy to produce processing signals, for example, to apply frequency filtering or compress. In fact, this way you can make equalizer: performing a direct Fourier transform, it is easy to increase or decrease the amplitude of a particular frequency range, and then perform an inverse Fourier transform (although the work of these equalizers is usually based on a different principle - the phase shift of the signal). And compress the signal is very simple - just a dictionary, where the key is the frequency, and the value corresponding to a complex number. The dictionary need only bring those frequencies at which the amplitude of the signal exceeds some minimum threshold. Information on the "quiet" frequencies are not audible ear will be lost, but will a sizeable contraction while maintaining an acceptable sound quality. Part of this principle is the basis of many codecs.

4. The exact definition of frequency

The discrete Fourier transform gives us a discrete spectrum, where the amplitude of each value is separated from the next by equal intervals in frequency. And if the frequency of the signal is a multiple step equal to (sampling frequency) / (number of counts), we obtain the expression pointed peak, but if the signal frequency is somewhere between step boundaries closer to the middle we will rush to the "cut" top and us it would be difficult to say what was on the frequency. It may well be that there are two signal frequencies lie next to each other. This is the limitation of the frequency resolution. Just as in the photo in low resolution small objects stick together and become indistinguishable, as well, and the subtle details of the spectrum can be lost.

But the frequency of musical notes are far from being on the grid Fourier transformation steps, and for the everyday tasks of musical instruments tuning and detection of music you need to know is the exact frequency. Moreover, the low octaves at a resolution of 1024 counts and lower mesh Fourier frequencies becomes so rare that just one step start to fit a few notes, and to determine what is really one game becomes virtually impossible.

In order to somehow get around this restriction is sometimes used approximating functions, such as parabolic.
www.ingelec.uns.edu.ar/pds2803/Materiales/Articulos/AnalisisFrecuencial/04205098.pdf
mgasior.web.cern.ch/mgasior/pap/biw2004_poster.pdf
Но всё это искусственные меры, которые улучшая одни показатели могут давать искажения в других.

Is there a natural way to determine the exact frequency?
Yes, and concealed it as time to use the phase of the signal spectrum, which is often neglected.
This method of verifying the signal frequency, based on the calculation of the phase delay at the spectra of the two frames overlapping but slightly shifted in time.

More about him can be read on the links:
www.guitarpitchshifter.com/algorithm.html
www.dspdimension.com/admin/pitch-shifting-using-the-ft (+ code samples)
eudl.eu/pdf/10.1007/978-3-642-29157-9_43
ctuner.googlecode.com (examples of the use of the algorithm in C ++ and Java)

In the C # implementation of the method is quite simple:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Numerics;

namespace Rainbow
{
    //  Δ∂ωπ
    public static class Filters
    {
        public const double SinglePi = Math.PI;
        public const double DoublePi = 2*Math.PI;

        public static Dictionary<double, double> GetJoinedSpectrum(
            IList<Complex> spectrum0, IList<Complex> spectrum1,
            double shiftsPerFrame, double sampleRate)
        {
            var frameSize = spectrum0.Count;
            var frameTime = frameSize/sampleRate;
            var shiftTime = frameTime/shiftsPerFrame;
            var binToFrequancy = sampleRate/frameSize;
            var dictionary = new Dictionary<double, double>();

            for (var bin = 0; bin < frameSize; bin++)
            {
                var omegaExpected = DoublePi*(bin*binToFrequancy); // ω=2πf
                var omegaActual = (spectrum1[bin].Phase - spectrum0[bin].Phase)/shiftTime; // ω=∂φ/∂t
                var omegaDelta = Align(omegaActual - omegaExpected, DoublePi); // Δω=(∂ω + π)%2π - π
                var binDelta = omegaDelta/(DoublePi*binToFrequancy);
                var frequancyActual = (bin + binDelta)*binToFrequancy;
                var magnitude = spectrum1[bin].Magnitude + spectrum0[bin].Magnitude;
                dictionary.Add(frequancyActual, magnitude*(0.5 + Math.Abs(binDelta)));
            }

            return dictionary;
        }

        public static double Align(double angle, double period)
        {
            var qpd = (int) (angle/period);
            if (qpd >= 0) qpd += qpd & 1;
            else qpd -= qpd & 1;
            angle -= period*qpd;
            return angle;
        }
	}
}

Применение также несложное:

                        var spectrum0 = Butterfly.DecimationInTime(frame0, true);
                        var spectrum1 = Butterfly.DecimationInTime(frame1, true);
                        for (var i = 0; i < frameSize; i++)
                        {
                            spectrum0[i] /= frameSize;
                            spectrum1[i] /= frameSize;
                        }

                        var spectrum = Filters.GetJoinedSpectrum(spectrum0, spectrum1, ShiftsPerFrame, Device.SampleRate);

Typically, the original frames are shifted by 1/16 or 1/32 of its length, ie ShiftsPerFrame is 16 or 32.

The result is a dictionary-frequency amplitude, frequency values which are quite close to reality. However, the "cut peak" will still occur, albeit less pronounced. To address this shortcoming, you can simply "to finish" them.



using System;
using System.Collections.Generic;
using System.Linq;
using System.Numerics;

namespace Rainbow
{
    public static class Filters
    {
        public static Dictionary<double, double> Antialiasing(Dictionary<double, double> spectrum)
        {
            var result = new Dictionary<double, double>();
            var data = spectrum.ToList();
            for (var j = 0; j < spectrum.Count - 4; j++)
            {
                var i = j;
                var x0 = data[i].Key;
                var x1 = data[i + 1].Key;
                var y0 = data[i].Value;
                var y1 = data[i + 1].Value;

                var a = (y1 - y0)/(x1 - x0);
                var b = y0 - a*x0;

                i += 2;
                var u0 = data[i].Key;
                var u1 = data[i + 1].Key;
                var v0 = data[i].Value;
                var v1 = data[i + 1].Value;

                var c = (v1 - v0)/(u1 - u0);
                var d = v0 - c*u0;

                var x = (d - b)/(a - c);
                var y = (a*d - b*c)/(a - c);

                if (y > y0 && y > y1 && y > v0 && y > v1 &&
                    x > x0 && x > x1 && x < u0 && x < u1)
                {
                    result.Add(x1, y1);
                    result.Add(x, y);
                }
                else
                {
                    result.Add(x1, y1);
                }
            }

            return result;
        }
    }
}


Outlook

Sheet music analysis reveals a number of interesting features. After all, having in stock ready music notation drawing, you can search for other songs with a similar pattern.

For example, the same work can be performed on another instrument, in a different manner, a different tone or transposed by octaves, but the music notation drawing will look like, which will find different versions of the same product. This is very similar to the game "Guess the melody».

In some cases, this analysis will help to identify plagiarism in musical works. Also on the drawing fidelity to the score, theoretically, you can search for product specific mood or genre that raises the search to a new level.

Results

This article outlines the basic principles of a precise definition of frequencies of acoustic signals and selection notes. And also shows some subtle intuitive communication with the discrete Fourier transform of quantum physics that encourages thinking about a unified picture of the world.

P.S. Rainbow framework (Rainbow Framework) containing all of the above sample code can be downloaded from Kodpleksa .

P.P.S. Perhaps this article will ever seriously useful to you in the professional activity and will help to keep a lot of time and effort, so if you want to thank the author for work, you can make a donation , buy the app (бесплатная version with advertising ), thanks to which the article appeared, or just to thank the good word.

Literature

1. Fundamentals of the spectral analysis of sounds pandia.org/text/77/481/644.php

2. Algorithm of Cooley-Tukey www.codeproject.com/Articles/32172/FFT-Guitar-Tuner

3. Oversampling (resampling)
www.dsplib.ru/forum/viewtopic.php?f=5&t=11
www.dsplib.ru/content/farrow/farrow.html

4. Correction for phase shift of frequency
www.guitarpitchshifter.com/algorithm.html
www.dspdimension.com/admin/pitch-shifting-using-the-ft (+ code samples)
eudl.eu/pdf/10.1007/978-3-642-29157-9_43
ctuner.googlecode.com (examples of the use of the algorithm in C ++ and Java)

Comments

Q: Thanks for the article, there was one question. And what about the party extracting individual instrument from the record? For example, how feasible from a technical point of view, extract Violas or, say, the drums of Sing, Sing, Sing Benny Goodman? I understand that some of the tools the frequency spectra overlap considerably, but still.

A: Good question.

Extract party tools, slightly overlapping spectral technically possible now. You just need to apply EQ and stifle unwanted frequencies. However, in practice, quite the most difficult to lay out the notes low-frequency sounds, as in the bass notes are closer to each other in frequency, so you need to use longer shots, which leads to a low time resolution.

With instruments that have similar spectra, the situation is much more interesting. For them we recognize it for tonal color, the perception of which has largely psyche-acoustic nature. As a person with your hearing aid can accomplish this task, then, theoretically, it is available and the computer ... But this requires, as a minimum, specially trained neural network. Learning of this network, as time and people engaged in listening to the sounds of the world, different music, playing music ...

One man said to the bear's ear has come, and the other is able to distinguish the difference in nekolko hertz. And it is probably not in the peculiarities of the ear structure, namely, the neural network training, kotoroya, by the way, starts to be laid in early childhood, probably even in the womb ... Scientists conducted an experiment, just born kitten, sealed with one eye, and after a couple of weeks, the cast is removed, but the kitten for life remained blind in the one eye, although physiologically it was great ... that is, the ability to vision is laid in early childhood, and then it is not so easy to repair. So when Mom and Dad singing or talking to a kid is not born, it is absolutely not a waste of time as it may seem pragmatic person.

Remember how it sounds to you a foreign language, which you know not. It's just a set of fused sounds incomprehensible abrokadabra. But when you start to study it, all of a sudden, you find the ability to pick out individual words, a combination of proposals ... If mastered the language perfectly, even in noisy environments on scraps of phrases can restore what the source said.

It's the same with music, when you listen to it a lot, you begin to distinguish colors, vibrato, suddenly find themselves to new sounds, which somehow had not noticed in the old compositions. reply

As a person with your hearing aid can accomplish this task, then, theoretically, it is available and the computer ... But this requires, as a minimum, specially trained neural network. Learning of this network, as time and people engaged in listening to the sounds of the world, different music, playing music ...

Q: What you do not know who else is involved in this area, are there any developments?

A: I think that in the development of AI and computer scientists involved hearing a lot in the world, but, unfortunately, nothing more specific I can not tell, and at the moment I have no such developments.   

In fact, the preliminary stage auditory system converts the acoustic signal into a spectral form. Moreover, according to recent studies, taking into account not only the amplitude but also the phase of the signal (!), So that the brain on the phase difference and the amplitude between the two ears is able to more accurately determine the direction of the sound source.