Blue Flower

I think that for many people audio compression lossy resembles a magical black box, where a surprisingly complex manner with the use of mathematical alchemy compress data at the expense of loss of redundancy, plohorazlichimoy or inaudible by the human ear, and, as a consequence, a reduction recording quality. However, just to evaluate the significance of such losses and understand their essence is not very easy. But today we will try to find out what's the deal there and making all possible similar data compression process in the dozens of times ...

It is time to lift the veil, to open the door and personally take a look at the mysterious algorithm disturbing the hearts and minds, welcome to the a session with exposure !



Uncompressed audio stream - an array of integers ( C # short ), for example, coming from the microphone buffer. It is a discrete set of values of the analog signal amplitudes taken at regular intervals (ie, at a certain sampling rate and quantization on the level).

image
* For simplicity we will hereafter be considered a classic mono

If you just write that array to a file, even a short time period to obtain a very voluminous. It is clear that, apparently, in a stream signal contains a lot of redundant data, so it is reasonable to question arises, how to select the right and remove unnecessary The answer is simple and stern -? use of the Fourier transform .

Yes, exactly the same, which is regularly interpreted in technical universities and that we successfully manage to forget. However, it does not matter - study introductory presentation and read article , where the issue is discussed in more detail than. What we need only the sheer size of the algorithm just 32 line of code in the language C # .

   public   static  Complex [] DecimationInTime ( this  Complex [] frame,  bool  direct)
        {
             if  (frame.Length ==  1 )  return  frame;
             var  frameHalfSize = frame.Length & gt; & gt;  1 ;  // frame.Length / 2 
             var  frameFullSize = frame.Length;

             var  frameOdd =  new  Complex [frameHalfSize];
             var  frameEven =  new  Complex [frameHalfSize];
             for  ( var  i =  0 ; i & lt; frameHalfSize; i ++)
            {
                 var  j = i & lt; & lt;  1 ;  // i = 2 * j; 
                frameOdd [i] = frame [j +  1 ];
                frameEven [i] = frame [j];
            }

             var  spectrumOdd = DecimationInTime (frameOdd, direct);
             var  spectrumEven = DecimationInTime (frameEven, direct);

             var  arg = direct? -DoublePi / FrameFullSize: DoublePi / frameFullSize;
             var  omegaPowBase =  new  Complex (Math.Cos (arg), Math.Sin (arg));
             var  omega = Complex.One;
             var  spectrum =  new  Complex [frameFullSize];

             for  ( var  j =  0 ; j & lt; frameHalfSize; j ++)
            {
                spectrum [j] = spectrumEven [j] + omega * spectrumOdd [j];
                spectrum [j + frameHalfSize] = spectrumEven [j] - omega * spectrumOdd [j];
                omega * = omegaPowBase;
            }

             return  spectrum;
        }
 

Conversion is applied to a small portion of the signal - Personnel with the number of counts multiples of powers of two, that is typically 1024 , 2048 , 4096 . In the standard sampling rate of 44100 Hz , which according to Nyquist theorem - Nyquist - Shannon allows distortion-free to restore the original signal with a maximum frequency in the spectrum to the 22050 Hz , which is the maximum frequency threshold of human hearing, the passages on the equivalent length of about 23 , 46 and 93 ms , respectively. Then we have an array of complex numbers (of the same length as the frame), which contains information about the phase and frequency spectra of the signal fragment. The resulting array consists of two mirrored copies of parts, so the actual information content has only half of its members.

At this point, just we can remove information about the quiet frequencies, while maintaining a high-profile , for example, in the form of a dictionary, and during playback to recover lost items, replacing them with zeros, and then perform the inverse Fourier transform and get fit for the reproduction signal. It is on this principle based work is a huge number of audio codecs, as described operation makes it possible to produce compression in the tens or even hundreds of times (!). Of course, the method of recording information in a file also makes its own adjustments to the output size and far it is not superfluous to use the additional archiving lossless algorithms, but the most significant contribution delivers exactly mute inaudible frequencies.

At first, I can not believe that such a trivial way you can achieve such significant compression ratios, but if you look at the spectral image on a linear frequency scale, not logarithmic, it immediately becomes clear that only a narrow set of harmonics is usually present in the real spectrum, carrying useful information, and everything else - a light noise, which is out of earshot. And, paradoxically, at low degrees of compression of the signal does not deteriorate for perception, but rather only cleaned of noise, that is idealized!


* on pictures displayed in a frame of reference 4096 (93 ms) and the range of frequencies up to 22,050 Hz (see. LimitFrequency in the source code). This example demonstrates how little bearing harmonics in real signals

Not to be unfounded offer test demo application ( Rainbow Framework ) , where the compression algorithm is trivially simple, but works well. At the same time, you can evaluate the distortions that arise depending on the degree of compression, as well as to explore ways to visualize the sound, and much more ...

   var  fftComplexFrequencyFrame = inputComplexTimeFrame.DecimationInTime ( true );
 var  y =  0 ;
 var  original = fftComplexFrequencyFrame.ToDictionary (c = & gt; y ++, c = & gt; c);
 var  compressed = original.OrderByDescending (p = & gt; p.Value.Magnitude) .Take (CompressedFrameSize) .ToList ();
 

Yes, of course, there are puzzling algorithms take into account the psycho-acoustic model of perception, the probability of occurrence of certain frequencies in the spectrum, etc., etc., but they help only partly to improve performance, while maintaining a decent quality, while the lion's share of compression is achieved by such simple in a manner that in C # takes just a few lines.

Thus, the total compression scheme is the following:

1. Splitting signal amplitude and time frames *
2. Direct Fourier Transform - receiving amplitude-frequency frames
3. Full mute quiet frequencies and additional optional processing
4. Record the data in the file

* of the frame partially overlap each other, which helps to avoid clicks when a certain harmonic signal starts or ends in one frame, but still weak, and be drowned in it, though in a different frame of his sound into a full blaze and not muted. Because of that, depending on the phase of the sine wave amplitude is possible sharp rise on the borders not overlap frame is perceived as poschёlkivanie.

Reverse the process includes the following stages:

1. Reading a file
2. Restoration of the amplitude-frequency frames
3. The inverse Fourier transform - to provide amplitude and time frames
4. Formation of a signal amplitude and time frames

That's all. Perhaps you were expecting something much more complex?

Links:
1. Introductory presentation
2. Article by Fourier transform
3. demo application with source code (Rainbow Framework)
( backup link)