Low Latency Background - Buffer and Latency Jitter
What you always wanted to know about latency (or perhaps not...)

»Technical Information Index


Latency Jitter means variation in the latency, i. e. in some sense the reaction time of the system. Latency jitter has been almost completely ignored up to now, but will actually be one of the key challenges for the software industry in the future. The good news first: for recording and playback at the same time, latency is constant. The phenomenon of varying latency comes up with the use of software synthesizers and samplers, i. e. sound generation triggered from outside the system.


In order to understand the reason for this effect, one has to know that the communication between audio hardware and software is never a constant data flow, but occurs in blocks or packages, the buffers. Applications copy the audio data into a buffer in advance, the audiohardware plays back the data later.

Step for step: the diagram shows an example of a double buffer (as it can be found in ASIO.) The closed loop clarifies the sequence, because after the second buffer, the first buffer comes once again. Each buffer here has a capacity of 256 samples, corresponding to a latency of 6 ms. It is easy to comprehend how it comes to 6 ms. When the playback is started, the cursor's position is at the label 0. Because the first buffer is being played back when the playback starts, the software copies the data into the second buffer. Imagine in slow motion, how the cursor moves on clockwise. At a sample rate of 44.1 kHz the 256 samples will be processed in only 5.8 ms. As soon as the cursor has reached the border between the buffers (position 1), an interrupt is triggered and the software copies the upcoming data into the (already played) first buffer. The second buffer is being played back now and in the end (position 0) an interrupt is triggered, that copies data into the second buffer and so on.

Thus it is clear: all data processed by the application will be output not before a run-through of a complete buffer - which leads to the above mentioned latency time. And nicely constant it is as well!

It is getting interesting (and a bit complicated too), if now instead of playback from the hard disk (constant streaming), keys on a MIDI keyboard are being pressed at unforeseeable times. Example 1: the key is pressed at cursor position T1, shortly before a buffer change. The application calculates a sound immediately, and the cursor crosses the border to the next buffer shortly after that, triggering a transfer of the calculated data into the first buffer. After crossing the second buffer, the generated sound is played back with a delay of one buffer plus the short time between T1 and the buffer change 1. The resulting time difference is very small and worth a frown at the best.

This changes with example 2: the key is pressed at cursor position T2, shortly after a buffer change. The application calculates a new sound immediately, but the cursor takes almost 6 ms to reach the next border and trigger the copying of the data into the second buffer - which is played back after another 6 ms (the first buffer is played before.) That means, the 'in real-time' generated sound is played back by the audio hardware almost 12 ms later!

At the moment there is a rule of thumb under ASIO: using software synths and samplers the real delay varies between one and two buffers. The amount of variation depends on the buffer size itself and the technique. The technique known from the Gigasampler for instance, works with 3 buffers à 128 samples, two of which are a fixed part. This results in 6 to 9 ms actual latency.

...and practice

You are certainly wondering, if this is just theoretical nonsense or has an effect in practice. Let's try a simple but convincing example: set the audio card to the highest latency (e. g. 46 ms.) Now play like a metronome on a MIDI keyboard (percussive sound, e. g. cowbell) at a constant rate of say 120 bpm. The sounds now stumbling out of the speaker is far from what you played on the keyboard. It should come out as it was played, just delayed by a fixed amount of time. Because of latency jitter, the cowbells vary by 46 ms, thus are sometimes 454 ms, sometimes 546 ms apart. This corresponds to a tempo variation of 110 to 132 bpm and is definitely audible

Now it's getting worse: we had looked at a variation of 46 ms (plus/minus 23 ms), now we have plus/minus 46 ms. An explanation can be found in the time diagram, where you see the original keystrokes with a distance of 0.5 s. The basic delay of 46 ms was left out for better understanding.

The grey field shows the area, in which the audio signal is played back because of latency jitter, after 500 or up to 546 ms. In the worst case, the first cowbell is played at 546 ms, the second at 1 s. Thus the difference between cowbells 1 and 2 is only 454 ms, the playback seems to be accelerated.

But latency jitter doesn't have to occur. It's the standard at the moment (unfortunately), but not unavoidable. Actually the software could delay all output data up to the maximum of 2 buffers. The latency for real-time synthesizers would then be twice as high, but constant. As mentioned earlier on, this is much more comfortable than shaky timing. According to information that we have, the company emagic was a pioneer for this technique, there was no latency jitter in Logic 4.2. Because the audio engine was completely redone for Logic 4.5, the jitter is there again. Without any doubt, updates from emagic and Steinberg (Cubase 5.0 jitters as well) will cure the problem in the near future.

Math lesson

The calculation of latency can quickly cause headaches - only because some applications don't display the latency customer-friendly. The buffer size is generally shown in bytes or samples. The latter is the only solution, because the latency can then be found simply by division by the sampling rate:

8192 samples divided by 44100 yields 186 ms, divided by 48000 still 171 ms and so on.

For an indication in bytes, the number of channels, the number of buffers and the wordlength has to be taken into account. Example WaveLab stereo:

4 times 4096 equals 16384. Half of this per channel means 8192. For a resolution of 16 bit 2 bytes per sample are needed (1 byte = 8 bit.) There are 4096 samples in the buffer effectively, corresponding to 93 ms at 44.1 kHz.

At 24 bit 3 bytes are necessary, thus the buffer effect is reduced to 62 ms.

In the so-called unpacked 32 bit buffer mode, in WaveLab called '24 bit alt.', 4 bytes per sample are being used (although in most cases only 24 bit audio data is transferred.) The resulting latency is only 46 ms.

You have to keep in mind that the displayed latency means the effective buffer size for playback or record. The total time from input to output through the computer is twice as high (record plus playback.) In contrast to playback only, the value for monitoring an input signal is twice as high.

Because of the very small latencies that can be achieved today, techniques like RME's Zero Latency Monitoring often don't have to be used. Generally, hard disk recording with many tracks needs bigger buffers. ZLM and ADM will thus be necessary for quite some time.

Copyright © Matthias Carstens, 2000.

All entries in this Tech Infopaper have been thoroughly checked, however no guarantee for correctness can be given. RME cannot be held responsible for any misleading or incorrect information provided throughout this manual. Lending or copying any part or the complete document or its contents is only possible with the written permission from RME.

Home    News    Audio Converters    Sound Cards    MADI Series     DIGICheck   Mic Preamps
Accessories    Support    RME Newsgroup    Company Info    Purchasing    Downloads    Links

Copyright 2002 RME. All rights reserved. RME is a registered trademark.
This website contains names and marks of other companies.