• End-to-end
  • Capture
  • Transcoding
  • Audio codecs
  • Video codecs
  • Scheduling
  • Network Transmission
  • Rendering
  • Conclusion

    Discussion on Latency

    Author: Guilhem Tardy
    Last update: June 6, 2004

    The purpose of this paper is to explain in greater detail the latency inherent to all audio and video conferencing systems.

    End-to-end property

    Latency is defined by the sum of all delays incurred by the media end-to-end.

    For example, delays from the source microphone to the destination loudspeakers include audio capture, encoding, operating system scheduling, packetization, transmission over the Internet, de-packetization, buffering, decoding, and rendering (i.e. playout).

    There are two benchmarks of interest: the minimum and maximum latency of the conferencing system. Or, if you prefer, the average latency and its variance.


    The A/D and D/A converters in a sound card have a typical latency in the range of 30-50 samples, which represents about 1-1.5 ms of delay at 44.1 kHz, or about 4-6 ms at 8 kHz.

    The acquisition of audio through an external device may incur additional delays. For example a Bluetooth microphone shows the following properties:

    • encoding (8 kbit/s, -law or A-law PCM), 1 ms
    • packetization
    • transmission over Bluetooth, 0.625ms
    • preparation for re-encoding, 1 ms

    The complexity of video capture naturally varies with the image resolution. Typical A/D and D/A converters induce about 15 ms delay on the video signal, excluding additional delays due to the camera and USB signaling (if applicable).

    The driver API can also make a difference, for example in Linux (ALSA vs. OSS) and Windows (Multimedia Extensions vs. DirectSound, Video for Windows vs. DirectShow). Especially for raw video with its extremely high bandwidth (>12 Mbps), the fewer the memory copies the better.


    Transcoding takes place at both ends of the audio and video conference.

    It should be noted that encoding takes a much longer time (usually 10x) than decoding, all other things being equal.


    There are many audio codecs to choose from, each with its particular set of advantages and disadvantages. Here follows a table that summarizes a few audio codecs published by the International Telecommunication Union (ITU):

    G.711 G.723lr G.723hr G.726 G.729
    bit rate (kbps) 48, 56, 64 5.3 6.3 16, 24, 32, 40 8
    voice quality excellent fair good good (40), fair (24) good
    frame time (ms) 1 30 30 1 10
    frame size (bytes) 6, 7, 8 20 24 2, 3, 4, 5 10
    typical bundle (frames) 20 - 30 1 - 3 1 - 3 15 - 30 2 - 3
    typical delay (ms) 20 - 30 37.5 - 97.5 37.5 - 97.5 15 - 30 25 - 35

    Note: Algorithms in red take advantage of the characteristics of speech to further compress the signal, and thus show limited use for other applications.

    Stream codecs handle each sample independently, at a typical frequency of 8KHz for telephony. For clarity, we assume a frame size of 1ms (i.e. 8 samples).

    In the case of frame-based codecs, the size of the packet payload is a multiple of the frame size. As a result, audio codecs have a typical end-to-end delay (excluding network delay) that is dependent on the frame size, look-ahead and bundle. For example, G.729 uses a 10 ms frame with 5 ms look-ahead.

    The selection of one codec is essentially a trade-off between complexity, bit rate, quality (from the voice compression), error resilience and compatibility with other equipment.

    Incidentally, several of these characteristics also have consequences on the latency in other parts of the conferencing system. A more complex algorithm naturally results in a longer transcoding time, all other things being equal. And a higher bit rate induces further delays at the network bottleneck.


    The delay incurred by the video codec varies with the codec features (e.g. baseline H.263 vs. annexes), settings (e.g. intraframe vs. interframes), and characteristics of the video signal (e.g. frame size, amount of motion for interframes).

    There is no framing or bundling mechanism for video: each video frame is handled separately.

    Contrary to audio, a video frame is usually split into several chunks (<1400 bytes) in order to fit into IP packets. How the codec handles this falls into two categories:

    • incremental
    • monolithic
    The latter means that the codec encodes all of a video frame before its first packet is sent, and/or waits for all corresponding IP packets to decode a video frame. This may also prove a burden on the scheduling, and thus other media such as audio.

    Video dominates the processing and bandwidth requirement of the conferencing system, and thus greatly influences its overall performance. Therefore, it should be considered with special care by the designer.


    The host operating system induces interrupt latency, a delay between when an external event (e.g. hardware interrupt) occurs and when the corresponding thread receives control of the processor to perform its task.

    This delay is largely dependent on the operating system's design, and possible configuration tweaks. For instance, Windows exhibits a much higher variance (typically 1-100 ms on Win9x, and 12-245 ms on Win2K) than Linux.

    The hardware and number of simultaneous applications/threads also have a tremendous effect on the interrupt delay, as exemplified by embedded systems and massively multi-threaded applications like MCUs.

    Network transmission

    The transmission of audio and video packets over the Internet introduces various delays, mainly due to queuing (at the busiest router) and the actual act of transmitting over the wire (at the slowest link, or "bottleneck").

    The network bottleneck is often the Internet access (e.g. dial-up), but not necessarily. For instance, the use of satellite communications easily adds 300 ms to the overall delay.

    The network delay constantly changes, due to other traffic and adaptive routing. Audio and video conferencing libraries typically implement a "de-jitter" buffer at the receiving end that smooths out these variations at the cost of yet an additional delay.

    Note: If the network delay could be anywhere between 10 ms and 100 ms, and the de-jitter buffer was set at 60 ms, all packets delayed more than 70 ms would be considered lost, and the rest delivered to the application after 70ms from the transmission time.

    Although it may be impractical to buffer video packets alike, they too are silenly discarded if received too late.


    There may be some post-processing (e.g. echo and noise control, deblocking filter, scaling) before the media is presented to the user through loudspeakers or on a screen.

    The audio playout buffer (at the driver level) introduces a delay to counter-act the interrupt latency of the host operating system, followed by the D/A conversion (see "Capture" above).

    The application renders video to the GUI, a procedure that may involve several costly overlay operations before "blitting" to the D/A converter (see "Capture" above).


    The audio is most sensitive to delay and its variations, with possible breaks (i.e. silence) if a packet was delayed too long or another thread (e.g. video encoding) monopolizes the processor.

    The video experiences longer capture, transcoding and rendering delays. Still, the absence of framing or bundling keeps its end-to-end delay in a range similar to the audio.

    Understanding delays is essential to the synchronization of audio and video signals rendered to the user. Although a time difference of 80 ms is below the limit of human perception, the ITU recommends no more than 20 or 40 ms depending on which of the two is ahead of the other.