Depacketization and Decoding



So far this chapter has discussed how video and audio signals are encoded, packetized, and multiplexed onto the IP network. The following sections describe what happens when the packets reach the destination TelePresence endpoint and how the video and audio signals are decoded.

Managing Latency, Jitter, and Loss

The first step in decoding is to receive, buffer, and reassemble the packets to prepare them to be decoded. Recall that the encoder encodes and packetized the video and audio signals at a smooth, consistent rate. (A 30 Hz camera clock rate and 48 kHz audio sampling rate result in fixed, consistent, encoding intervals.) However, as the packets containing those video and audio samples traverse the IP network, there will inevitably be variation in their arrival times and possibly the order in which they arrive. Therefore, the receiving endpoint must buffer the packets, reordering them if necessary, until an adequate number of packets have arrived to begin decoding a given video frame or audio sample. Lost packets or packets that arrive too late to be decoded (late packets) must also be dealt with by the decoder.
The following sections detail how the Cisco TelePresence codec handles latency, jitter, and loss in the packets that it receives. 

Latency

At the human experience level, latency is defined and measured as the time it takes for the speech or gestures of one individual (the speaker) to reach the ears and eyes of another (the listener), and for the audible or visual reaction of that listener to come all the way back to speaker so that they can hear and see the listener’s reaction. Hence, the human experience is round-trip in nature. This is referred to as conversational latency, or experience-level latency; 250 ms to 350 ms is the threshold at which the human mind begins to perceive latency and be annoyed by it.
At the technical level however, the latency in Cisco TelePresence is defined and measured as the time it takes for an audio or video packet containing speech or motion to travel from the Ethernet network interface of the speaker’s TelePresence system to the Ethernet network interface of the listener’s TelePresence system in one direction. The listener’s TelePresence system processes the incoming packets and computes a running average of the latency based on timestamps within the RTP packets and their associated RTCP sender reports. Therefore, latency is measured only at the network-level from one TelePresence system to another, not at the experience-level. It is measured unidirectionally by each TelePresence system, not measured round-trip, and does not take into account the processing time (encoding and decoding) of the packets.

Latency Target

To maintain acceptable experience-level latency, Cisco recommends that customers engineer their networks with a target of no more than 150 ms of network-level latency, in each direction, between any two TelePresence systems. Given the circumference of the earth, the speed of light, and the cabling paths that light travels on between cities, it is not always possible to achieve 150 ms between any two points on the globe. Therefore, Cisco TelePresence implements the following thresholds to alert the network administrator and the user when network-level latency exceeds acceptable levels.

Latency Thresholds

When network-level latency exceeds 250 ms averaged over any 10-second period, the Cisco TelePresence system receiving those packets generates an alarm, and an onscreen message displays to the user. The alarm is written to the syslog log file of that TelePresence system, and an SNMP trap message is generated. The onscreen message displays for 15 seconds, after which it is removed. The onscreen message does not display again for the duration of the meeting, unless the media is interrupted or restarted, such as when the user places the meeting on hold and then resumes it (using the Hold or Resume softkeys), or the user terminates the meeting and then reestablishes it (using the End Call or Redial softkeys).
Tip 
Cisco TelePresence release 1.5 added support for satellite networks. This feature requires a software license to activate. When activated, the latency threshold is adjusted from 250 ms to 2 seconds.

Understanding Latency Measurements in Multipoint Meetings

As audio and video packets traverse, a Cisco TelePresence Multipoint Switch (CTMS), the RTP header containing the original timestamp information, is overwritten, and a new timestamp value is applied by the CTMS. Therefore the latency measured by each participating TelePresence system is only a measurement of the latency from the CTMS to thatendpoint. It is possible for the end-to-end latency from one TelePresence system through the CTMS to another TelePresence System to exceed the 250 ms-latency threshold, without the TelePresence system realizing it.
For example, if the latency from one TelePresence system in Hong Kong to the CTMS in London is 125 ms, and the latency from the CTMS in London to the other TelePresence system in San Francisco is 125 ms, the end-to-end latency from the Ethernet network interface of the Hong Kong system to the Ethernet network interface of the San Francisco system is 250 ms, plus approximately 10 ms added by the CTMS, for a total of 260 ms. The TelePresence System in San Francisco will not realize this and will think that the latency for that meeting is only 125 ms. Therefore, care should be taken when designing the network and the location of the CTMS to reduce the probability of this situation occurring as much as possible. The CTMS is the only device in the network that is aware of the end-to-end latency between any two TelePresence systems in a multipoint meeting. Network administrators can view the end-to-end statistics (calculating the sum of any two legs in that meeting) through the CTMS Administration interface. 

Jitter

Simply put, jitter is variation in network latency. In Cisco TelePresence, jitter is measured by comparing the arrival time of the current video frame to the expected arrival time of that frame based on a running clock of fixed 33 ms intervals. Unlike most other video conferencing and telepresence products on the market that use variable frame rate codecs, Cisco TelePresence operates at a consistent 30 frames per second (30 fps). Therefore, the sending codec generates a video frame every 33 ms, and the receiving codec expects those video frames to arrive every 33 ms.

Frame Jitter Versus Packet Jitter

Video frames vary in size based on how much motion is represented by a given video frame. When a low amount of motion occurs within the encoded video, the video frame is relatively small. When a large amount of motion occurs within the encoded video, the video frame is large. Cisco TelePresence 1080p video frames can be as large as 65,000 bytes (65 KB) and averages approximately 13 KB.
These video frames are then segmented into smaller chunks and placed within the payload of RTP packets. Cisco TelePresence video packets tend to be approximately 1100 bytes each, with relatively minor variation in size.
Given a constant end-to-end network latency and relatively constant packet sizes, you can expect to have low packet-level jitter. However, there will still inevitably be variation in the arrival times of video frames simply due to the variation in their size. This variation is primarily a function of the serialization rate (speed) of the network interfaces the packets constituting those video frames traverse but can also be affected by queuing and shaping algorithms within the network routers along the path that might need to queue (buffer) the packets to prioritize them relative to other traffic, shape them prior to transmission, and then transmit (serialize) them on their outgoing interface. On fast networks (45 Mbps DS-3 circuits or faster), the time required to serialize all the packets constituting a large 65 KB video frame versus the time required to serialize a small 13 KB video frame is inconsequential. On slower networks (10 Mbps or slower), the time difference between the serialization of these frame sizes can be significant.
Cisco TelePresence systems implement jitter buffers to manage these variations in video frame arrival times. Upon receipt at the destination, the packets are buffered until an adequate portion of the video frame has arrived, and then the packets are removed from the buffer and decoded. The size (depth) of the jitter buffer dictates how much jitter can be managed before it begins to be noticeable to the user. Packets exceeding the jitter buffer are dropped by the receiving codec because they arrived too late to be decoded. The depth of the jitter buffer has an important consequence to the experience-level latency; every millisecond spent waiting in the jitter buffer increases the end-to-end latency between the humans, so jitter buffers must be kept as small as reasonably possible to accommodate network-level jitter without adding an unacceptable amount of experience-level latency.

Jitter Target

To maintain acceptable experience-level latency, Cisco recommends that customers engineer their networks with a target of no more than 10 ms of packet-level jitter and no more than 50 ms of video frame jitter in each direction between any two TelePresence systems. Given the desire to deploy TelePresence over the smallest and, hence least expensive, amount of bandwidth possible and the need in some circumstances to implement shaping within the routers along the path to conform to a service providers contractual rates and policing enforcements, 50 ms of jitter at the video frame-level jitter is not always possible to accomplish. Therefore, Cisco TelePresence implements the following thresholds and jitter buffer behavior to alert the network administrator when video frame-level jitter exceeds acceptable levels.

Jitter Thresholds

Cisco TelePresence uses a quasi-adaptive jitter buffer. At the beginning of every new meeting, the jitter buffer starts out at 85 ms in depth. After monitoring the arrival time of the video frames for the first few seconds of the meeting, if the incoming jitter exceeds 85 ms average, the jitter buffer is dynamically adjusted to 125 ms. After that, if the jitter exceeds 125 ms averaged over any 10-second period, the Cisco TelePresence system receiving those video frames generates an alarm and dynamically adjusts the jitter buffer to 165 ms. The alarm is written to the syslog log file of that TelePresence system, and an SNMP trap message is generated. No onscreen message is displayed to the user.
Any packets exceeding the 165 ms jitter buffer depth are discarded by the receiving TelePresence system and logged as “late packets” in the call statistics. No alarms or onscreen messages are triggered by this threshold. However, late packets are just as bad as lost packets in that they can cause a noticeable effect on the video quality, so care should be taken to design the network so that video frame jitter never exceeds 165 ms.

Packet Loss

Loss is defined as packets that did not arrive (because they were dropped somewhere along the network path) are measured by each TelePresence system by comparing thesequence numbers of the RTP packets it receives versus the sequence numbers it expected to receive. Packet loss can occur anywhere along the path for a variety of reasons; the three most common follow:
  • Layer-1 errors on the physical interfaces and cables along the path, such as a malfunctioning optical interface
  • Misconfigured network interfaces along the path, such as Ethernet speed or duplex mismatches between two devices
  • Bursts of packets exceeding the buffer (queue) limit or policer configurations on network interfaces along the path, such as Ethernet switches with insufficient queue depth or oversubscribed backplane architectures, or WAN router interfaces that police traffic to conform to a service provider’s contractual rates
A closely related metric is late packets, which are packets that arrived but exceeded the jitter buffer (arrived too late to be decoded) and hence were discarded (dropped) by the receiving TelePresence system. Lost packets and late packets are tracked independently by Cisco TelePresence systems, but they both result in the same outcome, noticeable pixelization of the video.
Loss is by far the most stringent of the three metrics discussed here. Latency can be annoying to the users, but their meeting can still proceed, and jitter is invisible to the user, but loss (including packets that arrived but exceeded the 165 ms jitter buffer and manifest into late packets) is immediately apparent. Consider the following calculation:
1080p resolution uncompressed (per screen)
     2,073,600 pixels per frame
× 3 colors per pixel
× 1 byte (8 bits) per color
× 30 frames per second
= 1.5 Gbps uncompressed
The Cisco TelePresence systems use the H.264 codec to compress this down to 4 Mbps (per screen). This represents a compression ratio of > 99 percent. Therefore, each packet is representative of a large amount of video data, and, hence, a small amount of packet loss can be extremely damaging to the video quality.
At the time this book was written, Cisco TelePresence was just beginning to implement a new technique known as Long-Term Reference Frames. This enables the system to recover from packet loss significantly faster by maintaining multiple reference frames and, therefore, reducing the number of IDR reference frames that need to be retransmitting when packet loss occurs.

Loss Target

To maintain acceptable experience-level video quality, Cisco recommends that customers engineer their networks with a target of no more than .05 percent packet loss in each direction between any two TelePresence systems. This is an incredibly small amount, and given the complexity of today’s global networks, 0.05 percent loss is not always possible to accomplish. Therefore, Cisco TelePresence implements the following thresholds to alert the network administrator when packet loss (or late packets) exceeds acceptable levels.

Loss Thresholds

When packet loss (or late packets) exceeds 1 percent averaged over any 10-second period, the Cisco TelePresence system receiving those packets generates an alarm, and an onscreen message appears. The alarm is written to the syslog log file of that TelePresence system, and an SNMP trap message is generated. The onscreen message appears for 15 seconds, after which it is removed, and a 5-minute hold timer is started. During the 5-minute hold timer, syslog/SNMP alarms continue to be generated, but no onscreen message displays.
When packet loss (or late packets) exceeds 10 percent averaged over any 10-second period, the Cisco TelePresence system receiving those packets generates a second alarm, and a second on-screen message appears (unless the hold timer is already in affect). The alarm is written to the syslog log file of that TelePresence system, and an SNMP trap message is generated. The on-screen message displays for 15 seconds, after which it is removed, and a 5-minute hold timer starts (if it weren’t already started by loss threshold #1). During the 5-minute hold timer, syslog/SNMP alarms continue to be generated, but no onscreen message appears.
If loss (or late packets) exceeds 10 percent averaged over any 60-second period, in addition to the actions described, the system downgrades the quality of its outgoing video. When the video downgrades, an alarm generates, and an onscreen icon and message display indicating that the quality has been reduced. The video quality is downgraded by reducing its motion handling (by applying a higher-compression factor to the motion) but the resolution is not affected. For example, if the meeting runs at 1080p-Best, it downgrades to 1080-Good. If the meeting runs at 720p-Best, it downgrades to 720p-Good.
Tip 
Ten percent packet loss can be measured as 1 out of every 10 packets lost, in which case every inter-reference frame would be impacted, causing the video to completely freeze; or it can be measured as 10 ten packets consecutively lost followed by 90 packets consecutively received, which would have a much less severe effect on the video quality. For example, ten-percent packet loss due to a duplex mismatch, in which packets are dropped consistently and evenly, would have a much more severe effect than ten-percent packet loss due to a queue in the network tail dropping several packets in a burst and then forwarding the remaining packets.
Finally, if loss equals 100 percent for greater than 30 seconds, the codec hangs up the call. If the packets begin flowing again anytime up to the 30-second timer, the codec immediately recovers.

1 comment:

Mann said...

Really well written article with practical insights on designing video infrastructures. I ended up at your article whilst searching TP packet size and QOS requirements. Keep up the great work.

Post a Comment