Video Encoding | Cisco TelePresence

When the video coming from the cameras is presented at the HDMI inputs of the codecs, the video passes through the DSP array to be encoded and compressed using the H.264 encoding and compression algorithm. The encoding engine within the Cisco TelePresence codec derives its clock from the camera input, so the video is encoded at 30 frames per second (30 times a second, the camera passes a video frame to the codec to be encoded and compressed).

H.264 Compression Algorithm

H.264 is a video encoding and compression standard jointly developed by the Telecommunication Standardization Sector (ITU-T) Video Coding Experts Group (VCEG) and the International Organization for Standardization/International Electrotechnical Commission (ISO/IEC) Moving Picture Experts Group (MPEG). It was originally completed in 2003, with development of additional extensions continuing through 2007 and beyond.
H.264 is equivalent to, and also known as, MPEG-4 Part 10, or MPEG-4 AVC (Advanced Video Coding). These standards are jointly maintained so that they have identical technical content and are therefore synonymous. Generally speaking, PC-based applications such as Microsoft Windows Media Player and Apple Quicktime refer to it as MPEG-4, whereas real-time, bidirectional applications such as video conferencing and telepresence refer to it as H.264.

H.264 Profiles and Levels

The H.264 standard defines a series of profiles and levels, with corresponding target bandwidths and resolutions. Basing its development of the Cisco TelePresence codec upon the standard, while using the latest Digital Signal Processing hardware technology and advanced software techniques, Cisco developed a codec that could produce 1080p resolution (1920x1080) at a bit rate of under 4 Mbps. One of the key ingredients used to accomplish this level of performance was by implementing Context-Adaptive Binary Arithmetic Coding (CABAC).

CABAC

CABAC is a method of encoding that provides considerably better compression but is extremely computationally expensive and hence requires considerable processing power to encode and decode. CABAC is fully supported by the H.264 standard, but Cisco TelePresence, with its advanced array of Digital Signal Processing (DSP) resources was the first implementation on the market capable of performing this level of computational complexity while maintaining the extremely unforgiving encode and decode times (latency) needed for real-time, bidirectional human interaction.

Resolution

The resolution of an image is defined by the number of pixels contained within the image. It is expressed in the format of the number of pixels wide x number of pixels high (pronounced x-by-y). The H.264 standard supports myriad video resolutions ranging from Sub-Quarter Common Interchange Format (SQCIF) (128x96) all the way up to ultra-highdefinition resolutions such as Quad Full High Definition (QFHD) or 2160p (3840x2160). Table 1 lists the most common resolutions used by video conferencing and Telepresence devices. The resolutions supported by Cisco TelePresence are noted with an asterisk.
Table 1: H.264 Common Resolutions 
Name
Width
Height
Aspect Ratio
QCIF
176
144
4:3
CIF*
352
288
4:3
4CIF
704
576
4:3
480p
854
480
16:9
XGA*
1024
768
4:3
720p*
1280
720
16:9
1080p*
1920
1080
16:9

Aspect Ratio

The aspect ratio of an image is its width divided by its height. Aspect ratios are typically expressed in the format of x:y (pronounced x-by-y, such as “16 by 9”). The two most common aspect ratios in use by video conferencing and telepresence devices are 4:3 and 16:9. Standard Definition displays are 4:3 in shape, whereas high-definition displays provide a widescreen format of 16:9. As shown in Table 1, Cisco TelePresence supports CIF and XGA that are designed to be displayed on standard-definition displays, along with 720p and 1080p that are designed to be displayed on high-definition displays. The Cisco TelePresence system uses high-definition format displays that are 16:9 in shape, so when a CIF or XGA resolution is displayed on the screen, it is surrounded by black borders, as illustrated in Figure 1.

 
Figure 1: 16:9 and 4:3 images displayed on a Cisco TelePresence system

Frame Rate and Motion Handling

The frame rate of the video image constitutes the number of video frames per second (fps) that is contained within the encoded video. Cisco TelePresence operates at 30 frames per second (30 fps). This means that 30 times per second (or every 33 milliseconds) a video frame is produced by the encoder. Each 33 ms period can be referred to as a frame interval.
Motion handling defines the degree of compression within the encoding algorithm to either enhance or suppress the clarity of the video when motion occurs within the image. High motion handling results in a smooth, clear image even when a lot of motion occurs within the video (people walking around, waving their hands, and so on). Low motion handling results in a noticeable choppy, blurry, grainy, or pixelized image when people or objects move around. These concepts have been around since the birth of video conferencing. Nearly all video conferencing and telepresence products on the market offer user-customizable settings to enhance or suppress the clarity of the video when motion occurs within the image. Figure 2 shows a historical reference that many people can relate to, which shows a screenshot of a Microsoft NetMeeting desktop video conferencing client from the 1990s that provided a slider bar to allow the users to choose whether they preferred higher quality (slower frame rate but clearer motion) or faster video (higher frame rate, but blurry motion).

 
Figure 2: Microsoft NetMeeting video quality setting
Cisco TelePresence provides a similar concept. Although the Cisco TelePresence cameras operate at 1080p resolution 30 Hz (1080p / 30), the encoder within the Cisco TelePresence codec to which the camera attaches can encode and compress the video into either 1080p or 720p resolutions at three different motion-handling levels per resolution providing the customer with the flexibility of deciding how much bandwidth they want the system to consume. Instead of a sliding scale, Cisco uses the terms Good, Better, and Best. Best motion handling provides the clearest image (and hence uses the most bandwidth), whereas Good motion handling provides the least-clear image (and hence the least bandwidth). The Cisco TelePresence codec also supports the CIF resolution (352x288) for interoperability with traditional video conferencing devices, and the XGA resolution (1024x768) for the auxiliary video channels used for sharing a PC application or a document camera image. Table 2 summarizes the resolutions, frame rates, motion-handling levels, and bit rates supported by the Cisco TelePresence codec.
Table 2: Cisco TelePresence Supported Resolutions, Motion Levels, and Bit Rates 
Resolution
Frame Rate
Motion Handling
Bit Rate
CIF
30 fps
Not configurable
768 kbps
XGA
5 fps
Not configurable
500 kbps
XGA
30 fps
Not configurable
4 Mbps
720p
30 fps
Good
1 Mbps
720p
30 fps
Better
1.5 Mbps
720p
30 fps
Best
2.25 Mbps
1080p
30 fps
Good
3 Mbps
1080
30 fps
Better
3.5 Mbps
1080p
30 fps
Best
4 Mbps
Tip 
Note that Cisco TelePresence always runs at 30 fps (except for the auxiliary PC and document camera inputs that normally run at 5 fps but can also run at 30 fps if the customer wants to enable that feature). Most video conferencing and telepresence providers implement variable frame-rate codecs that attempt to compensate for their lack of encoding horsepower by sacrificing frame rate to keep motion handling and resolution quality levels as high as possible. Cisco TelePresence codecs do not need to do this because they contain so much DSP horsepower. The Cisco TelePresence codec can comfortably produce 1080p resolution at a consistent 30 fps regardless of the amount of motion in the video. The only reason Cisco provides the motion handling (Good, Better, Best) setting is to give the customer the choice of sacrificing motion quality to fit over lower bandwidths.
The video from the Cisco TelePresence cameras and the auxiliary video inputs (PC or document camera) is encoded and compressed by the Cisco TelePresence codec using the H.264 standard. Each input is encoded independently and can be referred to as a channel. (This is discussed further in subsequent sections that details how these channels arepacketized and multiplexed onto the network using the Real-Time Transport Protocol.) On multiscreen systems, such as the CTS-3000 and CTS-3200, each camera is connected to its respective codec, and that codec encodes the cameras video into an H.264 stream. Thus there are three independently encoded video streams. The primary (center) codec also encodes the auxiliary (PC or document camera) video into a fourth, separate H.264 stream.

Frame Intervals and Encoding Techniques

Each 33-ms frame interval contains encoded slices of the video image. A reference frame is the largest and least compressed and contains a complete picture of the image. After sending a reference frame, subsequent frames contain only the changes in the image since the most recent reference frame. For example, if a person is sitting in front of the camera, talking and gesturing naturally, the background (usually a wall) behind the person does not change. Therefore, the inter-reference frames need to encode only the pixels within the image that are changing. This allows for substantial compression of the amount of data needed to reconstruct the image. Reference frames are sent at the beginning of a call, or anytime the video is interrupted, such as when the call is placed on hold and the video suspended, and then the call is taken off of hold and the video resumed. An Instantaneous Decode Refresh (IDR) Frames An IDR frame is a reference frame containing a complete picture of the image. When an IDR frame is received, the decode buffer is refreshed so that all previously received frames are marked as “unused for reference,” and the IDR frame becomes the new reference picture. IDR frames are sent by the encoder at the beginning of the call and at periodic intervals to refresh all the receivers. They can also be requested at any time by any receiver. There are two pertinent examples of when IDRs are requested by receivers:
  • In a multipoint meeting, as different sites speak, the Cisco TelePresence Multipoint Switch (CTMS) switches the video streams to display the active speaker. During this switch, the CTMS sends an IDR request to the speaking endpoint so that all receivers can receive a new IDR reference frame.
  • Whenever packet loss occurs on the network and the packets lost are substantial enough to cause a receiver to lose sync on the encoded video image, the receiver can request a new IDR frame so that it can sync back up.
This concept of reference frames and inter-reference frames result in a highly variable bit rate. When the video traffic on the network is measured over time, peaks and valleys occur in the traffic pattern. (The peaks are the IDR frames, and the valleys are the inter-IDR frames). It is important to note that Cisco TelePresence does not use a variable frame rate (it is always 30 fps), but it is variable bit rate. (The amount of data sent per frame interval varies significantly depending upon the amount of motion within each frame interval.) Figure 3 illustrates what a single stream of Cisco TelePresence 1080p / 30 encoded video traffic looks like over a one second time interval.

 
Figure 3: 1080p / 30 traffic pattern as viewed over one second
Long-Term Reference Frames A new technique, implemented at the time this book was authored, is Long-Term Reference (LTR) Frames. LTR allows for multiple frames to be marked for reference, providing the receiver with multiple points of reference to reconstruct the video image. This substantially reduces the need for periodic IDR frames, hence further reducing the amount of bandwidth needed to maintain picture quality and increasing the efficiency of the encode and decode process. With LTR, the periodic spikes in the traffic pattern illustrated in Figure 3 would be substantially less frequent, resulting in overall efficiencies in bandwidth consumption. It would also enable the receiver to more efficiently handle missing frame data, resulting in substantially higher tolerance to network packet loss.

1 comment:

Suzuki said...

Here is a useful guide on playing H264 video easily, check it out if needed!

Post a Comment