Real-Time Transport Protocol | Encoding and Packetization



The preceding content covered the basic concepts of video and audio inputs and encoding processes. The video is encoded using H.264, and the audio is encoded using AAC-LD. Now those video and audio samples must be turned into IP packets and sent onto the network to be transported to the other end. This is done using the Real-Time Transport Protocol (RTP).
RTP is a network protocol specifically designed to provide transport services for real-time applications such as interactive voice and video. The services it provides include identification of payload type, sequence numbering, timestamps, and monitoring of the delivery of RTP packets through the RTP control protocol (RTCP). RTP and RTCP are both specified in IETF RFC 3550.

RTP Packet Format

RTP defines a standard packet format for delivering the media, as shown in Figure 1.

 Figure 1: RTP packet format
The following are the fields within the RTP packet:
  • Version (V): A 2-bit field indicating the protocol version. The current version is 2.
  • Padding (P): A 1-bit field indicating padding at the end of the RTP packet.
  • Extension Header (X): A 1-bit field indicating the presence of an optional extension header.
  • CRSC Count (CC): A 4-bit field indicating the number of Contributing Source (CSRC) identifiers that follow the fixed header. It is presented only when inserted by an RTP mixer such as a conference bridge or transcoder.
  • Marker (M): A 1-bit marker bit that identifies events such as frame boundaries.
  • Payload Type (PT): A 7-bit field that identifies the format of the RTP payload.
  • Sequence Number: A 16-bit field that increments by one for each RTP packet sent. The receiver uses this field to identify lost packets.
  • Timestamp: A 32-bit timestamp field that reflects the sampling instant of the first octet of the RTP packet.
  • Synchronization Source Identifier (SSRC): A 32-bit field that uniquely identifies the source of a stream of RTP packets.
  • Contributing Source Identifiers (CSRC): Variable length field that contains a list of sources of streams of RTP packets that have contributed to a combined stream produced by an RTP mixer. You can use this to identify the individual speakers when a mixer combines streams in an audio or video conference.
  • RTP Extension (Optional): Variable length field that contains a 16-bit profile specific identifier and a 16-bit length identifier, followed by variable length extension data. Intended for limited use.
  • RTP Payload: Variable-length field that holds the real-time application data (voice, video, and so on.).

Frames Versus Packets

To understand the role of RTP within TelePresence, you need to understand the behavior of voice and video over a network infrastructure. Figure 2 shows a sample comparison of voice and video traffic as it appears on a network.


Figure 2: Comparison of voice and video on the network
 
As you can see in Figure 2, voice appears as a series of audio samples, spaced at regular intervals. In the case of Cisco TelePresence, voice packets are typically sent every 20 msec. Each packet contains one or more encoded samples of the audio, depending on the encoding algorithm used and how many samples per packet it is configured to use. G.711 and G.729, two of the most common VoIP encoding algorithms, typically include two voice samples per packet. The sizes of the voice packets are consistent, averaging slightly over 220 bytes for a 2-sample G.711 packet; therefore, the overall characteristic of voice is a constant bit-rate stream.
Video traffic appears as a series of video frames spaced at regular intervals. In the case of Cisco TelePresence, video frames are sent approximately every 33 msec. The size of each frame varies based on the amount of changes since the previous frame. Therefore, the overall characteristic of TelePresence video is a relatively bursty, variable bit-rate stream.
A video frame can also be referred to as an Access Unit in H.264 terminology. The H.264 standard defines two layers, a Video Coding Layer (VCL) and a Network Abstraction Layer (NAL). Figure 3 shows a simplified example.

 
Figure 3: Mapping TelePresence video into RTP packets
The VCL is responsible for encoding the video. Its output is a string of bits representing the encoded video. The function of the NAL is to map the string of bits into units that can then be transported across a network infrastructure. IETF RFC 3984 defines the format for H.264 video carried within the payload of the RTP packets. Each video frame consists of multiple RTP packets spaced out over the frame interval. The boundary of each video frame is indicated through the use of the marker bit, as shown in Figure 1. Each RTP packet contains one or more NAL Units (NALU), depending upon the packet type: single NAL unit packet, single-time or multi-time aggregation packet, or fragmentation unit (part of a NALU). Each NALU consists of an integer number of bytes of coded video.
Note 
RTP Packets within a single video frame and across multiple frames are not necessarily independent of each other. In other words, if one packet within a video frame is discarded, it affects the quality of the entire video frame and might possibly affect the quality of other video frames.

TelePresence Video Packet and Frame Sizes

The sizes of individual RTP packets within frames vary, depending upon the number of NALUs they carry and the sizes of the NALUs. Overall, packet sizes average 1100 bytes for Cisco TelePresence video. The number of packets per frame also varies considerably based upon how much information is contained within the video frame. This is partially determined by how the video is encoded; either reference frames or inter-reference frames.
Note 
Coding is actually done at the macroblock layer. An integer number of macroblocks then form a slice, and multiple slices form a frame. Therefore, technically slices are intrapredicted (I-slices) or interpredicted (P-slices).
Compression of reference frames is typically only moderate because only spatial redundancy within the frame is eliminated. Therefore, reference frames tend to be much larger in size than inter-reference frames. Reference frame sizes up to 64 KB to 65 KB (and approximately 60 individual packets) have been observed with TelePresence endpoints. Inter-reference frames have much higher compression because only the difference between the frame and the reference frame is sent. This information is typically sent in the form of motion vectors indicating the relative motion of objects from the reference frame. The size of TelePresence inter-reference frames is dependent upon the amount of motion within the conference call. Under normal motion, TelePresence inter-reference frames tend to average 13 KB in size and typically consist of approximately 12 individual packets. Under high motion they can be approximately 19 KB in size and consist of approximately 17 individual packets. From a bandwidth utilization standpoint, much better performance can be achieved by sending reference frames infrequently. During normal operation, Cisco TelePresence codecs send reference frames (IDRs) only once every several minutes in point-to-point calls to reduce the burstiness of the video and lower the overall bit rate.

4 comments:

Aaru Garg said...

Hello Author...
Your article is so creative which represents very useful information. Reader-friendly writing style. Thanks!!! Stay Safe while using Public Wifi

Vicky said...

interesting blog. Thanks for sharing.

Learn also: Adaptive Bitrate Streaming.

Rupinder kaur said...
This comment has been removed by the author.
Buy Routers And Switches said...


Very nice and informative blog. It really helped me add some useful points in my knowledge. Thanks for sharing!
Also check out these amazing Cisco products if you want:

C3850-NM-8-10G
C3850-NM-2-10G
C9200L-24P-4X-E

Post a Comment