Latency Calculations in Software Defined Radios

This section addresses requirements and concerns associated with latency critical applications. As the stock product is not optimized for latency, we encourage users with latency sensitive applications to contact us directly to help determine the optimal implementation.

For the purposes of this section, we will concern ourselves with two types of latencies: the receive or transmit latency and the round trip latency. Receive or transmit latency is the time required for the unidirectional receipt or transmission of data between the antenna of the radio chain and the host computer. The round trip latency is the time required for radio data that is received on one radio chain to be transmitted on another chain (or vice versa).

Receive or Transmit Latency

This measurement concerns itself with measuring the time elapsed from when a signal incident on the antenna is processed and presented to the user. Alternatively, it is the time elapsed from when a user application sends radio samples, to the time those samples are transmitted over the antenna.

Various sources contribute to this latency, including the radio chain, converters, FPGA DSP, packetization, buffering, network transmission latency, and OS receive latency. The equations below summarize the major latency contributors when considering receive and transmit latency:

\[\begin{equation} \tau_{\text{lat},\text{Rx}}=\tau_{\text{RF},\text{Rx}}+\tau_{\text{ADC}}+\tau_{\text{DSP}}+\tau_{\text{buf},\text{Rx}}+\tau_{\text{frame}}+\tau_{\text{net}}+\tau_{\text{os},\text{Rx}}+\tau_{\text{app},\text{Rx}} \label{eq:latrx} \end{equation}\]
\[ \tau_{\text{lat},\text{Tx}}=\tau_{\text{app},\text{Tx}}+\tau_{\text{os},\text{Tx}}+\tau_{\text{net}}+\tau_{\text{deframe}}+\tau_{\text{buf},\text{Tx}}+\tau_{\text{DSP}}+\tau_{DAC}+\tau_{\text{RF},\text{Tx}} \]

where, \(\tau_{lat,rx}\) and \(\tau_{lat,tx}\) are the total receive and transmit latencies, \(\tau_{RF,Tx}\) and \(\tau_{RF,Rx}\) represent the transmit and receive group delays associated with the radio front end (RFE), \(\tau_{ADC}\) and \(\tau_{DAC}\) are the total converter (de)serialization delays, \(\tau_{DSP}\) is the DSP processing delay on the FPGA (digital up/down conversion, decimation, interpolation, and filtering), \(\tau_{\text{buf, Rx}}\) and \(\tau_{\text{buf, Tx}}\) represent the FPGA receive and transmit sample buffers, \(\tau_{\text{deframe}}\) and \(\tau_{\text{frame}}\) represent the time required for the FPGA to frame or deframe the Ethernet packets, \(\tau_{\text{net}}\) is the total network latency, \(\tau_{\text{os, Tx}}\) and \(\tau_{os,Rx}\) is the host PC operating system network stack latency, and \(\tau_{\text{app, Tx}}\) and \(\tau_{\text{app, Rx}}\) is the user space client receive or transmit latency.

It is worth noting that not all of these sources have a constant latency contribution. Examining some components in more detail, we find that the radio chain group delay, \(\tau_{RF}\), is approximately constant and largely invariant to changes in sample rate, or frequency. In contrast, the network latency, and especially the transmit latency (\(\tau_{\text{net, Tx}}\) ), and operating system latencies (\(\tau_{os,Rx}\) and \(\tau_{\text{os, Tx}}\)) can be especially variable, with a strong dependency on the type of network card, operating system, and system load. This latency may be reduced by purchasing a 10GBASE-R NIC that is optimized for low latency, and made more determinate by running a real-time operating system to help limit variance. Taking this variance into account is especially important when transmitting at high sample rates on multiple channels (ie. over 30MSPS on 4 channels). This is because the transmit FIFO is specified as a fixed number of samples - at very high sample rates, the cumulative period represented by the fixed number of samples stored in the FIFO is reduced, which makes the overall system more susceptible to the transmit NIC variance. Also note that due to the determinism of Crimson TNG/Cyan, the delivery and receipt of samples to and from the FPGA 10G PHY has substantially lower variance and jitter compared to that of the host PC. If this potentially impacts your application, please contact us for more specific advice.

Note

In the event of a buffer underflow or overrun, the default FPGA configuration inserts zero valued samples, or discards samples, to preserve the initial sample coherency.

In addition, the sample buffer latencies (\(\tau_{\text{buf, Rx}}\) , \(\tau_{\text{buf, Tx}}\) ) strongly depends on sample rate. This is because the sample buffers are located immediately before and after the 10GBASE-R framing code. In the case of the receive chain, the samples accumulate in the sample buffer, at divisor of the sample rate clock, until a sufficient number of samples have accumulated to make up a complete UDP payload. Once a sufficient number of samples have accumulated to make up an entire UDP packet payload, those samples are popped from the FIFO, at the network clock rate, and assembled into a complete UDP packet that is immediately transmitted (and accounted for in \(\tau_{\text{frame}}\) and \(\tau_{\text{deframe}}\) ).

As a result of this, we can expand \(\tau_{\text{buf, Rx}}\) and \(\tau_{\text{buf,frame}}\) into:

Buffer Rx equation:

\[\begin{equation} \label{eq:1} \tau_{\text{buf, Rx}}=\frac{P_{\text{pkt,bytes}}}{S_{\text{smpl,size,bytes}}}\cdot\left(\frac{1}{f_{s}}\right) \end{equation}\]

deframe equation:

\[\begin{equation} \tau_{\text{(de)frame}} =\frac{P_{\text{pkt,bytes}}}{S_{\text{framesize}}}\cdot\left(\frac{1}{f_{\text{c,nw}}}\right) \label{eq:deframe} \end{equation}\]

Given that Crimson TNG/Cyan uses complex samples of 32 bits each, \(S_{smpl,size}=4\), and the Ethernet framer is 64 bits, \(S_{\text{framesize}}=8\), with a network clock of \(f_{c,nw}\)=156.25MHz , and the default specified packet size is \(P_{\text{pkt,bytes}}\)=1600 bytes, we can simply apply the latency buffer equation and this means the default latency is: \(\tau_{\text{buf,Rx,default}}\approx\frac{400}{f_{s}}\) and \(\tau_{\text{(de)frame,default}} \approx \frac{1600}{8}\cdot \frac{1}{156.25 \times 10^6} \approx 128\mu s\)

When analyzing transmission latency, consideration needs to be applied to the transmit sample buffer. Crimson TNG/Cyan communicates over a packetized Ethernet network - that is, the minimum unit of sample data transmission is a UDP packet, with a payload (made up of a number of complex radio samples) size is generally determined by the amount of data the application passes to Crimson TNG/Cyan when sending it data.

Note

More specifically, when calling the UHD library send() command, the nsamps_per_buff argument is used to determine the number, and size, of the UDP packets that will actually encapsulate the sample data. While the smallest meaningful payload consists of a single sample, the maximum payload size is bounded by the maximum individual Ethernet packet size (9000 bytes when using jumbo frames), and Crimson TNG/Cyan protocol support (which does currently allow for fragmented UDP packets).

In other words, the stock Crimson TNG/Cyan firmware image product does not have any provisions for the regular and monotonic transmission of individual samples, but rather aggregates a number of samples into UDP packets which are sent at the 10GBASE-R line rate to Crimson TNG/Cyan. This aspect is somewhat analogous to how the FPGA queues samples from the ADC until a sufficient number have accumulated and, in a well designed system, the actual radio sample generation is not a limiting factor, and the transmission buffer latency analysis is identical to the receive case. However, this analysis does not take into account, or compensate, for other system variance. Therefore, in order to complete the analysis of \(\tau_{\text{buf, Tx}}\), we need to consider the impact of \(\tau_{\text{os, Tx}}\) and \(\tau_{\text{net, Tx}}\).

In an ideal system, the host PC would transmit individual radio samples to Crimson TNG/Cyan at exactly the correct sample rate, and perfectly regularly. Such a perfect and ideal system would not require any buffering - because the host PC and application sample rates and jitter are perfectly matched, the FPGA would always have exactly the correct number of samples to send to the DAC. The closest practical approximation to such a system would be when interfacing the SDR to a synchronous data processing source, such as another FPGA, that shares the same reference source and clock. In such cases it is comparatively easy to design and optimize the amount of buffering required to compensate for sample rate clock variances (due to random drift, thermal variance, noise, or silicon variations), and thereby minimizing the latency contribution of \(\tau_{\text{buf, Tx}}\).

When interfacing with an external PC, running a traditional operating system, a number of different considerations come into play. In addition to not sharing a common clock (which requires us to address crossing clock domains with a potentially large variance), there are two major sources of variance: the operating system (\(\tau_{\text{os, Tx}}\)) and the 10GBASE-R NIC (\(\tau_{\text{net, Tx}}\)). When a host PC application calls send(), the UHD library needs to make a number of system calls to the operating system in order to actually send the data over the network, to the correct address, at the correct time. The time required for these calls to be serviced by the operating system (\(\tau_{\text{os, Tx}}\)), can vary quite substantially. Additionally, most host computers lack the stable and precise time source required to preserve a frequency reference (and that is included by default in Crimson TNG/Cyan), and therefore cannot immediately agree on the duration of 1 second.

In addition to this, different 10GBASE-R Ethernet PHYs can have substantially different latencies (\(\tau_{\text{net, Tx}}\)) that can also vary substantially. Part of this behaviour is intrinsic to the Ethernet protocol, which requires random back-off periods, and part of it is due to the design and implementation of specific network cards, which are broadly optimized for throughput, and not latency. Table 1 provides some very rough figures for the round trip time of simple packet.

Table 1: Round Trip Time (RTT) comparison for Crimson TNG/Cyan-PC communication between various NICs.

This table compares the round trip time between a host PC and Crimson TNG/Cyan using various network cards. These figures are not intended to be a performance proxy, and are not statistically significant or even rigorous. They are simply intended to illustrate real world NIC variances. Using the default Crimson TNG/Cyan image, the results of: ping 10.10.10.2 -f -c 1000000 were used to populate the table (this constitutes a comparatively small sample size, and took between 13-24 seconds to complete).

NIC Part Number Manufacturer min RTT \(\mathbf{\mu s}\) avg RTT \(\mathbf{\mu s}\) max RTT \(\mathbf{\mu s}\) Std Dev RTT \(\mathbf{\mu s}\) IPG RTT \(\mathbf{\mu s}\)
FFRM-NS12-000 Atto Technologies 6 8 908 3 13
OCe14102B-UM Emulex Corporation 9 13 271 4 21
ADD-PCIE-2SFP+ AddOn 5 9 400 4 13

RTT = Round trip Time


Table 2: Ping statistics for high packet count (long duration) tests of Crimson TNG/Cyan-PC communication using Emulex OCe14102B-UM NICs.

This table provides ping statistics over a point-to-point connection between a host PC and Crimson TNG/Cyan for various packet counts, using an Emulex Corporation NIC, a default Crimson TNG/Cyan image, and the populated with the results of: ping -f -c <N> <Address>, where N is the number of packets (as an integer), and Address corresponds to the destination IP address. Though the increased number of samples is intended to provide greater significance, no controls were applied to the operating system or kernel. As a result, results may vary due to host PC configuration, NIC card used, and system load.

Dest Index (N) Nbr. Addr Pkts (N) Duration (HH:MM:ss) Loss (N) min RTT (\(\mu\) s) avg RTT (\(\mu\) s) max RTT (\(\mu\) s) Std. Dev RTT (\(\mu\) s) IPG RTT (\(\mu\) s)
1 10.10.10.2 100M 00:51:19 Nil 9 21 2594 9 32
2 10.10.11.2 100M 0:53:02 Nil 9 16 3196 7 30

RTT = Round trip Time

As user sample rate increases, the variation in \(\tau_{\text{os, Tx}}\) and (\(\tau_{\text{net, Tx}}\) ) can rapidly become greater than the temporal duration represented by the payload of a single UDP packet. For example, assuming an approximate payload of 1500 bytes, we can use the definition of \(\tau_{\text{lat, Rx}}\) to determine the amount duration of radio samples encapsulated by a single packet. With a sample rate of 1MSPS , this represents 400\(\mu s\), which is approximately comparable to maximum round trip time we observed in Table 2. But with a sample rate of 10MSPS, each payload now represents only 40\(\mu\) s. At this point, because the NIC variance is greater than duration of time represented by one packet, in the absence of a compensation mechanism, transmitting a single packet at a time means that the duration of time represented by those samples could be less than the maximum amount of time it requires the next sample to be transmitted. In other words, although the time averaged throughput of the system may be constant (over a sufficiently long period) because the period-to-period variation between subsequent packets is greater than the duration of radio data represented by the sample within a packet, an accumulation of such events could cause us to temporarily exhaust the transmission buffer on Crimson TNG/Cyan.

Of course, this poses a very substantial and critical problem: The transmission of radio data requires a tightly coupled and synchronous data flow in order to effectively represent arbitrary wave forms. When we run out of user data to transmit to the DAC (ie: we are in an underflow condition), we therefore need to insert samples into the data stream to ensure specified performance, and to provide a well defined mode of operation. In addition, due to the real time requirements of a radio system, data that comes in late is not particularly useful - often times, it is better to keep track of how many samples the system inserted, and, on receipt of a packet, discard late data and resume from the correct point (the better to preserve phase coherency).

Note

In the case of the stock product, we insert zero valued samples whenever we exhaust the FIFO. Custom firmware may be implemented to support other modes, including repeating whatever the last valid user sample was.

Moreover, depending on the application implementation, the fact that data is delayed does not mean it’s not arriving (eventually). Depending on the duration of the delay, sufficient packets may have accumulated that Crimson TNG/Cyan does not have time to store all the late data (overflow).

As a direct result of these considerations, Crimson TNG/Cyan includes a fairly robust flow control mechanism to ensure synchronous data transmission and timing. Using a fairly large sample buffer (65,536), Crimson TNG/Cyan targets an 80% utilization when streaming data. This substantial buffer serves to compensate for variations of \(\tau_{\text{os, Tx}}\) and \(\tau_{\text{net, Tx}}\) to avoid underflow or overflow conditions.

As a result, using the default stock product, the transmit buffer latency may be modeled as;

Buffer Tx equation:

\[\begin{equation} \tau_{\text{buf, Tx}}=C_{\text{tgt,buf}}\cdot\left(\frac{S_{\text{fifo,smpl}}}{f_{s}}\right) \end{equation}\]

Where \(S_{\text{fifo,smpl}}\) is the size of the FIFO, in samples (by default 65,536), \(f_{s}\) is the user specified sample rate, and \(C_{\text{tgt,buf}}\) is the target buffer utilization co-efficient (ranging from 0 to 1, with a default value of 0.8). For the default product, equation the latency buffer Tx equation may be simplified to approximately;

\[\begin{equation} \tau_{\text{buf,Tx,default}}\approx\frac{52429}{f_{s}} \label{eq:bufTxDefault} \end{equation}\]

while the deframing delay remains approximately the same, as specified by Latency Frame Equation.

Round Trip Latency

We can calculate the total round trip latency by adding the host PC processing time to Rx latency equation and the Tx latency equation. This results in;

\[\begin{equation} \tau_{\text{lat,RTT}}=\tau_{\text{lat, Rx}}+\tau_{\text{app, process}}+\tau_{\text{lat, Tx}} \label{eq:latrtt} \end{equation}\]

where RTT is the rountrip total time, and \(\tau_{\text{app, process}}\) is the latency due to processing done on the host machine (such as GNU Radio or a C++ program), and is variable.

Note that the application \(\tau_{\text{app, process}}\) can vary substantially. For example, GnuRadio companion flow graphs impose a fixed delay between Rx and Tx to ensure sufficient buffering, there exists a fixed latency of about 0.2s when the program is instantiated. This latency is may be substantially reduced when using a lower level programming language like C++.

Minimizing Latency

As the default product is not specifically optimized for latency (focusing instead on ensuring satisfactory operation of as wide a range of parameters as possible), operation at high sample rate requires a number of different considerations. One of the primary considerations when interfacing with a host PC is reducing the variation in latency. The greatest non-deterministic contributors to latency are \(\tau_{os}\) and \(\tau_{\text{net}}\). Moving to a real time operating system (such as the real time Linux kernel), using the latest kernel drivers (to ensure optimal network card performance), and processor and core affinity provide the most immediate benefit. In addition, purchasing a network card optimized for low latency applications provides additional benefit.

Lower latency applications may benefit from a modified stock image that uses a low latency IP core, and reduces sample buffer size - please contact us directly for more information. In addition, interfacing with another FPGA or a real time, synchronous system, allows for reductions in payload size, which can also provide substantial opportunities to reduce transmission latencies. In ultra-low latency applications, custom interface protocols using the SFP+ connectors can be implemented to further reduce the latency between the SDR and application.

Of course, for the lowest latency applications, you can consider embedding application logic on the FPGA; please contact us to discuss your specific requirements.