>
Choosing the appropriate DSP for your 3G wireless handset design
3G TD-SCDMA and WCDMA cell phones present complex design and partitioning issues that challenge current baseband processors.
Brendon Slade, LSI Logic
After many years of anticipation and false starts, 3G wireless handset deployment is now underway. However, the baseband processor designs have yet to provide the functionality promised by 3G and the price points and energy efficiency that customers have come to expect from 2G and 2.5G products.
Even as these expectations are met, the need for multi-mode support looms large. The protocols for 3G are an order of magnitude greater in complexity than those for 2G, resulting in a tremendous software burden on the companies in this market and more pressure on the platform architects to make the right choices in partitioning between programmable and fixed-function digital signal processors (DSPs). Designers need to select DSPs carefully, considering not only baseband processing requirements but also productivity, future-proofing, power, and cost.
In this article we will examine the 3G baseband challenges, programmable DSP requirements, and process-balancing issues that designers must address when trying to find the right DSP for their 3G wireless handset design. We’ll begin this process with an analysis of the main baseband processing steps for time-division multiplexed synchronous code-division multiple access (TD-SCDMA) and wideband CDMA (WCDMA).
TD-SCDMA
TD-SCDMA supports multiple subscribers using a combination of time-division and code-division multiplexing. Figure 1 illustrates the main features of this standard.
TD-SCDMA occupies a 1.6-MHz band using 5-msec sub-frames within each band. Each sub-frame is divided into seven time slots. The sub-frame is further split among users by way of 16 spreading codes. Although 8-PSK can be used to obtain the highest data rates (up to 2 Mbits/sec per user), quadrature phase-shift keying (QPSK) is most commonly used and provides a user bandwidth up to 384 kbits/sec duplex.
Proponents of TD-SCDMA claim high performance with much less complexity than WCDMA, which is one reason why the standard is approved for and being deployed in China. The standard uses a different technique from most CDMA standards when it comes to detection and recovery of data at the receiver. Instead of attempting to extract one user-data channel and treating other subscribers as noise, all users are detected together, or “joint detected.” Joint detection can be accomplished using one of a handful of algorithms, some patented and some not. Whichever algorithm is chosen, it will still represent the bulk of the processing after down-sampling in the receiver.
Figure 2 shows the main processing blocks in a TD-SCDMA receiver. The blocks that could be reasonably considered for a programmable solution are shown in green.
Symbols received from the automatic gain control/burst splitting blocks are fed into a channel estimator and block linear equalizer (BLE), which also receive input from a matched filter correlator that computes the convolution of channel impulse responses and spreading sequences. The BLE block is the compute-intensive joint detection algorithm and computes the linear estimate of the data symbols of all users. A matrix-based technique can be used to estimate this, and Cholesky’s decomposition algorithm can be used to simplify this matrix calculation into the dot product of upper and lower triangular matrices. The resulting calculation is thus greatly reduced and is broken down into these equations:
The basic calculations being performed are multiply-accumulate (MAC) operations on complex numbers. At first glance it seems that a hardware approach would best implement these functions; however, navigating the matrices to perform the operations is nontrivial and would be hard to implement in hardware blocks. In addition, a programmable processor is well suited to the channel-characterization functions in the other parts of the joint detection function. Hence, dealing with this part of the system in a programmable DSP provides much lower risk with more flexibility.
Programming the DSP
Figure 3 shows the C code representation of the complex math calculations and the implementations in assembler on a dual (ZSP500) and quad (ZSP540) MAC DSP. A complex MAC requires four multiplies, three additions, and one subtraction. The ZSP cores used to illustrate the coding have specific instructions for complex multiplications that use additional arithmetic logic units (ALUs) to increase efficiency, enabling a complex MAC every two cycles on the dual MAC core and in a single cycle on the quad MAC core.
Figure 1. TD-SCDMA coding scheme.
In addition to strength in MAC operations, there are a number of other DSP features that need to be considered for the programmable part of the baseband system. Most calculations require 32-bit precision, with results truncating to 16 bits at the end of the calculations; hence DSPs with adequate 32-bit ALU resources are needed. DSPs that are targeted for this kind of application have three 32- or 40-bit ALUs available each clock cycle. Functions in the joint detection processing require fast access to interim calculation results, so the bandwidth to data memory of the processor needs to match the processing resources available. For a quad MAC DSP this would typically mean dual 64-bit ports to RAM.
Figure 2. TD--SCDMA processing blocks.
FFTs are used extensively in the channel-estimation function, so the programmable DSP must support them efficiently. In addition to the MAC, parallel add/subtract and bit-reversed addressing are important requirements that differentiate high-performance DSPs from older architectures and microprocessors with basic DSP capabilities. Joint detection incorporates numerous decisions, another reason for a programmable approach. Predictability of execution time and efficient branching during the execution of these decisions mean that conditional execution is a very important feature, especially where high-performance, deeper pipeline processors are being considered. Finally, efficient bit-stream processing calls for bit field insertion/extraction capabilities to prevent wasting of precious processor cycles on masking and shifting operations.
As we saw in Figure 2, joint detection is the major function in the TD-SCDMA receiver that it makes sense to implement on a programmable DSP. Some other components in the receiver are obvious candidates for hardware-based implementations. These are blocks that handle data rates too high for programmable approaches (such as the RX filter) or where processing is well defined and fixed. Blocks used in forward-error correction (FEC), such as Viterbi or Turbo decoders, are fairly complex but well defined. To implement these functions in a programmable DSP for a 3G system would demand the majority of the available processing bandwidth in even the most advanced DSPs, which would not be efficient in energy consumption or silicon area.
Figure 3. Complex math calculations and implementations in assembler.
Functional elements such as FEC decoders work with programmable DSPs in a loosely coupled fashion, working on blocks of data, with the two elements typically communicating via a shared memory. Tuning a baseband design for optimal balance between a programmable DSP and hardware is desirable due to the benefits in lower energy consumption of a hardware-based DSP and increased headroom for the programmable DSP. To facilitate this, the programmable DSP must efficiently split algorithms with hardware using a tightly coupled approach. Configurable processors can be attractive for this but may introduce unknown factors in timing closure and potentially affect software compatibility between generations of baseband chips.
By adopting an approach that incorporates a coprocessor instruction with a configurable instruction field, the main instruction set is unaffected. However, the coprocessor(s) can be used to implement customized instructions.
Figure 4 shows a section of the pipeline from a processor that uses this approach. Data from the register file is readily accessible to the coprocessor, enabling the processor data path to be used effectively in conjunction with the resources added in the coprocessor block. By providing hooks to allow the coprocessor to stall the processor when necessary, operations can be pipelined and power saved by avoiding unnecessary wait loops. This approach also enables a full registered interface between the processor and coprocessor(s), ensuring timing closure for these two blocks can be treated in isolation.
Figure 4. Partial processor pipeline.
The system designer can define the opcode fed to the coprocessor. Therefore, some bits of this field can be used to select among multiple coprocessor blocks.
Some processor architectures can use coprocessor extensions with the C compiler. However, at the data rates required in a mobile handset baseband design, it is most likely that coding using tightly coupled coprocessors would need to be written in assembler. C does not have the data types and language constructs to implement these types of algorithms efficiently enough. However, if the compiler enables full integration of in-line assembler, algorithm development can be done from a starting point in C.
The joint detection approach based on Cholesky decomposition is just one option, as mentioned before. Other approaches based on FFTs may also be suitable for handset designs with programmable DSPs. For these algorithmic approaches a more loosely coupled FFT engine can be implemented and a dual MAC programmable DSP could be considered. Dual MAC capability is realistically the baseline requirement, due to the demanding nature of the other processing in the system that would remain best suited for the programmable DSP. Selection of a DSP family that has binary compatibility enables easier migration of a design into a purely programmable solution, with more flexibility for multimode systems once smaller/higher-performance process geometries become available. For example, a dual MAC core with an FFT engine in 0.13 μm could be replaced at the 90-nm process node with a quad MAC member of the same DSP family.
Table 1 shows the processing requirement for a TD-SCDMA handset (based on the Cholesky decomposition algorithm) partitioned as shown in Figure 2 (receiver) with the transmitter added. The table includes the AMR voice codec and is based on a 384-kbit/sec duplex data plus 1 channel voice configuration.
WCDMA
The processing requirements and logical partitioning of WCDMA frequency-division duplex (FDD) handsets are somewhat different than those of TD-SCDMA devices. In W-CDMA FDD a wider (5-MHz) bandwidth is used, with subscribers sharing that band via spreading codes. This standard requires a higher chip rate than TD-SCDMA (3.84 Mc/sec versus 1.28 Mc/sec), and hence places more demands on the receiver processing task. The main processing task in the receiver is the rake receiver, a block diagram of which is shown in Figure 5.
Receiver rake processing involves combining multiple receiver paths to detect the data from a specific subscriber. The basic arithmetic involves intensive operation loops on complex numbers, as for TD-SCDMA.
Figure 5. WCDMA rake receiver.
The processing blocks in the rake receiver are simpler in nature and lend themselves more readily to hardware implementation, but they are also achievable in a modern, high-performance, programmable DSP. Hence the implementation decision comes down to power consumption versus time-to-market and flexibility tradeoffs. In scenarios that require multimode solutions or in which some higher layers of the baseband protocol stack are also to run on the DSP, the use of a coprocessor interface offers a compromise between these two options.
Table 2 shows the processing requirements for a WCDMA FDD receiver with four fingers. It can be seen that partitioning to make tradeoffs with hardware acceleration for path searcher, more receiver fingers, or for other features and/or less power can be easily identified.
Partitioning of algorithm processing is a complex process that will normally evolve across product generations as processing nodes and availability of different options in a processor core family change. The right initial choices in partitioning and efficient adaptation to new options require effective modeling tools and methodologies. Many designers start with a Matlab algorithmic design from which C models are developed, either automatically or by manual translation into more optimal implementations. From this point designers may then proceed with implementation, but this is a high-risk approach that does not accurately model data flow. SystemC-based architectural modeling tools from the likes of CoWare, Synopsys, and Mentor Graphics enable rapid implementation of the bus structures incorporating standard DSP processor cores and peripherals. Custom acceleration can be modeled at the transaction level, and system performance can be observed in enough detail to ensure that the design is viable and efficient.
Models of tightly coupled coprocessors can be effectively achieved if the programmable processor is supported by a cycle-accurate simulator with an API for interfacing coprocessors. The coprocessor model is called at each simulated clock cycle, taking data inputs and generating return data, stall signals, and/or interrupts back to the processor simulator. This approach enables the rapid development of models for system design that can continue to be used throughout the software development phase (pre-silicon).
Power efficiency can be evaluated by observing bus loading and contention, processor stalls, and memory accesses during SystemC modeling. However, overall power management needs to be considered as a separate part of the design process, since opportunities for clock gating and power islands (if possible) need to be assessed.
Any programmable processor to be used in a modern handset design must be capable of supporting clock gating and have hardware- and software-controlled power management logic. In addition to checking for clock gating support, the system designer needs to consider the power consumption of the memory system associated with the programmable processor that may contribute significantly more power than the processor itself. Features such as pre-fetch buffers and/or first-level caches can significantly reduce this power consumption for DSPs. This is because much DSP processing is loop-based and those loops can reside entirely in cache, obviating the need for instruction fetches.
Programmable DSPs are now available that can enable 3G handsets that will meet consumer demands, but the system designer still has to make smart choices about the hardware/programmable DSP tradeoffs. An excessive reliance on hardware DSP can affect time-to-market and flexibility during critical type approval and performance tuning. Conversely, an approach that relies too much on a programmable DSP alone may result in a design that has no headroom or is too power-hungry. By selecting the programmable DSP equipped to make the right trade-offs and by using the appropriate modeling tools and methodologies, today’s system designers can meet the challenges of 3G.