# HIGH- PERFORMANCE AND ENERGY-EFFICIENT CSKA OPERATING UNDER A WIDE RANGE OF SUPPLY VOLTAGES

T.Sowmiya<sup>1</sup>,S.Jeya Anusuya<sup>2</sup>,

<sup>1</sup>PG student, Dept. of ECE, T.J.S Engineering College, Chennai, Tamil Nadu, India. <u>sowmiyasurya@gmail.com</u>

<sup>2</sup>Associate Professor and HOD, Dept. of ECE, T.J.S Engineering College, Chennai, Tamil Nadu, India. jeyaanusuya@yahoo.com

Abstract - In this paper, a carry skip adder (CSKA) structure that has a higher speed yet lower energy consumption compared with the conventional one. The speed enhancement is achieved by applying concatenation and incrementation schemes to improve the efficiency of the conventional CSKA (Conv-CSKA) structure. In addition, instead of utilizing multiplexer logic, the proposed structure makes use of AND-OR-Invert (AOI) and OR-AND-Invert (OAI) compound gates for the skip logic. The structure may be realized with both fixed stage size and variable stage size styles, wherein the latter further improves the speed and energy parameters of the adder. Finally, a hybrid variable latency extension of the proposed structure, which lowers the power consumption without considerably impacting the speed, is presented. This extension utilizes a modified parallel structure for increasing the slack time, and hence, enabling further voltage reduction. The proposed structure power-delay product was the lowest among the structures considered in this paper, while its energy-delay product was almost the same as that of the Kogge-Stone parallel prefix adder with considerably smaller area and power consumption. Simulations on the proposed hybrid variable latency CSKA reveal reduction in the power consumption compared with the latest works in this field while having a reasonably high speed.

Keywords— Carry skip adder (CSKA), energy efficient, high performance, hybrid variable latency adders, voltage scaling.

#### I. INTRODUCTION

Adders are a key building block in arithmetic and logic units (ALUs)and hence increasing their speed and reducing their power/energy consumption strongly affect the speed and power consumption of processors. There are many works about optimizing the speed and power of these units, which have been reported. Obviously, it is highly desirable to achieve higher speeds at low-power/energy consumptions, which is a challenge for the designers of general purpose processors.

One of the effective techniques to lower the power consumption of digital circuits is to reduce the supply voltage due to quadratic dependence of the switching energy on the voltage. Moreover, the subthreshold current, which is the main leakage component in OFF devices, has an exponential dependence on the supply voltage level through the drain-induced barrier lowering effect. Depending on the amount of the supply voltage reduction, the operation of ON devices may reside in the super threshold, near-threshold, or subthreshold regions. Working in the super threshold region provides us with lower delay and higher switching and leakage powers compared with the near/subthreshold regions. In the subthreshold region, the logic gate delay and leakage power exhibit exponential dependences on the supply and threshold voltages. Moreover, these voltages are (potentially) subject to process and environmental variations in the nanoscale technologies. The variations increase uncertainties in the aforesaid performance parameters. In addition, the small subthreshold current causes a large delay for the circuits operating in the subthreshold region.

Recently, the near-threshold region has been considered as a region that provides a more desirable tradeoff point between delay and power dissipation compared with that of the subthreshold one, because it results in lower delay compared with the subthreshold region and significantly lowers switching and leakage powers compared with the super threshold region. In addition, near-threshold operation, which uses supply voltage levels near the threshold

voltage of transistors, suffers considerably less from the process and environmental variations compared with the subthreshold region.

The dependence of the power (and performance) on the supply voltage has been the motivation for design of circuits with the feature of dynamic voltage and frequency scaling. In these circuits, to reduce the energy consumption, the system may change the voltage (and frequency) of the circuit based on the workload requirement. For these systems, the circuit should be able to operate under a wide range of supply voltage levels. Of course, achieving higher speeds at lower supply voltages for the computational blocks, with the adder as one the main components, could be crucial in the design of high-speed, yet energy efficient, processors.

In addition to the knob of the supply voltage, one may choose between different adder structures/families for optimizing power and speed. There are many adder families with different delays, power consumptions, and area usages. Examples include ripple carry adder (RCA), carry increment adder (CIA), carry skip adder (CSKA), carry select adder (CSLA), and parallel prefix adders (PPAs). The descriptions of each of these adder architectures along with their characteristics may be found. The RCA has the simplest structure with the smallest area and power consumption but with the worst critical path delay. In the CSLA, the speed, power consumption, and area usages are considerably larger than those of the RCA. The PPAs, which are also called carry look-ahead adders, exploit direct parallel prefix structures to generate the carry as fast as possible. There are different types of the parallel prefix algorithms that lead to different PPA structures with different performances. As an example, the Kogge–Stone adder (KSA) is one of the fastest structures but results in large power consumption and area usage. It should be noted that the structure complexities of PPAs are more than those of other adder schemes.

The CSKA, which is an efficient adder in terms of power consumption and area usage, was introduced. The critical path delay of the CSKA is much smaller than the one in the RCA, whereas its area and power consumption are similar to those of the RCA. In addition, the power-delay product (PDP) of the CSKA is smaller than those of the CSLA and PPA structures. In addition, due to the small number of transistors, the CSKA benefits from relatively short wiring lengths as well as a regular and simple layout. The comparatively lower speed of this adder structure, however, limits its use for high-speed applications.

In this paper, given the attractive features of the CSKA structure, we have focused on reducing its delay by modifying its implementation based on the static CMOS logic. The concentration on the static CMOS originates from the desire to have a reliably operating circuit under a wide range of supply voltages in highly scaled technologies. The proposed modification increases the speed considerably while maintaining the low area and power consumption features of the CSKA. In addition, an adjustment of the structure, based on the variable latency technique, which in turn lowers the power consumption without considerably impacting the CSKA speed, is also presented. To the best of our knowledge, no work concentrating on design of CSKAs operating from the superthreshold region down to near-threshold region and also, the design of (hybrid) variable latency CSKA structures have been reported in the literature. Hence, the contributions of this paper can be summarized as follows.

- 1) Proposing a modified CSKA structure by combining the concatenation and the incrementation schemes to the conventional CSKA (Conv-CSKA) structure for enhancing the speed and energy efficiency of the adder. The modification provides us with the ability to use simpler carry skip logics based on the AOI/OAI compound gates instead of the multiplexer.
- 2) Providing a design strategy for constructing an efficient CSKA structure based on analytically expressions presented for the critical path delay.
- 3) Investigating the impact of voltage scaling on the efficiency of the proposed CSKA structure (from the nominal supply voltage to the near-threshold voltage).
- 4) Proposing a hybrid variable latency CSKA structure based on the extension of the suggested CSKA, by replacing some of the middle stages in its structure with a PPA, which is modified in this paper.

#### II. PRIVIOUS WORK

# A. Modifying CSKAs for Improving Speed

The conventional structure of the CSKA consists of stages containing chain of full adders (FAs) (RCA block) and 2:1 multiplexer (carry skip logic). The RCA blocks are connected to each other through 2:1 multiplexers,

which can be placed into one or more level structures [19]. The CSKA configuration (i.e., the number of the FAs per stage) has a great impact on the speed of this type of adder [23]. Many methods have been suggested for finding the optimum number of the FAs [18]- [26]. The techniques presented in [19]- [24] make use of VSSs to minimize the delay of adders based on a single level carry skip logic. In [25], some methods to increase the speed of the multilevel CSKAs are proposed. The techniques, however, cause area and power increase considerably and less regular layout. The design of a static CMOS CSKA where the stages of the CSKA have a variable size was suggested in [18]. In addition, to lower the propagation delay of the adder, in each stage, the carry look-ahead logics were utilized. Again, it had a complex layout as well as large power consumption and area usage. In addition, the design approach, which was presented only for the 32-bit adder, was not general to be applied for structures with different bits lengths. Alioto and Palumbo [19] propose a simple strategy for the design of a single-level CSKA. The method is based on the VSS technique where the near-optimal numbers of the FAs are determined based on the skip time (delay of the multiplexer), and the ripple time (the time required by a carry to ripple through a FA). The goal of this method is to decrease the critical path delay by considering a noninteger ratio of the skip time to the ripple time on contrary to most of the previous works, which considered an integer ratio [17], [20]. In all the works reviewed so far, the focus was on the speed, while the power consumption and area usage of the CSKAs were not considered. Even for the speed, the delay of skip logics, which are based on multiplexers and form a large part of the adder critical path delay [19], has not been reduced.

# B. Improving Efficiency of Adders at Low Supply Voltages

To improve the performance of the adder structures at low supply voltage levels, some methods have been proposed in [27]–[36]. In [27]–[29], an adaptive clock stretching operation has been suggested. The method is based on the observation that the critical paths in adder units are rarely activated. Therefore, the slack time between the critical paths and the off-critical paths may be used to reduce the supply voltage. Notice that the voltage reduction must not increase the delays of the noncritical timing paths to become larger than the period of the clock allowing us to keep the original clock frequency at a reduced supply voltage level. When the critical timing paths in the adder are activated, the structure uses two clock cycles to complete the operation. This way the power consumption reduces considerably at the cost of rather small throughput degradation. In [27], the efficiency of this method for reducing the power consumption of the RCA structure has been demonstrated.

The CSLA structure in [28] was enhanced to use adaptive clock stretching operation where the enhanced structure was called cascade CSLA (C<sup>2</sup>SLA). Compared with the common CSLA structure, C<sup>2</sup>SLA uses more and different sizes of RCA blocks. Since the slack time between the critical timing paths and the longest off-critical path was small, the supply voltage scaling, and hence, the power reduction were limited. Finally, using the hybrid structure to improve the effectiveness of the adaptive clock stretching operation has been investigated in [31] and [33]. In the proposed hybrid structure, the KSA has been used in the middle part of the C2SLA where this combination leads to the positive slack time increase. However, the C<sup>2</sup>SLA and its hybrid version are not good candidates for low-power ALUs. This statement originates from the fact that due to the logic duplication in this type of adders, the power consumption and also the PDP are still high even at low supply voltages.

## III. CONVENTIONAL CARRY SKIP ADDER

1

The structure of an N-bit Conv-CSKA, which is based on blocks of the RCA (RCA blocks), is shown in Fig.



Fig. 1. Conventional structure of the CSKA.

In addition to the chain of FAs in each stage, there is a carry skip logic. For an RCA that contains N cascaded FAs, the worst propagation delay of the summation of two N-bit numbers, A and B, belongs to the case where all the FAs are in the propagation mode. It means that the worst case delay belongs to the case where

$$Pi = A_i \bigoplus B_{i=1}$$
 for  $i = 1, ..., N$ 

where Pi is the propagation signal related to Ai and Bi. This shows that the delay of the RCA is linearly related to N [1]. In the case, where a group of cascaded FAs are in the propagate mode, the carry output of the chain is equal to the carry input. In the CSKA, the carry skip logic detects this situation, and makes the carry ready for the next stage without waiting for the operation of the FA chain to be completed. The skip operation is performed using the gates and the multiplexer shown in the figure. Based on this explanation, the N FAs of the CSKA are grouped in Q stages. Each stage contains an RCA block with M  $_{\rm i}$  FAs ( $_{\rm i}$  = 1,..., Q) and a skip logic.



Fig. 2. Conventional structure of the CSKA RTL diagram.

In each stage, the inputs of the multiplexer (skip logic) are the carry input of the stage and the carry output of its RCA block (FA chain). In addition, the product of the propagation signals (P) of the stage is used as the selector signal of the multiplexer. The CSKA may be implemented using FSS and VSS where the highest speed may be obtained for the VSS structure [19], [22]. Here, the stage size is the same as the RCA block size as shown in fig. 2. In Sections III-A and III-B, these two different implementations of the CSKA adder are described in more detail.

### A. Fixed Stage Size CSKA

By assuming that each stage of the CSKA contains M FAs, there are Q=N/M stages where for the sake of simplicity, we assume Q is an integer. The input signals of the jth multiplexer are the carry output of the FAs chain in the jth stage denoted by C0 j , the carry output of the previous stage(carry input of the jth stage) denoted by C1 j (Fig. 1). The critical path of the CSKA contains three parts: 1) the path of the FA chain of the first stage whose delay is equal to  $M \times T_{CARRY}$ ; 2) the path of the intermediate carry skip multiplexer whose delay is equal to the  $(Q-1) \times T_{MUX}$ ; and3) the path of the FA chain in the last stage whose its delay is equal to the  $(M-1) \times T_{CARRY} + T_{SUM}$ . Note that  $T_{CARRY}$ ,  $T_{SUM}$ , and  $T_{MUX}$  are the propagation delays of the carry output of an FA, the sum output of an FA, and the output delay of a2:1 multiplexer, respectively. Hence, the critical path delay of a FSS CSKA is formulated by

$$T_D = [M \times T_{\text{CARRY}}] + \left[ \left( \frac{N}{M} - 1 \right) \times T_{\text{MUX}} \right] + \left[ (M - 1) \times T_{\text{CARRY}} + T_{\text{SUM}} \right]. \tag{1}$$

Based on (1), the optimum value of M ( $M_{opt}$ ) that leads to optimum propagation delay may be calculated as  $(0.5N\alpha)1/2$ where  $\alpha$  is equal to TMUX/TCARRY. Therefore, the optimum propagation delay ( $T_{D, opt}$ ) is obtained from

$$T_{D,\text{opt}} = 2\sqrt{2NT_{\text{CARRY}}T_{\text{MUX}}} + (T_{\text{SUM}} - T_{\text{CARRY}} - T_{\text{MUX}})$$
  
=  $T_{\text{SUM}} + (2\sqrt{2N\alpha} - 1 - \alpha) \times T_{\text{CARRY}}.$  (2)

Thus, the optimum delay of the FSS CSKA is almost proportional to the square root of the product of N and  $\alpha$  [19].

#### IV. PROPOSED CSKA STRUCTURE

Based on the discussion presented in Section III, it is concluded that by reducing the delay of the skip logic, one may lower the propagation delay of the CSKA significantly. Hence, in this paper, we present a modified CSKA structure that reduces this delay.

#### A. General Description of the Proposed Structure

The structure is based on combining the concatenation and the incrementation schemes [13] with the Conv-CSKA structure, and hence, is denoted by CI-CSKA. It provides us with the ability to use simpler carry skip logics. The logic replaces 2:1 multiplexers by AOI/OAI compound gates (Fig. 3). The gates, which consist of fewer transistors, have lower delay, area, and smaller power consumption compared with those of the 2:1 multiplexer [37]. Note that, in this structure, as the carry propagates through the skip logics, it becomes complemented. Therefore, at the output of the skip logic of even stages, the complement of the carry is generated. The structure has a considerable lower propagation delay with a slightly smaller area compared with those of the conventional one. Note that while the power consumptions of the AOI (or OAI) gate are smaller than that of the multiplexer, the power consumption of the proposed CI-CSKA is a little more than that of the conventional one. This is due to the increase in the number of the gates, which imposes a higher wiring capacitance (in the noncritical paths).



Fig. 3. Proposed CL-CSKA structure

The stages 2 to Q consist of two blocks of RCA and incrementation. The incrementation block uses the intermediate results generated by the RCA block and the carry output of the previous stage to calculate the final summation of the stage. The internal structure of the incrementation block, which contains a chain of half-adders (HAs), is shown in Fig. 3. In addition, note that, to reduce the delay considerably, for computing the carry output of the stage, the carry output of the incrementation block is not used.



Fig. 4. Internal structure of the  $j^{th}$  incrementation

The reason for using both AOI and OAI compound gates as the skip logics is the inverting functions of these gates in standard cell libraries. In addition, another point to mention is that the use of the proposed skipping structure in the Conv-CSKA structure increases the delay of the critical path considerably. This originates from the fact that, in the Conv-CSKA, the skip logic (AOI or OAI compound gates) is not able to bypass the zero-carry input until the zero-carry input propagates from the corresponding RCA block. This way, since the RCA block of the stage does not need to wait for the carry output of the previous stage, the output carries of the blocks are calculated in parallel.

## B. Area and Delay of the Proposed Structure

As mentioned before, the use of the static AOI and OAI gates (six transistors) compared with the static 2:1 multiplexer (12 transistors), leads to decreases in the area usage and delay of the skip logic [37], [38]. In addition, except for the first RCA block, the carry input for all other blocks is zero, and hence, for these blocks, the first adder cell in the RCA chain is a HA. This means that (Q-1) FAs in the conventional structure are replaced with the same number of HAs in the suggested structure decreasing the area usage (Fig. 2). These blocks, however, may be implemented with about the same logic gate as those used for generating the select signal of the multiplexer in the conventional structure. Therefore, the area usage of the proposed CI-CSKA structure is decreased compared with that of the conventional one.

#### V. RESULTS AND DISCUSSION

In this section, we assess the efficacies of the proposed structures by comparing their delays, LUTs and IOs with those of some other adders is given in Table I. All the adders considered here had the size of 32 bits and were Synthesis and simulated using a Xilinx ISE 9.2i and ModelSim as shown in fig 5 and fig 6.

TABLE I COMPARISON OF VARIOUS ADDERS

| Adders | LUTs | IOs | Delay (ns) |
|--------|------|-----|------------|
| RCA-4  | 8    | 14  | 11.786     |
| RCA-8  | 16   | 26  | 13.052     |
| RCA-32 | 64   | 129 | 40.879     |
| KSA-4  | 7    | 14  | 7.82       |
| KSA-8  | 28   | 26  | 7.995      |
| KSA-16 | 66   | 50  | 8.117      |

| KSA-32  | 141 | 98 | 8.277 |
|---------|-----|----|-------|
| CSKA-4  | 2   | 14 | 4.632 |
| CSKA-16 | 29  | 50 | 4.722 |



Fig. 5. Simulation results



Fig. 6. comparison of Different Adders

# VI. CONCLUSION

In this paper, a static CMOS CSKA structure called CI-CSKA was proposed, which exhibits a higher speed and lower energy consumption compared with those of the conventional one. The speed enhancement was achieved by modifying the structure through the concatenation and incrementation techniques. In addition, AOI and OAI compound gates were exploited for the carry skip logics. The efficiency of the proposed structure for both FSS and VSS was studied by comparing its power and delay with those of the Conv-CSKA, RCA, CIA, SQRT-CSLA, and KSA structures. The results revealed considerably lower PDP for the VSS implementation of the CI-CSKA structure over a wide range of voltage from super-threshold to near threshold. The results also suggested the CI-CSKA structure as a very good adder for the applications where both the speed and energy consumption are critical. In addition, a hybrid variable latency extension of the structure was proposed. It exploited a modified parallel adder structure at the middle stage for increasing the slack time, which provided us with the opportunity for lowering the energy consumption by reducing the supply voltage. The efficacy of this structure was compared versus those of the variable latency RCA, C2SLA, and hybrid C2SLA structures.

#### REFERENCES

- [1] I. Koren, Computer Arithmetic Algorithms, 2nd ed. Natick, MA, USA: A K Peters, Ltd., 2002.
- [2] R. Zlatanovici, S. Kao, and B. Nikolic, "Energy–delay optimization of 64-bit carry-lookahead adders with a 240 ps 90 nm CMOS design example," *IEEE J. Solid-State Circuits*, vol. 44, no. 2, pp. 569–583, Feb. 2009.
- [3] S. K. Mathew, M. A. Anders, B. Bloechel, T. Nguyen, R. K. Krishnamurthy, and S. Borkar, "A 4-GHz 300-mW 64-bit integer execution ALU with dual supply voltages in 90-nm CMOS," *IEEE J. Solid-State Circuits*, vol. 40, no. 1, pp. 44–51, Jan. 2005.
- [4] V. G. Oklobdzija, B. R. Zeydel, H. Q. Dao, S. Mathew, and R. Krishnamurthy, "Comparison of high-performance VLSI adders in the energy-delay space," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 13, no. 6, pp. 754–758, Jun. 2005.
- [5] B. Ramkumar and H. M. Kittur, "Low-power and area-efficient carry select adder," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 20, no. 2, pp. 371–375, Feb. 2012.
- [6] M. Vratonjic, B. R. Zeydel, and V. G. Oklobdzija, "Low- and ultra low-power arithmetic units: Design and comparison," in *Proc. IEEE Int. Conf. Comput. Design, VLSI Comput. Process. (ICCD)*, Oct. 2005, pp. 249–252.
- [7] C. Nagendra, M. J. Irwin, and R. M. Owens, "Area-time-power tradeoffs in parallel adders," *IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process.*, vol. 43, no. 10, pp. 689–702, Oct. 1996.
- [8] Y. He and C.-H. Chang, "A power-delay efficient hybrid carry look ahead/ carry-select based redundant binary to two's complement converter," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 55, no. 1, pp. 336–346, Feb. 2008.
- [9] C.-H. Chang, J. Gu, and M. Zhang, "A review of 0.18  $\mu$ m full adder performances for tree structured arithmetic circuits," *IEEE Trans. Very Large Scale Integr.* (VLSI) Syst., vol. 13, no. 6, pp. 686–695, Jun. 2005.
- [10] D. Markovic, C. C. Wang, L. P. larcon, T.-T. Liu, and J. M. Rabaey, "Ultralow-power design in near-threshold region," *Proc. IEEE*, vol. 98, no. 2, pp. 237–252, Feb. 2010.
- [11] R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge, "Near-threshold computing: Reclaiming Moore's law through energy efficient integrated circuits," *Proc. IEEE*, vol. 98, no. 2, pp. 253–266, Feb. 2010.
- [12] S. Jain et al., "A 280 mV-to-1.2 V wide-operating-range IA-32 processor in 32 nm CMOS," in *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers (ISSCC)*, Feb. 2012, pp. 66–68.
- [13] R. Zimmermann, "Binary adder architectures for cell-based VLSI and their synthesis," Ph.D. dissertation, Dept. Inf. Technol. Elect. Eng., Swiss Federal Inst. Technol. (ETH), Zürich, Switzerland, 1998.
- [14] D. Harris, "A taxonomy of parallel prefix networks," in *Proc. IEEE Conf. Rec. 37th Asilomar Conf. Signals, Syst., Comput.*, vol. 2. Nov. 2003, pp. 2213–2217.
- [15] P. M. Kogge and H. S. Stone, "A parallel algorithm for the efficient solution of a general class of recurrence equtions," *IEEE Trans Comput.*, vol. C-22, no. 8, pp. 786–793, Aug. 1973.
- [16] V. G. Oklobdzija, B. R. Zeydel, H. Dao, S. Mathew, and R. Krishnamurthy, "Energy-delay estimation technique for high performance microprocessor VLSI adders," in *Proc. 16th IEEE Symp. Comput. Arithmetic*, Jun. 2003, pp. 272–279.
- [17] M. Lehman and N. Burla, "Skip techniques for high-speed carrypropagation in binary arithmetic units," *IRE Trans. Electron. Comput.*, vol. EC-10, no. 4, pp. 691–698, Dec. 1961.
- [18] K. Chirca et al., "A static low-power, high-performance 32-bit carry skip adder," in *Proc. Euromicro Symp. Digit. Syst. Design (DSD)*, Aug./Sep. 2004, pp. 615–619.
- [19] M. Alioto and G. Palumbo, "A simple strategy for optimized design of one-level carry-skip adders," *IEEE Trans. Circuits Syst. I, Fundam. Theory Appl.*, vol. 50, no. 1, pp. 141–148, Jan. 2003.

[20] S. Majerski, "On determination of optimal distributions of carry skips in adders," *IEEE Trans. Electron. Comput.*, vol. EC-16, no. 1, pp. 45–58, Feb. 1967.

[21] A. Guyot, B. Hochet, and J.-M. Muller, "A way to build efficient carryskip adders," *IEEE Trans. Comput.*, vol. C-36, no. 10, pp. 1144–1152, Oct. 1987.

