Hardware Implementation of Elliptic Curve Cryptography over Binary Field

1Sandeep S.V, 2Hameem Shanavas.I, 3Nallusamy.V, 4Brindha.M

1 PG Scholar, Department of ECE, M.V.J College of Engineering, Bangalore, India. Email: Sandeepsv88@gmail.com

2Assistant Professor, Department of ECE, M.V.J College of Engineering, Bangalore, India. Email:hameemshan@gmail.com

3,4Associate Professor, Department of ECE, M.V.J College of Engineering, Bangalore, India. Email:nallu1910@gmail.com,brindha_mo47@yahoo.co.in

Abstract—This paper presents high-performance Elliptic Curve Cryptography (ECC) architecture over binary field, based on the Montgomery scalar multiplication algorithm. The word-serial finite field arithmetic unit (AU) is proposed with the optimized operation scheduling and bit-parallel modular reduction. With a dedicated squarer, the 163-bit point scalar multiplication with coordinate conversion can be done in 20.9μs by the design of one AU, and can be further improved to 11.1μs by the one of three AUs, both using 0.13μm CMOS technology. The comparison with other ECC designs justifies the effectiveness of the proposed architecture in terms of performance and area-time efficiency.

Index Terms—Scalar Multiplication, Montgomery Modular Multiplication, Binary field, ECC.

1. INTRODUCTION

Elliptic Curve Cryptography (ECC) has been regarded mature to provide robustness for secure data transaction. Compared with RSA, ECC can supply equivalent level of security with a much smaller key length. Therefore, ECC has become an attractive alternative cryptosystem and many designs have been proposed in recent years. Among them, there are dual-field ECC implementations that support both binary field $GF(2^m)$ and prime field $GF(p)$. The scalable dual-field ECC processor with the high-speed Montgomery multiplier based on a unified word-based ($w\times w$-bit) dual-field multiplier has been proposed [1]. The parallel architecture and its optimization methodology for the dual-field ECC have been presented [2]. In addition, the hardware implementation and measured results have been shown in [3]. Instead of the dual-field approaches, ECC over binary field $GF(2^m)$ can achieve a high throughput inherently because there is no carry propagation in the arithmetic operations, resulting in fast and compact implementations proposed recently [4]. In [4], the high-performance architecture based on a pseudo-pipelined word-serial finite field multiplier has been shown. In addition to serial ones, parallel architectures with multiple arithmetic units (AUs) have
been presented in [5], [7]. Furthermore, the parallelism in different hierarchy levels of point scalar multiplication has been explored with the architecture of parallel processors to reduce the latency.

This paper, explains high-performance architecture for ECC over binary field with the operation scheduling for the Montgomery scalar multiplication algorithm. The scheduling is analyzed using one to four AUs and an extra squarer. Each AU consists of a word-serial multiplier, a squarer and an added. In addition, the formula for bit-parallel modular reduction is derived for the irreducible pentanomials. Therefore the reduction is much simplified and can be performed in one cycle. Using 0.13\(\mu m\) CMOS technology, the 163- bit point scalar multiplication with coordinate conversion can be done in 20.9\(\mu s\) by one AU, and can be further speeded up to 11.1\(\mu s\) by three AUs, respectively. The comparison shows that our approach achieves a very high performance with significant area-time efficiency.

Now days, many information and communication security systems are base on the public key cryptography. Most of the communication security systems are mostly deals with RSA and Elliptic curve cryptography (ECC). Elliptic curve cryptography (ECC) provides the same levels of security with less key size compare with key size of RSA. The Table I [1] shows key size comparison between RSA and Elliptic curve cryptography. Elliptic curve cryptography (ECC) [2], provide secure data transaction to personal identity verification, authentication, digital signature, and key management. ECC Algorithm has been implemented in software ASIC and FPGA.

Hardware implementation of public key cryptosystems is flexible compare with software. The software implementation is slow and not safe to store the private key in the computer’s memory. The advantage of ASIC implementation is secure and faster than the software, but not flexible. The FPGA implementation is suitable to most of the applications, secure, and faster.

<table>
<thead>
<tr>
<th>ECC Key Size</th>
<th>RSA Key Size</th>
<th>Key Size Ratio</th>
</tr>
</thead>
<tbody>
<tr>
<td>163</td>
<td>1024</td>
<td>1:6</td>
</tr>
<tr>
<td>256</td>
<td>3972</td>
<td>1:12</td>
</tr>
<tr>
<td>384</td>
<td>7680</td>
<td>1:20</td>
</tr>
<tr>
<td>512</td>
<td>15360</td>
<td>1:30</td>
</tr>
</tbody>
</table>

An elliptic curve (EC) over GF (p) for \(3>p\) is defined as [14]

\[ y^2 = x^3 + \alpha x + \beta \]  

(1)

Where \(x, y \in GF (p)\), and \(4\alpha^3 + 27\beta^2 \neq 0\) in the GF(p).

Each value of the ‘\(\alpha\)’ and ‘\(\beta\)’ gives a different EC. All points \((x, y)\) which satisfies the above equation plus a point at infinity lies on the elliptic curve. The public key is a point in the curve while the private key is a random number. The public key is generated by multiplying the private key with the generator point \(G\) in the curve. The generator point \(G\), the curve parameters ’\(\alpha\)’ and ’\(\beta\)’, together with few more constants constitutes the domain parameters of ECC. The fundamental operations on the ECC are point adding, doubling and scalar multiplication. A simple elliptic curve is shown in Fig.1.

Hence, the pair \((x, y)\), \(x, y \in GF (p)\), will be a point on the curve when \((x, y)\) satisfies (1), and the point at infinity, denoted by \(\infty\), is said to be on the...
The algebraic formulae for point addition and point doubling are given as

\[
\begin{align*}
X_3 &= \lambda^2 - X_1 - X_2 \pmod{p} \quad (2) \\
Y_3 &= \lambda (X_1 - X_3) - Y_1 \pmod{p} \\
\text{When } P \neq Q, &\quad \lambda = (y_2 - y_1) / (x_2 - x_1) \quad (3) \\
\text{When } P = Q, &\quad \lambda = (3x_1^2 + \alpha) / 2y_1 \quad (4)
\end{align*}
\]

The common arithmetic operations over GF (p) are the addition, subtraction, multiplication and inverse. In the EC operations, there are two type of coordinates, one is affine coordinate another is projective coordinate. One of the complex and time consuming operations on EC is field inversion. The affine coordinate contains the field inversion operation, while the alternative Jacobian’s projective coordinate is in the form \((x, y, z) \rightarrow (x/z^2, y/z^2)\) over GF (p).

The steps taken by two individuals in order to communicate a secret through an unsecured channel using ECC is significant. In this protocol both parts perform the same computation apart from the signal of the point multiplication. To encrypt the message \(M\), \(C = M + k_1(k_2G)\) is computed and to decrypt \(M = C - k_1(k_2G)\) is computed, with ‘\(k_i\)’ the private keys and ‘\(k_iG\)’ the public keys of the two individuals.

Performance Enhanced Co-Processor for Elliptic Curve Cryptography over GF (p)

An EC over GF (p) consists of the solutions \((x, y)\) as defined by (1) along with an additional element called the point at infinity, denoted by Q. The set of points \((x, y)\) are in the so-called affine coordinate representation. The operation of \(Q = kP\) is called point multiplication or scalar multiplication, where ‘\(k\)’ is an integer and ‘\(P\)’ is a point on an EC ‘E’ defined over a field GF(p), which dominates the execution time of ECC schemes. In the implementation of ECC, scalar multiplication and inversion are not only the basic computing but also the most time consuming operations. So the operational efficiency of inversion and scalar multiplication directly determines the performance of ECC.

Many ECC algorithm have been implemented [3],[9], and there were flexible and high speed either over GF(2m) or GF(p).The high speed parallel architecture have been implemented [3],[5] over GF(2m). In [6], high throughput and low power signs have been implemented over GF (p). The dual – field designs were proposed [7],[9], which are flexible, high throughput, parallel, and scalable for elliptic curve over GF (p) and GF (2m).

The proposed ECC processor has been synthesized using Xilinx ISE. Simulation was done with Modelsim XE 6.1e, and fabricated using TSMC 130-nm CMOS cell base technology. This coprocessor can be adapted both prime field and binary field, also contains a control unit with 256 bit serial and parallel operations , which provide integrated high throughput with low power consumptions. This test coprocessor can operate in the parallel mode with four arithmetic units (AUs) and serial mode with the single AU. This scalar Multiplier architecture operation is perform base on clock rate and produce better performance in term of time and area compared to similar works.

2. MATHEMATICAL BACKGROUND

In this work, we focus on ECC over binary field GF (2m) with polynomial basis defined in the IEEE 1363 Standard Specifications for Public-Key Cryptography [9]. The standardized Elliptic Curve (EC) is \(y^2 + xy = x^3 + ax^2 + b\), where \(a, b \in \text{GF} (2^m)\) and \(b \neq 0\). The most critical operation in ECC, i.e., the point scalar multiplication, can be defined as \(Q = kP\), where \(Q\) and \(P\) are points on the EC and \(k\) is a scalar. The operation \(KP\) can be performed by iterative point double and point addition. Our ECC architecture adopts the Montgomery scalar multiplication with projective coordinate presented in [10] for fast point scalar multiplication as shown in algorithm 1.
Algorithm 2.1 Montgomery scalar multiplication with projective coordinate representation.

**Input:** A point \( P = (x, y) \), an \( l \)-bit integer \( k = (k_l-1, \ldots, k_1, k_0) \).

**Output:** \( Q = kP \).

1. \( X_1 = x, Z_1 = 1, X_2 = x^4 + \beta, Z_2 = x^2. \)
2. for \( i = l-2 \) to 0 by \(-1\) do
3. if \( k_i = 1 \) then
4. \((X_1, Z_1) = M_{add}(X_1, Z_1, X_2, Z_2), (X_2, Z_2) = M_{double}(X_2, Z_2)\)
5. else
6. \((X_2, Z_2) = M_{add}(X_1, Z_1, X_2, Z_2), (X_1, Z_1) = M_{double}(X_1, Z_1)\)
7. end if
8. end for
9. \( Q = M_{xy}(X_1, Z_1, X_2, Z_2) \)
10. \( M_{add}(X_1, Z_1, X_2, Z_2) // Point Addition\)
11. \( Z_3 = (X_1 \times Z_2 + X_2 \times Z_1)2, X_3 = x \times Z_3 + (X_1 \times Z_2) \times (X_2 \times Z_1)\)
13. return \((X_3, Z_3)\)
14. \( M_{double}(X_1, Z_1) // Point Double\)
15. \( Z_2 = Z_2 \)
17. \( M_{xy}(X_1, Z_1, X_2, Z_2) // Coordinate Conversion\)
18. \( X = X_1/Z_1, Y = (x+X)\times(y+X+2+(X_2/Z_2 +X)\times(X_1/Z_1 +X))/x+y\)

3. PROPOSED ECC ARCHITECTURE

Fig. 2 shows the proposed ECC scalar multiplication Architecture, which contains control units, dual field 256 bit multiplier and adder, and output register. The control unit decoded the four 256 bit instructions and send to the Arithmetic unit, which is performing the adding and doubling operations. In the dual field co-processor based on [6] and performs the Montgomery scalar multiplication over GF (2m) and Montgomery inversion over GF (p) we used Spartan II device to build up the processor. The EC point scalar multiplication can be done by iteratively point double and/or point double with point additions. This approach shows the significant improvements over [9], in term of performance and power efficiency. To speed up the performance, the coprocessor adopted full word field adder and it reduces 85% of the cycle for multiplicative inversion over finite field. To reduce the power consumption, it provides both the parallel and serial power modes.

Fig. 3 shows the proposed processor, which contains advanced high performance bus (AHB) interface, input and output buffers, Main controller, EC data selector, ECC modular multiplier, Clock controller, Register File, and Montgomery Unit. The EC Data selector fetches the instructions from the main controller and decoded to ECC Modular Multiplication units. The Clock Control Unit used for schedule the cycle required to perform the scalar multiplications. The ECC Modular Multiplier performs point coordinate conversion, point double, point addition, scalar multiplication, Montgomery pre and post processing, and modular exponentiation.
The Montgomery unit contains Montgomery Scheduler and Montgomery Data selector. The Montgomery Scheduler decodes instruction from the Phase II of Algorithm 2.1 and performs the Montgomery inversion. The Data Selector fetches the input value from the input buffer, the input values are either prime or binary field based on the curve over GF (p) and GF (2m), respectively. Our processor contains the dual-field adders which consist of four parallel units and one serial unit.

The main features of our design is, we used Altera Stratix II device and TMSC 0.13μm CMOS technologies are together in order to produce the low power and less area. The 256 bit ECC Modular Multiplier is embedding in between ECC Data Selector and Clock Control unit, which is produce better result compare with [14].

These four parallel and serial units are used to reduce the power consumptions. The four parallel Arithmetic Units and one serial unit are decodes control instructions from the main controller and perform the modular addition and modular multiplication, the intermediate results are stores in the register file. The dual-field multipliers and dual-field adders are capable of performing arithmetic over both the prime and binary fields by a unified hardware. Each intermediate variable during the EC operation is stored in the register file, which consists of seven 256-bit buffers. Finally, the output buffer stores the results, which can be accessed via the bus.

4. PERFORMANCE COMPARISION

The proposed design has been captured in Verilog HDL, simulated by ModelSim and implemented on Spartan II device. The FPGA design presented is highly adaptable and easily reprogrammable for both prime field and finite field and our design also fabricated by using TSMC 130 nm 1.6 V 1P8M CMOS standard.

Since the design used both Spartan II device and TSMC 130-nm CMOS cell-base technology, the comparison table contains previous works with both technologies. The Table II shows the comparison result of our design and previous similar works with different application specific integrated circuit (ASIC) design for both GF (p) and GF (2m), as most of them reported synthesis or post layout results.

<table>
<thead>
<tr>
<th>Design</th>
<th>Device</th>
<th>Implementation of Modular Multiplication</th>
<th>Clock Cycles for One Operation</th>
<th>Time for Operation</th>
</tr>
</thead>
<tbody>
<tr>
<td>[6]</td>
<td>0.13 ASIC</td>
<td>Systolic array</td>
<td>120 Cycles</td>
<td>1.2 µs</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>100 MHz</td>
</tr>
<tr>
<td>[7]</td>
<td>Xilinx Virtex-E V1000E</td>
<td>Systolic array</td>
<td>230 Cycles</td>
<td>2.6 µs</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>91.3 MHz</td>
</tr>
<tr>
<td>[8]</td>
<td>Xilinx Virtex2 pro XC2VI P125</td>
<td>256 embedded multipliers used</td>
<td>32 Cycles</td>
<td>0.7 µs</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>45.68 MHz</td>
</tr>
<tr>
<td>[9]</td>
<td>0.18 ASIC</td>
<td>Custom-designed 256-bit multiplier</td>
<td>3 Cycles</td>
<td>0.021 µs</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>142 MHz</td>
</tr>
<tr>
<td>Ours</td>
<td>Altera Cyclone3 EP3C40</td>
<td>81 embedded multipliers used</td>
<td>3 Cycles</td>
<td>0.1 µs</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>130.38 Hz</td>
</tr>
</tbody>
</table>
5. CONCLUSION

In this work, a new 256-bit Montgomery multiplier has been designed. With the 5-stage pipeline, it can perform a modular multiplication in only 3 cycles. To our knowledge, it is much faster than previous designs on FPGA devices. By using Karasuba-Ofman algorithm, the number of embedded multipliers required is reduced on FPGA devices. Furthermore, the design can be easily merged by some other IPs to build a full cryptographic SOC system.

REFERENCES


Sandeep S.V has graduated Bachelor degree in Electronics and Communication Engineering from Channabasaveshwara Institute of Technology affiliated to Visvesvaraya Technological University in the academic year of 2010.He is currently pursuing M.Tech in Digital Electronics and Communication Engineering from MVJCE, Bangalore. His research areas are VLSI implementation, Communication Networks and Microcontroller.

Hameem Shanavas .I is the Doctoral Research Scholar of Anna University, Coimbatore, India. He is currently working Assistant Professor, Department of ECE, M.V.J. College of Engineering, Bangalore, India. He has completed his Bachelor Degree in Electronics and Communication (2006), Masters in VLSI Design (2008) and also he
 completed Masters in Business Administration (2009). He worked for various institutions in electronics and communication department around many states in India. He has published many journals and attended many Conferences in National and International Level. He is in editorial committee of many International Journals and reviewer for many Journals like IEEE Transactions, Science Direct etc. He is the member of Professional bodies like ISECE, IACSIT, IAEng. His research areas are VLSI Physical Design and Testing.

V Nallusamy completed his diploma in Electronics and Communication Engineering from Government Polytechnic, Aranthangi, Tamilnadu in 2001 and subsequently received Bachelor in Engineering from Pavendar Bharathidasan College of Engineering under Bharathidasan University, Trichirappalli, Tamilnadu in 2004. He also obtained a Master of Engineering in Computers and Communication from Pavendar Bharathidasan College of Engineering under Anna University, Tamilnadu in 2010. He is currently leading the Department of ECE, M.V.J. College of Engineering, Bangalore, India. He has published many journals and attended many Conferences in National and International Level. His research areas are embedded Systems, Robotics and CAD Algorithms. email: nallu1910@gmail.com

M. Brindha completed Bachelor in Engineering (2004). Master of Engineering from Anna University, Chennai (2006). She is currently an Associate Professor in the Department of ECE, M.V.J. College of Engineering, Bangalore, India. She has published many journals and attended many Conferences in National and International Level. Her research areas are embedded Systems, FPGA Implementation and Algorithms. email: brindha_mo47@yahoo.co.in