基于FPGA的高效蒙哥马利模乘器设计

颜震; 黄亦成; 马仕成; 王雪岩

doi:10.20193/j.ices2097-4191.2024.0086

PDF(6807 KB)

集成电路与嵌入式系统 ›› 2025, Vol. 25 ›› Issue (4) : 1-9. DOI: 10.20193/j.ices2097-4191.2024.0086

封面文章

基于FPGA的高效蒙哥马利模乘器设计

作者信息 +

Design of an efficient Montgomery modular multiplier based on FPGA

Author information +

文章历史 +

摘要

为实现高并行计算,全同态加密硬件加速系统需要例化大量密码学原语操作单元。模乘作为全同态加密中最重要的原语操作,其电路实现面积对加速系统面积有较大影响。针对现有模乘器设计中资源使用较多、支持参数范围较小和依赖硬核IP等问题,本文设计了一种基于FPGA的高效蒙哥马利模乘器。该模乘器在算法层面通过NTT-Friendly模数特性、压缩和编码等方法降低计算量,在电路层面通过分时复用和数据整合等方法减少资源。此外,该模乘器支持参数配置以实现不同位宽的蒙哥马利模乘。实验结果表明,本文设计的蒙哥马利模乘器在位宽为32比特时,时钟频率223 MHz,延迟26.9 ns,LUT使用1 313个,FF使用213个,相较于对比对象,资源平均减少32%,延迟平均提高16%,但结构更灵活,具有较强的适用性。

Abstract

To achieve high parallel computing, fully homomorphic encryption hardware acceleration systems require the instantiation of a large number of cryptographic primitive operation units. As the most crucial primitive operation in fully homomorphic encryption, the circuit implementation area of modular multiplication significant impacts the overall area of the acceleration system. Addressing issues such as excessive resource usage, limited parameter, and dependency on macro core IPs in existing modular multiplier designs, this paper presents an efficient Montgomery modular multiplier based on FPGA. At the algorithmic level, the multiplier reduces the computational load through techniques such as NTT-Friendly modulus characteristics, compression, and encoding. At the circuit level, it minimizes resource through methods like time-division and data integration. Furthermore, the multiplier supports parameter configuration to implement Montgomery modular multiplication for different widths. Experimental results demonstrate that, for a 32-bit width, the designed Montgomery modular multiplier operates at a clock frequency of 223 MHz with a latency of 26.9 ns, utilizing 1 313 LUTs and 213 FFs. Compared to the baseline, the resource consumption is reduced by 32% on average, while the latency is improved by 16% on average, making the design more flexible and highly applicable.

导出引用

颜震, 黄亦成, 马仕成, 等. 基于FPGA的高效蒙哥马利模乘器设计[J]. 集成电路与嵌入式系统. 2025, 25(4): 1-9 https://doi.org/10.20193/j.ices2097-4191.2024.0086

YAN Zhen, HUANG Yicheng, MA Shicheng, et al. Design of an efficient Montgomery modular multiplier based on FPGA[J]. Integrated Circuits and Embedded Systems. 2025, 25(4): 1-9 https://doi.org/10.20193/j.ices2097-4191.2024.0086

中图分类号： TN47 (大规模集成电路、超大规模集成电路)

参考文献

列表( 原文顺序 | 文献年度倒序 | 文中引用次数倒序 ) 可视化分析

[1]	高献伟, 张晓楠, 董秀则. 基于FPGA的Montgomery模乘器的高效实现[J]. 计算机应用研究, 2017, 34(11):3424-3427. GAO X W, ZHANG X N, DONG X Z. Efficient implementation of Montgomery modular multiplier on FPGA[J]. Application Research of Computers, 2017, 34(11):3424-3427 (in Chinese). 本文引用 [2]

[2]

陈億, 杨萱, 曾涵, 等. 一种高速可伸缩的双域Montgomery模乘器架构[J]. 计算机工程, 2023, 49(8):283-290.

https://doi.org/10.19678/j.issn.1000-3428.0065018

摘要

为提高Montgomery模乘在硬件实现上的运算速度并保持较高的性能，提出一种适用于高速椭圆曲线密码处理器的高速可伸缩的双域Montgomery模乘算法及其硬件架构。通过迭代调用Karatsuba乘法，实现最大位宽为576 bit的Montgomery模乘，并利用Montgomery模乘相邻运算部分数据的无关性，通过提前计算部分数据，减少Montgomery模乘运算使用的时钟周期数。基于Karatsuba算法中多次使用大位宽加法运算带来资源消耗大和超长进位链的问题，设计基于双域4-2压缩变换的加法选择电路结构，将一个超大位宽的加法运算拆分成多个小位宽的加法，在一个时钟周期内同时得到所有加法运算的结果，并根据加法的进位输出进行最终输出结果的选择，有效缩短加法进位链的延时。实验结果表明，相比基于ASIC的Montgomery模乘实现方案，Montgomery模乘算法及硬件架构具有更高的灵活性，在65 nm的CMOS工艺下进行逻辑综合，最高时钟频率能够达到459 MHz，面积资源占用为480 254 µm<sup>2</sup>，完成0~145 bit、146~289 bit、290~435 bit和436~576 bit的Montgomery模乘分别仅需要8.72 ns、23.98 ns、58.86 ns和71.94 ns，且具有较低的面积时间积。

CHEN

, YANG

, ZENG

, et al. A high-speed scalable dual-field Montgomery modular multiplier architecture[J]. Computer Engineering, 2023, 49(8):283-290 (in Chinese).

https://doi.org/10.19678/j.issn.1000-3428.0065018

本文引用 [2] 摘要

A high-speed scalable dual-field Montgomery Modular Multiplier(MMM) algorithm and its hardware architecture are proposed for an Elliptic Curve Cryptography(ECC) processor to improve its performance. The 576 bit scalable MMM is implemented in this study by iteratively calling the Karatsuba algorithm.The number of clock cycles is reduced by calculating some of the data in advance, utilizing the data irrelevance in the adjacent operations of the MMM. To avoid large resource consumption and a long carry chain in the implementation of the Karatsuba algorithm, a carry-select addition approach based on dual-field 4-2 compressors is proposed. It splits a large addition into several small additions, simultaneously obtains all the addition results, and selects the final result according to the addition carries, shortening the delay of the carry chain.The experimental results show that, compared with other MMM implementations in Application-Specific Integrated Circuits(ASIC), the proposed MMM hardware architecture has higher flexibility.Synthesized with a 65 nm Complementary Metal-Oxide-Semiconductor(CMOS) process, the maximum frequency is 459 MHz and the area is 480 254 µm². Completing MMMs of 0-145 bit, 146-289 bit, 290-435 bit, and 436-576 bit takes 8.72 ns, 23.98 ns, 58.86 ns, and 71.94 ns, respectively, with a low area-time product.

[3]	LI B, WANG J, DING G, et al. A high-performance and low-cost montgomery modular multiplication based on redundant binary representation[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2021, 68(7):2660-2664. 本文引用 [2]

[4]	LONGA P, NAEHRIG M. Speeding up the number theoretic transform for faster ideal lattice-based cryptography[C]// Cryptology and Network Security:15th International Conference, Milan,Italy,2016: 124-139. 本文引用 [1]

[5]	ITABASHI Y, UENO R, HOMMA N. Efficient modular polynomial multiplier for NTT accelerator of Crystals-Kyber[C]//2022 25th Euromicro Conference on Digital System Design (DSD). IEEE,2022:528-533. 本文引用 [1]

[6]	DERYA K, MERT A C, ÖZTÜRK E, et al. CoHA-NTT:A configurable hardware accelerator for NTT-based polynomial multiplication[J]. Microprocessors and Microsystems, 2022,89:104451. 本文引用 [3]

[7]	HU X, TIAN J, LI M, et al. AC-PM:An area-efficient and configurable polynomial multiplier for lattice based cryptography[J]. IEEE Transactions on Circuitsand Systems I: Regular Papers, 2022, 70(2):719-732. 本文引用 [2]

[8]	TAN W, CASE B M, WANG A, et al. High-speed modular multiplier for lattice-based cryptosystems[J]. IEEE Transactions on Circuits and Systems II:Express Briefs, 2021, 68(8):2927-2931. 本文引用 [1]

[9]	HU X, LI M, TIAN J, et al. DARM:A low-complexity and fast modular multiplier for lattice-based cryptography[C]//2021 IEEE 32nd International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE, 2021:175-178. 本文引用 [3]

[10]	KOZIEL B, ACKIE A B, EL KHATIB R, et al. SIKE’d up:Fast hardware architectures for supersingular isogeny key encapsulation[J]. IEEE Transactions on Circuits and Systems I:Regular Papers, 2020, 67(12):4842-4854. 本文引用 [1]