TY - JOUR
T1 - Resource-aware Montgomery modular multiplication optimization for digital signal processing
AU - Tao, Qiqi
AU - Li, Liying
AU - Zhou, Junlong
AU - Cao, Guitao
AU - Meng, Dan
N1 - Publisher Copyright:
© 2024 Elsevier B.V.
PY - 2024/6
Y1 - 2024/6
N2 - Homomorphic encryption is an important technology for protecting data privacy, and the performance of modular multiplication directly affects the efficiency of homomorphic encryption. Currently, there are numerous FPGA-based acceleration techniques targeting modular multiplication. However, many of these implementations require substantial hardware resources or suffer from resource imbalance. This leads to a lower throughput. Therefore, we present a novel FPGA-based implementation of Montgomery Modular Multiplication aimed at addressing these challenges. Our design employs a suitable radix bit width and word size based on the digital signal processing (DSP) bit width rather than the conventional binary powers of two. We aim to instantiate more modular multipliers using limited resources while minimizing latency. We also introduce a novel DSP cascade structure, called parallel grouping cascade DSP, which reduces the number of clock cycles of internal multipliers. To balance the ratio of lookup table (LUT) and DSP usage, we also use multipliers implemented in the LUT to replace some DSPs. Our results, implemented on Xilinx Virtex-7 field-programmable gate array (FPGA), demonstrate more than 27% improvement in throughput on 1024-bit modular multiplication and more than 70% improvement on 2048-bit compared to the best previous state-of-the-art references.
AB - Homomorphic encryption is an important technology for protecting data privacy, and the performance of modular multiplication directly affects the efficiency of homomorphic encryption. Currently, there are numerous FPGA-based acceleration techniques targeting modular multiplication. However, many of these implementations require substantial hardware resources or suffer from resource imbalance. This leads to a lower throughput. Therefore, we present a novel FPGA-based implementation of Montgomery Modular Multiplication aimed at addressing these challenges. Our design employs a suitable radix bit width and word size based on the digital signal processing (DSP) bit width rather than the conventional binary powers of two. We aim to instantiate more modular multipliers using limited resources while minimizing latency. We also introduce a novel DSP cascade structure, called parallel grouping cascade DSP, which reduces the number of clock cycles of internal multipliers. To balance the ratio of lookup table (LUT) and DSP usage, we also use multipliers implemented in the LUT to replace some DSPs. Our results, implemented on Xilinx Virtex-7 field-programmable gate array (FPGA), demonstrate more than 27% improvement in throughput on 1024-bit modular multiplication and more than 70% improvement on 2048-bit compared to the best previous state-of-the-art references.
KW - Cryptosystem
KW - Digital signal processing
KW - Field Programmable Gate Array
KW - Large integer arithmetic
KW - Montgomery modular multiplication
UR - https://www.scopus.com/pages/publications/85191442575
U2 - 10.1016/j.sysarc.2024.103142
DO - 10.1016/j.sysarc.2024.103142
M3 - 文章
AN - SCOPUS:85191442575
SN - 1383-7621
VL - 151
JO - Journal of Systems Architecture
JF - Journal of Systems Architecture
M1 - 103142
ER -