Improving ML-KEM and ML-DSA on OpenTitan

Ruben Niederhagen, Hoang Nguyen Hien Pham

Ruben Niederhagen, Hoang Nguyen Hien Pham

Cyber

IACR Transactions on Cryptographic Hardware and Embedded Systems

0.0 (0 ratings)

Introduction

Improving ml-kem and ml-dsa on opentitan. Improve ML-KEM & ML-DSA performance on OpenTitan's OTBN via enhanced instruction set extensions. Achieve up to 17% speedup & reduced time-area product for post-quantum crypto.

41 views

Abstract

This work improves upon the instruction set extension proposed in the paper “Towards ML-KEM and ML-DSA on OpenTitan”, in short OTBNTW, for OpenTitan’s big number coprocessor OTBN. OTBNTW introduces a dedicated vector instruction for prime-field Montgomery multiplication, with a high multi-cycle latency and a relatively low utilization of the underlying integer multiplication unit. The design targets post-quantum cryptographic schemes ML-KEM and ML-DSA, which rely on 12-bit and 23-bit prime field arithmetic, respectively. We improve the efficiency of the Montgomery multiplication by fully exploiting existing integer multiplication resources and move modular multiplication from hardware back to software by providing more powerful and versatile integer-multiplication vector instructions. This enables us not only to reduce the overall computational overhead through lazy reduction in software but also to improve performance in other functions beyond finite-field arithmetic. We provide two variants of our instruction set extension, each offering different trade-offs between resource usage and performance. For ML-KEM and ML-DSA, we achieve a speedup of up to 17% in cycle count, with an ASIC area increase of up to 6% and an FPGA resource usage increase of up to 4% more LUT, 20% more CARRY4, 1% more FF, and the same number of DSP compared to OTBNTW. Overall, we significantly reduce the ASIC time-area product, if the designs are clocked at their individual maximum frequency, and at least match that of OTBNTW, if the designs are clocked at the same frequency.

Review

This paper presents a compelling advancement in the hardware acceleration of post-quantum cryptographic schemes, ML-KEM and ML-DSA, on the OpenTitan platform's big number coprocessor (OTBN). Building upon the foundation laid by previous work, OTBNTW, the authors meticulously identify a critical inefficiency: the high latency and low utilization of the dedicated prime-field Montgomery multiplication vector instruction. This is particularly pertinent given the relatively small 12-bit and 23-bit prime fields central to ML-KEM and ML-DSA, respectively. The core problem addressed is how to more effectively leverage existing hardware resources for these arithmetic operations without being constrained by a highly specialized, and thus underutilized, instruction. The proposed solution marks a significant architectural shift. Instead of a specialized hardware instruction for modular multiplication, the authors advocate for more versatile and powerful *integer-multiplication* vector instructions. This strategic move allows for modular multiplication to be efficiently handled in software, crucially enabling techniques like lazy reduction to minimize computational overhead. A key advantage highlighted is that this generalized approach not only benefits finite-field arithmetic but also improves performance across other functions, showcasing a more holistic and flexible design philosophy. The introduction of two instruction set extension variants further demonstrates a thorough exploration of the design space, offering different trade-offs between hardware resource utilization and performance. The improvements achieved are quantitatively impressive and well-articulated. For ML-KEM and ML-DSA, the new instruction set yields a speedup of up to 17% in cycle count. This performance gain is accompanied by a modest increase in hardware resources: up to 6% more ASIC area, and for FPGAs, up to 4% more LUTs, 20% more CARRY4, and 1% more FF, while maintaining the same number of DSPs. Critically, the paper claims a significant reduction in the ASIC time-area product when clocked at individual maximum frequencies, at least matching that of OTBNTW at the same frequency. This work represents a notable stride in optimizing post-quantum cryptography on embedded platforms, effectively balancing performance, resource efficiency, and instruction set flexibility for a more robust and future-proof design.

Full Text

You need to be logged in to view the full text and Download file of this article - Improving ML-KEM and ML-DSA on OpenTitan from IACR Transactions on Cryptographic Hardware and Embedded Systems .