Introduction
This security of traditional Public-Key Cryptography (PKC), such as Rivest–Shamir–Adleman (RSA) and elliptic curve cryptography (ECC), relies on one of the three hard mathematical problems: integer factorization, discrete logarithm, or the elliptic-curve discrete logarithm. These hard problems can easily be solved on a sufficiently powerful quantum computer with Shor’s algorithm [1], [2]. This creates the need for post-quantum PKC algorithms that can resist the threat from quantum computers in near future.
The National Institute of Standards and Technology (NIST) is in the process of selecting one or more post-quantum cryptography algorithms through a public competition-like process [3], where the candidates need to specify the digital signature and key encapsulation mechanism (KEM). The evaluation criteria not only focuses on security aspects of an algorithm, but also looks into performance from its implementation. Considering the performance aspects, the algorithm should be evaluated on various classical platforms to show its efficiency in practical applications. In November 2017, 82 candidate algorithms were submitted to the NIST post-quantum competition for consideration. Of those candidates, seven finalists and eight alternate candidates were selected for the third round, according to the announcement made by NIST in July 2020.
In the third round, five lattice-based cryptography algorithms were selected as finalists (i.e., KYBER, NTRU, SABER, DILITHIUM, and FALCON) while another two are chosen as alternate candidates (i.e., FrodoKEM and NTRU Prime). Compared with other post-quantum cryptography candidates, lattice-based cryptography maintains the majority share in third round. In order to evaluate the practicality of cryptographic algorithms, a lot of work is devoted to improving the performance on various platforms, such as FPGA, microcontrollers and massively parallel processors (GPU). Recently, an FPGA implementation based on approximate computation was proposed [4] to accelerate lattice-based cryptography. The first GPU implementation of NTRU was presented in 2010 [5] and showed that the GPU can achieve very high encryption and decryption throughput by utilizing the product form polynomial and some bit-packing techniques. Other works have been proposed to accelerate the performance of NTRU on a GPU [6]–[10]. There are also several attempts to accelerate post-quantum cryptography (PQC) on various GPU platforms [11]–[14], targeting the parameters in NIST standardization process [15]. However, previous works paid little attention to the power of the new GPU tensor core, which would be a better choice than the ordinary GPU instruction set (i.e., integer/floating point units). The tensor core is a specialized unit released by NVIDIA in its’ latest GPU architectures (Volta, Turing and Ampere). Many deep neural network applications take advantages of NVIDIA’s tensor core to improve the training and inference performance. However, it is unclear how cryptography can exploit tensor core to improve the implementation performance.
In this paper, our aim is to exploit tensor core to speed up the lattice-based KEM implementation on GPU, in order to achieve high throughput KEM. Our main contributions are summarized below:
For the first time, a tensor-core-based polynomial convolution on GPU is presented. Experiments are carried out on two latest GPUs (RTX2060 and RTX3080) that supports tensor core. The proposed technique can handle polynomials with a degree in multiples of 16, which shows up to
(RTX2060) and$3.41\times $ (RTX3080) faster performance compared to the conventional implementation using 32-bit integer units in the GPU, for polynomial degree$5.77\times $ .$N=1024$ The first NTRU [16] implementation based on tensor core is proposed in this paper. Since the polynomial degree in NTRU is not a multiple of 16, some modifications are required in order to use the tensor-core-based polynomial convolution. A series of parallel algorithms, including zero padding, sign conversion and type casting, is proposed to achieve this, resulting a high-performance NTRU implementation in a GPU. For instance, the tensor-core-based ntruhps2048509 can achieve a throughput of 793651 encapsulations per second and 505051 decapsulations per second on RTX2086 GPU. The results are
and$28.47\times /2.02\times $ faster than implementation in Central Processing Unit (CPU)/integer units in GPU, for encapsulation and decapsulation, respectively. The same experiments were conducted on RTX3080, where similar speed-ups were obtained. Note that this is also the first NTRU implementation on GPU that follows the NIST PQC specifications [16].$66.50\times /1.56\times $ The proposed tensor-core-based polynomial convolution can handle various polynomial sizes. To validate this point, we apply the proposed technique to another two lattice-based cryptosystems: LAC and two variant parameter sets of FrodoKEM. The tensor-core-based polynomial convolution in LAC and one variant of FrodoKEM outperformed integer-units-based implementations by
and$3.10\times $ , respectively (RTX2060). Detailed steps to efficiently utilize the proposed technique for polynomial/matrix multiplication in these two schemes are described Section IV-E and IV.F.$3.31\times $ The source code for tensor-core-based polynomial convolution was placed in the public domain at
https://github.com/benlwk/Tensorcrypto . This allows researchers to easily re-produce our results in their own development environments to utilize the tensor-core-aided lattice-based cryptography implementation for their own purposes.
Here we established the criteria to apply our technique over other lattice-based schemes. At a high level, our solution allows very high throughput key encapsulation/decapsulation using a same public-private key pair for each communication session. The proposed solution is applicable to all lattice-based cryptography with small modulus, where multiplication is expressed in the form of a vector and a matrix. This can be either an ideal lattice construction, as in NTRU and LAC, or a generic lattice construction, as in FrodoKEM. However, schemes such as KYBER [12] and NewHope [13] already use NTT-based multiplications, the polynomial convolution is already efficient, so it would be more advantageous to use NTT instead of schoolbook multiplication presented in this paper. Moreover, it is also difficult to use tensor cores to accelerate the NTT computations, since the size of modulus (
Although this paper only use the proposed tensor-core-based polynomial convolution for cryptography, it can be extended to support other applications on GPU. For instance, polynomial convolution is used to perform image reconstruction [17], feature extraction [18], data preprocessing [19] and signal processing [20]. These applications may also benefits from the tensor-core-based polynomial convolution to achieve high throughput performance.
The remainder of this paper is organized as follows. In Section II, we introduce the background and related prior works. In Section III, we present a novel tensor-core-based polynomial convolution and the implementation of two parameter sets in NTRU. Thereafter, we summarize our experimental results for NTRU, LAC, and FrodoKEM in Section IV. Finally, we conclude the paper in Section V.
Background and Related Works
A. Overview of GPU Architecture and CUDA Programming Model
A GPU consists of thousands of cores, enabling massively parallel computation in many interesting applications. From the hardware perspective, the GPU groups many cores (e.g., 64, 128, or 192) into a Streaming Multiprocessor (SM). The memory in the GPU can be categorized into on-chip and off-chip. On-chip memory refers to register files and shared memory that resides near to the GPU cores. Registers are very fast, but come in small sizes (64–96K 32-bit words per SM). Shared memory is known as “user-managed cache”, which is usually used to store frequently accessed values (e.g., look-up table or pre-cached values). Like registers, shared memory is fast but small in size (48–164K 32-bit words per SM). Off-chip memory refers to global memory, which is essentially the DRAM. It comes in a large size (2–16 GB), but the access latency can be up to
From a programming perspective, many parallel threads form a block and multiple blocks form a grid. This allows flexible arrangement of software threads into the physical SM and cores across many different GPU architectures. The relationship between grids, blocks, and threads is illustrated in Figure 1. NVIDIA GPUs group 32 threads into a warp, wherein all 32 threads execute the same instruction in parallel. For this reason, the number of threads per block is usually set in multiples of 32 to avoid divergence in the instruction execution path. Besides that, shared memory also has 32 banks, allowing parallel access by all 32 threads within a warp. Additional features like warp shuffle instruction and tensor core are also designed to work at the warp level to maximize the efficiency of the GPU warp scheduler.
B. Tensor Core
In 2017, NVIDIA released the Volta GPU architecture, which introduced a specialized unit named as the tensor core. Tensor core is used to perform one half-precision matrix-multiply-and-accumulate (MMA) on a
Many deep neural network applications can take advantage of the NVIDIA tensor core. For instance, the convolutional neural network (CNN) requires many dot-product computations, which can easily be expressed as MMA operations. However, it is unclear how cryptography implementations can exploit the newly introduced tensor core. In this work, we present a series of algorithms to map the polynomial convolution in lattice-based cryptography to matrix multiplication in order to exploit tensor core for faster performance.
C. Lattice-Based Cryptography
Lattice-based cryptographic constructions are based on the hardness of Shortest Vector Problem (SVP) which approximates the minimal Euclidean length of a lattice vector. Lattice-based cryptography is believed to be secure against both conventional and quantum computers. In the third round of the NIST post-quantum cryptography standardization process, five lattice-based cryptography algorithms were selected as finalists (CRYSTALS-KYBER, NTRU, SABER, CRYSTALS-DILITHIUM, and FALCON) and another two are selected as alternate candidates (FrodoKEM and NTRU Prime). Table 1 shows the summary parameters for these candidates. Most of the lattice-based schemes rely on polynomial convolution, which has high computational complexity. In order to improve the performance of polynomial convolution, we utilize the tensor core and show performance enhancements on lattice-based cryptography with a small modulus, such as NTRU, LAC, and two variant parameter sets of FrodoKEM.
NTRU encryption is a lattice-based one-way CPA-secure (OW-CPA) public-key encryption scheme that was invented in 1996 [29]. For the purpose of this paper, we do not go into the details of the scheme, interesting readers may refer to [16] for details. The major computation bottleneck in NTRU is the polynomial multiplication over the ring
In the NIST PQC competition, there have been two flavors of NTRU, differing in the choice of. The original NTRU scheme, known as NTRU-HPS [29], [30], works over
D. Previous PQC Implementations on GPU
The first implementation of NTRU in GPU dates back to 2010. Hermans et al. [5] showed that GPU can achieve very high encryption and decryption throughput by utilizing product form polynomial and bit-packing techniques. The product form polynomial is no longer used in the NTRU submission to NIST. Following up this work, Lee et al. [7] proposed a sliding window technique to pre-compute some repeating patterns in NTRU polynomial and then stored them in a lookup table. With this technique, some of the multiplication operations can be skipped. Although this work is able to achieve high throughput, it may not be secure against a side channel attack, because the look up table leaks timing information. The NTRU modular lattice signature (NTRU-MLS) scheme [32], [33], which requires operations on large vectors, was optimized with parallel polynomial multiplication on a GPU by Dai et al. [9]. Recently, Lee et al. [10] proposed utilizing the Karatsuba algorithm to speed up polynomial multiplication in NTRU. These previous works were all implemented on the integer units available in GPU. Unlike previous NTRU implementations on GPU, we introduce the first tensor-core-based NTRU implementation.
Recently, there are also interests on implementing other PQC schemes on GPU. Lee et al. [34] show various ways to speed up NTT computation on GPU, targeting polynomials used in qTESLA signature scheme. A following up work from [35] demonstrated that Nussbaumer algorithm can be faster than NTT on the polynomial convolutions when executed on GPU. Sun et al. [14] show a parallel implementation of SPHINCS signature scheme on GPUs with significantly higher throughput compared to CPU implementation. Gupta et al. [11] analyzed various parallelism available in GPU (batch mode and single mode), and evaluate three PQC schemes (FrodoKEM, NewHope and Kyber). Later on, Lee et al. [12] demonstrated a low latency implementation of Kyber KEM, which can be beneficial to latency-sensitive applications. Another interesting work from Gao et al. [13] further improved the throughput achieved by NewHope on two different GPU platforms. Note that all these previous work also do not consider the possibility to use tensor core in their implementation. Besides high performance implementation of PQC, another interesting research direction is to develop efficient implementation of various sieving algorithms to solve hard lattice problems [36].
Optimized Parallel Implementation of NTRU
A. Parallel Polynomial Convolution Through Schoolbook Convolution
Polynomial convolution is known as “truncated polynomial multiplication”. This is the most time-consuming operation in NTRU PKC. A straightforward way to implement this is schoolbook multiplication, wherein the operation exhibits a high degree of parallelism. Referring to Algorithm 1, schoolbook polynomial convolution can be arranged in such a way that it processes one column at a time (the
Algorithm 1 Schoolbook Polynomial Convolution
Polynomial
Polynomial
for
for
end for
for
end for
end for
Detailed illustrations are presented in Figure 2. One can observe that the operations within the
Parallel computation of polynomial convolution with integer units in GPU; operations from (1) to (4) are performed, independently.
Algorithm 2 shows the parallel version of schoolbook polynomial convolution that can be implemented efficiently in a GPU. This implementation utilizes
Algorithm 2 Parallel Schoolbook Polynomial Convolution in NTRU
Polynomial
shared_a[
shared_b[
__syncthreads()
for
end for
for
end for
return c[
Note that we only need to perform the modulo operation (
B. Proposed Polynomial Convolution Through Tensor Core
The tensor core was introduced into the GPU to accelerate MMA operations with much higher throughput. By taking a closer look at Algorithm 1, we find that the polynomial convolution can be expressed in the form of matrix multiplication. To achieve this, polynomial
Computing polynomial convolution using tensor core in GPU: Matrix
With this proposed technique, NTRU polynomial convolution can be formulated as matrix multiplication and accelerated through the use of the tensor core, which is faster than the conventional INT32 operations.
C. TensorTRU: NTRU Implementation Based on Tensor Core
1) Representing a Polynomial in Floating Point
Referring to Table 1, NTRU requires modulus
The parameter sets ntruhps2048509 and ntruhps2048677 requires
Another two NTRU parameters (ntruhps4096821 and ntruhrss701) can be implemented in tensor core with double precision using configuration 3. However, the performance of FP64 tensor core is much slower compared to FP32, and they only support a smaller matrix size (
2) Parallel Polynomial Convolution Using Tensor Core
Tensor core were developed to handle a small matrix with dimension of
Matrix multiplication: dimensions
Referring to Algorithm 3, the tensor-core-based polynomial convolution requires the input matrices to be in multiples of
Algorithm 3 TC-PC: Parallel Polynomial Convolutions Using Tensor Core
and many polynomial
fragment<A, 16, 16, 16, half, row_major> a_frag
fragment<B, 16, 16, 16, half, col_major> b_frag
fragment< accumulator, 16, 16, 16, float> c_frag
tid = thread ID
bid = block ID
blockDim = block dimension
for
load_matrix_sync(a_frag,
load_matrix_sync(b_frag,
mma_sync(c_frag, a_frag, b_frag, c_frag)
end for
store_matrix_sync(
3) Handling a Matrix Not in a Multiple of $16\times16$
The polynomial degrees of two selected NTRU parameter sets (ntruhps2048509 and ntruhps2048677) are
There are two methods to overcome this limitation. The first method is through a hybrid algorithm that combines the tensor core and integer-based polynomial convolution. Figure 5a shows a high-level illustration of such a hybrid algorithm. In this example (parameter set ntruhps2048509), one can utilize tensor core to compute the polynomial convolution of
Handling a matrix that is not a multiple of
To achieve high-performance NTRU implementation in GPU, we proposed a series of parallel algorithms to efficiently perform the following tasks:
Organize the polynomial in cyclic form and pad the remaining parts with zeros to construct the matrix in a multiple of
(Algorithm 4).$16\times 16$ Convert unsigned 16-bit integer (U16) polynomial elements to 16-bit floating point (FP16) format (Algorithm 5).
Convert 32-bit floating point (FP32) elements to unsigned 16-bit integers (U16) and perform modulo operations (Algorithm 6).
Algorithm 4 ParCyc: Parallel Algorithm to Arrange Polynomial in Cyclic Form
Polynomial in with degree
Matrix out with
in cyclic form and padded with zeros for unused elements.
if
else
end if
Algorithm 5 ParU16toFP16: Parallel Algorithm to Convert Polynomial Elements From U16 to FP16
Matrix in with
Matrix out with
if
else
end if
Algorithm 6 ParFP32toU16: Parallel Algorithm to Convert Polynomial Elements From FP32 to U16 and Perform Modulo ${q}$
matrix in with elements in U16 format and modulo
Referring to Algorithm 4 and Figure 6, the input polynomial (in) is read by
With these three proposed algorithms, one can perform highly parallel polynomial convolution for NTRU using tensor core, where the steps are given in Algorithm 7. Three floating point matrices are first initialized to zero in the CPU; this process is only performed once. Next, the two proposed algorithms are implemented in the GPU to perform pre-processing on Matrix A and Matrix B (lines 8-9). Subsequently, tensor core is used to perform the polynomial convolution, resulting in Matrix fp32_C in FP32 format (line 10). Lastly, this result is converted to Matrix C with U16 format and modulo with
Algorithm 7 Parallel Implementation of NTRU Polynomial Convolution Using Tensor Core in a GPU
polynomial
degree
and many different polynomial
ParCyc
ParU16toFP
TC-PC
ParFP32toU
Another point to note is that when we use the proposed technique to implement NTRU, the polynomial convolution for decryption is slightly different from the encryption. During the encryption process, one computes
D. Ephemeral Key Pair
The proposed tensor-core-based polynomial convolution can be more efficient than an integer-based implementation in a GPU. So far, we have only discussed situations that allow the same public/private key pair to be reused for a small number of encryption/decryption. For instance, one can perform
Referring to Algorithm 8, the number of warps required to perform tensor-core-base polynomial convolution (TC-PC) is reduced from
Algorithm 8 NTRU Polynomial Convolution Using Tensor Core With Scalable Ephemeral Key Pair Configurations
polynomial
degree
and many different polynomial
(Same with Alg. 7)
ParCyc
ParU16toFP
TC-PC
ParFP32toU
E. Polynomial Addition
NTRU Encrypt involves polynomial convolution followed by addition to another polynomial (
Evaluation
This section presents experimental results for the proposed tensor-core-based polynomial convolution and its application to three different lattice-based cryptographic schemes. Results are compared to the reference and AVX2 accelerated implementation found in the NIST PQC standardization submission package [15]. CPU implementations were evaluated on a machine with Intel Core i7-9700F clocked at 4.7 GHz with 16 GB RAM. The GPUs used in this paper are the NVIDIA RTX2060 with 8 GB RAM and RTX3080 with 10GB; both devices are clocked at 1.71 GHz. GPU implementations of NTRU follow closely the NIST submission package [15] and the results are verified against the test vectors provided.
A. Performance Evaluation of Tensor-Core-Based Polynomial Convolution
The first experiment was aimed at demonstrating the superiority of tensor-core-based polynomial convolution (TC-PC) against the conventional implementation using 32-bit integer units (INT32-PC). In INT32-PC implementation,
B. Performance Evaluation Under the Ephemeral Key Pair Scenario
By changing the dimensions of matrix multiplication from
C. Performance Evaluation of NTRU
To demonstrate the benefit of tensor core in accelerating lattice-based cryptographic schemes, we implemented NTRU public-key encryption and KEM scheme with parameter sets ntruhps2048509 and ntruhps2048677 using TC-PC and INT32-PC. In the experiment, 512 blocks are launched, where each block computes one NTRU operation. Results of our GPU implementation are presented in Table 5, where they are compared against the reference and AVX2 implementation in CPU. Note that the AVX2 implementation is heavily optimized for performance. On the other hand, the reference implementation aims at providing a clear description to the underlying operations, so it is not optimized for performance.
Considering the results in RTX2060, for the ntruhps2048 509 parameter set, the TC-PC throughput of implementation was
Note that GPU is a throughput-oriented accelerator that is only useful when there are many operations to be computed. Conversely, the AVX2 implementation is advantageous in improving the latency of a single NTRU operation. In other words, the reported speed-up against AVX2 in Table 5 is the full throughput achievable when there is a sufficient workload (512 encapsulation/decapsulation or encryption/decryption). Under such circumstances, the GPU can be an effective accelerator to assist the computation in the CPU, especially in server environment where CPU cores are usually busy handling many other tasks. With insufficient workload, one can always fallback on the AVX2 implementation or employ the techniques for an ephemeral key pair (see Section 3.3).
Fig. 7 and 8 show the throughput achieved by CPU (using AVX2) and GPU (using TC-PC) at various batch sizes, for two different parameter sizes. In this experiment, each block performs one encapsulation/decapsulation, so the batch size is essentially the number of blocks. When the batch size is small (2 to 32), AVX2 implementation is achieving higher or similar throughput compared to the GPU version. However, GPU can produce higher throughput when batch size increases beyond 128 (approximately). This shows that the proposed TensorTRU can be used in various applications to provide high throughput KEM.
Comparing the throughput of AVX2 and GPU implementation of ntruhps2048509 parameter set with various batch sizes.
Comparing the throughput of AVX2 and GPU implementation of ntruhps2048677 parameter set with various batch sizes.
D. Comparison With Other NTRU and PQC Implementations on a GPU
Existing NTRU implementations on GPU platforms are presented in Table 6. Note that the previous implementations do not follow the NIST NTRU specifications; they targeted different polynomial sizes and GPU devices, which are difficult to benchmark with our work directly. To allow a fair comparison, we scale the results from previous implementation to match our GPU, which is calculated as
TensorTRU achieved
Compared to the most recent work by Lee et al. [10], TensorTRU also achieved
Table 7 shows the comparison with existing PQC KEM implementations on GPU. Gupta et al. [11] shows that Kyber KEM can achieve a very high throughput when it is implemented in batch mode, which is essentially a serial implementation. The key exchange throughput achieved by TensorTRU is
E. TensorLAC: Application to LAC
LAC is a cryptosystem based on the poly-LWE variant of the Learning with Errors problem, and was selected as Round 2 candidate in the NIST PQC competition. The modulus of LAC is restricted to
In this paper, we have extended our idea of using tensor core to compute polynomial convolution in LAC. Since LAC uses modulus
The implementation of polynomial convolution in LAC is similar to NTRU, and is presented in Algorithm 9. Since polynomial elements in LAC are already represented in 8-bit integer (U8) form, we can use Configuration 5 in tensor core, and no type conversion is required. This reduces one step compared to Algorithm 7. Firstly, one
Algorithm 9 Parallel Implementation of LAC Polynomial Convolution Using Tensor Core in a GPU
polynomial
degree
LACParCyc
TC-PC
ParFP32toU
Table 8 shows the implementation results of nega-cyclic polynomial convolution of LAC in a CPU (reference and AVX2) and a GPU (integer units and tensor core), respectively. TC-NPC is showed
F. TensorFro: Application to FrodoKEM
FrodoKEM was selected as an alternate candidate in the third round of the NIST PQC competition. The official FrodoKEM parameter sets require that modulus
Polynomial degree
The error vectors in FrodoKEM span a larger distribution compared to ternary values in NTRU and LAC. For instance, Frodo-II and TensorFro have error vector with values in the range {−4, −3,…0,…, +3, +4}. When the proposed tensor-core-based technique is used, multiplication between a 11-bit (
Table 9 shows the results of matrix multiplication for Frodo variant parameter sets implemented on RTX2060. The achieved speed-up between INT32-MM and TC-MM is similar to cases in TensorTRU and TensorLAC.
G. Use Case: Secure Communication in IoT Applications
IoT applications typically employ symmetric encryption schemes (e.g., AES) to encrypt the IoT data. To reduce the risks of compromising symmetric encryption keys, we need to refresh these encryption keys frequently. Referring to Figure 9, this key refreshment task in a typical IoT application can be carried out in two ways:
Scenario 1: The IoT sensor nodes generate new symmetric encryption keys locally and forward them to the gateway device and cloud server via KEM. This process takes place in every communication session, in which pseudorandom number generator (PRNG) can be employed to produce the new symmetric encryption key.
Scenario 2: The cloud server generates new symmetric encryption keys and forward them to the gateway device and each sensor node via KEM. Similarly, PRNG can be utilized to generate new keys for every session.
In scenario 1, the gateway device and cloud server need to handle many key decapsulations in real time. On the other hand, in scenario 2, the cloud server needs to generate and encapsulate many keys, so that these keys can be sent to each sensor node in a timely manner. Under the IoT application scenario, a gateway device typically needs to communicate with tens to hundreds of sensor nodes. On the IoT cloud servers, this connection can go up to tens of thousands. With such massive connections in IoT communication, it is clear that both scenarios needs high throughput key encapsulation/decapsulation, which is difficult even for a high-end workstation. To mitigate this challenge, one can offload the KEM to GPU, which is already found in many gateway device (e.g., Jetson AGX Xavier [38]) and cloud server platforms. The proposed tensor-core-based technique can be very useful in handling this kind of massive key encapsulations/decapsulations.
Due to constrained energy storage, IoT sensor nodes usually transmit the collected sensor data in a coordinated session [39], intermittently. Instead of using a separate public-private key pair for each sensor node, the same key pair is used for key encapsulations/decapsulations for a particular session in all sensor node. This implies that performing hundreds to thousands of KEMs using the same public-private key pair for one communication session is common in IoT applications [12]. In the subsequent sessions, the user can choose to use the same key pair or generate a new one (ephemeral key pair).
Conclusion
In this paper, we present the first tensor-core-aided cryptography implementation on a GPU. The proposed tensor-core-based polynomial convolution is faster than conventional implementations that rely on integer units in the GPU. Since the proposed tensor-core-based polynomial convolution is a generic algorithm, it can be applied to various sizes of matrix/polynomial. Although the current tensor core can only support limited floating point precision and integer types, we believed the situation may change in near future. In particular, the introduction of FP64 into tensor core recently opened up its adoption into the mainstream scientific computing applications, fostering the use of the GPU in a wider range of applications. As this trend persists, we believe the performance of FP64 tensor core will increase and eventually support more parameters for lattice-base cryptography. On top of that, a recently released embedded GPU (the Jetson AGX Xavier) also offers tensor core to accelerate deep learning inference. This embedded GPU can be used to implement gateway device in IoT applications (e.g., road side units in a smart city), in which our solution can be applied to enable high throughput key encapsulations/decapsulations.
Advanced Matrix Extensions (AMX) is a new x86 instruction set architecture (ISA) released by Intel to support matrix multiplication, which is similar to the tensor core on GPUs. AMD had also released the MI100 accelerator recently, which has a matrix engine similar to tensor core. Adapting the proposed tensor-core-based polynomial convolution these platforms would be an interesting future work that we wish to pursue. On the other hand, a recent work utilized the Strassen’s algorithm to speed-up the matrix-multiplication [40], [41]. This can also be an interesting future direction, as the performance of such approach on a GPU is still unknown.
The proposed tensor-core-based polynomial convolution can be implemented on consumer (RTX2060 and RTX3080) and server grade (T4 and A100) GPU platforms with Turing and Ampere architecture, with similar performance gain. However, the tensor core in Volta architecture GPU (e.g., V100) is not as powerful as the one in Turing and Ampere architectures. Hence, we expect that the performance of our solution implemented on Volta architecture GPU may not be as impressive compared these two architectures.