I. INTRODUCTION
The Fast Fourier Transform (FFT) is a fundamental algorithm in digital signal processing, a quintessential tool for computing discrete Fourier transforms (DFTs) [1]. DFTs appear in almost any field involving some type of analysis, like image processing, audio analysis, broadcast communications, scientific computing, and more [2]. DFT algorithms are often essential tools in various applications, and the FFT is one of the cornerstones of such computational techniques. Algorithm design has always been a critical and complex problem in digital signal processing due to computational complexity [3]. However, the drive for quicker and faster ways to compute FFTs has been quite successful in the recent years, even in the most complex problems, thanks to the use of new hardware that enables massive parallelism. One of the most promising areas for FFT computational acceleration makes use of Graphical Processing Units (GPUs). GPUs are highly parallel devices already known to be effective in a large number of computationally intensive problems such as specific matrix operations, image processing or machine learning [4]. By harnessing the inherent parallelism of a GPU, significant speedups are possible over CPUs stands-alone solutions or traditional CPUs plus some GPU acceleration. In fact, a greater benefit comes from the efficient ‘compute-bound’ nature of FFT algorithms, where more processors are yielding greater advantage, as opposed to when the limiting factor is memory access, what’s also known as ‘memory-bound’ problem [5]. Minimizing the data memory access is always a relevant design-parameter. This has motivated extensive research efforts into FFT-focused GPU acceleration. From [6] that studied a hybrid approach using both CPUs and GPUs searching for the optimum combination of both architectures, to [7] that investigated how different memory access patterns affect GPU accelerated FFTs, and to [8] that studied the FFTs optimized specifically for GPU architectures, some of these works have examined how to flexibly harness the available parallelism in order to get the best results. The GPU hardware and software landscape, on the other hand, is extremely vibrant, and changes constantly. As a result, FFT acceleration continuously faces new opportunities and challenges [9].