I. Introduction
The in-place transposition of square matrices is a well-explored problem in scientific computing [1]–[3]. In-place transposition routines are widely used to optimize the memory access patterns of different FFT methods. In some cases, such as in the "six-step" FFT variant, the transposition steps are found to represent the most significant performance bottle-neck [4]. The in-place transposition of square matrices is also an important building block for methods that compute the in-place transposition of rectangular matrices, where the input matrix is partitioned into square sub-matrices that must be transposed in-place, such as the Euclid's GCD method proposed by Gustavson et al. [5].