I. Introduction
The rise of power wall, due to the end of Dennard scaling [1], [2] in the mid , has promoted the investigation on more efficient energy efficiency technologies and, in connection with the users' never ending appetite for higher through-put, the integration of a larger number of cores in processor architectures. As a result, high performance computing (HPC) servers -for example, from AMD, Fujitsu, Huawei, IBM and Intel- currently integrate dozens of cores in a single socket (or chip), and one or more of these sockets. While the memory wall [3], [4] has long posed a major performance bottleneck, the increasing number of cores has added memory contention (i.e., conflicts due to simultaneous accesses to memory from two or more cores) to the problem. Consequently, hardware architects have responded to this new challenge with the design of NUMA (Non-Uniform Memory Access) systems [5], [6]. Unfortunately, the NUMA design principles come at the cost of introducing a supplemental burden on the programmers' shoulders, who now have to reduce remote memory accesses as well as control thread-to-core pinning [7]–[9]. For the particular domain of (dense) linear algebra, developing software for these NUMA architectures presents many similarities with message-passing programming [10] which, to a certain extent, can be addressed via a domain-specific approach that raises the abstraction level, as done in [11]. Concretely, in that work we addressed the difficulties of attaining high performance in the execution of multi-threaded dense matrix factorization and inversion (DMFI) on NUMA architectures. For that purpose, we proposed a methodology that improves performance portability, combines a hybrid task/loop-level parallelization, and exploits locality at the expense of minor code modifications.