Loading [MathJax]/extensions/MathZoom.js
Enrique S. Quintana-Ortí - IEEE Xplore Author Profile

Showing 1-25 of 92 results

Filter Results

Show

Results

Vision Transformers have demonstrated outstanding performance in Computer Vision tasks. Nevertheless, this superior performance for large models comes at the expense of increasing memory usage for storing the parameters and intermediate activations. To accelerate model inference, in this work we develop and evaluate integer and mixed-precision kernels in Triton for the efficient execution of two f...Show More
Incorporating deep learning (DL) technologies to the edge is crucial for improving the security, privacy, and energy efficiency of the Internet of Things (IoT). In this scenario, the limitations of edge devices in terms of power dissipation, memory capacity, and processing power require a careful selection and optimization of algorithms for IoT DL applications. In this line, our work focuses on th...Show More
Photovoltaic systems are being used in almost every field such as smart cities, Internet of Things paradigms or remote Wireless Sensor Networks. In Internet of things paradigms deployed in natural environments, energy harvesting technology is crucial to power the devices. For the energy management system, it is important to predict how much energy can be harvested from the environment. In this wor...Show More
Global Navigation Satellite System (GNSS) is widely used today for both positioning and timing purposes. Many distinct receiver chips are available off-the-shelf, each tailored to match various applications’ requirements. Being implemented as Application-Specific Integrated Circuits, these chips provide good performance and low energy consumption but must be treated as "black boxes" by customers. ...Show More
The remarkable positive impact of Deep Neural Networks on many Artificial Intelligence (AI) tasks has led to the development of various high performance algorithms as well as specialized processors and accelerators. In this paper we address this scenario by demonstrating that the principles underlying the modern realization of the general matrix multiplication (GEMM) in conventional processor arch...Show More
The convolution operator is a crucial kernel for many computer vision and signal processing applications that rely on deep learning (DL) technologies. As such, the efficient implementation of this operator has received considerable attention in the past few years for a fair range of processor architectures. In this paper, we follow the technology trend toward integrating long SIMD (single instruct...Show More
We address the efficient design and implementation of dense matrix factorizations and inversion (DMFI) on modern multicore processors with several NUMA (non-uniform memory access) nodes. Our approach enhances the DMFI routines with a look-ahead strategy, in order to overcome the “panel factorization bottleneck”. In addition, it exploits both hybrid task- and loop-level parallelizations while takin...Show More
We take a step forward in the direction of developing high performance codes for the convolution, based on the Winograd transformation, that are easy to customize for different processor architectures. In our approach, augmenting the portability of the solution is achieved via the introduction of vector intrinsics to exploit the SIMD (single-instruction multiple-data) capabilities of current proce...Show More
The efforts of the scientific community and hardware vendors to develop and optimize linear algebra codes have historically led to highly-tuned libraries, carefully adapted to the underlying processor architecture, with excellent (near-peak) performance. These optimization efforts, however, are commonly focused on obtaining the best performance possible when the involved operands are large and “sq...Show More
We propose a hybrid parallelization scheme for matrix inversion on multicore processors that combines a look-ahead technique to extract task-parallelism, at a high level, with loop-level parallelism to ensure an efficient utilization of the processor memory subsystem. As a result, our scheme outperforms the conventional approach for dense linear algebra operations, which simply extracts parallelis...Show More
We present a multi-threaded implementation of the matrix multiplication for deep learning on ARM multicore processors. Following standard practice for inference with convolutional neural networks, our GEMM kernel operates with 16-bit integer arithmetic, yielding significant performance acceleration and cutting the memory requirements with respect to IEEE (floating point) single precision by half, ...Show More
We perform a theoretical analysis comparing the scalability of data versus model parallelism, applied to the distributed training of deep convolutional neural networks (CNNs), along five axes: batch size, node (floating-point) arithmetic performance, node memory bandwidth, network link bandwidth, and cluster dimension. Our study relies on analytical performance models that can be configured to rep...Show More
Training deep neural networks is a costly procedure, often performed via sophisticated deep learning frameworks on clusters of computers. As faster processor technologies are integrated into these cluster facilities (e.g., NVIDIA’s graphics accelerators or Google’s tensor processing units), the communication component of the training process rapidly becomes a performance bottleneck. In this paper,...Show More
Process malleability has proved to have a highly positive impact on the resource utilization and global productivity in data centers compared with the conventional static resource allocation policy. However, the non-negligible additional development effort this solution imposes has constrained its adoption by the scientific programming community. In this work, we present DMRlib, a library designed...Show More
The considerable impact of Convolutional Neural Networks on many Artificial Intelligence tasks has led to the development of various high performance algorithms for the convolution operator present in this type of networks. One of these approaches leverages the IM2COL transform followed by a general matrix multiplication (GEMM) in order to take advantage of the highly optimized realizations of the...Show More
In this paper, we describe and evaluate an extension of the CHAMELEON library to operate with hierarchical matrices (H-Matrices) and hierarchical arithmetic (H-Arithmetic), producing efficient solvers for linear systems arising in Boundary Element Methods (BEM). Our approach builds upon an open-source H -Matrices library from Airbus, named HMAT-OSS, that collects sequential numerical kernels for b...Show More
With the appearance of multi-/many core machines, applications and runtime systems have evolved in order to exploit the new on-node concurrency brought by new software paradigms. POSIX threads (Pthreads) was widely-adopted for that purpose and it remains as the most used threading solution in current hardware. Lightweight thread (LWT) libraries emerged as an alternative offering lighter mechanisms...Show More
The solution of sparse triangular linear systems is often the most time-consuming stage of preconditioned iterative methods to solve general sparse linear systems, where it has to be applied several times for the same sparse matrix. For this reason, its computational performance has a strong impact on a wide range of scientific and engineering applications, which has motivated the study of its eff...Show More
We analyze the asymptotic performance of the training process of deep neural networks (NN) on clusters in order to determine the scalability. For this purpose, i) we assume a data parallel implementation of the training algorithm, which distributes the batches among the cluster nodes and replicates the model; ii) we leverage the roofline model to inspect the performance at the node level, taking i...Show More
We propose two novel techniques for overcoming load-imbalance encountered when implementing so-called look-ahead mechanisms in relevant dense matrix factorizations for the solution of linear systems. Both techniques target the scenario where two thread teams are created/activated during the factorization, with each team in charge of performing an independent task/branch of execution. The first tec...Show More
The solution of sparse linear systems of large dimension is a important stage in problems that span a diverse kind of applications. For this reason, a number of iterative solvers have been developed, among which ILUPACK integrates an inverse-based multilevel ILU preconditioner with appealing numerical properties. In this work we extend the iterative methods available in ILUPACK. Concretely, we dev...Show More
We address the acceleration of the PageRank al- gorithm for web information retrieval on graphics processing units (GPUs) via a modular precision framework that adapts the data format in memory to the numerical requirements as the iteration converges. In detail, we abandon the IEEE 754 single- and double-precision number representation formats, employed in the standard implementation of PageRank, ...Show More
We revisit an alternative representation to the compact WY transform for the accumulation (blocking) of Householder reflectors that exhibits the same numerical stability and is composed of efficient computational kernels from Level-3 Basic Linear Algebra Subprograms (BLAS) in contrast with the Level-2 BLAS that are utilized for the construction of the conventional compact WY representation. For th...Show More
Numerous signal processing applications are emerging on both mobile and high-performance computing systems. These applications are subject to responsiveness constraints for user interactivity and, at the same time, must be optimized for energy efficiency. The increasingly heterogeneous power-versus-performance profile of modern hardware introduces new opportunities for energy savings as well as ch...Show More
An important number of scientific and engineering problems currently require the solution of large and sparse linear systems of equations. In previous work, we applied a GPU accelerator to the solution of sparse linear systems of moderate dimension via ILUPACK, showing important reductions in the execution time while maintaining the quality of the solution. Unfortunately, the use of GPUs attached ...Show More