Loading [MathJax]/extensions/MathMenu.js
Achieving Scalability in a k-NN Multi-GPU Network Service with Centaur | IEEE Conference Publication | IEEE Xplore

Achieving Scalability in a k-NN Multi-GPU Network Service with Centaur


Abstract:

Centaur is a GPU-centric architecture for building a low-latency approximate k-Nearest-Neighbors network server. We implement a multi-GPU distributed data flow runtime wh...Show More

Abstract:

Centaur is a GPU-centric architecture for building a low-latency approximate k-Nearest-Neighbors network server. We implement a multi-GPU distributed data flow runtime which enables efficient and scalable network request processing on GPUs. The runtime eliminates GPU management overheads from the CPU, making the server throughput and response time largely agnostic to the CPU load, speed or the number of dedicated CPU cores. Our experiments systems show that our server achieves near-perfect scaling for 16 GPUs, beating the throughput of a highly-optimized CPU-driven server by 35% while maintaining about 2msec average request latency. Furthermore, it requires only a single CPU core to run, achieving over an order of magnitude higher throughput than the standard CPU-driven server architecture in this setting.
Date of Conference: 23-26 September 2019
Date Added to IEEE Xplore: 07 November 2019
ISBN Information:

ISSN Information:

Conference Location: Seattle, WA, USA
References is not available for this document.

I. Introduction

High-concurrency memory-demanding server applications are ubiquitous in high performance computing systems and data centers [12]. They pose three distinctive requirements to developers: low, strictly bounded response time for client requests, high throughput for higher server efficiency, and large physical memory to keep the data set resident to achieve these performance goals. Fulfilling all these requirements together is a significant challenge.

Select All
1.
"Amazon EC2", 2006, [online] Available: https://aws.amazon.com/ec2/.
2.
"Google Cloud Platform", 2011, [online] Available: https://cloud.goog.com/.
3.
"stress-ng", 2014, [online] Available: https://openbenchmarking.org/test/pts/ stress-ng.
4.
"Mellanox BlueField", 2015, [online] Available: http://www.iptronics.com/page/products_dyn?product_family=256=soc_overview.
5.
"OpenCL 2.0", 2015, [online] Available: https://www.khronos.org/registry/OpenCL/ specs/opencl-2.0.pdf.
6.
"CUDA Programming Guide", 2017, [online] Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/.
7.
M. Abadi et al., "Tensorflow: Large-scale machine learning on heterogeneous distributed systems", 2016.
8.
R. Agrawal, "K-nearest neighbor for uncertain data", International Journal of Computer Applications, vol. 105, no. 11, 2014.
9.
S. R. Agrawal et al., "Rhythm: Harnessing data parallel hardware for server workloads" in ACM SIGARCH Computer Architecture News, ACM, vol. 42, no. 1, pp. 19-34, 2014.
10.
"Amazon EC2 P2 Instances", [online] Available: https://aws.amazon.com/ec2/instance-types/p2/.
11.
M. E. Belviranli, L. N. Bhuyan and R. Gupta, "A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures", ACM Transactions on Architecture and Code Optimization (TACO), vol. 9, no. 4, pp. 57, 2013.
12.
R. Buyya, High performance cluster computing, New Jersey:F’rentice, 1999.
13.
S. Chatterjee et al., "Dynamic task parallelism with a GPU work-stealing runtime system" in International Workshop on Languages and Compilers for Parallel Computing, Springer, pp. 203-217, 2011.
14.
F. Daoud, A. Watad and M. Silberstein, "GPUrdma: GPU-side library for high performance networking from GPU kernels", Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers, pp. 6, 2016.
15.
J. Dean and L. A. Barroso, "The tail at scale", Communications of the ACM, vol. 56, no. 2, pp. 74-80, 2013.
16.
D. Foley, "NVLink Pascal and stacked memory: Feeding the appetite for big data", Nvidia.com, 2014.
17.
J. H. Friedman, J. L. Bentley and R. A. Finkel, "An algorithm for finding best matches in logarithmic time", ACM Trans. Math. Software, vol. 3, pp. 209-226, 1976.
18.
K. Fukunage and P. M. Narendra, "A branch and bound algorithm for computing k-nearest neighbors", IEEE transactions on computers, no. 7, pp. 750-753, 1975.
19.
V. Garcia, E. Debreuve and M. Barlaud, "Fast k nearest neighbor search using GPU", 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1-6, 2008.
20.
V. Garcia et al., "K-nearest neighbor search: Fast GPU-based implementations and application to high-dimensional feature matching", 2010 IEEE International Conference on Image Processing, pp. 3757-3760, 2010.
21.
A. Gionis et al., "Similarity search in high dimensions via hashing", Vldb, vol. 99, no. 6, pp. 518-529, 1999.
22.
"GPUs on Compute Engine", [online] Available: https://cloud.google.com/compute/docs/gpus/.
23.
K. Gupta, J. A. Stuart and J. D. Owens, "A study of persistent threads style GPU programming for GPGPU workloads" in 2012 Innovative Parallel Computing (InPar), IEEE, pp. 1-14, 2012.
24.
T. Gysi, J. B¨ar and T. Hoefler, "dCUDA: hardware supported overlap of computation and communication", Proceedings of the International Conference for High Performance Computing Networking Storage and Analysis, pp. 52, 2016.
25.
T. H. Hetherington, M. O’Connor and T. M. Aamodt, "Mem-cachedgpu: Scaling-up scale-out key-value stores", Proceedings of the Sixth ACM Symposium on Cloud Computing, pp. 43-57, 2015.
26.
K. Jang et al., "SSLShader: Cheap SSL acceleration with commodity processors", Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI’11), pp. 1-14, 2011, [online] Available: http://dl.acm.org/citation.cfm?id=1972457.1972459.
27.
H. Jegou, M. Douze and C. Schmid, "Product quantization for nearest neighbor search", IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 1, pp. 117-128, 2011.
28.
H. Jégou et al., "Searching in one billion vectors: re-rank with source coding", Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE International Conference on, pp. 861-864, 2011.
29.
F. Khan, "The cost of latency", 2015, [online] Available: https://www.digitalrealty.com/blog/the-cost-of-latency/.
30.
S. Kim et al., "GPUnet: Networking abstractions for GPU programs", OSDI, vol. 14, pp. 6-8, 2014.

Contact IEEE to Subscribe

References

References is not available for this document.