Loading [MathJax]/extensions/MathMenu.js

Achieving Scalability in a k-NN Multi-GPU Network Service with Centaur | IEEE Conference Publication | IEEE Xplore

- Donate
- Personal Sign In

ADVANCED SEARCH

Conferences >2019 28th International Confe...

Achieving Scalability in a k-NN Multi-GPU Network Service with Centaur

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Centaur is a GPU-centric architecture for building a low-latency approximate k-Nearest-Neighbors network server. We implement a multi-GPU distributed data flow runtime wh...Show More

Metadata

Abstract:

Centaur is a GPU-centric architecture for building a low-latency approximate k-Nearest-Neighbors network server. We implement a multi-GPU distributed data flow runtime which enables efficient and scalable network request processing on GPUs. The runtime eliminates GPU management overheads from the CPU, making the server throughput and response time largely agnostic to the CPU load, speed or the number of dedicated CPU cores. Our experiments systems show that our server achieves near-perfect scaling for 16 GPUs, beating the throughput of a highly-optimized CPU-driven server by 35% while maintaining about 2msec average request latency. Furthermore, it requires only a single CPU core to run, achieving over an order of magnitude higher throughput than the standard CPU-driven server architecture in this setting.

Published in: 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Date of Conference: 23-26 September 2019

Date Added to IEEE Xplore: 07 November 2019

ISBN Information:

ISSN Information:

DOI: 10.1109/PACT.2019.00027

Conference Location: Seattle, WA, USA

References is not available for this document.

I. Introduction

High-concurrency memory-demanding server applications are ubiquitous in high performance computing systems and data centers [12]. They pose three distinctive requirements to developers: low, strictly bounded response time for client requests, high throughput for higher server efficiency, and large physical memory to keep the data set resident to achieve these performance goals. Fulfilling all these requirements together is a significant challenge.

Select All

1.

"Amazon EC2", 2006, [online] Available: https://aws.amazon.com/ec2/.

2.

"Google Cloud Platform", 2011, [online] Available: https://cloud.goog.com/.

3.

"stress-ng", 2014, [online] Available: https://openbenchmarking.org/test/pts/ stress-ng.

4.

"Mellanox BlueField", 2015, [online] Available: http://www.iptronics.com/page/products_dyn?product_family=256=soc_overview.

5.

"OpenCL 2.0", 2015, [online] Available: https://www.khronos.org/registry/OpenCL/ specs/opencl-2.0.pdf.

6.

"CUDA Programming Guide", 2017, [online] Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/.

7.

M. Abadi et al., "Tensorflow: Large-scale machine learning on heterogeneous distributed systems", 2016.

8.

R. Agrawal, "K-nearest neighbor for uncertain data", International Journal of Computer Applications, vol. 105, no. 11, 2014.

9.

S. R. Agrawal et al., "Rhythm: Harnessing data parallel hardware for server workloads" in ACM SIGARCH Computer Architecture News, ACM, vol. 42, no. 1, pp. 19-34, 2014.

CrossRef Google Scholar

10.

"Amazon EC2 P2 Instances", [online] Available: https://aws.amazon.com/ec2/instance-types/p2/.

11.

M. E. Belviranli, L. N. Bhuyan and R. Gupta, "A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures", ACM Transactions on Architecture and Code Optimization (TACO), vol. 9, no. 4, pp. 57, 2013.

CrossRef Google Scholar

12.

R. Buyya, High performance cluster computing, New Jersey:F’rentice, 1999.

13.

S. Chatterjee et al., "Dynamic task parallelism with a GPU work-stealing runtime system" in International Workshop on Languages and Compilers for Parallel Computing, Springer, pp. 203-217, 2011.

14.

F. Daoud, A. Watad and M. Silberstein, "GPUrdma: GPU-side library for high performance networking from GPU kernels", Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers, pp. 6, 2016.

CrossRef Google Scholar

15.

J. Dean and L. A. Barroso, "The tail at scale", Communications of the ACM, vol. 56, no. 2, pp. 74-80, 2013.

CrossRef Google Scholar

16.

D. Foley, "NVLink Pascal and stacked memory: Feeding the appetite for big data", Nvidia.com, 2014.

17.

J. H. Friedman, J. L. Bentley and R. A. Finkel, "An algorithm for finding best matches in logarithmic time", ACM Trans. Math. Software, vol. 3, pp. 209-226, 1976.

CrossRef Google Scholar

18.

K. Fukunage and P. M. Narendra, "A branch and bound algorithm for computing k-nearest neighbors", IEEE transactions on computers, no. 7, pp. 750-753, 1975.

19.

V. Garcia, E. Debreuve and M. Barlaud, "Fast k nearest neighbor search using GPU", 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1-6, 2008.

20.

V. Garcia et al., "K-nearest neighbor search: Fast GPU-based implementations and application to high-dimensional feature matching", 2010 IEEE International Conference on Image Processing, pp. 3757-3760, 2010.

21.

A. Gionis et al., "Similarity search in high dimensions via hashing", Vldb, vol. 99, no. 6, pp. 518-529, 1999.

22.

"GPUs on Compute Engine", [online] Available: https://cloud.google.com/compute/docs/gpus/.

23.

K. Gupta, J. A. Stuart and J. D. Owens, "A study of persistent threads style GPU programming for GPGPU workloads" in 2012 Innovative Parallel Computing (InPar), IEEE, pp. 1-14, 2012.

24.

T. Gysi, J. B¨ar and T. Hoefler, "dCUDA: hardware supported overlap of computation and communication", Proceedings of the International Conference for High Performance Computing Networking Storage and Analysis, pp. 52, 2016.

25.

T. H. Hetherington, M. O’Connor and T. M. Aamodt, "Mem-cachedgpu: Scaling-up scale-out key-value stores", Proceedings of the Sixth ACM Symposium on Cloud Computing, pp. 43-57, 2015.

CrossRef Google Scholar

26.

K. Jang et al., "SSLShader: Cheap SSL acceleration with commodity processors", Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI’11), pp. 1-14, 2011, [online] Available: http://dl.acm.org/citation.cfm?id=1972457.1972459.

27.

H. Jegou, M. Douze and C. Schmid, "Product quantization for nearest neighbor search", IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 1, pp. 117-128, 2011.

28.

H. Jégou et al., "Searching in one billion vectors: re-rank with source coding", Acoustics Speech and Signal Processing (ICASSP) 2011 IEEE International Conference on, pp. 861-864, 2011.

29.

F. Khan, "The cost of latency", 2015, [online] Available: https://www.digitalrealty.com/blog/the-cost-of-latency/.

30.

S. Kim et al., "GPUnet: Networking abstractions for GPU programs", OSDI, vol. 14, pp. 6-8, 2014.

References is not available for this document.