Loading [MathJax]/extensions/MathZoom.js
Enhancing Server Efficiency in the Face of Killer Microseconds | IEEE Conference Publication | IEEE Xplore

Enhancing Server Efficiency in the Face of Killer Microseconds


Abstract:

We are entering an era of “killer microseconds” in data center applications. Killer microseconds refer to μs-scale “holes” in CPU schedules caused by stalls to access fas...Show More

Abstract:

We are entering an era of “killer microseconds” in data center applications. Killer microseconds refer to μs-scale “holes” in CPU schedules caused by stalls to access fast I/O devices or brief idle times between requests in high throughput microservices. Whereas modern computing platforms can efficiently hide ns-scale and ms-scale stalls through micro-architectural techniques and OS context switching, they lack efficient support to hide the latency of μs-scale stalls. Simultaneous Multithreading (SMT) is an efficient way to improve core utilization and increase server performance density. Unfortunately, scaling SMT to provision enough threads to hide frequent μs-scale stalls is prohibitive and SMT co-location can often drastically increase the tail latency of cloud microservices. In this paper, we propose Duplexity, a heterogeneous server architecture that employs aggressive multithreading to hide the latency of killer microseconds, without sacrificing the Quality-of-Service (QoS) of latency-sensitive microservices. Duplexity provisions dyads (pairs) of two kinds of cores: master-cores, which each primarily executes a single latency-critical master-thread, and lender-cores, which multiplex latency-insensitive throughput threads. When the master-thread stalls, the master-core borrows filler-threads from the lender-core, filling μs-scale utilization holes of the microservice. We propose critical mechanisms, including separate memory paths for the master-thread and filler-threads, to enable master-cores to borrow filler-threads while protecting master-threads' state from disruption. Duplexity facilitates fast master-thread restart when stalls resolve and minimizes the microservice's QoS violation. Our evaluation demonstrates that Duplexity is able to achieve 1.9× higher core utilization and 2.7× lower iso-throughput 99th-percentile tail latency over an SMT-based server design, on average.
Date of Conference: 16-20 February 2019
Date Added to IEEE Xplore: 28 March 2019
ISBN Information:

ISSN Information:

Conference Location: Washington, DC, USA
References is not available for this document.

I. Introduction

We are entering the “killer microsecond” era in data center applications [1]. Due to advances in processor, memory, storage, and networking technologies, events that stall execution increasingly fall in a microsecond-scale latency range. Accesses to emerging storage-class memories [2]–[9], rack-scale memory disaggregation [10]–[14], 100+ gigabit network communication [15], and accelerator/GPU micro-offloads [16]–[18] are example program activities that incur microsecond delays.

Select All
1.
L. Barroso, M. Marty, D. Patterson and P. Ranganathan, "Attack of the killer microseconds", Communications of the ACM, vol. 60, no. 4, pp. 48-54, 2017.
2.
N. Agarwal and T. F. Wenisch, "Thermostat: Application-transparent page management for two-tiered main memory", Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 631-644, 2017.
3.
J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, D. Burger, et al., "Better i/o through byte-addressable persistent memory", ACM SIGOPS 22nd symposium on Operating systems principles, pp. 133-146, 2009.
4.
S. R. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz, D. Reddy, R. Sankaran, et al., "System software for persistent memory", European Conference on Computer Systems, 2014.
5.
A. Mirhosseini, A. Agrawal and J. Torrellas, "Survive: Pointer-based in-dram incremental checkpointing for low-cost data persistence and rollback-recovery", IEEE Computer Architecture Letters, vol. 16, no. 2, pp. 153-157, 2017.
6.
A. Tavakkol, A. Kolli, S. Novakovic, K. Razavi, J. Gomez-Luna, H. Hassan, C. Barthels, Y. Wang, M. Sadrosadati, S. Ghose et al., "Enabling efficient rdma-based synchronous mirroring of persistent memory transactions" in arXiv preprint, 2018.
7.
S. Pelley, P. M. Chen and T. F. Wenisch, "Memory persistency" in ACM SIGARCH Computer Architecture News, 2014.
8.
A. Kolli, V. Gogte, A. Saidi, S. Diestelhorst, P. M. Chen, S. Narayanasamy, et al., "Language-level persistency", ACM/IEEE International Symposium on Computer Architecture, 2017.
9.
V. Gogte, S. Diestelhorst, W. Wang, S. Narayanasamy, P. M. Chen and T. F. Wenisch, "Persistency for synchronization-free regions", ACM SIGPLAN Conference on Programming Language Design and Implementation, 2018.
10.
K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt and T. F. Wenisch, "Disaggregated memory for expansion and sharing in blade servers" in ACM SIGARCH Computer Architecture News, ACM, 2009.
11.
A. Dragojević, D. Narayanan, O. Hodson and M. Castro, "Farm: Fast remote memory", USENIX Conference on Networked Systems Design and Implementation, 2014.
12.
S. Novakovic, A. Daglis, E. Bugnion, B. Falsafi and B. Grot, "Scale-out numa", ACM SIGPLAN Notices, vol. 49, pp. 3-18, 2014.
13.
J. Gu, Y. Lee, Y. Zhang, M. Chowdhury and K. G. Shin, "Efficient memory disaggregation with infiniswap" in NSDI, 2017.
14.
M. K. Aguilera, N. Amit, I. Calciu, X. Deguillard, J. Gandhi, P. Subrahmanyam, et al., "Remote memory in the age of fast networks", Symposium on Cloud Computing, 2017.
15.
C. Binnig, A. Crotty, A. Galakatos, T. Kraska and E. Zamanian, "The end of slow networks: Its time for a redesign", Proceedings of the VLDB Endowment, vol. 9, no. 7, pp. 528-539, 2016.
16.
D. Lustig and M. Martonosi, "Reducing gpu offload latency via fine-grained cpu-gpu synchronization", IEEE International Symposium on High Performance Computer Architecture, 2013.
17.
A. Caulfield, E. Chung, A. Putnam et al., "A cloud-scale acceleration architecture", IEEE/ACM International Symposium on Microarchitecture, 2016.
18.
A. Mirhosseini, M. Sadrosadati, B. Soltani, H. Sarbazi-Azad and T. F. Wenisch, "Binochs: Bimodal network-on-chip for cpu-gpu heterogeneous systems", IEEE/ACM International Symposium on Networks-on-Chip (NOCS), 2017.
19.
Y. Gan and C. Delimitrou, "The Architectural Implications of Cloud Microser-vices", Computer Architecture Letters (CAL), vol. 17, no. 2, Jul-Dec 2018.
20.
Staci D. Kramer, The biggest thing amazon got right: The platform.
21.
Tony Mauro, Adopting microservices at netflix: Lessons for architectural design.
22.
Yoni Goldberg, Scaling gilt: from monolithic ruby application to distributed scala micro-services architecture.
23.
Steven Ihde and Karan Parikh, From a monolith to microservices + rest: the evolution of linkedins service architecture.
24.
Phil Calcado, Building products at soundcloud part i: Dealing with the monolith.
25.
B. Fitzpatrick, "Distributed caching with memcached" in Linux J., 2004.
26.
B. Fan, D. G. Andersen and M. Kaminsky, "Memc3: Compact and concurrent memcache with dumber caching and smarter hashing" in NSDI, 2013.
27.
A. Likhtarov, R. Nishtala, R. McElroy, H. Fugal, A. Grynenko and V. Venkataramani, Introducing mcrouter: A memcached protocol router for scaling memcached deployments, 2014.
28.
R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab et al., "Scaling memcache at facebook", nsdi, vol. 13, pp. 385-398, 2013.
29.
A. Kalia, M. Kaminsky and D. G. Andersen, "Using rdma efficiently for key-value services" in ACM SIGCOMM Computer Communication Review, 2015.
30.
C. Mitchell, Y. Geng and J. Li, "Using one-sided rdma reads to build a fast cpu-efficient key-value store", USENIX Annual Technical Conference, pp. 103-114, 2013.

Contact IEEE to Subscribe

References

References is not available for this document.