I. Introduction
Among many cloud computing platforms, Apache Spark [1] is one of the popular open-source cloud platforms that introduces the concept of resilient distributed datasets (RDDs) [2] to enable fast processing of large volume of data leveraging distributed memory. Its in-memory data operations makes it well-suited for iterative applications such as iterative machine learning and graph algorithms. However, execution time of a particular job on Apache Spark platform can vary significantly depending on the input data type and size, design and implementation of the algorithm, and computing capability (e.g., number of nodes, CPU speed, memory size), making it extremely difficult to predict job performance, which is often needed to optimize resource allocation [3] [4]. Performance prediction can also help to locate execution stages with abnormal resource usage pattern [5]. While prior work exists that looked into the problem of performance prediction for cloud platforms such as Apache Hadoop [6] (an open-source implementation of MapReduce [7] computing framework), these approaches are not suitable for Apache Spark platform due to its different programming model and features such as in-memory data operations. Hence, to address this void, in this paper, we focus on performance modeling for Apache Spark jobs. While various forms of machine learning approaches are often used to predict system performance leveraging past system execution data [8]–[10] and can achieve reasonable prediction accuracy, it requires training dataset. In contrast, modeling based approaches predict performance through modeling system behavior [11]–[13], and often can provide a better understanding regarding internal execution of a program and resulting performance. Therefore, in this paper, we apply analytical approaches to predict the performance of Apache Spark jobs. Specifically, we leverage the multi-stage execution structure of Apache Spark jobs to develop hierarchical models that can effectively capture the execution behavior of different execution stages. Using these models, we first measure the job performance based on limited scale execution using only a fraction of real data set. Next, we predict the job performance based on the limited scale execution job performance data. We evaluated our framework with four real-world applications. In each case, our model is able to predict execution time for individual stage with high accuracy. Additionally, the model is able to predict memory requirement for RDD creation with high accuracy. However, the accuracy of I/O cost prediction varied for different applications and simulation setup. We discuss our detailed findings in Section IV.