Conferences >2017 IEEE 19th International ...

Cost-Efficient Big Intermediate Data Placement in a Collaborative Cloud Storage Environment

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Collaborative cloud storage environment, which share resources of multiple geographically distributed datacenters owned by different providers enable scientific workflow ...Show More

Metadata

Abstract:

Collaborative cloud storage environment, which share resources of multiple geographically distributed datacenters owned by different providers enable scientific workflow from different locations to process large scale big intermediate data through the Internet. Distributed datacenters are federated and each member can collaborate with each other to efficiently share and process the intermediate data from distributed workflow instances. This paper focuses on the storage cost minimization of intermediate data placement in federated cloud datacenters. Through collaborative and federation mechanisms, we propose an exact federation data placement algorithm based on integer linear programming model (ILP) to assist multiple datacenters hosting intermediate data files generated from a scientific workflow. Under the constraints of the problem, the proposed algorithm finds an optimal intermediate data placement with a cost saving over the federated cloud datacenters, taking into account scientific user requirements, data dependency and size. Experimental results show the cost-efficiency of the proposed cloud storage federation algorithm.

Published in: 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)

Date of Conference: 18-20 December 2017

Date Added to IEEE Xplore: 15 February 2018

ISBN Information:

DOI: 10.1109/HPCC-SmartCity-DSS.2017.67

Conference Location: Bangkok, Thailand

Contents

I. Introduction

Big data workflows have become an important paradigm since the introduction of scientific workflow and the need to formalize complex data-intensive scientific processes. One common characteristic of big data workflow applications is the existence of intermediate data during the execution of workflow instances. This leads to the generation of a massive amount of intermediate results as data dependencies need to be hosted and managed over a cloud infrastructure. Handling large intermediate data dependencies in cloud infrastructure is important for such operations and need a long time for execution since those dependencies need to process intermediate results from different storage locations. As some intermediate data are very large to be relocated efficiently, this operation must take into account the dependencies between intermediate data in selecting their locality. Furthermore, scientific users share important intermediate data dependencies for cooperation and reproduction of new intermediate results. This has enabled researchers to collaboratively work with other professionals or scientific users around the world and to handle and share intermediate data workflow enormously larger in size than before. By offering storage services in several geographically distributed datacenters, cloud infrastructure enables big data workflow applications to offer low latency access to scientific user data. However, the ever increasing volumes of scientific intermediate data address the need to interpret, move and store them more efficiently to the most appropriate datacenter. One fundamental issue in dealing with such scales of scientific user intermediate data results for a workflow application is how to efficiently place them in a distributed cloud datacenter while ensuring the dependency and scalability of the placed data such that the total storage cost of the cloud provider is minimized. On another note, cloud storage providers offer geographically distributed datacenters providing several storage classes with different prices. They can collaborate by sharing their respective resources and dynamically adjust their hosting capacities in response to their data applications. An important problem faced by cloud users is how to exploit these storage classes to serve an application with data requirements at minimum cost. A federation of existing cloud storage services supports the scientific users with a unified and combined view of storage and data services across several providers and applications. Recently, several studies take advantage of pricing plan variety of different resources in a cloud storage federation, where the cost can be optimized by trading through negotiation a storage versus compute and network resources as well as cost optimization of data distribution across cloud providers [1], [2]–[4] (here, we are disregarding in profit improvements). None of these studies investigated the trade off between network and storage cost to optimize cost of data workflow placement across a federated cloud storage provider. Our study is motivated by these pioneer issues as none of them can simultaneously answer the aforementioned questions (i.e., placement and cost saving of data workflow in cloud storage federation).

References is not available for this document.

Cost-Efficient Big Intermediate Data Placement in a Collaborative Cloud Storage Environment

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Cost-Efficient Big Intermediate Data Placement in a Collaborative Cloud Storage Environment

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?