I. Introduction
Big data workflows have become an important paradigm since the introduction of scientific workflow and the need to formalize complex data-intensive scientific processes. One common characteristic of big data workflow applications is the existence of intermediate data during the execution of workflow instances. This leads to the generation of a massive amount of intermediate results as data dependencies need to be hosted and managed over a cloud infrastructure. Handling large intermediate data dependencies in cloud infrastructure is important for such operations and need a long time for execution since those dependencies need to process intermediate results from different storage locations. As some intermediate data are very large to be relocated efficiently, this operation must take into account the dependencies between intermediate data in selecting their locality. Furthermore, scientific users share important intermediate data dependencies for cooperation and reproduction of new intermediate results. This has enabled researchers to collaboratively work with other professionals or scientific users around the world and to handle and share intermediate data workflow enormously larger in size than before. By offering storage services in several geographically distributed datacenters, cloud infrastructure enables big data workflow applications to offer low latency access to scientific user data. However, the ever increasing volumes of scientific intermediate data address the need to interpret, move and store them more efficiently to the most appropriate datacenter. One fundamental issue in dealing with such scales of scientific user intermediate data results for a workflow application is how to efficiently place them in a distributed cloud datacenter while ensuring the dependency and scalability of the placed data such that the total storage cost of the cloud provider is minimized. On another note, cloud storage providers offer geographically distributed datacenters providing several storage classes with different prices. They can collaborate by sharing their respective resources and dynamically adjust their hosting capacities in response to their data applications. An important problem faced by cloud users is how to exploit these storage classes to serve an application with data requirements at minimum cost. A federation of existing cloud storage services supports the scientific users with a unified and combined view of storage and data services across several providers and applications. Recently, several studies take advantage of pricing plan variety of different resources in a cloud storage federation, where the cost can be optimized by trading through negotiation a storage versus compute and network resources as well as cost optimization of data distribution across cloud providers [1], [2]–[4] (here, we are disregarding in profit improvements). None of these studies investigated the trade off between network and storage cost to optimize cost of data workflow placement across a federated cloud storage provider. Our study is motivated by these pioneer issues as none of them can simultaneously answer the aforementioned questions (i.e., placement and cost saving of data workflow in cloud storage federation).