Award Winners 2018
Abstract — Nikolas Herbst: Methods and Benchmarks for Auto-Scaling Mechanisms in Elastic Cloud Environments
A key functionality of cloud systems are automated resource management mechanisms at the infrastructure level. As part of this, elastic scaling of allocated resources is realized by so-called auto-scalers that are supposed to match the current demand in a way that the performance remains stable while resources are efficiently used. The process of rating cloud infrastructure offerings in terms of the quality of their achieved elastic scaling remains undefined. Clear guidance for the selection and configuration of an auto-scaler for a given context is not available. Thus, existing operating solutions are optimized in a highly application specific way and usually kept undisclosed.
The common state of practice is the use of simplistic threshold-based approaches. Due to their reactive nature they incur performance degradation during the minutes of provisioning delays. In the literature, a high-number of auto-scalers has been proposed trying to overcome the limitations of reactive mechanisms by employing proactive prediction methods. In this thesis, we identify potentials in automated cloud system resource management and its evaluation methodology. Specifically, we make the following contributions:
(I) We propose a descriptive load profile modeling framework together with automated model extraction from recorded traces to enable reproducible workload generation with realistic load intensity variations. The proposed Descartes Load Intensity Model (DLIM) with its Limbo framework provides key functionality to stress and benchmark resource management approaches in a representative and fair manner.
(II) We propose a set of intuitive metrics for quantifying timing, stability and accuracy aspects of elasticity. Based on these metrics, we propose a novel approach for benchmarking the elasticity of Infrastructure-as-a-Service (IaaS) cloud platforms independent of the performance exhibited by the provisioned underlying resources.
(III) We tackle the challenge of reducing the risk of relying on a single proactive auto-scaler by proposing a new self-aware auto-scaling mechanism, called Chameleon, combining multiple different proactive methods coupled with a reactive fallback mechanism. Chameleon employs on-demand, automated time series-based forecasting methods to predict the arriving load intensity in combination with run-time service demand estimation techniques to calculate the required resource consumption per work unit without the need for a detailed application instrumentation. It can also leverage application knowledge by solving product-form queueing networks used to derive optimized scaling actions. The Chameleon approach is first in resolving conflicts between reactive and proactive scaling decisions in an intelligent way.
We are confident that the contributions of this thesis will have a long-term impact on the way cloud resource management approaches are assessed. While this could result in an improved quality of autonomic management algorithms, we see and discuss arising challenges for future research in cloud resource management and its assessment methods: The adoption of containerization on top of virtual machine instances introduces another level of indirection. As a result, the nesting of virtual resources increases resource fragmentation and causes unreliable provisioning delays. Furthermore, virtualized compute resources tend to become more and more inhomogeneous associated with various priorities and trade-offs. Due to DevOps practices, cloud hosted service updates are released with a higher frequency which impacts the dynamics in user behavior
Abstract — Matteo Nardelli: QoS-aware Deployment and Adaptation of Data Stream Processing Applications in Geo-distributed Environments
Exploiting on-the-fly computation, Data Stream Processing (DSP) applications are widely used to extract valuable information in a near real-time fashion, thus enabling the development of new pervasive services. Nevertheless, running DSP applications is challenging, because they are subject to a varying workload, require long provisioning time, and express strict QoS requirements. Moreover, since data sources are, in general, geographically distributed (e.g., in Internet-of-Things scenarios), recently we have also witnessed a paradigm shift with the deployment and execution of DSP applications over distributed Cloud and Fog computing resources. This computing environment allows to move applications closer to the data sources and consumers, thus reducing their expected response time. This diffused computing infrastructure also promises to reduce the stress upon the Internet infrastructure, by reducing the movement of large data sets, and to improve the scalability of DSP systems, by better exploiting the ever increasing amount of resources at the network periphery. Nevertheless, such geo-distributed infrastructures pose new challenges to deal with, including the heterogeneity of computing and networking resources and the non-negligible network latencies.
In this thesis, we study the challenges of executing DSP applications over geo-distributed environments. A DSP application is represented as a directed graph, with data sources, operators, and final consumers as vertices, and streams as edges. For the execution, we need to solve the operator placement problem, which consists in determining the computing nodes that should host and execute each operator of a DSP application. Moreover, when the DSP application should efficiently process huge amount of incoming load, we also need to solve the operator replication problem. It consists in determining the number of parallel instances (or replicas) for the operators, so that each instance can process a subset of the incoming data flow in parallel (i.e., data parallelism). We first present a taxonomy to classify the existing deployment and runtime adaptation approaches for DSP systems. Starting from the literature review, we provide several contributions to the initial deployment and runtime adaptation of DSP applications over heterogeneous resources. We propose a general unified formulation of the operator placement problem, which also provides a benchmark against which placement heuristics can be evaluated. Then, we present several new heuristics for efficiently solving the operator placement problem in a feasible amount of time. Differently from existing research efforts, the heuristics are also evaluated in terms of quality of the computed placement solution. Afterwards, we study the operator replication problem and propose a general formulation that jointly optimizes the replication and placement of the DSP operators, while considering the application QoS requirements. To preserve the application performance in highly changing execution environments, the deployment of DSP applications should be accordingly reconfigured at runtime. Hence, we formulate the elastic replication and placement problem, which determines whether the application should be more conveniently redeployed by explicitly considering the adaptation costs (i.e., state migration, application downtime). To efficiently deal with runtime adaptation over large scale and geo-distributed infrastructures, we present a hierarchical approach for the autonomous control of elastic DSP applications. To evaluate the devised deployment and adaptation solutions on real DSP systems, we design and implement two extensions of Apache Storm, namely Distributed Storm and Elastic Storm, which are available as open source projects.
Our thesis work demonstrates the importance of models, prototypes, and empirical experiments to deeply understand and overcome the challenges of running DSP applications over geo-distributed environments. Indeed, a suitable representation of applications, computing and network resources, which explicitly considers their relevant QoS attributes, allows to improve the application performance, while efficiently managing geo-distributed infrastructures.