Wednesday, 13 December 2017

Session 6: Big Data & Database

A Constraint Programming Based Hadoop Scheduler for Handling MapReduce Jobs with Deadlines on Clouds

Authors:

Norman Lim (Carleton University)
Shikharesh Majumdar (Carleton University)
Peter Ashwood-Smith (Huawei, Canada)

Abstract:

A novel MapReduce constraint programming based matchmaking and scheduling algorithm (MRCP) that can handle MapReduce jobs with deadlines and achieve high system performance is devised. The MRCP algorithm is incorporated into Hadoop, which is a widely used open source implementation of the MapReduce programming model, as a new scheduler called the CP-Scheduler. This paper originates from the collaborative research with our industrial partner concerning the engineering of resource management middleware for high performance. It describes our experiences and the challenges that we encountered in designing and implementing the prototype CP-based Hadoop scheduler. A detailed performance evaluation of the CP-Scheduler is conducted on Amazon EC2 to determine the CP-Scheduler’s effectiveness as well as to obtain insights into system behaviour and performance. In addition, the CP-Scheduler’s performance is also compared with an earliest deadline first (EDF) Hadoop scheduler, which is implemented by extending Hadoop’s default FIFO scheduler. The experimental results demonstrate the effectiveness of the CP-Scheduler’s ability to handle an open stream of MapReduce jobs with deadlines in a Hadoop cluster.

DOI: 10.1145/2668930.2688058

Full text: PDF

[#][]

An Empirical Performance Evaluation of Distributed SQL Query Engines

Authors:

Stefan van Wouw (Azavista & Delft University of Technology)
José Viña (Azavista)
Alexandru Iosup (Delft University of Technology)
Dick Epema (Delft University of Technology)

Abstract:

Distributed SQL Query Engines (DSQEs) are increasingly used in a variety of domains, but especially users in small companies with little expertise may face the challenge of selecting an appropriate engine for their specific applications. Although both industry and academia are attempting to come up with high level benchmarks, the performance of DSQEs has never been explored or compared in-depth. We propose an empirical method for evaluating the performance of DSQEs with representative metrics, datasets, and system configurations. We implement a micro-benchmarking suite of three classes of SQL queries for both a synthetic and a real world dataset and we report response time, resource utilization, and scalability. We use our micro-benchmarking suite to analyze and compare three state-of-the-art engines, viz. Shark, Impala, and Hive. We gain valuable insights for each engine and we present a comprehensive comparison of these DSQEs. We find that different query engines have widely varying performance: Hive is always being outperformed by the other engines, but whether Impala or Shark is the best performer highly depends on the query type.

DOI: 10.1145/2668930.2688053

Full text: PDF

[#][]

IoTAbench: An Internet of Things Analytics Benchmark

Authors:

Martin Arlitt (HP Labs)
Manish Marwah (HP Labs)
Gowtham Bellala (HP Labs)
Amip Shah (HP Labs)
Jeff Healey (HP Vertica)
Ben Vandiver (HP Vertica)

Abstract:

The commoditization of sensors and communication networks is enabling vast quantities of data to be generated by and collected from cyber-physical systems. This “Internet-of-Things” (IoT) makes possible new business opportunities, from usage-based insurance to proactive equipment maintenance. While many technology vendors now offer “Big Data” solutions, a challenge for potential customers is understanding quantitatively how these solutions will work for IoT use cases. This paper describes a benchmark toolkit called IoTAbench for IoT Big Data scenarios. This toolset facilitates repeatable testing that can be easily extended to multiple IoT use cases, including a user’s specific needs, interests or dataset. We demonstrate the benchmark via a smart metering use case involving an eight-node cluster running the HP Vertica analytics platform. The use case involves generating, loading, repairing and analyzing synthetic meter readings. The intent of IoTAbench is to provide the means to perform “apples-to-apples” comparisons between different sensor data and analytics platforms. We illustrate the capabilities of IoTAbench via a large experimental study, where we store 22.8 trillion smart meter readings totaling 727 TB of data in our eight-node cluster.

DOI: 10.1145/2668930.2688055

Full text: PDF

[#][]