Thursday, 14 December 2017

Session 11: Benchmarks and Empirical Studies - Workloads, Scenarios and Implementations

NUPAR: A Benchmark Suite for Modern Gpu Architectures

Authors:

Yash Ukidave (Northeastern University)
Fanny Nina Paravecino (Northeastern University)
Leiming Yu (Northeastern University)
Charu Kalra (Northeastern University)
Amir Momeni (Northeastern University)
Zhongliang Chen (Northeastern University)
Nick Materise (Northeastern University)
Brett Daley (Northeastern University)
Perhaad Mistry (Advanced Micro Devices Inc.)
David Kaeli (Northeastern University)

Abstract:

Heterogeneous systems consisting of multi-core CPUs, Graphics Processing Units (GPUs) and many-core accelerators have gained widespread use by application developers and data-center platform developers. Modern day heterogeneous systems have evolved to include advanced hardware and software features to support a spectrum of application patterns. Heterogeneous programming frameworks such as CUDA, OpenCL, and OpenACC have all introduced new interfaces to enable developers to utilize new features on these platforms. In emerging applications, performance optimization is not only limited to effectively exploiting data-level parallelism, but includes leveraging new degrees of concurrency and parallelism to accelerate the entire application. To aid hardware architects and application developers in effectively tuning performance on GPUs, we have developed the NUPAR benchmark suite. The NUPAR applications belong to a number of different scientific and commercial computing domains. These benchmarks exhibit a range of GPU computing characteristics that consider memory-bandwidth limitations, device occupancy and resource utilization, synchronization latency and device-specific compute optimizations. The NUPAR applications are specifically designed to stress new hardware and software features that include: nested parallelism, concurrent kernel execution, shared host-device memory and new instructions for precise computation and data movement. In this paper, we focus our discussion on applications developed in CUDA and OpenCL, and focus on high-end server class GPUs. We describe these benchmarks and evaluate their interaction with different architectural features on a GPU. Our evaluation examines the behavior of the advanced hardware features on recently-released GPU architectures.

DOI: 10.1145/2668930.2688046

Full text: PDF

[#][]

Automated Workload Characterization for I/O Performance Analysis in Virtualized Environments

Authors:

Axel Busch (Karlsruhe Institute of Technology)
Qais Noorshams (Karlsruhe Institute of Technology)
Samuel Kounev (University of Würzburg)
Anne Koziolek (Karlsruhe Institute of Technology)
Ralf Reussner (Karlsruhe Institute of Technology)
Erich Amrehn (IBM Research & Development)

Abstract:

Next generation IT infrastructures are highly driven by virtualization technology. The latter enables flexible and efficient resource sharing allowing to improve system agility and reduce costs for IT services. Due to the sharing of resources and the increasing requirements of modern applications on I/O processing, the performance of storage systems is becoming a crucial factor. In particular, when migrating or consolidating different applications the impact on their performance behavior is often an open question. Performance modeling approaches help to answer such questions, a prerequisite, however, is to find an appropriate workload characterization that is both easy to obtain from applications as well as sufficient to capture the important characteristics of the application. In this paper, we present an automated workload characterization approach that extracts a workload model to represent the main aspects of I/O-intensive applications using relevant workload parameters, e.g., request size, read-/write ratio, in virtualized environments. Once extracted, workload models can be used to emulate the workload performance behavior in real-world scenarios like migration and consolidation scenarios. We demonstrate our approach in the context of two case studies of representative system environments. We present an in-depth evaluation of our workload characterization approach showing its effectiveness in workload migration and consolidation scenarios. We use an IBM System z equipped with an IBM DS8700 and a Sun Fire system as state-of-the-art virtualized environments. Overall, the evaluation of our workload characterization approach shows promising results to capture the relevant factors of I/O-intensive applications.

DOI: 10.1145/2668930.2688050

Full text: PDF

[#][]

Can Portability Improve Performance? An Empirical Study of Parallel Graph Analytics

Authors:

Ana Lucia Varbanescu (University of Amsterdam)
Merijn Verstraaten (University of Amsterdam)
Cees de Laat (University of Amsterdam)
Ate Penders (Delft University of Technology)
Alexandru Iosup (Delft University of Technology)
Henk Sips (Delft University of Technology)

Abstract:

Due to increasingly large datasets, graph analytics—traversals, all-pairs shortest path computations, centrality measures, etc.—are becoming the focus of high-performance computing (HPC). Because HPC is currently dominated by many-core architectures (both CPUs and GPUs), new graph processing solutions have to be defined to efficiently use such computing resources. Prior work focuses on platform-specific performance studies and on platform-specific algorithm development, successfully proving that algorithms highly tuned to GPUs or multi-core CPUs can provide high performance graph analytics. However, the portability of such algorithms remains an important concern for many users, especially the many companies without the resources to invest in HPC or concerned about lock-in in single-use parallel techniques. In this work, we investigate the functional portability and performance of graph analytics algorithms. We conduct an empirical study measuring the performance of 3 graph analytics algorithms (a single code implemented in OpenCL and targeted at many-core CPUs and GPUs), on 3 different platforms, using 11 real-world and synthetic datasets. Our results show that the code is functionally portable, that is, the applications can run unchanged on both CPUs and GPUs. The large variation in their observed performance indicates that portability is necessary not only for productivity, but, surprisingly, also for performance. We conjecture that the impact of datasets on performance is too high to allow platform-specific algorithms to outperform the portable algorithms by large margins, in all cases. Our conclusion is that portable parallel graph analytics is feasible without significant performance loss, and provides a productive alternative to the expensive trial-and-error selection of one algorithm for each (graph,platform) pair.

DOI: 10.1145/2668930.2688042

Full text: PDF

[#][]

Utilizing Performance Unit Tests to Increase Performance Awareness

Authors:

Vojtěch Horký (Charles University)
Peter Libič (Charles University)
Lukáš Marek (Charles University)
Antonin Steinhauser (Charles University)
Petr Tůma (Charles University)

Abstract:

Many decisions taken during software development impact the resulting application performance. The key decisions whose potential impact is large are usually carefully weighed. In contrast, the same care is not used for many decisions whose individual impact is likely to be small – simply because the costs would outweigh the benefits. Developer opinion is the common deciding factor for these cases, and our goal is to provide the developer with information that would help form such opinion, thus preventing performance loss due to the accumulated effect of many poor decisions. Our method turns performance unit tests into recipes for generating performance documentation. When the developer selects an interface and workload of interest, relevant performance documentation is generated interactively. This increases performance awareness – with performance information available alongside standard interface documentation, developers should find it easier to take informed decisions even in situations where expensive performance evaluation is not practical. We demonstrate the method on multiple examples, which show how equipping code with performance unit tests works.

DOI: 10.1145/2668930.2688051

Full text: PDF

[#][]

On the Road to Benchmarking BPMN 2.0 Workflow Engines

Authors:

Marigianna Skouradaki (University of Stuttgart)
Dieter H. Roller (University of Stuttgart)
Frank Leymann (University of Stuttgart)
Vincenzo Ferme (University of Lugano)
Cesare Pautasso (University of Lugano)

Abstract:

Workflow Management Systems (WfMSs) provide platforms for delivering complex service-oriented applications that need to satisfy enterprise-grade quality of service requirements such as dependability and scalability. In this paper we focus on the case of benchmarking the performance of the core of WfMSs, Workflow Engines, that are compliant with the Business Process Model and Notation 2.0 (BPMN 2.01) standard. We first explore the main challenges that need to be met when designing such a benchmark and describe the approaches we designed for tackling them in the Bench-Flow project2. We discuss our approach to distill the essence of real-world processes to create from it processes for the benchmark, and to ensure that the benchmark finds wide applicability.

DOI: 10.1145/2668930.2695527

Full text: PDF

[#][]