Wednesday, 13 December 2017

Session 3: Performance Measurements and Experimental Analysis

Accurate and Efficient Object Tracing for Java Applications

Authors:

Philipp Lengauer (Johannes Kepler Universität Linz)
Verena Bitto (Johannes Kepler Universität Linz)
Hanspeter Mössenböck (Johannes Kepler Universität Linz)

Abstract:

Object allocations and garbage collection can have a considerable impact on the performance of Java applications. Without monitoring tools, such performance problems are hard to track down, and if such tools are applied, they often cause a significant overhead and tend to distort the behavior of the monitored application. In this paper we present a new light-weight memory monitoring approach in which we trace allocations, deallocations and movements of objects using VM-specific knowledge. We strive for utmost compactness of the trace by using a binary format with optimized encodings for different cases of memory events and by omitting all information that can be reconstructed offline when the trace is processed. Our approach allows us to reconstruct the heap for any point in time and to do offline analyses both on the heap and on the trace. We evaluated our tracing technique with more than 30 benchmarks from the DaCapo 2009, the DaCapo Scala, the SPECjvm 2008, and the SPECjbb 2005 benchmark suites. The average run-time overhead is 4.68%, which seems to be fast enough for keeping tracing switched on even in production mode.

DOI: 10.1145/2668930.2688037

Full text: PDF

[#][]

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures

Authors:

Thomas R. W. Scogland (Virginia Tech)
Wu-chun Feng (Virginia Tech)

Abstract:

As core counts increase and as heterogeneity becomes more common in parallel computing, we face the prospect of programming hundreds or even thousands of concurrent threads in a single shared-memory system. At these scales, even highly-efficient concurrent algorithms and data structures can become bottlenecks, unless they are designed from the ground up with throughput as their primary goal. In this paper, we present three contributions: (1) a characterization of queue designs in terms of modern multi- and many-core architectures, (2) the design of a high-throughput, linearizable, blocking, concurrent FIFO queue for many-core architectures that avoids the bottlenecks and pitfalls common in modern queue designs, and (3) a thorough evaluation of concurrent queue throughput across CPU, GPU, and co-processor devices. Our evaluation shows that focusing on throughput, rather than progress guarantees, allows our queue to scale to as much as three orders of magnitude (1000x) faster than lock-free and combining queues on GPU platforms and two times (2x) faster on CPU devices. These results deliver critical insights into the design of data structures for highly concurrent systems: (1) progress guarantees do not guarantee scalability, and (2) allowing an algorithm to block can increase throughput.

DOI: 10.1145/2668930.2688048

 

Lightweight Java Profiling with Partial Safepoints and Incremental Stack Tracing

Authors:

Peter Hofer (Johannes Kepler Universität Linz)
David Gnedt (Johannes Kepler Universität Linz)
Hanspeter Mössenböck (Johannes Kepler Universität Linz)

Abstract:

Sampling profilers are popular because of their low and adjustable overhead and because they do not distort the profile by modifying the application code. A typical sampling profiler periodically suspends the application threads, walks their stacks, and merges the resulting stack traces into a calling context tree. Java virtual machines offer a convenient interface to accomplish this, but rely on safepoints, a synchronization mechanism that requires all threads to park in a safe location. However, a profiler is primarily interested in the running threads, and waiting for all threads to reach a safe location significantly increases the overhead. In most cases, taking a complete stack trace is also unnecessary because many stack frames remain unchanged between samples. We present three techniques that reduce the overhead of sampling Java applications. Partial safepoints require only a certain number of threads to enter a safepoint and can be used to sample only the running threads. With self-sampling, we parallelize taking stack traces by having each thread take its own stack trace. Finally, incremental stack tracing constructs stack traces lazily and examines each stack frame only once instead of walking the entire stack for each sample. Our techniques require no support from the operating system or hardware. With our implementation in the popular HotSpot virtual machine, we show that we can significantly reduce the overhead of sampling without affecting the accuracy of the profiles.

DOI: 10.1145/2668930.2688038

Full text: PDF

[#][]

Sampling-based Steal Time Accounting under Hardware Virtualization

Authors:

Peter Hofer (Johannes Kepler Universität Linz)
Florian Hörschläger (Johannes Kepler Universität Linz)
Hanspeter Mössenböck (Johannes Kepler Universität Linz)

Abstract:

Virtualization enables the efficient sharing of hardware resources among multiple virtual machines (VMs). Because the physical resources are limited, the scheduler must often suspend one VM to allow some other VM to run. The operating system in a VM is typically unaware of the suspension and accounts periods of suspension as CPU time to the executing application thread. This misrepresentation of resource usage makes it difficult to tell whether a performance problem is caused by an actual bottleneck in the application or by the virtualization infrastructure. We present a novel approach to compute to what degree the threads of an application in a virtual machine are affected by suspension. Our approach does not require any modifications to the operating system or to the virtualization software. It periodically samples the system-wide amount of “steal time” that is reported by the virtualization infrastructure, and divides it among the monitored threads according to their CPU usage. With a prototype implementation, we demonstrate that our approach accounts accurate amounts of steal time to application threads, that it can be used to compute the true resource usage of an application, and that it incurs only negligible performance overhead.

DOI: 10.1145/2668930.2695524

Full text: PDF

[#][]

Landscaping Performance Research at the ICPE and its Predecessors: A Systematic Literature Review

Authors:

Alexandru Danciu (fortiss GmbH)
Johannes Kroß (fortiss GmbH)
Andreas Brunnert (fortiss GmbH)
Felix Willnecker (fortiss GmbH)
Christian Vögele (fortiss GmbH)
Anand Kapadia (Technische Universität München)
Helmut Krcmar (Technische Universität München)

Abstract:

This paper conducts a systematic literature review of papers published in the proceedings of the International Conference on Performance Engineering (ICPE) and its predecessors. It provides an overview of prevailing topics within the community over time. We look at research and contribution facets that have been used to address these topics. Trends are outlined in terms of evaluation methods to validate contributions. The results are complemented with a geographical and organizational dimension. The paper concludes with a look at the top ten contributing countries and organizations for this purpose.

DOI: 10.1145/2668930.2688039

Full text: PDF

[#][]