Theme for HPM 2015:

The first workshop in the series will focus on appropriate power metrics for all stake holders: procurement managers, facilities and infrastructure personnel, IT personnel, and the computing resource users. There is currently a lack of general consensus amongst various stake holders as to the most relevant and appropriate power metrics for data centers. This workshop will create a forum for the HPC community to have a meaningful discussion to find common grounds about power metrics.
The 2015 workshop will include a keynote talk and cover three areas of interest: Power Metrics, Monitoring Tools, and Future Technologies.

Monday September 21, 2015

7:30pm-9:00pm Working Reception: Discussion of Potential Questions for Panelists

Tuesday September 22, 2015

7:30am-8:15am Breakfast (45min)

8:15am-8:30am Opening Remarks (15min)
Neena Imam, Oak Ridge National Laboratory

8:30am-9:15am Keynote Talk (45min)

Metrics for HPC Data Center Power Proportionality and Efficiency
Natalie Bates, Chair of the Energy Efficient HPC Working Group

9:15am-9:30am Coffee Break (15min)

9:30am-11:30am Session 1: Metrics (2hr)

Area 1: Power Metrics: Representatives from large data center facilities will discuss their power requirements and relevant metrics. We will bring together leaders from large data center facilities to discuss the power requirements that are the deciding factors in procurement decisions, data center management, DoD and DoE mission needs, and end-user satisfaction. These speakers will present their definitions of power usage effectiveness. The focus of the discussion will be to answer the key question, "is there a be-all and end-all metric for data centers?"

Optimizing High Performance Compute by Identifying Trapped Power Capacity
Craig Barclay, DoD
Abstract: High-performance computing is growing exponentially. Infrastructure power for an HPC system is designed and provisioned to meet the required computational power to support the full capability of each rack. The workload of an HPC system uses its components dynamically, and as such it under-utilizes allocated power resources that results in increased installation and operating costs of the data center. The under-utilization of these resources is known as trapped capacities. Trapped capacity is not a widely known term and often gets confused with stranded capacity. This presentation highlights the differences between these two phenomena, their effects in managing a data center, and gives a perspective on how to control them.

Designing an HPC Facility for Modern Systems
Jim Rogers, ORNL
Abstract: In November 2014, Oak Ridge National Laboratory (ORNL) announced a subcontract with IBM for Summit, a 150PF 10MW supercomputer that presents significant challenges and opportunities. ORNL currently operates Titan, a 27PF Cray XK7 that has remained the fastest system in the United States for nearly three years. Summit will eventually supersede Titan, but in an operating environment that will dramatically shift. Gone are the traditional chillers, cold water temperatures and low flow rates. In their place, ORNL will provision a facility that can provide up to 20MW of electrical distribution capacity in a stepwise manner, 20C supply temperatures, and significant reductions in operating costs through evaporative cooling. This talk will describe the anticipated mechanical and electrical designs, the control systems, and the measurement systems not only for Summit, but also for the facility as a whole. 

HPC Utility Metrics and Facility Controls
Thomas Durbin, NCSA
Abstract: HPC facility electricity consumption is affected by computing system utilization and scheduling. NCSA metrics reporting includes electricity consumption and cooling equipment utilization metrics. Implementation of building automation systems enables HPC facilities to minimize electricity consumption and improve metrics such as PUE. Control of flow rates and differential pressure as well as cooling assets improves system efficiency and reduces energy cost. Integration of building controls with HPC system controls may provide energy cost savings and system management opportunities, enabling facility managers to work with utility providers to minimize effects of large load swings and control peak demand.

Trends in HPC Power Metrics and where to from here?
Ramkumar Nagappan, Intel
Abstract: This talk will explore the current trends in power metrics such as PUE, ITUE, Green 500 and other energy efficiency metrics. The talk will also review the System and CPU Energy efficiency Trend from 2012 to 2018 and how these trends can be used to plan for future power capacity planning.  It will also discuss today's challenges in measuring energy efficiency metrics due to a multitude of factors including: not enough infrastructure instrumentation, lack of awareness of existing IT based instrumentation, and gaps in procurement processes driving the right behavior.  We will discuss the new and upcoming usage models such as peak power shedding, running jobs under power limit & power ramp controls and whether this creates the need for new power metrics.  To close, the talk will propose new directions and new opportunities to continue to develop metrics that will drive the right behaviors; lowering TCO and increasing HPC efficiency.

11:30am-12:30pm Lunch Break (60min)

12:30pm-2:30pm Session 2: Monitoring Tools (2hr)

Area 2: Monitoring Tools: Representatives from Industry will discuss the monitoring, modeling, and evaluation tools that are available today to model the entire energy stack. Can we make transparent to the user the power consumption behavior of large scale data centers via these monitoring tools? This will be an excellent forum for industry to get feedback from data center mangers and users regarding the adequateness of the available tools and existing gaps in technology.

Power Monitoring on High Performance Computers
Benjamin Payne, DoD
Abstract: Electrical power monitoring is the first necessary step towards understanding utilization and enabling better use of infrastructure and IT resources.  The number of DoD HPCs is increasing and all require live power monitoring in order for key decision makers to make informed decisions. Power monitoring solutions vary by each HPC vendor. This presentation provides the background on the work performed by DoD personnel to provide power monitoring data, highlights the various techniques to obtain power data, and suggests a statement of work template for power monitoring to include in a HPC RFP.

Monitoring and Improving Application Energy Efficiency
Mark O'Connor, Allinea
Abstract: For the past two years Allinea has been working with teams around the world on the role of the application in energy consumption. This talk outlines the findings so far in moving from pure system-level monitoring to application-level monitoring, and the integration of energy profiling and optimization in Allinea's development and benchmarking tools. Ongoing international projects on energy-centric workload scheduling with SLURM and autotuning compilers will also be presented.

POWER8 On Chip Controller - Measuring and Managing Power Consumption
Todd Rosedahl, IBM

Abstract: The On Chip Controller (OCC) is a co-processor that is embedded directly on the main POWER 8 processor die. The OCC can be used to control the processor frequency, power consumption, and temperature in order to maximize performance and minimize energy usage. Additionally, a rich set of sensor data, including power, temperature, and performance indicators are collected by the OCC and made available for external consumption. This presentation will include an overview of the power, thermal, and performance data that the OCC can access as well as the various control knobs, including adjusting the processor frequency and memory bandwidth. Details about the OCC processor, firmware structure, loop timings, off-load engines, and bus accesses will be given along with descriptions of sensor data.

Experiences Developing and Deploying Per Node Power Monitoring at Scale
Phil Pokorny, Penguin Computing
Abstract: Presentation on the design goals, design issues and deployment of a system for monitoring power inside multiple compute nodes in an HPC cluster.  Specifics of Penguin Computing's Power Insight hardware version 1.0 and 2.1 will be discussed.  Ideas for future development of the Power Insight product will also be presented.

2:30pm-2:45pm Coffee Break (15min)

2:45pm-4:45pm Session 3: Future Technologies (2hr)

Area 3: Future Technologies: Since there is an interest for power-awareness at the facility level, we will discuss how the current state of job scheduling research can very naturally be extended to address some of the concerns of facility managers. Until customers can implement methods for more accurate power estimations or a more flexible facilities power management model is available, the job scheduler can facilitate efficient system power usage by scheduling jobs for maximum power utilization.

Intelligent Job Scheduling: Applying Metrics and Analytics to Resource Management
Greg Koenig, ORNL
Abstract: For the past five years, Oak Ridge National Laboratory has been engaged in a project to develop an Intelligent Job Scheduling system.  This project combines research in data movement power models, predictive analytics for anticipating upcoming workloads, machine learning techniques for task identification and classification, and scheduling heuristics that understand tradeoffs between the time-decaying value of a job versus the energy required to complete the job on heterogeneous resources.  The capabilities of such a system are becoming critical for managing facility-wide needs such as stranded capacity and trapped capacity or demand-response requirements of energy service providers.  This talk will describe the current state of the art in the Intelligent Job Scheduling project, challenges and opportunities encountered by the project, as well as future directions for the work.

From Facility to Component Level - A High Performance Computing Power Application Programming Interface
James H. Laros III, SNL
Abstract: Power measurement and control capabilities are essential from the facility level down to the individual component level in order to achieve practical, efficient exascale computing. Coarse level guarantees of power and energy maximum consumption can be enforced by platform level power caps, but maximizing computational productivity within facility established boundaries will require more fine-grained approaches such as power/energy-aware scheduling, runtime and/or application involvement. Multi-level power/energy awareness, such as this, requires interfaces at each of these levels that are appropriate for use by each individual level or interface. The High Performance Computing - Power Application Programming Interface Specification strives to standardize power measurement and control from facility to component level to address the challenges of energy efficient computing.

Is Green Exascale Computing an Oxymoron?
Wu-chun Feng, Virginia Tech
Abstract: Advances in the system architecture - both in hardware and software - for high-performance computing (HPC) have enabled the scientific community to run increasingly accurate (or higher fidelity) simulations. However, the traditional post-processing approach to visualization, where large amounts of data are written to and read from disks, will not scale in the originally projected 2015 exascale time frame due to power and storage constraints. Both of these constraints arise from a cost perspective, i.e., the cost to power a 20-MW supercomputer and the cost to purchase sufficient storage to store such large volumes of data. As a result, these constraints have led to the desire to minimize the "cost per insight" from an exascale visualization. This talk seeks to address the above constraints by exposing the most power-consuming and energy-consuming aspects of large-scale systems and address them via intelligent application of visualization under specific power or energy constraints. Viewed from another perspective, we (loosely) seek to do for power and energy what "The Case of the Missing Supercomputer Performance" from DOE did for performance.

Performance and Power of Emerging Heterogeneous Architectures
Ke Wang, University of Virginia
Abstract: Heterogeneous Architectures show great potential to improve performance and power efficiency by supplementing general-purpose processors with specialized processors to accelerate specific tasks. In addition to GPUs, which target data-parallel (SIMD or SPMD) computations, other types of accelerators, such as the Automata Processor (AP) and the FPGA, are emerging to accelerate additional problem types.  This talk will describe these accelerators and give examples of successful applications on the AP and FPGA. We will also briefly discuss implications of 3D IC techniques for future heterogeneous architectures.  3D integration offers benefits in performance, power-efficiency, and manufacturing, but poses challenges in thermal control and power delivery.

4:45pm-5:30pm Panel Discussion (45min)
Moderator: Chung-Hsing Hsu, Oak Ridge National Laboratory