The debugging tool is a significant milestone in LLNL's multi-year collaboration with the University of Wisconsin (UW), Madison and the University of New Mexico (UNM) to ensure supercomputers run more efficiently.
Playing a significant role in scaling up the Sequoia supercomputer, STAT, a 2011 R&D 100 Award winner, has helped both early access users and system integrators quickly isolate a wide range of errors, including particularly perplexing issues that only manifested at extremely large scales up to 1,179,648 compute cores. During the Sequoia scale-up, bugs in applications as well as defects in system software and hardware have manifested themselves as failures in applications. It is important to quickly diagnose errors so they can be reported to experts who can analyze them in detail and ultimately solve the problem.
"STAT has been indispensable in this capacity, helping the multi-disciplined integration team keep pace with the aggressive system scale-up schedule," said LLNL computer scientist Greg Lee.
"While testing a subsystem of Blue/Gene Q, my test program consistently failed only when scaled to 1,179,648 MPI processes. Although the test program was simple, the sheer scale at which this program ran made debugging efforts highly challenging. But when I applied STAT, it quickly revealed that one particular rank process was consistently stuck in a system call," said Dong Ahn, a computer scientist in Livermore Computing.
Based on this finding, a system expert took a close look at the compute core on which this rank process was running and discovered a hardware defect. "Replacing the component suddenly got the entire Sequoia system back to life," Ahn said. "Putting this exercise into perspective, this error was due to a defect in a tiny hardware unit, the decrementor, of a single hardware thread out of a total of 4.7 million hardware threads. I felt it was like finding a needle in a haystack over a coffee break."
Sequoia delivers 20 petaflops of peak power and was ranked No. 1 in June of this year's TOP500 list. It is currently ranked No. 2, behind Oak Ridge National Laboratory's Titan.
LLNL plans to use Sequoia's impressive computational capability to advance understanding of fundamental physics and engineering questions that arise in the National Nuclear Security Administration's (NNSA) program to ensure the safety, security and effectiveness of the United States' nuclear deterrent without testing. Sequoia also will support NNSA/DOE programs at LLNL that focus on nonproliferation, counterterrorism, energy, security, health and climate change.
As LLNL takes delivery of the Sequoia system and works to move it into production, computer scientists will migrate applications that have been running on earlier systems to this newer architecture. This is a period of intense activity for LLNL's application teams as they gain experience with the new hardware and software environment.
"Having a highly effective debugging tool that scales to the full system is vital to the installation and acceptance process for Sequoia. It is critical that our development teams have a comprehensive parallel debugging tool set as they iron out the inevitable issues that come up with running on a new system like Sequoia," said Kim Cupps, leader of the Livermore Computing Division at LLNL.
STAT is particularly important for LLNL because supercomputer simulations are essential in virtually every mission area of the Laboratory. The tool also has been used at other sites and proved to be effective on a wide range of supercomputer platforms, including Linux clusters and Cray systems.
The team is actively pursuing further optimization of STAT technologies and is exploring commercialization strategies. More information about STAT, including a link to the source code, is available on the Web.More Information
LLNL news release, Nov. 9, 2012"Venturing into the heart of high-performance computing simulations"
Anne Stark | EurekAlert!
The Flexible Grid Involves its Users
27.09.2016 | Fraunhofer-Institut für Angewandte Informationstechnik FIT
Optical fiber transmits one terabit per second – Novel modulation approach
16.09.2016 | Technische Universität München
Heavy construction machinery is the focus of Oak Ridge National Laboratory’s latest advance in additive manufacturing research. With industry partners and university students, ORNL researchers are designing and producing the world’s first 3D printed excavator, a prototype that will leverage large-scale AM technologies and explore the feasibility of printing with metal alloys.
Increasing the size and speed of metal-based 3D printing techniques, using low-cost alloys like steel and aluminum, could create new industrial applications...
Dank einer halbseitigen Beschichtung mit Kohlenstoff lassen sich Mikroschwimmer durch Licht antreiben und steuern
Manche Bakterien zieht es zum Licht, andere in die Dunkelheit. Den einen ermöglicht dieses phototaktische Verhalten, die Sonnenenergie möglichst effizient für...
LMU-Physiker haben mit Nanopartikeln und Laserlicht Protonenstrahlung produziert. Sie könnte künftig neue Wege in der Strahlungsmedizin eröffnen und bei der Tumorbekämpfung helfen.
Stark gebündeltes Licht entwickelt eine enorme Kraft. Ein Team um Professor Jörg Schreiber vom Lehrstuhl für Experimentalphysik - Medizinische Physik der LMU...
Ein geomagnetischer Sturm hat sich als Glücksfall für die Wissenschaft erwiesen. Jahrzehnte rätselte die Forschung, wie hoch energetische Partikel, die auf die Magnetosphäre der Erde treffen, wieder verschwinden. Jetzt hat Yuri Shprits vom Deutschen GeoForschungsZentrum GFZ und der Universität Potsdam mit einem internationalen Team eine Erklärung gefunden: Entscheidend für den Verlust an Teilchen ist, wie schnell die Partikel sind. Shprits: „Das hilft uns auch, Prozesse auf der Sonne, auf anderen Planeten und sogar in fernen Galaxien zu verstehen.“ Er fügt hinzu: „Die Studie wird uns überdies helfen, das ‚Weltraumwetter‘ besser vorherzusagen und damit wertvolle Satelliten zu schützen.“
Ein geomagnetischer Sturm am 17. Januar 2013 hat sich als Glücksfall für die Wissenschaft erwiesen. Der Sonnensturm ermöglichte einzigartige Beobachtungen, die...
Friction stir welding is a still-young and thus often unfamiliar pressure welding process for joining flat components and semi-finished components made of light metals.
Scientists at the University of Stuttgart have now developed two new process variants that will considerably expand the areas of application for friction stir welding.
Technologie-Lizenz-Büro (TLB) GmbH supports the University of Stuttgart in patenting and marketing its innovations.
Friction stir welding is a still-young and thus often unfamiliar pressure welding process for joining flat components and semi-finished components made of...
30.09.2016 | Veranstaltungen
30.09.2016 | Veranstaltungen
30.09.2016 | Veranstaltungen
30.09.2016 | Veranstaltungsnachrichten
30.09.2016 | Messenachrichten
30.09.2016 | Energie und Elektrotechnik