Vitus Leung

Machine Learning Helps Diagnose Supercomputer Problems

Dec. 5, 2017
Engineers are leveraging machine learning to both uncover problems with supercomputers and fix them, all without human intervention.

Computer scientists and engineers from Sandia National Laboratories and Boston University recently earned the Gauss Award at the International Supercomputing conference. They were honored for their work automatically diagnosing problems and potentially fixing them in supercomputers using machine learning.

It turns out that supercomputers, which are relied on for everything from forecasting the weather to cancer research to ensuring U.S. nuclear weapons are safe and reliable, can have bad days. They contain a complex collection of interconnected parts and processes that can go wrong. For example, parts can break, previous programs can leave “zombie processes” running that gum up the works, network traffic can cause bottlenecks, or a computer code revision can instigate problems. These problems often result in programs not running to completion and wasting valuable supercomputer time.

So the team came up with a list of issues they have encountered when working with supercomputing and then wrote code to re-create those problems or anomalies. They ran a variety of programs with and without the anomaly codes on two supercomputers, one at Sandia and a public cloud system operated by Boston University.

While the programs were running, researchers collected data on the process, monitoring how much energy, processor power, and memory was used by each node. Monitoring more than 700 criteria used less than 0.005% of the supercomputer’s processing power, and this is where machine learning comes in.

Machine learning is a broad collection of computer algorithms that find patterns without being explicitly programmed on the important features. The team wrote several machine learning algorithms that detect anomalies by comparing data from normal program runs and those with anomalies. They tested the algorithms to see which was best at correctly diagnosing the anomalies. For example, one technique, called Random Forest, was particularly adept at analyzing vast quantities of the data monitored and deciding which metrics were important, then determining if the supercomputer was being affected by an anomaly.

To accelerate the analysis, the team calculated various statistics for each metric. Simple statistical values (such as the average and the fifth and 95th percentiles), as well as more complex values (such as noisiness, trends over time, and symmetry), did suggest abnormal behavior and thus potential warning signs. Calculating these values doesn’t take much computer power and they streamlined the rest of the analysis.

The team is now working with more artificial anomalies and more useful algorithms. A major future task is to validate diagnostic techniques on real anomalies discovered during normal runs.

Thanks to the relatively low computational cost of running the machine learning algorithms, diagnostics could be used in real time, which also needs to be tested. The hope is that diagnostics will eventually be able to inform users and operation staff of anomalies as they occur, or even autonomously take action to fix or work around them.

Sponsored Recommendations

Sept. 16, 2025
From robotic arms to high-speed conveyors, accuracy matters. Discover how encoders transform motor control by turning motion into real-time datadelivering tighter speed control...
Sept. 16, 2025
Keep high-torque gearboxes running efficiently with external lubrication and cooling systems delivered fast. Flexible configurations, sensor-ready monitoring, and stocked options...
Sept. 16, 2025
Now assembled in the U.S., compact P2.e planetary gear units combine maximum torque, thermal efficiency, and flexible configurations for heavy-duty applicationsavailable faster...
Aug. 22, 2025
Discover how to meet growing customer demands for custom products without overextending your engineering team. Learn how scaling your automation strategy can help you win more...

Voice your opinion!

To join the conversation, and become an exclusive member of Machine Design, create an account today!