Computer scientists and engineers from Sandia National Laboratories and Boston University recently earned the Gauss Award at the International Supercomputing conference. They were honored for their work automatically diagnosing problems and potentially fixing them in supercomputers using machine learning.
It turns out that supercomputers, which are relied on for everything from forecasting the weather to cancer research to ensuring U.S. nuclear weapons are safe and reliable, can have bad days. They contain a complex collection of interconnected parts and processes that can go wrong. For example, parts can break, previous programs can leave “zombie processes” running that gum up the works, network traffic can cause bottlenecks, or a computer code revision can instigate problems. These problems often result in programs not running to completion and wasting valuable supercomputer time.
So the team came up with a list of issues they have encountered when working with supercomputing and then wrote code to re-create those problems or anomalies. They ran a variety of programs with and without the anomaly codes on two supercomputers, one at Sandia and a public cloud system operated by Boston University.
While the programs were running, researchers collected data on the process, monitoring how much energy, processor power, and memory was used by each node. Monitoring more than 700 criteria used less than 0.005% of the supercomputer’s processing power, and this is where machine learning comes in.
Machine learning is a broad collection of computer algorithms that find patterns without being explicitly programmed on the important features. The team wrote several machine learning algorithms that detect anomalies by comparing data from normal program runs and those with anomalies. They tested the algorithms to see which was best at correctly diagnosing the anomalies. For example, one technique, called Random Forest, was particularly adept at analyzing vast quantities of the data monitored and deciding which metrics were important, then determining if the supercomputer was being affected by an anomaly.
To accelerate the analysis, the team calculated various statistics for each metric. Simple statistical values (such as the average and the fifth and 95th percentiles), as well as more complex values (such as noisiness, trends over time, and symmetry), did suggest abnormal behavior and thus potential warning signs. Calculating these values doesn’t take much computer power and they streamlined the rest of the analysis.
The team is now working with more artificial anomalies and more useful algorithms. A major future task is to validate diagnostic techniques on real anomalies discovered during normal runs.
Thanks to the relatively low computational cost of running the machine learning algorithms, diagnostics could be used in real time, which also needs to be tested. The hope is that diagnostics will eventually be able to inform users and operation staff of anomalies as they occur, or even autonomously take action to fix or work around them.