To help engineers in large and small companies analyze Big Data, computer engineers at the National Institute of Standards and Technology (NIST) have released broad specifications on how to build useful tools for tackling that task. The release, the final version of the NIST Big Data Interoperability Framework, is the result of several years or development work between NIST and more than 800 experts from industry, academia, and government.
It fills nine volumes, and the framework is intended to guide developers on how to deploy software tools that can analyze data using any type of computing platform—from a single laptop to the most powerful cloud-based environment. Just as important, it lets analysts move work from one platform to another and substitute more advanced algorithms without retooling the computing environment.
“We want to let data scientists do effective work on whatever platform they choose or have available, and however their operation grows or changes,” says Wo Chang, a NIST computer scientist. “This framework is a reference for how to create an ‘agnostic’ environment for tool creation. If software vendors use the framework’s guidelines when developing analytical tools, then analysts’ results can flow uninterruptedly, even as their goals change and technology advances.”
The framework meets a longstanding need among data engineers and scientists who are asked to extract usable information from ever-larger and more varied datasets while navigating through shifting technologies. Interoperability is increasingly important as huge amounts of data pour in from a growing number of platforms, ranging from telescopes and physics experiments to the countless tiny sensors and devices we have linked to IoT and IIoT. Although several years ago the world was generating 2.5 exabytes (a billion bytes) of data daily, that number is predicted to reach 463 exabytes by 2025. (That amount of data would fill 212 million DVDs.)
The NIST Big Data Interoperability Framework (NBDIF) is supposed to help create software tools (represented here as a spreadsheet page) that can analyze data using any type of computing platform and be moved from one platform to another easily. (Credit: N. Hanacek/NIST)
Computer specialists use the term “Big Data analytics” to refer to the systematic approaches that try to extract usable information from these large datasets. With the rapid growth in the number and variety of tools built for that task, data scientists can now scale up their work from a single, small desktop computing setup to a large, distributed cloud-based environment with a host of processor nodes. But often, this shift puts enormous demands on analysts. For example, tools may have to be rebuilt from scratch using a different computer language or algorithm, costing staff time and potentially time-critical insights.
The NIST framework is an effort to address these problems. It includes consensus definitions and taxonomies to help ensure developers are all on the same page when they discuss plans for new tools. It also includes key requirements for the data security and privacy protections these tools should have. There is also a new reference architecture interface specification to guide the use of these tools.
“The architecture interface will let vendors build flexible environments that any tool can operate in,” Chang says. “Before, there was no specifications on how to create interoperable solutions.”
This interoperability would help analysts address a number of data-intensive problems, such as weather forecasting. Meteorologists section the atmosphere into small blocks and apply analytics models to each block, using Big Data techniques to keep track of changes that hint at the future. As these blocks get smaller and our ability to analyze finer details grows, forecasts can improve—if computational components can be swapped for more advanced tools.
“You model these blocks with several equations whose variables move in parallel,” Chang said. “It’s hard to keep track of them all. The agnostic environment of the framework means a meteorologist can swap in improvements to an existing model. It will give forecasters a lot of flexibility.”
Another potential application is drug discovery, where scientists must explore the behavior of several candidate drug proteins in a round of tests and then feed the results into the next round. Unlike weather forecasting, where an analytical tool must track several variables changing simultaneously, drug development generates long strings of data where changes come in sequence. Although this problem demands a different Big Data approach, it would still benefit from being able to change easily as drug development is already a time-consuming and expensive process.
Whether applied to one of these or other Big Data-related problems—from spotting health-care fraud to identifying animals from a DNA sample—the value of the framework will be in helping analysts speak to one another and more easily apply all the data tools they need to reach their goals.
“Performing analytics with the newest machine learning and AI techniques while still employing older statistical methods will all be possible,” Chang says. “Any of these approaches will work and the reference architecture will let you choose.”
To download the volumes, click here. Eight of the volumes are available and the last should also soon be online to download.