Federal agencies generate or collect large volumes of data from such sources as earth-observing satellites, sensor networks, and genomics research. Much of that information is useful to commercial and academic institutions, which can usually access this publicly generated data from agency servers at no charge.
As the volume of data continues to expand, however, many agencies are considering using commercial cloud services to store it and make it available to users. Although agencies may have different strategies, these new partnerships could result in user fees levied on downloads and analyses performed on data while it remains in the cloud.
A researcher at the Georgia Institute of Technology who studies such data use, Mariel Borowitz, urges caution about the design of these commercial cloud partnerships and possible imposition of user fees.
“Under the current system, free and open government data is used by scientists to conduct research, by entrepreneurs to create new businesses, and by citizens and other organizations to promote government transparency,” says Borowitz. “If users must pay fees to download or analyze data, it will decrease the ability of users to access and work with data. Past experience suggest that the effects of the decrease in data use could be large both for individuals and society as a whole.”
Moving data to commercial cloud systems would likely provide broader access and more efficient analysis options, but she cautions those advantages could be offset by the cost, particularly for organizations with small budgets.
“Agencies risk losing some of the benefits of this transition by not budgeting for the costs associated with data downloads and analysis,” Borowitz says. “Many who would be interested in using the data may not be able to pay the associated fees. Researchers, nonprofit organizations, and others who do not directly profit from the use of this data are most likely to be affected.”
Borowitz recently spent two years at NASA and watched the development of systems that will increase data collection and debates about future data storage. She says she would like to see the agencies that provide data continue to shoulder the costs, up to some “reasonable level,” to ensure data continues to be readily available to all users.
As an alternative to commercial services, some agencies are considering developing their own, custom-built cloud solutions. They will have to weigh the cost of benefits for the different options. There will also be technical, organizational and policy issues to consider.
“Agencies are taking the issues of security and long-term preservation of data seriously,” Borowitz adds. “When working with commercial providers, some are concerned about getting ‘locked in’ to one provider, due to the costs of migrating data from one system to another. It is possible that costs and capabilities could change over time. On the other hand, commercial cloud providers have large workforces and extensive infrastructure that let them provide services and capabilities well beyond what any one agency would be able to maintain.”
Borowitz notes that most agencies have not made final decisions about cloud-based programs, so there should be enough time to work through these issues.
“Most agencies that make data publicly available, particularly science agencies, are discussing and/or beginning to transition to cloud systems,” she says. “However, these programs at agencies such as NSF, NIH, NASA, and NOAA, are still in their early phases, and there is still time for feedback to be provided and adjustments to the programs to be made.”
The existence of fees to access government data is not without precedent, but Borowitz argues past experience suggests that user fees result in significantly less use. Before Landsat data (satellite images of Earth) was made freely available in 2008, no more than 25,000 images a year were purchased from the collection. “Within a few years of implementing the free and open data policy, the government was distributing 250,000 images a month,” she says.
That suggests what cash-strapped agencies are dealing with. The National Oceanic and Atmospheric Administration, for example, houses more than 100 petabytes (PB) of data and generates more than 30 PB per year from satellites, radars, computer models, and other sources. NASA projects its archive will grow to 250 PB by 2025. And the amount of genomic data at the National Institutes of Health is growing exponentially.
A petabyte is 1,024 terabytes, or a million gigabytes. A gigabyte is 1,024 megabtyes. For scale, an average photograph taken by a high-end cell phone camera can be in the neighborhood of 10 megabytes. Laptop computers may be able to store as much as a few terabytes of data.
Borowitz sees the transition to cloud computing as both an opportunity and a challenge for the future availability of government data. “Decisions being made right now about the structure of these programs will significantly affect researchers and society, so it is important to raise awareness and increase engagement on these issues.”