CMU’s DeltaFS Team Aims To Create Smarter Ways To Organize, Store Supercomputer Data

Trinity occupies a footprint the size of an entire floor of most office buildings, but its silently toiling workers are not flesh and blood. Trinity is a supercomputer at Los Alamos National Laboratory in New Mexico, made up of row upon row of CPUs stacked from the white-tiled floor to the fluorescent ceiling.

The machine is responsible for helping to maintain the United States’ nuclear stockpile, but it is also a valuable tool for researchers from a broad range of fields. The supercomputer can run huge simulations, modeling some of the most complex phenomena known to science.

However continued advances in computing power have raised new issues for researchers.

"If you find a way to double the number of CPUs that you have, you still have a problem of building software that will scale to use them efficiently," said George Amvrosiadis, an assistant research professor in Carnegie Mellon University’s Parallel Data Lab.

Amvrosiadis was part of a team that recently lent a hand to a cosmologist from Los Alamos working to simulate complex plasma phenomena. The problem wasn’t that Trinity lacked the power to run the simulations, but rather, that it was unable to create and store the massive amounts of data quickly and efficiently. That’s where Amvrosiadis and the DeltaFS team came in.

DeltaFS is a file system designed to alleviate the significant burden placed on supercomputers by data-intensive simulations like the plasma simulation.

When it comes to supercomputing, efficiency is the name of the game. If a task can’t be completed within the amount of time allotted, then the simulation will go incomplete, and precious time will have been wasted. With researchers vying for limited computing resources, any time wasted is a major loss.

DeltaFS was able to streamline the plasma simulation, bringing what had once been too resource-demanding a task within the supercomputer’s capabilities by tweaking a couple parts of how Trinity processed and moved the data.

First, DeltaFS changed the size and quantity of files the simulation program created. Rather than taking large snapshots encompassing every particle in the simulation - which numbered more than a trillion - at once, DeltaFS created a much smaller file for each individual particle. This made it much easier for the scientists to track the activity of individual particles.

Through DeltaFS, Trinity was able to create a record-breaking trillion files in just two minutes.

Additionally, DeltaFS was able to take advantage of the roughly 10 percent of simulation time usually spent storing the data created, during which Trinity’s CPUs are sitting idle. The system tagged data as it flowed to storage and created searchable indices that eliminated hours of time that scientists would have had to spend combing through data manually. This allowed the scientists to retrieve the information they needed 1,000 to 5,000 times faster than prior methods.

The team could not have been more thrilled with the success of DeltaFS’ first real-world test run and are already looking ahead to the future.

"We’re looking to get it into production and have the cosmologist who originally contacted us use it in his latest experiment," Amvrosiadis said. "To me that’s more of a success story than anything else. Often a lot of the work ends with just publishing a paper and then you’re done; that’s just anticlimactic."

But he and the rest of the team aren’t just looking to limit their efforts to cosmological simulations. They are looking at ways to expand DeltaFS for use with everything from earthquake simulations to crystallography. With countries across the globe striving to create machines that can compute at the exascale, meaning 1018 calculations per second, there’s a growing need to streamline these demanding processes.

The trick to finding a one-size-fits-all (or at least most) replacement for the current purpose-built systems in use, is designing the file system to be flexible enough for scientists and researchers to tailor it to their own specific needs.

"What researchers end up doing is stitching a solution together that is customized to exactly what they need, which takes a lot of developer hours," Amvrosiadis said. "As soon as something changes they have to sit back down to the drawing board and start from scratch and redesign all their code."

Amvrosiadis and the team have demonstrated a couple of ways that efficiency can be improved, such as indexing or altering file size and quantity. Now they are looking into further ways to take advantage of potential inefficiencies, like using in-process analysis to eliminate unneeded data before it ever reaches storage or compressing information in preparation for transfer to other labs.

Solutions like these center around repurposing CPU downtime to perform tasks that will contribute back into the information pipeline and creating smarter ways to organize and store data, increasing overall efficiency.

The idea is to let the expert scientists identify the areas where they have room for improvement or untapped resources, and to take advantage of the toolkit and versatile framework DeltaFS can provide.

As the world moves toward exascale computing, the pace that software development must maintain to keep pace with hardware improvements will only increase. Amvrosiadis said he hopes that one day more advanced AI techniques could be incorporated to do much of the observational work performed by scientists, cutting down on observation time and freeing them to focus on analysis and study. But for him and the rest of the DeltaFS team, all of that starts with finding little solutions to improve huge processes.

"I don’t know if there’s one framework to rule them all yet - but that’s the goal." 

The DeltaFS project includes Professors George Amvrosiadis, Garth Gibson, and Greg Ganger, Systems Scientist Chuck Cranor, and Ph.D. student Qing Zheng. Also involved were Los Alamos National Lab’s Brad Settlemyer and Gary Grider.


This site uses cookies and analysis tools to improve the usability of the site. More information. |