If you think your company is overwhelmed with having to support terabytes of data, the computing power behind the Large Hadron Collider (LHC) might put things into perspective.
As the world's largest particle accelerator, the LHC in Geneva generates about 30 petabytes, which is 30 million gigabytes, of data each year.
The LHC accelerates particle beams in opposing directions to 99.99 per cent of the speed of light and collides them. High tech digital cameras capture this as it's taking place, taking 40 million photographs per second. The raw data is automatically filtered out, focusing on photographs that are of interest, writing 6 gigabytes per second to permanent storage.
At ADMA's Advancing Analytics in Sydney today, Bob Jones, the project manager of CERN, which operates LHC, talked about the enormous computing power behind the machine, which has an annual budget of A$1.5 billion.
“We have a dozen or so major data centres around the world - one in each in the major European countries, several in North America, one in Taiwan, one in South Korea. We also have a Tier Two centre in Melbourne.”
“They serve a whole array of smaller centres, where the analytics is sitting. There are 170 computer centres around the world, which are all collaborating to make this global infrastructure operate.”
Its two major data centres in Geneva and Budapest are linked by the world's fastest network or 100 gigabit per second speeds, he said.
“What this means is we can operate that infrastructure as a single cloud deployment, it's now operated as a single open stack.”
Using neural network algorithms, the data centres process and identify photographs that are good candidates to filter through to the computer centres for analysis.
“We have a larger fraction of a second at this point to decide if we are going to keep that photograph or throw it away,” he said.
When the LHC is not running an experiment, the data centres are not sitting idle. Instead, they are used to create simulated data in order to compare with the actual data to ensure it meets quality, Jones said.
“There are so many things that could go wrong inside an accelerator, so if you are not able to compare it to simulation, then you don't know the quality of your data.
“When the machine stops, those data centres suddenly become free. They start analysing past data for doing simulation so they are never down,” he said.
Online disk stores the petabytes of data that are used for analysis. Tape is used for long-term archiving, storing up to 10 terabytes per cassette.
“But tape is a magnetic medium, which means it deteriorates over time, so you have to actually manage it. We have to repack this data every two years, because that's the time we know we've got the best chance of still being able to read the data and … that's a suitable timeframe during which we can pack even more data onto each one of those cassettes,” Jones added.
For the data scientists to make a discovery in the data, like Higgs boson in 2012, they have to go through about 400 events out of 600 million, said Jones.
“When the Higgs discovery was made it was running at 8 tera electronvolts. Now it's running on 13 tera electronvolts. Now we are going to get even more data. So we have to scale up to handle bigger quantities of data in the future.”
To support this growing amount of data, Jones said he is partnering with commercial cloud providers such as Rackspace to leverage off their infrastructure, as well as plugging into the NeCTAR federated research cloud model.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.