Researchers in California today unveiled what they describe as the world's largest repository for cancer genomes. The database will make it easier for scientists to analyze the vast amounts of sequencing data pouring out of the U.S. National Cancer Institute's (NCI's) genome projects.
Cancer Genomics Hub (CGHub ), built by a team at the University of California, Santa Cruz (UCSC), will hold raw sequencing data from The Cancer Genome Atlas  (TCGA). The atlas is NCI's mammoth effort to sequence the DNA of normal cells and tumor cells from 10,000 people with 20 types of cancer. (In some cases the project is sequencing whole genomes; in other cases, only the 1% of the genome that codes for proteins.) CGHub will also hold data from NCI's childhood- and HIV-associated cancer genome projects. It will take over for NIH's National Center for Biotechnology Information, which had been collecting cancer sequencing data through last August.
Physically based at the San Diego Supercomputer Center, the CGHub computer system is ready to store 5 petabytes of DNA and RNA data from cancer patients. (TCGA is generating 10 terabytes of data a month, and will eventually produce 10 petabytes [10,000 terabytes] of data.)
TCGA is building a catalog of key cancer-driving genetic changes that researchers can use to develop treatments tailored to the genetics of an individual's tumor. A central database will allow researchers to compare mutations and miswired pathways across cancer types, says UCSC bioinformatician David Haussler, who is leading the project funded with a $10.3 million contract from NCI: "What's very important is to gather the data in one place and make it easy for researchers to do cross-dataset comparisons." CGHub will not hold data from other international cancer genome projects , however.
For now, researchers will be able to only download the data. But sending genome data across the Internet is becoming impractical as datasets balloon in size (see our 2011 story "Will Computers Crash Genomics? "). Haussler says that eventually, researchers will be able to work on the data remotely on CGHub's servers through cloud computing, as NIH is doing with Amazon  for data from its 1000 Genomes Project.