Big data analytics – The Power of Petabytes
Fifteen years ago, it was a landmark achievement. Ten years ago, it
was an intriguing but highly expensive research tool. Now, falling
costs, soaring accuracy and a steadily expanding base of scientific
knowledge have brought genome sequencing to the cusp of routine clinical
care.
The US$1,000 genome, initially conceived as a price point at which
sequencing could become a component of personalized medicine, has
arrived. “Our capacity for data generation relative to price has
increased in a way that is almost unprecedented in science — roughly six
orders of magnitude in the past seven or eight years,” says Paul
Flicek, a specialist in computational genomics at the European Molecular
Biology Laboratory's European Bioinformatics Institute in Cambridge,
UK. The HiSeq X Ten system developed by Illumina of San Diego,
California, can sequence more than 18,000 human genomes per year, for
example.
The biomedical research community is diving in whole-heartedly, with
population-scale programmes that are intended to explore the clinical
power of the genome. In 2014 the United Kingdom launched the 100,000
Genomes Project, and both the United States (under the Precision
Medicine Initiative) and China (in a programme to be run by BGI of
Shenzhen) have unveiled plans to analyse genomic data from one million
individuals.
Many other programmes are under way that, although more regional in
focus, are still 'big data' operations. A partnership between Geisinger
Health System, based in Danville, Pennsylvania, and biotech firm
Regeneron Pharmaceuticals of Tarrytown, New York, for instance, aims to
generate sequence data for more than 250,000 people. Meanwhile, a
growing number of hospitals and service providers worldwide are
sequencing the genomes of people with cancers or rare hereditary
disorders (see 'DNA sequencing soars').
Some researchers worry that the flood of data could overwhelm the
computational pipelines needed for analysis and generate unprecedented
demand for storage — one article estimated that the output from genomics
may soon dwarf data heavyweights such as YouTube. Many also worry that
today's big data lacks the richness to provide clinical value. “I don't
know if a million genomes is the right number, but clearly we need more
than we've got,” says Marc Williams, director of the Geisinger Genomic
Medicine Institute.
The meaning of mutations
Clinical genomics today is largely focused on identifying
single-nucleotide variants — individual 'typos' in the genomic code that
can disrupt gene function. And rather than looking at the full genome,
many centres focus instead on the exome — the subset of sequences
containing protein-coding genes. This reduces the amount of data being
analysed nearly 100-fold, but the average exome still contains more than
13,000 single-nucleotide variants. Roughly 2% of these are predicted to
affect the composition of the resulting protein, and finding the
culprit for a given disease is a daunting challenge.
For decades, biomedical researchers have dutifully deposited their
discoveries of single-nucleotide variants in public resources such as
the Human Gene Mutation Database, run by the Institute of Medical
Genetics at Cardiff University, UK, or dbSNP, maintained by the US
National Center for Biotechnology Information. However, the effects of
these mutations were often determined from cell culture or animal
models, or even theoretical predictions, providing insufficient guidance
for clinical diagnostic tools. “In many cases, associations were made
with relatively low levels of evidence,” says Williams.
The situation is even more complicated for structural variants, such
as duplicated or missing chunks of genome sequence, which are far more
difficult to detect with existing sequencing technologies than
single-nucleotide variants. At the whole-genome scale, each person has
millions of variants. Many of these are in sequences that do not encode
proteins but instead regulate gene activity, so they can still
contribute to disease. However, the extent and function of these
regulatory regions are poorly defined. Although capturing all this
variability is desirable, it may not offer the best short-term returns
for clinical sequencing. “You're shooting yourself in the foot if you're
collecting data you don't know how to interpret,” says Altman.
Efforts are now under way to rectify this problem. The Clinical
Genome Resource, which was set up by the US National Human Genome
Research Institute, is a database of disease-related variants, and
contains information that could guide medical responses to these
variants as well as the evidence supporting those associations. Genomics
England, which runs the 100,000 Genomes Project, aims to bolster
progress in this area by establishing 'clinical interpretation
partnerships': doctors and researchers will collaborate to establish
robust models of diseases that can potentially be mapped to specific
genetic alterations.
However, quantity is as important as quality. Mutations that offer a
strong detrimental effect bring an evolutionary disadvantage, so they
tend to be exceedingly rare and require large sample sizes to detect.
Establishing statistically meaningful disease associations for variants
with weak effects also needs large numbers of people.
In Iceland, deCODE Genetics has demonstrated the power of
population-scale genomics, combining extensive genealogy and
medical-history records with genome data from 150,000 people (including
15,000 whole-genome sequences). These findings have allowed deCODE to
extrapolate the population-wide distribution of known genetic risk
factors, including gene variants linked to breast cancer, diabetes and
Alzheimer's disease.
They have also enabled studies in humans that normally require the
creation of genetically modified animals. “We have established that
there are about 10,000 Icelanders who have loss-of-function mutations in
both copies of about 1,500 different genes,” says Kári Stefánsson, the
company's chief executive. “We're putting significant effort into
figuring out what impact the knockout of these genes has on
individuals.”
This work was helped by the homogeneous nature of the Icelandic
population, but other projects require a broadly representative spectrum
of donors. Efforts such as the international 1000 Genomes Project have
catalogued some of the world's genetic diversity, but most data are
heavily skewed towards Caucasian populations, making them less useful
for clinical discovery. “Because they come from the genetic mother ship,
so to speak, people of African ancestry carry a lot more genetic
variants than non-Africans,” says Isaac Kohane, a bioinformatician at
Harvard Medical School in Boston, Massachusetts. “Variants that seem
unusual in Caucasians might be common in Africans, and may not actually
cause disease.”
Part of the problem stems from the reference genome — the yardstick
sequence by which scientists identify apparent abnormalities, developed
by the multinational Genome Reference Consortium. The first version was
cobbled together from a few random donors of undefined ethnicity, but
the latest iteration, known as GRCh38, incorporates more information
about human genomic diversity.
Into the cloud
Harvesting genomes or even exomes at the population scale produces a
vast amount of data, perhaps up to 40 petabytes (40 million gigabytes)
each year. Nevertheless, raw storage is not the primary computational
concern. “Genomicists are a tiny fraction of the people who need bigger
hard drives,” says Flicek. “I don't think storage is a significant
problem.”
A greater concern is the amount of variant data being analysed from
each individual. “The computation scales linearly with respect to the
number of people,” says Marylyn Ritchie, a genomics researcher at
Pennsylvania State University in State College. “But as you add more
variables, it becomes exponential as you start to look at different
combinations.” This becomes particularly problematic if there are
additional data related to clinical symptoms or gene expression.
Processing data of this magnitude from thousands of people can paralyse
tools for statistical analysis that might work adequately in a small
laboratory study.
Scaling up requires improvisation, but there is no need to start from
scratch. “Fields like meteorology, finance and astronomy have been
integrating different types of data for a long time,” says Ritchie.
“I've been to meetings where I talk to people from Google and Facebook,
and our 'big data' is nothing like their big data. We should talk to
them, figure out how they've done it and adopt it into our field.”
Unfortunately, many talented programmers with the skills to wrangle
big data sets are lured away by Silicon Valley. Philip Bourne, associate
director for data science at the US National Institutes of Health
(NIH), believes that this is partly due to a lack of recognition and
advancement within a publication-driven system of scientific credit that
leaves software creators and data managers out in the cold. “Some of
these people truly want to be scholars, but they can't get the stature
of faculty — that's just not right,” says Bourne.
Processing power is another limiting factor. “This is not a desktop
game — the real practitioners are proficient in massively parallel
computation with hundreds if not thousands of CPUs, each with large
memory,” says Kohane. Many groups that analyse massive amounts of
sequence data are moving to 'cloud'-based architectures, in which the
data are deposited within a large pool of computational resources and
can then be analysed with whatever processing power is required.
“There's been a gradual evolution towards this idea that you bring
your algorithms to the data,” says Tim Hubbard, head of bioinformatics
at Genomics England. For Genomics England, this architecture is
contained in a secure government facility, with strict control over
external access. Other research groups are turning to commercial cloud
systems, such as those provided by Amazon or Google.
Privacy protection
In principle, cloud-based hosting can encourage sharing and
collaboration on data sets. But regulations on patient consent and
privacy rights surrounding highly sensitive clinical information pose
tricky ethical and legal issues.
In the European Union, collaboration is impeded by member states
having different rules on data handling. Sharing with non-EU nations
relies on cumbersome mechanisms to establish adequacy of data
protection, or restrictive bilateral agreements with individual
organizations. To help solve this problem, a multinational coalition,
the Global Alliance for Genomics and Health, developed the Framework for
Responsible Sharing of Genomic and Health-Related Data. The Framework
includes guidelines on privacy and consent, as well as on accountability
and legal consequences for those who break the rules.
“In data-transfer agreements, you could save yourself pages and pages
of rules if the institution, researcher and funder agree to follow the
Framework,” says Bartha Knoppers, a bioethicist at McGill University in
Montreal, Canada, who chairs the Alliance's regulatory and ethics
working group. The Framework also calls for 'safe havens' that allow the
research community to analyse centralized banks of genomic data that
have been identity-masked but not fully 'de-identified', so they remain
useful. “We want to link it to clinical data and to medical records,
because we're never going to get to precision medicine otherwise, so
we're going to have to use coded data,” explains Knoppers.
Integrating genomics into electronic health records is becoming
increasingly important for many European nations. “Our objective is to
put this into the standard National Health Service,” says Hubbard. The
UK 100,000 Genomes Project may be the furthest along at the moment, but
other countries are following. Belgium recently announced an initiative
to explore medical genomics, for example.
All these nations benefit from having centralized, government-run
health-care systems. In the United States, the situation is more
fragmented, with different providers relying on distinct health-record
systems, supplied by different vendors, that are generally not designed
to handle complex genomic data. The NIH launched the Electronic Medical
Records and Genomics (eMERGE) Network in 2007 to define best practices.
From data to diagnosis
The immediate goal of genomically enriched health records is to
explain the implications of gene variants to physicians, and one of its
earliest implementations is pharmacogenetics. The Clinical
Pharmacogenetics Implementation Consortium has translated known
drug–gene interactions reported in PharmGKB (a database run by Altman
and his colleagues) for clinical use. For example, people with certain
variants may respond poorly to particular anticoagulants, leading to
increased risk of heart attack. “The issue there is, how do you take a
practitioner who has 12 minutes per patient and about 45 seconds of time
allocated for prescribing drugs, and influence their practice in a
meaningful way?” says Altman.
As long as deciding how to adapt care to genetic findings remains a
job for humans, this process will remain time- and labour-intensive.
Nevertheless, combining genotype and phenotype information is proving
fruitful from a research perspective. Most clinically relevant gene
variants were identified through genome-wide association studies, in
which large populations of people with a given disease were examined to
identify closely associated genetic signatures. Researchers can now work
backwards from health records to determine what clinical manifestations
are prevalent among individuals with a given genetic variant.
And the genome is only part of the story — other '-omes' may also be
useful barometers of health. In July, Jun Wang stepped down as chief
executive of BGI to start up an organization to analyse BGI's planned
million-genome cohort alongside equivalent data sets from the proteome,
transcriptome and metabolome. “I will be initiating a new institution to
focus on using artificial intelligence to explore this kind of big
data,” he says.
It takes patients
As researchers strive to integrate data from health records and
clinical trials with genomic and other physiological data, patients are
starting to contribute. “When we're focused on things like behaviour,
nutrition, exercise, smoking and alcohol, you can't get better data than
what patients report,” says Ritchie.
Wearable devices, such as smartphones and FitBits, are collecting
data on exercise and heart rate, and the volume of such data is soaring
as it can be gathered with minimal effort on the wearer's part.
Each patient may become a big-data producer. “The data we generate at
home or in the wild will vastly exceed what we accumulate in clinical
care,” says Kohane. “We're trying to create these big collages of
different data modalities — from the genomic to the environmental to the
clinical — and link them back to the patient.” As these developments
materialize, they could create computational crunches that will make
today's 'big data' struggles seem like pocket-calculator problems. And
as scientists find ways to crunch the data, patients will be the
ultimate winners.
For further information and advice please contact us on uk@bikal.co.uk or
+44 (0)20 7193 5708

No comments:
Post a Comment