Shovel-ready biomedical whole-genome analysis

September 10th, 2010 by Steve Chervitz

If there ever were a “shovel-ready” project fit for the 21st century, it is biomedical whole genome analysis. Built from investments made over the past 20 years by the NIH/DOE and untold numbers of public and private institutions, the Human Genome project has unleashed an arsenal of molecular biological technologies such as massively parallel DNA sequencers, DNA microarrays, mass spectrometers and other tools for collecting data at a whole-{genome, transcriptome, proteome, metabolome} scale. It’s now quite easy for researchers and lab techs to amass mountains of data, but it’s not so easy to sift through it to extract biologically and medically meaningful insights and relationships. Forget the shovels; we need bulldozers.

It is a milestone achievement for humanity to have decoded its complete genome and to be on-track for developing technologies that can do this cheaply for any individual. We will soon be inundated with multitudes of personal genomes, and multitudes of persons wanting help interpreting their (or their patients’) genome — primarily to learn about potential health risks for themselves and their children or to seek guidance during diagnosis and treatment of existing health conditions.

Though we’re rapidly getting quite good at collecting our whole-genome DNA sequence data, we still have a way to go in analyzing and managing it in clinically meaningful ways. Metaphorical shovels & bulldozers are needed to build better tools, resources, and service providers with the expertise in genome analysis and the ability to translate this complex data into informative reports that are digestible to both doctors and their patients.

So where specifically should we start “digging”? Francis Collins in 2009 proposed a number of ideas for relevant projects that would strengthen our command of whole genome information: (1) creating a resource to give consumers a way to obtain objective information about the clinical validity and utility of genome analyses, and (2) expanding studies of gene-environment interactions.

Another shovel-ready task is to train more genetic counselors and clinical geneticists so that we’ll have the capacity to effectively communicate to patients and their doctors about the implications of individual whole genome sequences. Creating the decision support systems for these counselors, clinicians, and patients to help them interpret whole genome sequence data and to keep it up-to-date with latest scientific understanding is another area deserving of expanded funding.

The genetic testing industry has been under pressure lately from US government agencies. At the heart of this matter is ensuring accurate interpretation and communication of DNA test results to patients and medical personnel. Improved regulatory oversight, if done properly, could really help society realize the promise of genome-guided medicine. But regardless of how this industry and the underlying data are regulated, it doesn’t change the fact that we have tons of important work yet to do on the analysis side.

It is Omicia’s mission to focus on this analysis work and deliver more effective medically-relevant personal genome interpretation; to fill a much-needed niche by providing the expertise and services for clinicians to better understand how variations identified in personal whole genome sequences impact health conditions and predispositions.

Improvements in our ability to grok whole genome sequences of individuals will help move our healthcare industry in the direction it sorely needs to go: towards an emphasis on predictive, preventive, personalized healthcare and away from the reactive, one-size-fits all ”disease care” system we largely have now.

My internship at Omicia

September 10th, 2010 by Anirudh Todi

I walked in and thought “I hope I’m in the right place!” I had a million questions running through my head. What does an intern do? How much is expected of me? Will they like me? Would they even acknowledge me? How does a budding computer science engineer fit into a biotechnology company? Despite having completed my freshman year in college, I felt I was the new kid at school all over again. I was absolutely terrified that I’d have that job where I was just…”the intern.”

My name is Anirudh Todi and I am a Computer Science major at UC Berkeley. As it turned out I worried too much…far too much! People already knew who I was before I even walked in the door. I was a real person – I was Anirudh. Everybody was incredibly welcoming and friendly and it felt great to be there. After only a few weeks into my internship, I have well and truly found my place in this office. For a company like Omicia, that delivers on personalized genetic analysis, the internal culture of communicating and helping others couldn’t be better. At Omicia, I truly believe, I am surrounded by brilliant teammates fueled by a desire to make a difference in this world.

Michael Levitt, a professor of structural biology at Stanford University, once said – “Computing has changed biology forever; most biologists just don’t know it yet.” A few days into my internship, I realized I wasn’t working with “most biologists.” In Omicia, I found a combination of two fairly divergent areas – biology and information technology – coming together for a common goal. I realized here that just as the tool for understanding the physical sciences is math/calculus, the tool for understanding biology at a systems level is computer science. At Omicia, I have primarily been working on the normalization of all medical vocabularies used in Omicia’s disease gene database. A side project that I have also been involved in is setting up a multi-node Hadoop cluster to speed-up annotation pipelines. Both projects involve new and fascinating concepts for me and have kept me excited enough to be the first to report in office and the last to leave. The best part of it all, however, is that I am helping develop a new product – something that I did not get to do very often in school.

My “hello world” to the working life, I already have many great memories and stories to tell my family and friends back home in India. A few days ago my CEO took the entire office out to lunch to watch the FIFA World Cup semis. We had the best seats in the house…surrounded by 250 wild, cheering, dancing football fans…it was incredible. We collectively sighed when the Germans lost in a terrific, nail-biting game…but it was the most amazing day! Noon-day bike rides along the beautiful Emeryville waterfront and Friday Socials round out the work week. One of  the best perks of working at Omicia is the easy access to the leadership team including the CEO.

Despite the casual dress code and laid-back office, Omicia can be very intense in its commitment to meeting the high standards that it sets for itself. The problems they solve make long hours a necessity; the people they hire make the long hours fly by. They have been quietly building up their technology to interpret human genomes. Gradually, they are starting to make a formidable impact. I believe that Omicia is on the verge of changing medicine forever and I am glad to know that I have been able to contribute my bytes to making people’s lives better.

World of Omicia: an Intern’s Perspective

July 8th, 2010 by Arvind Vepa

Hey guys, I’m Arvind, the new intern at Omicia. I’m a student from the UC Berkeley EECS department and have been helping Omicia develop some of the software for the company’s genomic analysis and interpretation platform this summer. While I’m a programmer, I have been interested in biology for some time. Ever since the completion of the Human Genome Project, I’ve always been interested in DNA’s implications; how much do we really know about base pair 133,984,058 in chromosome 8 and what can we really say about the life of a person from just that single base pair?

I’ve kind of been a skeptic till I joined Omicia: I felt the complexity of the body compounded by the lack of full genome sequences would mean that real benefits from this technology would come decades from now. However, after working here the last month and a half, I’ve really realized that that’s not the case.

Not only does a variation in that base pair inform you of major consequences to your body (susceptibility to autoimmune thyroid disease), but there are millions of SNPs (single nucleotide polymorphisms) that have been cataloged, analyzed definitively, and can be as helpful, if not more helpful. What’s really cool about Omicia is it’s on the cutting edge of genomic information. Currently, scientists from around the globe are testing different genes and recording variants that could have different consequences in your body. The funny thing is, as soon as information like this is published (regardless of how deep the information is buried), it somehow finds it way into Omicia’s system.

By the way, when I mentioned scientists from around the globe, I forgot to mention that the Omicia’s employees sitting with me right now actually encompass five of the six inhabited land continents (unfortunately Australia got left by the wayside). Especially with the World Cup fervor passing through, the global perspective in our company has been blaringly obvious (maybe, at times, too obvious…).

From a programmer’s perspective, there’s a lot of good things I’ve seen though. However, something that really stands out is Omicia’s vision for the future. Omicia envisions a world where individuals can have their DNA sequenced at an affordable cost and then analyzed by Omicia’s system, also at an affordable price, so that each individual can live his/her life knowing a significant portion of his/her body’s strengths and weaknesses.  It’s a world where we can move towards a realistic appraisal of ourselves. It’s a world that will improve lives right from the start. It’s a world, with Omicia’s help, that is hopefully coming soon.

Genome Sequence Data Explosion?

May 10th, 2010 by Archie Russell

Next-generation sequencing technologies, or, perhaps, current-generation sequencing technologies do produce an incredible amount of data. High-quality genomes have 20x-30x depth of the 3 giga base (Gb) genome, leaving us 60-90 gigabytes (GB) to work with. Throw in quality metrics and we’re looking at 100 GB of data per genome. Multiply this across a project like the 1000 Genomes Project and, we get to 100 terabytes. Storing 100 TB is an endeavor,  mirroring 100 TB  over a reasonably-high-speed link would take several days. Ouch.

I just downloaded the third pilot dataset from the 1000 Genomes Project. It’s only a fraction of the eventual data, but the numbers are substantial: around 1,000 gene regions sequenced in about 700 people. Depending on precise details, this is around 0.1% of the eventual dataset. Hundreds of gigabytes of data, hours of progress-bars, and thumb-twiddling, right?  Of course not. Supporting data aside, the pilot set, compressed, is 9.5 MB — the size of a couple of old-school mp3s — downloadable in minutes. Extrapolating out to the eventual complete dataset, we’d still only need several gigabytes in total (compressed). Any way you spin it, it’s very manageable. What’s going on here?  There are two explanations:

  • First off, all those sequence reads really aren’t that interesting once digested. They are process artifacts. Other than researchers in a specialized areas, no one really wants to work with reads. What we really want are genomes, which are dramatically smaller.
  • Secondly, through sequence homology, evolution has provided us the basis for very efficient data size compression. On average, any two of us are only 0.1% different from each other, with about three million, mostly small, differences. The other 99.9% of the genome we can take for granted. Three million datapoints in a single variant file per genome is something we can work with. Formats such as GVF (in beta) and VCF have been created to manage such variant files.

And this really isn’t any different than the past. All genomics resources already collapse source data into usable information. Even supposedly ‘raw’ reads are the result of algorithmic digestion of much larger digital images. With resources like UCSC and ENSEMBL it’s easy to forget that the reference genome itself was once many short, disorganized reads with little higher-level annotation, impractical to download, and work with.

Lincoln Stein, in a recent paper titled “The case for cloud computing in genome informatics“, made an excellent argument for a migration of genome informatics to the cloud.

Cloud? Yes. Worry? Not yet.