Making sense of the transcriptome

By Kate McDonald
Monday, 06 July, 2009


This feature appeared in the May/June 2009 issue of Australian Life Scientist. To subscribe to the magazine, go here.

Associate Professor Sean Grimmond’s name seems to be popping up all over the place at the moment. Last year, he and his colleagues had an excellent paper published in Nature Methods on profiling the transcriptome of mouse embryonic stem cells using massive-scale RNA sequencing, work that was recently repeated with human ES cells.

This year, his lab was heavily involved in the work published by the FANTOM consortium in Nature Genetics, which has produced remarkable information on the regulatory networks underlying gene expression in a range of different cells in the body.

Then it was announced that Grimmond will direct Australia’s contribution to the International Cancer Genome Consortium, which will involve deep sequencing 500 samples of pancreatic and ovarian tumours as part of a worldwide effort to generate high-quality genomic data on up to 50 types of cancer.

Grimmond and his lab are certainly making a big name for themselves, but it is the culmination of a lot of very difficult work. The advance of technology has helped inordinately, Grimmond says, so much that it is moving beyond even his “psychedelic optimism”.

Grimmond returned to the University of Queensland in 2000 after some years in the UK, and set about establishing an expression genomics lab. In that time, the lab has concentrated on capturing whatever information it can from the transcriptome and then trying to make sense of the biology.

In the seemingly olden days of 2000, that was achieved through microarrays. Now, the lab has two of Applied Biosystems’ new SOLiD 3 sequencing platforms, and the revolution has begun.

“In about 2003 we started branching out into looking at the transcriptional output and complexity of mammalian systems with the FANTOM consortium – that was FANTOM 3, between 2003 and 2005,” Grimmond says. “That posed a dilemma for us because it became clear that the tools we were using either weren’t scalable or weren’t amendable for studying all of the output that we were seeing in the transcriptome, so an array just wouldn’t cut it.”

In mid-2006, Grimmond’s lab was working on FANTOM 4 and using data from the 454 and Solexa platforms, which were generating up to 10 million tags per run. The team wanted to have a go at doing shotgun transcriptomes with 20 to 30 million tags, which is where Applied Biosystems comes in.

“We teed up with them and since 2007 we’ve been working on designing approaches to do shotgun transcriptome sequencing, and now we are talking about doing 100, 150 million reads per sample to get a complete picture of everything that’s there. The speed of its evolution has been phenomenal.”

The expression genomics lab received the third SOLiD system in the world, and the first outside of the US. “Our first run was just under a gigabase and now the runs are pushing beyond 20 gig a run,” he says. “A current upgrade is going on now and we expect to double that. There is the expectation that we could be pushing 100 gig by the end of the year.”

What all of this firepower enables the lab to do is to measure every transcript in a cell and to assess all of the complexity that entails. “The logistics of handling it is rather terrifying, but what you need to do is develop a pragmatic approach to asking specific questions.

“You can’t just get 10 gigabases and then say show me everything that is important. You have to have your pre-existing hypotheses and logic, and you ask the question of the data to see if it will support one or another.”

---PB---

FANTOM

Grimmond and his team have been heavily involved in the FANTOM – or Functional Annotation of the Mammalian Genome – consortium, which is spearheaded by Yoshihide Hayashizaki of RIKEN. It was Hayashizaki who took up the idea of trying to get full-length cDNA sequences from all of the tissues of mouse and man, Grimmond says. “He orchestrated how we would go about generating a massive amount of sequencing data by traditional methods, and then got an international consortium together to work out how we annotate this output and how do we value-add to that.”

FANTOM 1 was completed prior to the human genome sequence, and FANTOM 2 was published at the same time as the mouse genome paper.

“So you had the genomic view of the mouse as well as the transcriptomic coming out together in Nature,” Grimmond says.

“FANTOM 3 was a systematic review of the complexity of output in mouse and man, which suggested we’ve got six to seven transcripts per locus, which everyone thought was crazy.

“There were a lot of non-coding RNAs, which everyone thought was also crazy, but that has stood up to the test of time and has been shown through ENCODE and other projects to be legit.”

FANTOM 3 was helped along by the development of Cap Analysis of Gene Expression, or CAGE, which Grimmond describes as just like SAGE (Serial Analysis of Gene Expression) but from the five prime (5’) end.

“If you sequence the extreme 5’ end, you get a SAGE-like tag, but you also work out where the initiation of transcription is, so we now know exactly where the promoter starts,” he says.

“We don’t just assume that what it says in RefSeq is correct, and we know now that each gene has two to three promoters on average, rather than the one that we’d assumed prior to the likes of FANTOM 3.

“CAGE started giving us insights into expression but it is also telling us about promoters and the architecture and all of that interesting genome biology.”

With FANTOM 4, the aim of which was to understand how the many components discovered work together in the context of a biological network, new technology has allowed the development of deepCAGE, which enables researchers to sequence up to tenfold deeper.

“FANTOM 4 is different,” Grimmond says. “We are trying to take a true systems biology approach, where we’ve got a model of cell differentiation and we are bashing at it every way we can, with every genomic and transcriptomic tool we’ve got, to try to see if we can build up the entire network of transcription factors which are controlling that process.

“We would never have been able to do this without next-generation, transcriptome-based approaches.”

---PB---

Digging deeper

The new technology has allowed Grimmond and his team to develop a massive-scale RNA sequencing protocol that they call SQRL, or short quantitative random RNA libraries. This is a method for making a directional cDNA library to survey the complexity, dynamics and sequence content of transcriptomes in a near-complete fashion.

In the Nature Methods paper last year, the team showed that it could survey the transcriptomes of undifferentiated mouse embryonic stem cells and embryoid bodies. The technique can screen for single nucleotide polymorphisms (SNPs), the transcriptional activity of repeat elements, and alternative splicing events, something unheard of before.

“One of the things that we’ve discovered is that probably 10 to 20 per cent of all the transcription we see is outside known exons,” Grimmond says. “So we also want to screen for expressed SNPs or mutations or RNA editing events and things like that.

“We are also now looking at dividing up the transcriptome now into more and more refined fractions, so we’ll look at the RNA associated with the translational machinery, or we’ll look at the very large and very small molecular weight RNAs, and then try to tease out the true nature of these complexities.”

A recent paper in Nature Methods by Azim Surani and colleagues from the University of Cambridge shows just how far this technology has come. They used the SOLiD system to analyse the whole transcriptome of a single cell, in this case a single mouse blastomere.

“That’s a great paper that really shows you that the next-gen approaches at the moment are only limited by the quality of the molecular biology that makes the library,” Grimmond says. “That’s fundamentally important and something we are looking at very closely at the moment, because whether it be developmental biology or cancer biology, being able to get down to a very small sample size will be of critical importance.”

There are still some hurdles to jump through yet, as the approaches are not yet able to discriminate between sense and antisense strands and the ends of very long transcripts are lost, but as Grimmond says, it is like the early days of microarrays. “With the first arrays there were no amplification methods for the material so we needed 50 micrograms in total. Now we only need 50 nanograms. It’s just a matter of time before this becomes standard practice, I believe.”

---PB---

Cancer genome

Grimmond and his team have worked for many years on profiling cancer, working with a number of groups around Australia but primarily with teams from the Queensland Institute for Medical Research (QIMR). While looking at the transcriptome is very important, he says, tools have also been developed to look at cell division and differentiation and at pathological states such as breast and pancreatic cancer.

In what is the largest ever grant awarded by the NH&MRC, $27.5 million has been pledged to fund Australia’s role in the International Cancer Genome Consortium (ICGC). The ICGC was established last year by research organisations throughout the world to coordinate cancer genome mapping.

Every country involved has committed to sequencing at least one specific type or sub-type of cancer. For example, France is concentrating on alcohol-related liver cancer and HER2-positive breast cancer; India is leading oral cavity cancer; leukaemia is led by Spain; virus-related liver cancer is led by Japan; several sub-types of breast cancer are being studied by the UK; and China will look at stomach cancer.

Australia has chosen pancreatic and ovarian cancer, for a number of reasons. Pancreatic is one of the hardest, but we have a great deal of expertise, Grimmond says. The pancreatic work will be led by Professor Andrew Biankin, who heads a pancreatic cancer research group at Sydney’s Garvan Institute.

And ovarian will be tackled by Professor David Bowtell’s renowned research team at the Peter MacCallum Cancer Centre, which has been running the Australian Ovarian Cancer Study for several years and which has created the largest specimen repository of tumour samples in the world.

The idea behind the consortium is to sequence 500 samples each from up to 50 types of cancer. Australia will sequence the transcriptome and epigenome of 380 pancreatic and 150 ovarian samples and try to get the highest quality data possible.

The project is a long one – it is expected to last five years – and with technology moving at such a pace it is possible that that could change, Grimmond says. “If so, we can scale it up and reassess,” he says.

“If the technology overtakes what we have been doing then we will just use more samples. Technology is developing so fast that it could all change in six months. It is even beyond my psychedelic optimism.”

Related Articles

Melatonin helps to prevent obesity, studies suggest

In an experiment carried out in rats, chronic administration of melatonin prevented obesity to a...

Personality influences the expression of our genes

An international research team has used artificial intelligence to show that our personalities...

Pig hearts kept alive outside the body for 24 hours

A major hurdle for human heart transplantation is the limited storage time of the donor heart...


  • All content Copyright © 2024 Westwick-Farrow Pty Ltd