Automation and parallelisation

MathWorks Australia
Monday, 05 September, 2011

Processing the data generated in quantitative high-throughput gene expression analysis

Pharmaceutical and biomedical research are evolving to take advantage of the development of bioinformatic research programs, incorporating data from new high-resolution assays and technologies such as microarrays and fluorescent in situ hybridisation.

Information supplied by these methods concerning the physiological functions of genes can provide a molecular understanding of the mechanism of diseases, leading to effective therapies. It can also assist in basic scientific research, such as improving our understanding of genetic networks or the processes of embryological development.

The value added by these new assay technologies resides in the detailed insight provided by the high-resolution data they output. However, this blessing is also a curse: the huge volume of data must be processed, assessed for quality and have its relevant features extracted before any visualisation or statistical analysis is possible.

The time when it was possible for a lab scientist to manually perform this processing is now past; researchers would prefer to put their education and skills to use by concentrating on the science, rather than wasting their days repetitively annotating images, or manipulating tables in a spreadsheet program. Manual processing is also error-prone and often subjective.

For these new technologies, automated processing is essential. Effective development of data processing algorithms demands tools that enable rapid prototyping and implementation of these algorithms; ideally, this will be the same environment that is subsequently used for visualisation and statistical analysis of the processed data.

However, automated processing is a first step only. The deployment of assay technologies in a high-throughput environment can mean that even an automated data processing step will be the bottleneck. In this case, researchers can take advantage of the increasing availability of compute clusters to parallelise their data processing, releasing the bottleneck so that research can take place at the speed of science, not of analysis.

The following case study explores these issues. It details the automated parallel processing and subsequent visualisation and analysis of data from the FlyEx database.

The FlyEx database case study

The FlyEx database (http://flyex.uchicago.edu) is a web repository of segmentation gene expression images. It contains images of fruit fly (Drosophila melanogaster) embryos at various stages of development (cleavage cycles 10-14A) and quantitative data extracted from the images. Such work is invaluable in understanding the developmental processes in the embryo. Spatial and temporal patterns of gene expression are fundamental to these processes.

The images are created using immunofluorescent histochemistry. Embryos are stained with fluorescently tagged antibodies that bind to the product of an individual gene, staining only those parts of the embryo in which the gene is expressed. In this case, the genes studied are those involved in the segmentation of the embryo, such as the even_skipped, caudal and bicoid genes. The embryo is then imaged using confocal microscopy.

The database currently contains more than two million data records and thousands of images of embryos. Clearly, the automated processing of these images is a necessary step in the construction of the database.

Development of an image processing algorithm

MathWorks developed an algorithm for processing the images using MATLAB and Image Processing Toolbox.

MATLAB is an environment for technical computing, data analysis and visualisation. As an interactive tool, it enables researchers to rapidly prototype algorithms and analyses; its underlying programming language, optimised for handling large scientific datasets, allows these prototypes to be automated for high-throughput applications. Toolboxes provide application-specific add-on functionality, such as image and signal processing, multivariate statistics and bioinformatics.

The image processing algorithm involves a number of steps. The embryo image is first rotated into a standard configuration: centred, with the anterior-posterior axis horizontally oriented. The second step removes noise from the image and equalises the brightness across the entire image using adaptive histogram equalisation. Finally, the boundaries of each cell in the embryo are found using a brightness thresholding step and the pixels within each cell boundary are separated into their red, green and blue channels. These channels correspond to the level of expression of these three segmentation-related genes: even_skipped, caudal and bicoid. Figure 1 shows a profile of the three expression levels along the anterior-posterior axis.

The algorithm is available from MATLAB Central, an online community that features a newsgroup and downloadable code from users at the File Exchange. The use of prebuilt methods, such as brightness thresholding, histogram equalisation and median filtering for noise reduction (supplied as building-block algorithms in Image Processing Toolbox), together with the ability of MATLAB to treat images simply as numerical arrays, allow the prototype to be compact, readable and rapidly constructed.

Statistical analysis and visualisation

The processed results can then be analysed and visualised using the same MATLAB environment. This approach offers important advantages in reducing the number of tools necessary for the overall analysis - streamlining the process, reducing the training needs of scientists and removing a source of error in the transfer of data to a separate statistical package.

The results could be analysed according to the nature of the experimental context. For example, you could use:

Time series methods to examine how the gene expression profiles vary over time
ANOVA to examine the effects of different treatments or disease conditions
Multivariate statistical or machine learning methods, such as Principal Component Analysis or Cluster Analysis, to examine complex relationships within the profiles and between samples

Analysis results can also be visualised with other data sources and annotations, such as:

Publicly available information on the genes themselves
Public repositories of genetic data, such as GenBank
Genetic pathway visualisations

Automation

The prototype algorithm described above automates the processing of a single image. By taking advantage of the underlying programming language of MATLAB, it is a simple matter to apply this algorithm in a batch process to analyse thousands of images automatically, or in a continuous process to analyse images as they are generated by a high-throughput process.

Depending on IT infrastructure, generated image data might be accessed by MATLAB from files on disk, from a database or even directly streamed in from microscopes. Processing and analysis results can be output to a report file, a database or a web page.

Parallel computing

Depending on the nature of the particular steps taken in the image processing algorithm, an individual image could take from a few seconds to a minute to be processed on a typical desktop computer. Processing the few thousand images in the FlyEx database might then take from half a day to a few days, a time frame that perhaps is acceptable in the context of a research program in developmental biology.

However, in a pharmaceutical context, a typical high-throughput screening library contains one or two million compounds that are tested by an HTS robot at a rate of up to 100,000 compounds per day. A time frame of several months to process these generated images would be an unacceptable delay.

Parallel Computing Toolbox enables scientists to solve computationally intensive problems using MATLAB on multicore and multiprocessor computers or scaled to a cluster using Distributed Computing Server (Figure 2). By distributing the processing of high-throughput screening data across multiple computers, researchers can decrease analysis time by orders of magnitude.

Figure 1: Results of the automated image processing and statistical analysis of a fruit fly embryo. The graph depicts the expression levels of three genes involved in the segmentation of the embryo as a function of anterior-posterio position of the blastoderm.

Simple parallel programming constructs, such as parallel for-loops, allow scientists to convert algorithms to run in parallel with minimal code changes and at a high level without programming for specific hardware and network architectures (Figure 3).

Figure 2: Parallel Computing Toolbox used to accelerate computationally intensive tasks.

It is crucial that scientists are able to easily convert their algorithms to work in parallel while remaining at this high level, without needing to become experts in the traditionally complex techniques of programming for high-performance computing.

Future trends and requirements

It is already a clichéd concept that the new technologies currently expanding the boundaries of pharmaceutical and biomedical research are generating ever-increasing amounts of data; and that automated analysis algorithms, and efficient use of parallelisation using computer clusters, are vital if we want to avoid an analysis bottleneck and allow science to proceed at its own rapid pace.

Tools are therefore needed to enable the prototyping, automation and easy parallelisation of these algorithms. The use of technical computing environments, such as those described above, allow scientists to move rapidly from interactively visualising data, to prototyping a processing algorithm, to automation, to a high-throughput solution, in a single environment.

Adoption of these techniques will allow science to avoid analysis bottlenecks and to continue to proceed at its own pace, even in a high-throughput context.

By Sam Roberts, Application Engineer, Computational Biology, MathWorks

Figure 3: The parfor (a parallel for-loop) keyword from Parallel Computing Toolbox enables simple conversion of a batch processing application to run in parallel, taking advantage of a computer cluster.

Automation and parallelisation

Transforming health care through digitisation

AI hallucinations are eroding trust in lab tools — but there are solutions

Cutting turnaround times without cutting corners

Content from other channels on our network