Computational analysis of gene expression data

Kerr, Gráinne (2009) Computational analysis of gene expression data. PhD thesis, Dublin City University.

Abstract
Metadata
Downloads
Documents

Abstract

Gene expression is central to the function of living cells. While advances in sequencing and expression measurement technology over the past decade has greatly facilitated the further understanding of the genome and its functions, the characterisation of functional groups of genes remains one of the most important problems in modern biology. Technological advancements have resulted in massive information output, with the priority objective shifting to development of data analysis methods. As such, a large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments, and consequently, confusion regarding the best approach to take. Common techniques applied are not necessarily the most applicable for the analysis of patterns in microarray data. This confusion is clarified through provision of a framework for the analysis of clustering technique and investigation of how well they apply to gene expression data. To this end, the properties of microarray data itself are examined, followed by an examination of the properties of clustering techniques and how well they apply to gene expression. Clearly, each technique will find patterns even if the structures are not meaningful in a biological context and these structures are not usually the same for different algorithms. Also, these algorithms are inherently biased as properties of clusters reflect built in clustering criteria. From these considerations, it is clear that cluster validation is critical for algorithm development and verification of results, usually based on a manual, lengthy and subjective exploration process. Consequently, it is key to the interpretation of the gene expression data. We carry out a critical analysis of current methods used to evaluate clustering results. Clusters obtained from real and synthetic datasets are compared between algorithms. To understand the properties of complex gene expression datasets, graphical representations can be used. Intuitively, the data can be represented in terms of a bipartite graph, with weighted edges between gene-sample node couples corresponding to significant expression measurements of interest. In this research, this method of representation is extensively studied and methods are used, in combination with probabilistic models, to develop new clustering techniques for analysis of gene expression data in this mode of representation. Performance of these techniques can be influenced both by the search algorithm, and, by the graph weighting scheme and both merit vigorous investigation. A novel edge-weighting scheme, based on empirical evidence, is presented. The scheme is tested using several benchmark datasets at various levels of granularity, and comparisons are provided with current a popular data analysis method used in the Bioinformatics community. The analysis shows that the new empirical based scheme developed out-performs current edge-weighting methods by accounting for the subtleties in the data through a data-dependent threshold analysis, and selecting ‘interesting’ gene-sample couples based on relative values. The graphical theme of gene expression analysis is further developed by construction of a one-mode gene expression network which specifically focuses on local interactions among genes. Classical network theory is used to identify and examine organisational properties in the resulting graphs. A new algorithm, GraphCreate, is presented which finds functional modules in the one-mode graph, i.e. sets of genes which are coherently expressed over subsets of samples, and a scoring scheme developed (using bi-partite graph properties as a basis) to weight these modules. Use of this representation is used to extensively study published gene expression datasets and to identify functional modules of genes with GraphCreate. This work is important as it advances research in the area of transcriptome analyiii sis, beyond simply finding groups of coherently expressed genes, by developing a general framework to understand how and when gene sets are interacting.

Metadata

Item Type:	Thesis (PhD)
Date of Award:	November 2009
Refereed:	No
Supervisor(s):	Ruskin, Heather J. and Crane, Martin
Uncontrolled Keywords:	microarray data analysis; gene expression data; supervised and unsupervised clustering methods; graph theory;
Subjects:	Biological Sciences > Bioinformatics Computer Science > Computer simulation
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License
Funders:	National Institute for Cellular Biotechnology (NICB)
ID Code:	14837
Deposited On:	17 Nov 2009 15:11 by Martin Crane . Last Modified 27 Sep 2019 11:35

Documents

Full text available as:

Preview	PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader 5MB
Preview	PDF (3rd party copyright material has been removed) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader 4MB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record