摘要: | 近年來,群聚分析已經大量使用在微陣列資料上的分析,其目的是將有相似表現型態的基因或組織樣本群集一起,希冀能發現或者提供有利資訊關於代謝路徑或細胞調控結構有關之群體基因。然而,卻沒有一個群聚分析演算法能在所有的資料上得到最佳分析結果,而且即使應用在相同的資料上不同的參數設定也會影響分群的結果,例如距離量測指標 (distance measure)的選擇便是扮演著影響分群成功與否的關鍵因素,由於群聚分析為非監督式學習理論(unsupervised learning),正確選擇群聚個數也是左右分群結論的重要因素之ㄧ;此外,不同的群聚分析演算法在相同的資料上也可能產生不同的分群結論,這些問題主要是由於群聚分析屬於非監督式學習理論,分群的結果由機器自行歸納,沒有標準答案,目前大部分文獻主要探討如何最佳化及調整群聚分析演算法來達到具有生物註解資訊的分群(例如: D’haeseleer et al., 2000; Brazma et al., 2002; Steuer et al., 2002),本研究主要是採用混合式群聚分析的概念來降低不同的參數設定對分群的影響。在計畫中,我們將提出一個新的混合式群聚分析演算法,利用整合不同參數設定下分群結果來增進分群的品質及穩健性,避免不同參數的嚐試。整合方式主要是根據混合群聚分析中每個分群的品質及差異性來決定整合比重,計算出一相似矩陣(similarity matrix),再利用階層式群聚演算法(hierarchical clustering algorithm)來得到最後分群的結果;此外,我們也會探討不同參數設定下是否會對此混合式群聚分析有影響,利用一些群聚分析的評估指標來進行評估,在計畫中,各步驟的方法將透過電腦模擬的方式及一些公用基因晶片資料與其他方法比較。
Several clustering algorithms have been applied to the analysis of gene expression data, each utilizing different distance/similarity measures and objective functions. The primary goal is to cluster together genes or tissues that perform similar expression patterns. Similar expression patterns might provide insight for the discovery of novel classes associated with transcriptional and biological processes. However, there is no single clustering algorithm which performs best for all datasets. Applying different methods or the same method with different parameters choices to the same data can lead to varying clustering results. Choosing a proper clustering algorithm usually requires expertise and insight, and this choice is crucial for the success of clustering co-expression data. In addition, most clustering algorithms depend heavily on the distance measures (or similarity measures) that quantify the degree of association between expression profiles. Also, the choice of number of clusters is a key factor for a successful identification of the clustering results. It is widely recognized that the choice of the distance measure may be as crucial as the choice of the clustering algorithm itself. Different clustering algorithms and distance measures are likely to lead to different clustering, although based on the same expression data. The difficulty comes from the fact that no ground truth is available for validating the results. Many publications pay attentions to the optimization and justification of the clustered biological processes and the clustering algorithms (e.g., D’haeseleer et al., 2000; Brazma et al., 2002; Steuer et al., 2002). In this study, instead of running the risk of choosing an unstable clustering algorithm and distance measures, a cluster ensemble can be employed to reduce the crucial influence on the clustering results. In this study, I plan to propose a variant of the generic ensemble approach where the number of clusters is produced randomly for each ensemble member. In addition, a general framework of cluster ensemble for gene expression analysis will be proposed to generate high quality and robust clustering results. 2 First, we extract the information from clustering results of each ensemble member in terms of a distance/similarity matrix. These distance matrices are combined according to quality and diversity of clustering solutions. Then a hierarchical clustering algorithm is used to generate the final clustering results. In this empirical study, we will investigate the effect of distance measures when using cluster ensembles. The performance of proposed approach will be evaluated by using artificial datasets and several public gene expression datasets. The motivations for developing such cluster ensembles are to improve the quality and robustness of clustering results with the capability of detecting novel cluster structures among the gene expression data. |