![]() |
Research School of Biological Sciences
ANU COLLEGE OF SCIENCE
|
|
Information regarding the expression analysisThe main page The log page The results page The clusters page The clustering procedure License & Conditions of Use ReferencesThe main pageImage of the page (description below):
The log pageImage of the page (description below):
This page is used to display the status of the process being worked on (i.e. clustering, subclustering, finding correlating genes or extracting the values for specific identifiers.
You can refresh this page by pressing the button. Should an error occur, a hopefully informative error message will be displayed in this page.
The results pageImage of the page (description below):
The clusters pageImage of the page (description below):
The clusters page provides a list of the detected clusters from which one or more clusters can be selected. The sequence identifiers contained in the currently selected clusters are displayed in the text area to the right.
Pressing the "Show Expression for Cluster" button displays the expression values for all of the identifiers in the text area (--> see results page).
The clustering procedureMany tools for analysis of expression data use either K-means or hierarchical clustering procedures to identify relevant or correlating subsets of sequences within the available data. These two approaches however have distinct drawbacks. K-means clustering requires you to specify the number of clusters into which you want to partition the data. This is suboptimal, as you would ideally like the clustering procedure to automatically adjust the number of clusters it returns to best reflect the data that was provided. Hierarchical clustering does not require you to specify a number of clusters, but it does not return any clusters as such. Hierarchical clustering approaches are very nice to identify how different sequences are related in their expression to one another and subsequent partitioning of the data can certainly be based on these relations, but the actual partitioning of the data still needs to be performed by the user. Instead of using either of the above, we use a clustering procedure derived from the program CLANS that automatically partitions the data into clusters without the user having to specify any unknowns (such as the number of clusters present in the data). Typical expression data looks a bit like the below:probe_name Cond1.1 Cond1.2 Cond1.3 Cond2.1 Cond2.2 Cond2.3 Cond3.1 Cond3.2 Cond3.3 Cond4.1 Cond4.2 Cond4.3 probe_1_at 4.244 4.420 4.597 3.765 3.670 3.481 4.992 4.968 4.677 4.801 4.738 5.191 probe_2_at 3.652 3.601 3.006 3.501 3.271 2.753 3.873 4.149 3.821 3.432 3.574 3.573 probe_3_at 7.938 7.981 7.928 9.385 8.286 7.338 7.834 7.967 7.997 8.057 8.100 7.951 probe_4_at 5.107 5.405 5.495 6.381 4.840 4.713 4.974 5.023 4.891 4.634 4.627 4.561 probe_5_at 2.144 2.144 2.144 2.144 2.144 2.266 2.144 2.144 2.144 2.144 2.144 2.144 probe_6_at 6.028 5.846 6.737 6.439 6.598 6.112 6.512 6.320 6.330 6.037 5.732 5.630 probe_7_at 2.559 2.288 2.144 2.144 2.151 2.144 2.144 2.144 2.144 2.144 2.147 2.144 probe_8_at 2.144 2.144 2.144 2.144 2.144 2.144 2.144 2.144 2.144 2.144 2.144 2.144 probe_9_at 2.184 2.184 2.264 2.184 2.184 2.184 2.184 2.184 2.184 2.184 2.155 2.184 etc.From such data we can then calculate the correlation coefficients for all pairwise comparisons. A visualization of a large set of such pairwise relationships would resemble the picture below. Dots represent individual sequences or probe sets and the lines connecting them represent the pairwise similarities or correlation values. The darker a line, the more similar the connected dots are.
It is easy for us to see that there are clusters present, but automatically detecting them is a bit more tricky.
Such data can also be represented as a linear arrangement of nodes with edges connecting them. In (A) all connections have the same weight and every node is connected to every other. To reflect the differences in similarity between the sequences and provide a more accurate representation of the data, we remove non-existent edges (i.e. missing or undefined data) and weigh remaining edges according to the similarity of the connected nodes (B) (for example: black = correlation of 0.9; grey= correlation of 0.7). Each node is then assigned its own cluster (C) (clusters A-J).
For each node we then determine the cluster with the maximum weight based on the connections to that node. in the above example:
Node 0:
ClusterB: weight 0.9
ClusterC: weight 0 (no connection)
ClusterD: weight 0 (no connection)
ClusterE: weight 0.7
ClusterF: weight 0.7
ClusterG: weight 0 (no connection)
ClusterH: weight 0 (no connection)
ClusterI: weight 0 (no connection)
ClusterJ: weight 0 (no connection)
--> cluster with greatest weight=ClusterB
Node 1: (not showing the "no connection" data)
ClusterA: weight 0.9
ClusterE: weight 0.7
ClusterF: weight 0.7
--> cluster with greatest weight=ClusterA
Node 2:
ClusterD: weight 0.9
--> cluster with greatest weight=ClusterD
etc.
For nodes 4-7 there are of course multiple cluster assignment of equal weight (D). However, we only retain the cluster assignment with the smallest identifier (in this case the cluster letter closest to "A"). To keep nodes from repeatedly swapping cluster assignments back and forth, we then still need to post-process the cluster assignments for those nodes that would be assigned a cluster letter further away from "A" that they currently have (E, darker circles).
For the post-processing we check whether any of the nodes linked to the node we are post-processing would be swapping cluster assignments with our node. If that is the case, we reduce the cluster assignment for our node by the weight of that link
i.e.
Post processing Node0:
-->current assignment: clusterA
-->new assignment: clusterB, weight 0.9
checking Node1: -->part of clusterB, assigned clusterA -->is swapping!
reducing weight of assignment of node0 to clusterB by 0.9
checking Node2: -->part of cluster C -->skip
checking Node3: -->part of cluster D -->skip
checking Node4: -->part of cluster E -->skip
checking Node5: -->part of cluster F -->skip
checking Node6: -->part of cluster G -->skip
checking Node7: -->part of cluster H -->skip
checking Node8: -->part of cluster I -->skip
checking Node9: -->part of cluster J -->skip
...similar for nodes 2,4 and 8
This post-processing changes the weights assigned to the clusters. Node0 is now assigned clusterB with a weight of 0 (weight=0 assignments are disregarded).
The new assignments for Nodes 2,4 and 8 will similarly be disregarded as their weights also drop to zero.
Prior to the next iteration the new cluster assignments are transfered. (F-H show the next iteration of the clustering procedure)
Nodes 0 and 1 are both assigned to cluster E, as the two connections to Nodes 4 and 5 now link to the same cluster assignment. (2x 0.7 for cluster E is greater than 1x 0.9 for cluster A).
NOTE: In the post-processing in (G) there is no swap in assignment as nodes 4 and 5 are not assigned to clusterA but remain assigned to clusterE. Therefore the assignment of nodes 0 and 1 to clusterE is not down-weighted.
This iterative process is repeated until no further changes in cluster assignments are observed (in this case after (H) the clustering remains stable). The final cluster assignments for our data are: Nodes(0,1,4,5,6,7) in one cluster , (2,3) in another and (8,9) in a third
Coming back to the clustering image shown at the top, this approach automatically identifies 25 different clusters containing more than 2 sequences (highlighted in different colors). Clusters containing two or less sequences were not colored ![]()
License & Conditions of UseThese programs and scripts rely on both the SVGKit and MochiKit libraries (http://svgkit.sourceforge.net and http://mochikit.com) (many thanks to the developers for these great tools!). Both these libraries are released under the MIT license. As I find the "Information wants to be free" mentality highly appealing, these additional programs and script are also released under the MIT license (see below). NOTE: ONLY the scripts used on to create and display the pages are released under the MIT license. The Java programs used to calculate the correlation coefficients and various statistics have previously been released under the GNU public license and are part of the CLANS package. Should these utilities be used as a resource leading to a publication, please refer to this site and the corresponding publication in your references.The MIT License Copyright (c) <2009, Tancred Frickey> Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. References:The example data I use on this page was not generated by myself, but rather taken from a variety of sources: Illumina_Data_GSE10782:Peter A. C. Hoen, Yavuz Ariyurek, Helene H. Thygesen, Erno Vreugdenhil, Rolf H. A. M. Vossen, Rene X. de Menezes, Judith M. Boer, Gert-Jan B. van Ommen and Johan T. den Dunnen. Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms. Nucleic Acids Research 2008 36(21):e141; doi:10.1093/nar/gkn705 Medicago_expression_atlas:Vagner A. Benedito, Ivone Torres-Jerez, Jeremy D. Murray, Andry Andriankaja, Stacy Allen, Klementina Kakar, Maren Wandrey, Jerome Verdier, Helene Zuber, Thomas Ott, Sandra Moreau, Andreas Niebel, Tancred Frickey, Georg Weiller, Ji He , Xinbin Dai, Patrick X. Zhao, Yuhong Tang and Michael K. Udvardi. A gene expression atlas of the model legume Medicago truncatula. The Plant Journal 2008 Volume 55 Issue 3, Pages 504 - 513. Arabidopsis_expression_atlas:Schmid, M., Davison, T. S., Henz, S. R., Pape, U. J., Demar, M., Vingron, M., Scholkopf, B., Weigel, D., and Lohmann, J. (2005) A gene expression map of Arabidopsis development. Nature Genetics 37, 501-506. Drosophila_Hox_overexpression:Hueber SD, Bezdan D, Henz SR, Blank M, Wu H, Lohmann I.Comparative analysis of Hox downstream genes in Drosophila.Development. 2007 Jan;134(2):381-92. Parts of the scripts and programs used herein were derived from the utilities present in CLANS: Analysing microarray data using CLANS:Frickey T, Weiller G. (2007) Bioinformatics. May 1;23(9):1170-1. CLANS: A Java application for visualizing protein families based on pairwise similarity.Frickey T, Lupas A (2004) Bioinformatics, 20(18): 3702-4.
This work was developed as part of the ARC Centre for Integrative Legume Research (CILR)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Page last updated: 7 June 2007 Please direct all enquiries to: RSBS Webmaster Page Authorised by: Director, RSBS |
| The Australian National University — CRICOS Provider Number 00120C |