<
ANU Home | HORUS | Staff Home | Students | RSBS
The Australian National University
Research School of Biological Sciences
ANU COLLEGE OF SCIENCE
  
    
Printer Friendly Version of this Document

Information regarding the expression analysis


The main page
The log page
The results page
The clusters page
The clustering procedure
License & Conditions of Use
References

The main page

Image of the page (description below):

Data Area Description
Select Database Select the dataset to use for this analysis. Changing the selection will reload the page and reflect the different types of experimental conditions represented in each dataset.
View Expression Data for Identifiers Paste a list of identifiers (one per line) into the text area to get their expression values shown in a graph. NOTE: Gene names or gi numbers will NOT work. The identifiers used have to be the EXACT identifiers used in the data file (for example 245076_at for the Arabidopsis ATH121506 Affymetrix microarray chip). Upper and lower_case characters are regarded as equivalent.
Find Clusters of Co-Expressed Genes Here you can perform a data discovery and group all of the expression data into clusters of sequences with greater than average similarity in expression. Four parameters influence the clustering:
Parameter Effect
Minimum cluster size: Specifies the minimum number of sequences in a cluster required for this cluster to be returned. This has no effect on the clustering procedure itself, but only removes clusters with fewer than the specified numbers of sequences from the output. Clusterings performed for smaller values (for example "2") contain ALL of the clusters that would be returned for higher values (i.e. "3" or "4").
ANOVA P-value: Analysis of Variance (ANOVA) can be used to remove data that does not change (and therefore presumably is not of interest), thereby reducing the computation time. The value entered here is the P-value for the one-sided ANOVA. Values of -1 disable the ANOVA filtering step. Permissible values are bewteen 0 and 1. The smaller the value, the more stringent the filtering (i.e the more certain you can be that all of the data used in the clustering actually showed a change under one condition or the other)
Foldchange cutoff: The foldchange cutoff can be used to remove sequences that the user would consider as not changing significantly (generally set to between 1.5 and 2). This also helps focus on those sequences that show the most drastic change in expression. NOTE: The foldchange is not calculated in relation to the wild-type, but rather compares all conditions with all others. If there is any combination of conditions in which the expression values differ by a factor greater or equal to the foldchange cutoff, then this dataset passes this filtering step. (If any condition is "0" and any other condition is non-"0", then the dataset passes the filtering step).
Correlation cutoff: The clustering procedure uses the linear (Pearson) correlation value of the expression values between two entries. Permissible values are between 0 and 1 (negative correlations are treated as specified by the radio buttons described below). The more stringent your cutoff (i.e. closer to 1) the smaller and more consistent the groups returned by the clustering procedure will be. Changing the value can greatly influence the clustering procedure. The "best" value will be highly dependent on the number of conditions in your dataset (the more the better) and the type of results you wish to receive (small clusters of sequences with highly similar expression = increase the value; larger clusters of sequences with similar trends in expression = decrease the value). NOTE: The program performing the clustering is limited to using 300 MB of RAM on this machine. If you values are too lax you will receive a "OutOfMemory Error". In that case try being more stringent to reduce the amount of memory used for the clustering to below 300 MB.
Negative correlation radio-buttons: These radio buttons let you specify how to deal with negative correlations in the clustering:
I) Use negative correlations: Use the negative correlation values calculated from the expression values AS-IS. No change!. This will help separate groups of sequences with inversely correlated expression in the clustering
II) Invert negative correlations: In some cases you will want to group sequences with inversely correlated expression together into one group (for example, when searching for genes in a metabolic pathway producing metabolites with related but inverse effects). This option will cause only the magnitude of the correlation to be taken into account for the clustering NOT the sign!.
III) Discard negative correlations: Selecting this will disregard any correlation values below zero. This helps reduce the amount of RAM used for the clustering and generally does not negatively affect the results.
Search for genes with expression values correlating to the graph below This lets you find sequences with expressions correlating to a specific idea you have. Four options influence this approach:
Option Effect
The graph Use the graph to draw what type of expression pattern you are looking for. All values in the data file will be compared to the graph and those with linear correlation better than the specified cutoff (see below) will be returned.
By pressing shift while clicking, you can specify the amount of variance you are willing to accept for each condition. This allows you to further filter the genes with expression values correlating with your defined graph for those that have a specific range of values in certain conditions.
The checkboxes below the graph The checkboxes let you select and de-select which of the conditions present in the data file you want to take into account when calculating the linear (Pearson) correlation values (for example, if you are looking for genes up-regulated under salt-stress but your dataset has 50 other stress conditions in addition to salt stress, you might want to disregard the others).
Correlation coefficient This specifies the minimum correlation coefficient for which you want results returned. The more conditions you have in your dataset (and take into account), the lower you can set this value without the results becoming irrelevant.
Negative correlation radio-buttons: These radio buttons let you specify how to deal with negative correlations:
I) Use negative correlations: Use the negative correlation values calculated from the expression values. Only sequences with POSITIVE linear correlation to the specified graph will be returned.
II) Invert negative correlations: In some cases you will want to view sequences with inversely correlated expression to the graph as well as those with positive correlation (for example, when you are uncertain whether a give effect will be due to up-regulation or repression of a gene). This option will cause only the magnitude of the correlation to be taken into account for the clustering NOT the sign!.
Filter by above specified variances (Checkbox) Provided you have drawn a set of variances into the above graph, this lets you further filter the data returned for the specified correlation cutoff for those datasets ALSO matching the variance cutoffs. By default this option is NOT selected.

The log page

Image of the page (description below):

This page is used to display the status of the process being worked on (i.e. clustering, subclustering, finding correlating genes or extracting the values for specific identifiers.
You can refresh this page by pressing the button. Should an error occur, a hopefully informative error message will be displayed in this page.

The results page

Image of the page (description below):
Data Area Description
Selectable list of identifiers Click on an entry in this list to see it displayed as a red line in the graph below. Only one entry at a time can be selected.
The "Draw relative" checkbox draws the expression values for the entries either as their absolute values (if deselected) or scaled between 0 and 1 for each entry (if selected), for easier comparison of trends in expression.
Text version of list This provides a text representation (copy/paste-able) version of the list to the right. Changing things in this text area has no effect on this page itself, but editing of the text in this field (i.e. removing or adding entries) can be used to include/exclude entries prior to further sub-clustering of the data.
The "Identify Subclusters" button starts a clustering of all of the data in this text field. For the subclustering, all data from all conditions is taken into account. No data is disregarded, irrespective of what checkboxes were selected or de-selected in the main page.
Graph The graph displays the average expression value for each entry in each condition. The coloring from blue to yellow is used to reflect the position of the sequences in the list (blue=top, yellow=bottom). A red line is used to highlight the currently selected entry. For the currently selected entry, the standard deviation of the replicate values for each condition is shown as a red vertical bar.

The clusters page

Image of the page (description below):

The clusters page provides a list of the detected clusters from which one or more clusters can be selected. The sequence identifiers contained in the currently selected clusters are displayed in the text area to the right.
Pressing the "Show Expression for Cluster" button displays the expression values for all of the identifiers in the text area (--> see results page).

The clustering procedure

Many tools for analysis of expression data use either K-means or hierarchical clustering procedures to identify relevant or correlating subsets of sequences within the available data. These two approaches however have distinct drawbacks.
K-means clustering requires you to specify the number of clusters into which you want to partition the data. This is suboptimal, as you would ideally like the clustering procedure to automatically adjust the number of clusters it returns to best reflect the data that was provided.
Hierarchical clustering does not require you to specify a number of clusters, but it does not return any clusters as such. Hierarchical clustering approaches are very nice to identify how different sequences are related in their expression to one another and subsequent partitioning of the data can certainly be based on these relations, but the actual partitioning of the data still needs to be performed by the user.

Instead of using either of the above, we use a clustering procedure derived from the program CLANS that automatically partitions the data into clusters without the user having to specify any unknowns (such as the number of clusters present in the data).
Typical expression data looks a bit like the below:
probe_name	Cond1.1	Cond1.2	Cond1.3	Cond2.1	Cond2.2	Cond2.3	Cond3.1	Cond3.2	Cond3.3	Cond4.1	Cond4.2	Cond4.3
probe_1_at	4.244	4.420	4.597	3.765	3.670	3.481	4.992	4.968	4.677	4.801	4.738	5.191
probe_2_at	3.652	3.601	3.006	3.501	3.271	2.753	3.873	4.149	3.821	3.432	3.574	3.573
probe_3_at	7.938	7.981	7.928	9.385	8.286	7.338	7.834	7.967	7.997	8.057	8.100	7.951
probe_4_at	5.107	5.405	5.495	6.381	4.840	4.713	4.974	5.023	4.891	4.634	4.627	4.561
probe_5_at	2.144	2.144	2.144	2.144	2.144	2.266	2.144	2.144	2.144	2.144	2.144	2.144
probe_6_at	6.028	5.846	6.737	6.439	6.598	6.112	6.512	6.320	6.330	6.037	5.732	5.630
probe_7_at	2.559	2.288	2.144	2.144	2.151	2.144	2.144	2.144	2.144	2.144	2.147	2.144
probe_8_at	2.144	2.144	2.144	2.144	2.144	2.144	2.144	2.144	2.144	2.144	2.144	2.144
probe_9_at	2.184	2.184	2.264	2.184	2.184	2.184	2.184	2.184	2.184	2.184	2.155	2.184
etc.
From such data we can then calculate the correlation coefficients for all pairwise comparisons. A visualization of a large set of such pairwise relationships would resemble the picture below. Dots represent individual sequences or probe sets and the lines connecting them represent the pairwise similarities or correlation values. The darker a line, the more similar the connected dots are.

It is easy for us to see that there are clusters present, but automatically detecting them is a bit more tricky.
Such data can also be represented as a linear arrangement of nodes with edges connecting them. In (A) all connections have the same weight and every node is connected to every other. To reflect the differences in similarity between the sequences and provide a more accurate representation of the data, we remove non-existent edges (i.e. missing or undefined data) and weigh remaining edges according to the similarity of the connected nodes (B) (for example: black = correlation of 0.9; grey= correlation of 0.7). Each node is then assigned its own cluster (C) (clusters A-J).

For each node we then determine the cluster with the maximum weight based on the connections to that node. in the above example:
Node 0:
ClusterB: weight 0.9
ClusterC: weight 0 (no connection)
ClusterD: weight 0 (no connection)
ClusterE: weight 0.7
ClusterF: weight 0.7
ClusterG: weight 0 (no connection)
ClusterH: weight 0 (no connection)
ClusterI: weight 0 (no connection)
ClusterJ: weight 0 (no connection)
--> cluster with greatest weight=ClusterB
Node 1: (not showing the "no connection" data)
ClusterA: weight 0.9
ClusterE: weight 0.7
ClusterF: weight 0.7
--> cluster with greatest weight=ClusterA
Node 2:
ClusterD: weight 0.9
--> cluster with greatest weight=ClusterD
etc.
For nodes 4-7 there are of course multiple cluster assignment of equal weight (D). However, we only retain the cluster assignment with the smallest identifier (in this case the cluster letter closest to "A"). To keep nodes from repeatedly swapping cluster assignments back and forth, we then still need to post-process the cluster assignments for those nodes that would be assigned a cluster letter further away from "A" that they currently have (E, darker circles).

For the post-processing we check whether any of the nodes linked to the node we are post-processing would be swapping cluster assignments with our node. If that is the case, we reduce the cluster assignment for our node by the weight of that link
i.e.
Post processing Node0:
-->current assignment: clusterA
-->new assignment: clusterB, weight 0.9
checking Node1: -->part of clusterB, assigned clusterA -->is swapping!
reducing weight of assignment of node0 to clusterB by 0.9
checking Node2: -->part of cluster C -->skip
checking Node3: -->part of cluster D -->skip
checking Node4: -->part of cluster E -->skip
checking Node5: -->part of cluster F -->skip
checking Node6: -->part of cluster G -->skip
checking Node7: -->part of cluster H -->skip
checking Node8: -->part of cluster I -->skip
checking Node9: -->part of cluster J -->skip
...similar for nodes 2,4 and 8

This post-processing changes the weights assigned to the clusters. Node0 is now assigned clusterB with a weight of 0 (weight=0 assignments are disregarded). The new assignments for Nodes 2,4 and 8 will similarly be disregarded as their weights also drop to zero.
Prior to the next iteration the new cluster assignments are transfered. (F-H show the next iteration of the clustering procedure)

Nodes 0 and 1 are both assigned to cluster E, as the two connections to Nodes 4 and 5 now link to the same cluster assignment. (2x 0.7 for cluster E is greater than 1x 0.9 for cluster A). NOTE: In the post-processing in (G) there is no swap in assignment as nodes 4 and 5 are not assigned to clusterA but remain assigned to clusterE. Therefore the assignment of nodes 0 and 1 to clusterE is not down-weighted.
This iterative process is repeated until no further changes in cluster assignments are observed (in this case after (H) the clustering remains stable). The final cluster assignments for our data are: Nodes(0,1,4,5,6,7) in one cluster , (2,3) in another and (8,9) in a third

Coming back to the clustering image shown at the top, this approach automatically identifies 25 different clusters containing more than 2 sequences (highlighted in different colors). Clusters containing two or less sequences were not colored


License & Conditions of Use


These programs and scripts rely on both the SVGKit and MochiKit libraries (http://svgkit.sourceforge.net and http://mochikit.com) (many thanks to the developers for these great tools!). Both these libraries are released under the MIT license. As I find the "Information wants to be free" mentality highly appealing, these additional programs and script are also released under the MIT license (see below).
NOTE: ONLY the scripts used on to create and display the pages are released under the MIT license. The Java programs used to calculate the correlation coefficients and various statistics have previously been released under the GNU public license and are part of the CLANS package.
Should these utilities be used as a resource leading to a publication, please refer to this site and the corresponding publication in your references.
The MIT License

Copyright (c) <2009, Tancred Frickey>

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

References:


The example data I use on this page was not generated by myself, but rather taken from a variety of sources:
Illumina_Data_GSE10782:
Peter A. C. Hoen, Yavuz Ariyurek, Helene H. Thygesen, Erno Vreugdenhil, Rolf H. A. M. Vossen, Rene X. de Menezes, Judith M. Boer, Gert-Jan B. van Ommen and Johan T. den Dunnen. Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms. Nucleic Acids Research 2008 36(21):e141; doi:10.1093/nar/gkn705
Medicago_expression_atlas:
Vagner A. Benedito, Ivone Torres-Jerez, Jeremy D. Murray, Andry Andriankaja, Stacy Allen, Klementina Kakar, Maren Wandrey, Jerome Verdier, Helene Zuber, Thomas Ott, Sandra Moreau, Andreas Niebel, Tancred Frickey, Georg Weiller, Ji He , Xinbin Dai, Patrick X. Zhao, Yuhong Tang and Michael K. Udvardi. A gene expression atlas of the model legume Medicago truncatula. The Plant Journal 2008 Volume 55 Issue 3, Pages 504 - 513.
Arabidopsis_expression_atlas:
Schmid, M., Davison, T. S., Henz, S. R., Pape, U. J., Demar, M., Vingron, M., Scholkopf, B., Weigel, D., and Lohmann, J. (2005) A gene expression map of Arabidopsis development. Nature Genetics 37, 501-506.
Drosophila_Hox_overexpression:
Hueber SD, Bezdan D, Henz SR, Blank M, Wu H, Lohmann I.Comparative analysis of Hox downstream genes in Drosophila.Development. 2007 Jan;134(2):381-92.

Parts of the scripts and programs used herein were derived from the utilities present in CLANS:
Analysing microarray data using CLANS:
Frickey T, Weiller G. (2007) Bioinformatics. May 1;23(9):1170-1.
CLANS: A Java application for visualizing protein families based on pairwise similarity.
Frickey T, Lupas A (2004) Bioinformatics, 20(18): 3702-4.

This work was developed as part of the ARC Centre for Integrative Legume Research (CILR)