NOTE: This page is best viewed with any browser BUT internet explorer (i.e. mozilla, firefox, opera, netscape, etc.).
CLANS (CLuster ANalysis of Sequences)
Originally developed at the Max-Planck-Institute for Developmental Biology Tuebingen
V1: Tancred Frickey, Andrei Lupas (MPI Tuebingen)
V2: Tancred Frickey, Georg Weiller
Bioinformatics, Genomic Interaction Group
Research School of Biological Sciences
Australian National University, Canberra
Original version:
Frickey T., Lupas A.N. (2004) CLANS: a Java application for visualizing protein families based on pairwise similarity. Bioinformatics 20:3702-3704
Extended version:
Frickey T., Weiller G. Analyzing microarray data using CLANS.Bioinformatics. 2007 May 1;23(9):1170-1.
CLANS can be downloaded
HERE.
Help is available
HERE.
A tutorial is available
HERE.
The files used in the tutorial are available
HERE.
An outline of the "Network-clustering" approach can be found
HERE.
|
Frequently, homology is used as a reason to transfer knowledge about function or structure from known to unknown proteins. Although phylogenies are the method of choice when attempting to determine homology, the most frequently used marker is pairwise sequence similarity. Similarity search programs, such as BLAST or PSIBLAST (Altschul et al. 1997), can efficiently work with enormous datasets while phylogenetic inference and the prerequisite sequence alignments rapidly reach a point where they become unsuitable due to prohibitive calculation costs and loss of resolution. On the other hand, pairwise similarity searches are plagued by false positive matches and problems arising from amino acid composition bias causing, in many cases, the best BLAST hits not to be the closest sequence relatives (Koski & Golding 2001).
Aiming for the best of both worlds, we have implemented a version of the Fruchterman-Reingold graph-layout algorithm (1991). The program takes unaligned fasta format sequences a input, performs all-against-all BLAST searches and displays the pairwise similarities in either 2D or 3D graphs. Contrary to phylogenetic inference methods this approach uses unaligned sequences and works better the more sequences are provided as an increase in number of pairwise similarities better averages out the chance hits that plague standard BLAST comparisons.
The first version of CLANS was developed in 2004. Since, it has been extended with numerous features (and a few bells and whistles) in response to user-feedback and wish-lists. The biggest change however, came in the ability of CLANS to analyze microarray data.
The problem of finding groups of co-regulated genes across a number of microarray experiments is quite similar to the problem of finding groups of homologous proteins in a large dataset. In both cases we have huge amounts of data and are looking for those few genes that show some kind of significant similarity. Putative co-regulation as well as putative homology are generally inferred from similarities in the feature set of a gene. In the first case the feature set consists of the expression levels across the experiments, in the second case of the amino-acid sequences of the proteins. A certain similarity is also apparent when you look at the graphs generated for either protein sequence data (right,top graph) or microarray data (right,bottom graph)
|
|
|
While for protein data the knowledge of "homology" generally helps in formulating a hypothesis, as , most likely, some functional annotation will be available for at least some of the close or more distant homologs, the knowledge of apparent co-expression helps a lot less for microarray data. This due to two problems. First of all there is the huge amount of genes present on the microarray chip and the, generally, low number of experiments and/or replicates that are performed. The probability of two genes showing co-expression, even though they are not causally related in any way, is relatively high. The second is that knowing that a group of genes co-express does not help without having at least an inkling of what some of them do. Therefore we need to include additional data such as functional annotation, cellular localization or pathway information.
Overlaying such annotation on to a cluster map greatly enhances it's value. The map then not only tells you which genes appear co-expressed (i.e. might be causally related), but also shows which genes are thought/known to be involved in the same pathway or what their function might be.
Although it is quite possible to use other programs to find plausible "clusters" of co-expressed genes, then extract the annotation data for these genes from another source, the interactivity of CLANS is a great help. You can vary the clustering cutoffs at will and observe how the various groups associate or dissociate, what the annotations for the genes are, etc. and thereby find the most relevant clustering cutoffs for the question YOU want to answer. This interactivity is a great help when sifting through large amounts of data as no static clustering is likely to give the best results in all cases. Sometimes a bit more lenient settings may be beneficial, other time more stringent cutoffs may produce better results.
Example: Three groups of sequences are selected and the corresponding expression plots are drawn (each of the datapoints in the small, colored graphs represents one experiment for one gene; vertical bars represent the standard deviation across the replicates; the datapoints are connected by lines to better show "trends" in the cange of expression). Two groups are highlighted by black circles while the third is highlighted by red dots. The known functional annotation for the sequences in the third group is shown on the right (I had to shift the annotation a bit so that the sequence names were no longer visible, as this data has not yet been made publicly available). It is possible to search for all genes involved in a specific pathway (by selecting the pathway/bin of interest and pressing the "show bins in clans" button) or looking in what pathways/bins a given set of sequences (for example group 3) is involved (select a group of sequences and press the "show clans in bins" button). For a more detailed explanation read through the tutorial available HERE.
|
|
Running the program:
To run CLANS you need to have Java 1.5 or better installed (java can be downloaded HERE). For full functionality you will also need the NCBI BLAST,PSI-BLAST and formatdb executables (NCBI). For command line parameters and basic help please refer to the README or tutorial file.
Examples:
Graphs calculated using FASTA format sequences as input:
AAA+ ATPases
|
Graph layout for 5101 sequences identified as putative AAA-ATPases by PSIBLAST and Hidden Markov Model (HMM) searches. The set consists to the largest part of AAA+-ATPases (a superfamily of AAA-ATPases). ABC-transporters, a known outgroup to AAA+-ATPases, can be found as a separate cluster (bottom-left). Blast hits are displayed using a color gradient from red (good) to pale blue (less good). Edges with P-values worse than 10-10 are not shown.
|
Outer membrane proteins
|
Graph layout 14092 sequences that were the result of recursive PSIBLAST searches using outer membrane proteins as seeds. Many large clusters are visible as well as many low confidence hits (dots with no connections). Analysis in progress.
|
|
Other Data
Graphs calculated from precomputed similarities:
|
Example analysis of microarray data. approximately 75 experiments. Three groups are selected. The functional annotation for the sequences in the blue group shows that they are all thought to be involved in bin 28.1:"DNA synthesis/chromatin structure". The purple group is involved in bin 1.01.02.02:"Photosystem I reaction center"(data not shown). The red group contains a variety of annotations, many of which are proteins of "unknown" function.
|
|
|
Graph layout of amino acid similarities according to the BLOSUM62 substitution matrix (only edges with positive values are shown).
|
|
Files:
Standard Input:
Standard input is a file of fasta format sequences:
>name1
sequence1
>name2
sequence2
etc.
Command line parameters can be used to specify additional sequence databases, search parameters, where to find BLAST executables and many more (see README.txt file)
Alternative input:
A CLANS savefile can be loaded.
Format:
sequences=3 #number of sequences
<param>#optional
parameters used for graph layout
</param>
<rotmtx>#optional
current rotation of the graph
</rotmtx>
<seqs>
>sequence name1
sequence1
>sequence name2
sequence2
>sequence name3
sequence3
</seqs>
<pos>#X,Y and Z coordinates of vertices
0 -2.359319 -3.0919282 1.6470909
1 2.5697038 -3.258371 -1.4772072
2 -2.1152546 -3.3181837 -1.3644718
</pos>
<hsp>#P-values for edges between vertices (smaller value=better)(missing values=P-value of 1)
0 0:0
0 1:1e-3
0 2:1e-60
1 1:0
1 2:1e-10
2 0:1e-57
2 1:1e-8
2 2:0
</hsp>
Or, as an alternative, a file containing precomputed similarities can be used as input.
Format:
sequences=3 #number_of_sequences
<seqs>
>vertex_name_1
>vertex_name_2
>vertex_name_3
</seqs>
<mtx> #positive and negative values possible (neg. values = additional repulsive interaction)
0 0.2 -0.4
0.1 0 0.1
-0.1 0 0
</mtx>
Or, to simplify things, a simple tabular format file can be loaded (either two or three column format)
Three column format:
name1 name2 value
name2 name1 value
name1 name3 value
name2 name4 value
etc.
Two column format:(the missing third value will be set to "1")
name1 name2
name2 name1
name1 name3
name2 name4
etc.
Or, if you prefer loading microarray data; you will need the files in the following format (also see the help and tutorial):
Name value
244901_at ; 6.05567352352854
244902_at ; 6.52863702461879
244903_at ; 8.61443604229016
244904_at ; 4.25407793580797
244905_at ; 1.93260943278271
244906_at ; 7.0035897602782
244907_at ; 2.37812316412652
244908_at ; 2.13222740232501
etc.
If you want to use a mapping file, the format is the following: (also see the example mapping files in the tutorial foilder):
BINCODE';'NAME';'IDENTIFIER';'DESCRIPTION'
'11.6';'lipid metabolism.lipid transfer proteins etc';'At1g43666';'lipid transfer protein-related|'
'11.6';'lipid metabolism.lipid transfer proteins etc';'At3g12545';'lipid transfer protein-related| low similarity
to Lipid transfer protein (Brassica rapa) GI:3062791; contains Pfam profile PF00152: tRNA synthetases class II (D, K and N)'
'11.6';'lipid metabolism.lipid transfer proteins etc';'At1g27950';'lipid transfer protein-related| low similarity
to lipid transfer protein Picea abies GI:2627141; contains Pfam profile: PF00234: Protease inhibitor/seed storage/LTP family'
'11.6';'lipid metabolism.lipid transfer proteins etc';'At2g44300';'lipid transfer protein-related| low similarity
to lipid transfer protein Picea abies GI:2627141; contains Pfam profile: PF00234: Protease inhibitor/seed storage/LTP family'
'15.1';'metal handling.acquisition';'At5g50160';'ferric chelate reductase, putative, FRO8'
'15.1';'metal handling.acquisition';'AT3g08040';'ferric reductase deficient 3 gene, MATE family, FRD3'
etc.
And, if the mapping identifiers (i.e. At1g43666) do not match the microarray identifiers (i.e. 244901_at) you also need a lookup file telling CLANS what names correspond to which.(the tutorial gives an example on how to generate one of these at the bottom).
At1g01100 261578_at
At1g01110 261580_at
At1g01120 261570_at
At1g01130 261575_at
At1g01140 261581_at
At1g01150 261571_at
At1g01160 261582_at
At1g01170 261572_at
At1g01180 261573_at
etc.
References:
Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Miller W., Lipman D.J., (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3444
Enright A.J., Ouzounis C.A., (2001) BioLayout - an automatic graph layout algorithm for similarity visualization. Bioinformatics 17:853-854
Fruchterman T.M., Reingold E.M., (1991) Force directed placement, Softw. -Pract. Exp. 21:1129-1164
Koski L.B., Golding G.B., (2001) The closest BLAST hit Is Often Not the Nearest Neighbor, J. Mol. Evol. 52:540-542