ANU Home | HORUS | Staff Home | Students | RSBS
The Australian National University
Research School of Biological Sciences
Printer Friendly Version of this Document
NOTE: This page is best viewed with any browser BUT internet explorer (i.e. mozilla, firefox, opera, netscape, etc.).

CLANS (CLuster ANalysis of Sequences)
Originally developed at the Max-Planck-Institute for Developmental Biology Tuebingen

V1: Tancred Frickey, Andrei Lupas (MPI Tuebingen)
V2: Tancred Frickey, Georg Weiller
Bioinformatics, Genomic Interaction Group
Research School of Biological Sciences
Australian National University, Canberra

Original version: Frickey T., Lupas A.N. (2004) CLANS: a Java application for visualizing protein families based on pairwise similarity. Bioinformatics 20:3702-3704
Extended version: Frickey T., Weiller G. Analyzing microarray data using CLANS.Bioinformatics. 2007 May 1;23(9):1170-1.

CLANS can be downloaded HERE.
Help is available HERE.
A tutorial is available HERE.
The files used in the tutorial are available HERE.
An outline of the "Network-clustering" approach can be found HERE.

Frequently, homology is used as a reason to transfer knowledge about function or structure from known to unknown proteins. Although phylogenies are the method of choice when attempting to determine homology, the most frequently used marker is pairwise sequence similarity. Similarity search programs, such as BLAST or PSIBLAST (Altschul et al. 1997), can efficiently work with enormous datasets while phylogenetic inference and the prerequisite sequence alignments rapidly reach a point where they become unsuitable due to prohibitive calculation costs and loss of resolution. On the other hand, pairwise similarity searches are plagued by false positive matches and problems arising from amino acid composition bias causing, in many cases, the best BLAST hits not to be the closest sequence relatives (Koski & Golding 2001).
Aiming for the best of both worlds, we have implemented a version of the Fruchterman-Reingold graph-layout algorithm (1991). The program takes unaligned fasta format sequences a input, performs all-against-all BLAST searches and displays the pairwise similarities in either 2D or 3D graphs. Contrary to phylogenetic inference methods this approach uses unaligned sequences and works better the more sequences are provided as an increase in number of pairwise similarities better averages out the chance hits that plague standard BLAST comparisons.

The first version of CLANS was developed in 2004. Since, it has been extended with numerous features (and a few bells and whistles) in response to user-feedback and wish-lists. The biggest change however, came in the ability of CLANS to analyze microarray data.
The problem of finding groups of co-regulated genes across a number of microarray experiments is quite similar to the problem of finding groups of homologous proteins in a large dataset. In both cases we have huge amounts of data and are looking for those few genes that show some kind of significant similarity. Putative co-regulation as well as putative homology are generally inferred from similarities in the feature set of a gene. In the first case the feature set consists of the expression levels across the experiments, in the second case of the amino-acid sequences of the proteins. A certain similarity is also apparent when you look at the graphs generated for either protein sequence data (right,top graph) or microarray data (right,bottom graph)

While for protein data the knowledge of "homology" generally helps in formulating a hypothesis, as , most likely, some functional annotation will be available for at least some of the close or more distant homologs, the knowledge of apparent co-expression helps a lot less for microarray data. This due to two problems. First of all there is the huge amount of genes present on the microarray chip and the, generally, low number of experiments and/or replicates that are performed. The probability of two genes showing co-expression, even though they are not causally related in any way, is relatively high. The second is that knowing that a group of genes co-express does not help without having at least an inkling of what some of them do. Therefore we need to include additional data such as functional annotation, cellular localization or pathway information.
Overlaying such annotation on to a cluster map greatly enhances it's value. The map then not only tells you which genes appear co-expressed (i.e. might be causally related), but also shows which genes are thought/known to be involved in the same pathway or what their function might be.
Although it is quite possible to use other programs to find plausible "clusters" of co-expressed genes, then extract the annotation data for these genes from another source, the interactivity of CLANS is a great help. You can vary the clustering cutoffs at will and observe how the various groups associate or dissociate, what the annotations for the genes are, etc. and thereby find the most relevant clustering cutoffs for the question YOU want to answer. This interactivity is a great help when sifting through large amounts of data as no static clustering is likely to give the best results in all cases. Sometimes a bit more lenient settings may be beneficial, other time more stringent cutoffs may produce better results.
Example: Three groups of sequences are selected and the corresponding expression plots are drawn (each of the datapoints in the small, colored graphs represents one experiment for one gene; vertical bars represent the standard deviation across the replicates; the datapoints are connected by lines to better show "trends" in the cange of expression). Two groups are highlighted by black circles while the third is highlighted by red dots. The known functional annotation for the sequences in the third group is shown on the right (I had to shift the annotation a bit so that the sequence names were no longer visible, as this data has not yet been made publicly available). It is possible to search for all genes involved in a specific pathway (by selecting the pathway/bin of interest and pressing the "show bins in clans" button) or looking in what pathways/bins a given set of sequences (for example group 3) is involved (select a group of sequences and press the "show clans in bins" button). For a more detailed explanation read through the tutorial available HERE.

Running the program:

To run CLANS you need to have Java 1.5 or better installed (java can be downloaded HERE). For full functionality you will also need the NCBI BLAST,PSI-BLAST and formatdb executables (NCBI). For command line parameters and basic help please refer to the README or tutorial file.

Graphs calculated using FASTA format sequences as input:
AAA+ ATPases
Graph layout for 5101 sequences identified as putative AAA-ATPases by PSIBLAST and Hidden Markov Model (HMM) searches. The set consists to the largest part of AAA+-ATPases (a superfamily of AAA-ATPases). ABC-transporters, a known outgroup to AAA+-ATPases, can be found as a separate cluster (bottom-left). Blast hits are displayed using a color gradient from red (good) to pale blue (less good). Edges with P-values worse than 10-10 are not shown.

Outer membrane proteins
Graph layout 14092 sequences that were the result of recursive PSIBLAST searches using outer membrane proteins as seeds. Many large clusters are visible as well as many low confidence hits (dots with no connections). Analysis in progress.

Other Data
Graphs calculated from precomputed similarities:
Example analysis of microarray data. approximately 75 experiments. Three groups are selected. The functional annotation for the sequences in the blue group shows that they are all thought to be involved in bin 28.1:"DNA synthesis/chromatin structure". The purple group is involved in bin"Photosystem I reaction center"(data not shown). The red group contains a variety of annotations, many of which are proteins of "unknown" function.
Graph layout of amino acid similarities according to the BLOSUM62 substitution matrix (only edges with positive values are shown).

Standard Input:
Standard input is a file of fasta format sequences:
Command line parameters can be used to specify additional sequence databases, search parameters, where to find BLAST executables and many more (see README.txt file)

Alternative input:
A CLANS savefile can be loaded.
	sequences=3 #number of sequences
	parameters used for graph layout
	current rotation of the graph

	>sequence name1
	>sequence name2
	>sequence name3
	<pos>#X,Y and Z coordinates of vertices
	0 -2.359319 -3.0919282 1.6470909 
	1 2.5697038 -3.258371 -1.4772072
	2 -2.1152546 -3.3181837 -1.3644718
	<hsp>#P-values for edges between vertices (smaller value=better)(missing values=P-value of 1)
	0 0:0
	0 1:1e-3
	0 2:1e-60
	1 1:0
	1 2:1e-10
	2 0:1e-57
	2 1:1e-8
	2 2:0

Or, as an alternative, a file containing precomputed similarities can be used as input.
	sequences=3   #number_of_sequences

	<mtx> #positive and negative values possible (neg. values = additional repulsive interaction)
	0    0.2 -0.4
	0.1  0    0.1
	-0.1 0    0

Or, to simplify things, a simple tabular format file can be loaded (either two or three column format)
Three column format:
	name1	name2	value
	name2	name1	value
	name1	name3	value
	name2	name4	value
Two column format:(the missing third value will be set to "1")
	name1	name2
	name2	name1
	name1	name3
	name2	name4

Or, if you prefer loading microarray data; you will need the files in the following format (also see the help and tutorial):
Name value
	244901_at ; 6.05567352352854
	244902_at ; 6.52863702461879
	244903_at ; 8.61443604229016
	244904_at ; 4.25407793580797
	244905_at ; 1.93260943278271
	244906_at ; 7.0035897602782
	244907_at ; 2.37812316412652
	244908_at ; 2.13222740232501

If you want to use a mapping file, the format is the following: (also see the example mapping files in the tutorial foilder):

'11.6';'lipid metabolism.lipid transfer proteins etc';'At1g43666';'lipid transfer protein-related|'
'11.6';'lipid metabolism.lipid transfer proteins etc';'At3g12545';'lipid transfer protein-related| low similarity to Lipid transfer protein (Brassica rapa) GI:3062791; contains Pfam profile PF00152: tRNA synthetases class II (D, K and N)'
'11.6';'lipid metabolism.lipid transfer proteins etc';'At1g27950';'lipid transfer protein-related| low similarity to lipid transfer protein Picea abies GI:2627141; contains Pfam profile: PF00234: Protease inhibitor/seed storage/LTP family'
'11.6';'lipid metabolism.lipid transfer proteins etc';'At2g44300';'lipid transfer protein-related| low similarity to lipid transfer protein Picea abies GI:2627141; contains Pfam profile: PF00234: Protease inhibitor/seed storage/LTP family'
'15.1';'metal handling.acquisition';'At5g50160';'ferric chelate reductase, putative, FRO8'
'15.1';'metal handling.acquisition';'AT3g08040';'ferric reductase deficient 3 gene, MATE family, FRD3'

And, if the mapping identifiers (i.e. At1g43666) do not match the microarray identifiers (i.e. 244901_at) you also need a lookup file telling CLANS what names correspond to which.(the tutorial gives an example on how to generate one of these at the bottom).
	At1g01100       261578_at
	At1g01110       261580_at
	At1g01120       261570_at
	At1g01130       261575_at
	At1g01140       261581_at
	At1g01150       261571_at
	At1g01160       261582_at
	At1g01170       261572_at
	At1g01180       261573_at


Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Miller W., Lipman D.J., (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3444

Enright A.J., Ouzounis C.A., (2001) BioLayout - an automatic graph layout algorithm for similarity visualization. Bioinformatics 17:853-854

Fruchterman T.M., Reingold E.M., (1991) Force directed placement, Softw. -Pract. Exp. 21:1129-1164

Koski L.B., Golding G.B., (2001) The closest BLAST hit Is Often Not the Nearest Neighbor, J. Mol. Evol. 52:540-542