This document should provide a slightly deeper insight into the workings of Mclip and Mmatch than the very short article and help-page provides.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The web-interface is limited to motif-searches covering less than 50.000 characters. This is due to our limited computer capacities and not an inherent limitation of the program. A local installation of Mclip is only restricted by the amount of RAM accessible to the Java program (specified by the "-Xmx" parameter (i.e. "java -Xmx300m -jar mclip.jar" would allow mclip to use up to 300 megabytes of RAM))
1. Introduction.
2. Basic layout.
3. Parameters.
4. Profile-profile alignment.
5. Clique detection.
6. Motif-sequence alignment.
1. Introduction
I started work on these programs shortly after trying to detect motifs in the upstream UTR of a set of co-expressed sequences (determined by microarray).
Although there are a multitude of programs published and available (see "http://www.stud.uni-potsdam.de/~haussler/master/list.php" for a nice (non-exhaustive) list of available programs (unfortunately the list seems to have been contaminated with non-relevant data since, but the original motif program entries are still available towards the end)
I ran into major difficulties trying to analyze hundreds of small datasets and/or a few large ones arising from my above mentioned analysis.
A major problem was that many easy to use programs were available on the web, but the web-interfaces would not accept some of my input files due to their size.
The other major problem was that I was unable to get the source code for many of the programs and for those I did manage to find the sources for, many did not compile properly or required extensive dependencies that would have made installation of that tool a major hassle.
I did manage to find a few programs that I was able to install locally without any difficulty and that would analyze my larger datasets, but I noticed that the results they returned were highly variable.
Some methods appeared too stringent, finding only a very few motifs in the large number of sequence sets I had, others would find at least one motif for most of the sets, but were unable to "generalize" to detect highly similar motifs in other sequences (i.e. they were over-learning) and others again would return large numbers of motifs for almost every set (most of which unfortunately appeared to be randomly selected from the sequences; i.e. converting the aligned "hit-regions" back to profiles did not show any kind of conservation pattern).
One other aspect that disturbed me was that many of the programs required me to tell them things I did not know (i.e. the length of the motif I was looking for, how often it was present in the set of sequences, etc.).
Coming from the protein world and having written a number of alignment programs, I decided to cannibalize some of the programs and create sth. that would let me search for motifs using an alignment based approach (making it deterministic; running the program multiple times would return the same results (not sth. all motif finders do)) and allowing for gaps (I did not expect many gaps in my motifs, but I did not want to exclude the possibility of them being present either).
Over the course of programming I found I had to add a number of bounding parameters (I was getting way too many results) but mostly these are "lower bounds" (i.e. I want my motif to be longer than X and be conserved) (for example, nobody would be interested in motifs of length "1" no matter how conserved)
The following is what I came up with:
2. Basic layout
The program (called MClip for Motif detection based on CLIques of Profile-profile alignments) consists of a number of consecutive steps.
1. All-against-all alignment of the input sequences. This provides mclip with the local similarities the sequences have to one another.
2. Next, Mclip derives a profile for each of the sequences by mapping the local similarities back on to the sequences.The profile provides a numerical representation of the conservation and number of residues involved in the local alignments covering that sequence. Using a profile has the advantage of reducing the noise produced by chance local similarities.
-At this point mclip has information about the regions of local similarity and the number of time they occur in my set. However, that information is still distributed over a wide range of profiles (as every input sequence was converted into a profile). Mclip needs to discard profile regions showing no similarity to others (based on the assumption that these are unlikely to be responsible for the observed co-expression) and combine profile regions showing similarities to produce a motif.
Now for the tricky part.
3. Aligning the profiles generated in "2" to one another produces sets of local alignment traces stating what profiles are similar in what regions and how they align to one another (see below for more details on the profile-profile alignment procedure).
It may seem superfluous to first align sequences, then derive profiles only to realign the profiles; why not simply use the sequence-sequence alignment traces? It would be possible to do so, using a sum of pairs score or similar, but the advantage of aligning profiles is increased sensitivity (similar to the gain in sensitivity going from BLAST to PSIBLAST).
4. Mclip then has to determine which regions of which profiles are alignable to one another without one profile alignment contradicting another. Combining these, produces the final motifs.
5. As I was not really interested in the motifs themselves, but rather in which sequences shared these motifs, I added a small routine than would take the motifs found and try to align the sequences to them. Using the alignment score of the local sequence-to-motif matches it is reasonably straight forward to estimate how likely each match is due to chance similarity (using Z-statistics).
The output of the program consists of:
The motif
Motif position; freq(A); freq(C); freq(G); freq(T); freq(gap); The number of residues from which the frequencies were derived(i.e. the number of profiles involved)
and the corresponding sequence hits
sequence name
start; motif hit; end; alignment score; Z-score; Expect-value
and, possibly, other hits for that sequence
As I also wanted to compare the motifs found in one group with the number of times the motif ocurred outside that group I made the motif-to-sequence matching utility available separately (Mmatch).This utility has two matching modes; in match=0 it tries to match all the sequences to the given motifs, returning the motif name and the sequences that contained that motif (similarly to mclip); in match=1 it tries to match the motifs to the given sequences, giving an easier overview of what sequences contained what motifs.
Match=0:(match sequences to motifs)
3. Parameters: A list of the possible parameters used in Mclip and Mmatch and a short explanation of what they are for.
Mclip parameters
Parameter
Value
Info
-Infile
Filename
The path to the file containing the set of sequences to be searched for shared motifs (FASTA).
-Complement
True/False (def: true)
Should only the sequences provided be analyzed, or also their reverse complements?
-Countcomplement
True/False (def: false)
Differentiate between hits to the reverse complement and other sequences?
If "true", mclip will not count a motif-hit to a sequence and it's reverse complement as two separate hits.
This can help eliminate inverted repeat motifs hitting only in one sequence and it's reverse complement which are not "motifs" as such.
-Alnmode
0-2 (def: 1)
Will use different profile-profile alignment modes. (see below: profile-profile alignment).ONLY AVAILABLE IN THE STANDALONE VERSION; NOT VIA THE WEB_INTERFACE!
-Endgaps
True/False (def: true)
Should endgaps be allowed (i.e. should the motifs be matched globally(false) or locally(true))
-E-value
>=0 (def: 1e-3)
Up to what E-value should motif hits be returned.
-Minlength
>=1 (def: 7)
Minimal length of a motif.
-Minmotifnum
>=1 (def: 2)
The minimal number of times a motif has to be recovered in the set of input sequences.
-Mincoverage
>=0 (def: 0.75)
How much of the motif must a valid hit cover (only relevant if -endgaps is "true")
-Matchvals
"estimate" (default) or float[ ] (in order AA,AC,AG,AT,CC,CG,CT,GG,GT,TT)
Either estimate the match scores for each residue from the input data, or use values specified by the user. NOTE: If values are provided, then the parameter "frequencies" also needs to be specified. If "estimate" is used then both of these are calculated from the input sequences.
-Frequencies
float[ ] in order A,C,G,T
If Match values were provided, this tells the program what relative residue frequencies it should use to calculate scores.
-Gapopen
float (def: -5)
The penalty for opening a gap in the sequence-sequence local alignments.
-Gapextend
float (def: -4)
The penalty for extending a gap in the sequence-sequence local alignments.
-Minalnlength
int (def: 7)
Only local alignments of length greater than minalnlength are used to generate the profiles.
-Minalnscore
float (def:7)
Only local alignments with scores greater than minalnscore are used to generate the profile.
-Minprofilescore
float (def: 7)
Only profile traces with scores greater than minprofilescore are used to generate the motifs.
-Minprofilenum
int (def:2)
Cliques with less than minprofilenum members are discarded.
-Iterate
int (def: 0)
How many round of iteration should be performed (the optional updating of profiles based on the profile-profile matches).
-Pseudocounts
float (def: 1e-3)
The amount of pseudocounts to add to each profile prior to profile-profile comparisons (see profile-profile alignment procedure desribed below)
Mmatch parameters
Parameter
Value
Info
-Infile
Filename(s)
The path(s) to the file(s) containing the fasta sequences to analyze (space delimited)
-Profile
Filename(s)
The path(s) to the file(s) contining the motifs to match (space delimited)
-Evalue
float (def: 0.01)
The E-value up to which motif hits are returned
-Pvalue
float (def:-1)
If >=0 overrides E-value settings and causes hits to be returned for P-values instead of E-values (useful if database size changes across different comparisons (in which case you are better off comparing P-values instead of E-values), or motifs are very short)
-Mincoverage
float (def:0.75)
The minimum coverage a hit has to have to a motif
-Complement
True/False (def:true)
Search the reverse-complement of the sequences for motifs as well?
-Endgaps
True/False (def:true)
Should endgaps be allowed (i.e. should the motifs be matched globally (no engaps, false) or locally (endgaps, true))
-Pseudocounts
float (def:1e-3)
Determines the amount of pseudocounts to add to the Motifs. Motifs based on few sequences get added proportionally more pseudocounts than those based on more sequences (see mclip "-pseudocount" parameter). "-1" or any value greater than 0 are acceptable.(also see profile-profile alignment below for how pseudocounts affects a motif. The score-profiles for the motif-sequence alignment are calculated in a very similar manner). If -1 is selected, Mmatch uses an amount of pseudocounts that sets the average score of a sequence-motif alignment to -scoretarget (see below) (From the tests I performed a value of "-5" for -scoretarget appears to give reasonable results in most cases, but you are still better off optimizing the amount of pseudocounts you add individually for each profile).
-Scoretarget
float (def:-5)
The score an average hit to the motif should generate. This will influence the amount of pseudocounts that are added to the motifs. The more negative the value, the less pseudocounts are added. This is useful when dealing with many highly diverse profiles at the same time as it automatically selects comparable amounts of pseudocounts to add to each profile. This only has an effect if -pseudocounts is set to "-1" (see above)
-Limit
float (def:0.01)
The amount of pseudocounts to add is optimized until the average score to the motif lies in the range of scoretarget-limit to scoretarget+limit. The smaller the limit, the longer it takes to optimize the pseudocounts (a value of 0.01 takes less than a second). This only has an effect if -pseudocounts is set to "-1".
-Match
0/1 (def:0)
0:search profiles against sequences; 1: search sequences against profiles (0: returns one entry per profile and the matching sequences; 1: returns one entry per sequence and the matching profiles)
Runtime and motif detection:
To provide a rough estimate on how Mclip compares to other programs in terms of speed and usability, Aglam, AlignACE, Meme and Mclip were compared. The datasets used consist the upstream UTR of groups of co-expressed sequences as determined by microarray experiments. These programs were used as they are readily available, easy to compile/install and do not require training-sets containing true-positive and true-negative sequences (since we do not know from the microarray experiments which co-expressions are due to direct co-regulation (i.e. these sequences should share a common motif), more general co-regulation (they are causally related, but may not share a motif) or chance similarity (the co-expression we see is due to chance or the limited number of conditions we examine and NOT due to actual co-regulation of these genes)).
Meme was run under the "Any Number of Repeats" (ANR) model to make the results more comparable to Mclip.
NOTE: Aglam only returns one motif, all other programs return multiple motifs
Commands used:
All sequences were 1kb long. The number of sequences per dataset varied from 2 to 700 with the overwhelming majority of the datasets containing less than 10 sequences (see histogram).
The run-times for the various programs are are shown in the plot.(For datasets containing less than 200 sequences. I chose this limit as both Meme and Mclip generally need more than 2 days to derive the motifs for a dataset containing 200 sequences (on my desktop), and I was not willing to wait that long for the results from 5 extra datasets).
Aglam scales best and generally takes between 1 and 2 minutes to detect motifs.
AlignACE scales second best, taking between 10 seconds and 15 minutes to detect motifs.
Mclip scales slightly better than Meme, taking between 2 seconds and 9 hours to detect motifs.
Meme scales worst and takes between 1 second and 10 hours to detect motifs.
Stars "*" mark the datapoints; the lines correspond to the least-squares linear regressions.
Aglam is shown in blue, AlignACE in green, Mclip in red and Meme in black.
Quality:
It is difficult to estimate the "prediction quality" of any of these programs, as it is unknown which of the genes really are co-regulated (and might therefore share regulatory motifs in the upstream UTR) and which only appear co-regulated over the small number of microarrays we had at our disposal (8 conditions, 3 replicates each).
To get an approximate estimate for the motif-detection ability of the various programs, we used the setup depicted in the figure on the right:
-The output of the motif detection programs consists of sets of sequence regions containing a motif. For programs producing gapless motifs we can convert these sequence regions to a profile simply by counting the residue frequencies at each position. For Mclip results we have to take end-gaps and inserts into account.
-The set of profiles is then compares to all sequences. There are three possibilities:
1. The profile finds no significant matches at all. This is most likely due to inconsistencies in the set of sequence regions used as input. No "conserved" features could be detected, causing no high-scoring hits to be found (it is, however, also possible that the motif was very short, thereby causing poor E-values).(BAD)
2. Significant matches are found and found only in the set of sequences from which the motif was derived. This shows the motif could be recovered and the motif is specific to that group of sequences.(GOOD)
3. Significant matches are found, but also to sequences outside the set from which the profile was derived. Here we cannot say much. Maybe the motif we found is used to co-regulate these genes in a condition not covered by our microarray experiments or maybe the motif is not specific enough.(UNKNOWN)
The expression plot on the right shows what I mean. The expression values for 3 sequences sharing a motif are depicted. The vertical bars provide an estimate of the variance over the replicate datasets. Two sequences show a strong correlation of their expression values. The third shows a clear anti-correlation. It is quite possible, that the motif we find is involved in the expression of these sequences, but other motifs or regulatory elements provide a modulatory effect.
Program
Motifs found
in X datasets
GOOD
UNKNOWN
BAD
Aglam
103
103
69 (67%)
11 (11%)
23 (22%)
AlignACE
993
103
3 (0.3%)
23 (2.3%)
967 (97%)
Meme
46
21
6 (13%)
9 (20%)
31 (67%)
Mclip
1058
101
155 (15%)
246 (23%)
657 (62%)
Note: This is BY NO MEANS a rigorous test. The set of data I use as well as the very rough analysis are certainly biased in a large number of ways. This is just meant as an examples of how the various programs performed under default conditions on this set of sequences (see above ("commands used") for the parameters I did change from their default values).
Aglam: Aglam returns only the best motif for a sequence set. It was able to detect 103 motifs in 103 of the 106 sets of co-expressed sequences. In most of the motifs predicted by Aglam, a clear "conserved" signal was present. After deriving profiles from the sequence regions returned by aglam, 69% of the profiles returned hits specific to the group of sequences they were derived from, 11% had hits to other sequence groups and 22% did not return any high scoring hits.
The large discrepancy between Aglam and the rest of the methods is most likely due to Aglam returning only the best motif for each set. In general, the aglam motifs are also longer and cover less sequences than those returned by the other methods. These effects bias towards longer profiles with clear-cut "conservation patterns". This produces nice results in a test such as this, but there is the risk of over-learning, i.e. producing motifs that are not general enough and are only able to detect the actual input sequences. This effect may be visible in the low number of "unknowns" that aglam generates. The other programs have a ratio of approx. 2:1 "unknowns" (i.e. unspecific) to "good" (i.e. specific) motifs, while for aglam the ratio is about 1:6.
Alignace: Alignace returns multiple motifs for a sequence set and was able to detect 993 motifs in 103 of the 106 sequence sets. Unfortunately only very few showed any "conserved pattern" that could be converted into a profile, causing 97% of the predicted motifs to return no high scoring hits.
Alignace has a tendency to return short motifs. This affected its results negatively in this example, as the shorter motifs have a tendency to produce worse P-values and are less specific.
Meme: Meme was told to return up to 100 motifs for each sequence set (see above, "commands used:"). It was able to detect 46 motifs in 21 of the 106 datasets. 33% of these could be converted into profiles that produced high-scoring hits with 13% being specific to the group of sequences they were originally derived from.
I am astounded that Meme found motifs in less than 50% of the datasets. This may be due to use of suboptimal parameters or an unfavorable set of data.
Mclip: Mclip returns multiple motifs for each sequence set and was able to detect 1058 motifs in 101 of the 106 datasets. 38% of these could be converted into profiles returning high-scoring hits with 15% of the motifs appearing specific to the group of sequences they were derived from.
Being the author of Mclip, I doubt I can provide an truly impartial comparison. In short: it does what it's supposed to do appears to works fine. One disadvantage of Mclip is the large number of motifs it finds. In the above example, 401 motifs have to be examined. This increases the chance of finding a putative motif for any given set of sequneces, but also increases the number of motifs you have to sift through to find that motif.
This is meant as an example application to a single set of data. A more thorough test (such as performed by Tompa M. 2005, Nature Biotechnology 1(23):137-144) would require sets of "true positive" sequences containing known motifs (for example taking randomized sequences and adding motifs from transfac) as well as sets of "true negative" sequences to compare the results against. Such a test would however go far beyond the scope of a simple example application (and would take more time and resources than I have at my disposal for this project).
4. Profile profile alignment
A lot of literature is available on how to perform sequence to sequence alignment and I will not go into details on this subject; suffice it to say mclip uses a smith-waterman approach to determine all possible local alignments greater or equal to the specified minimum alignment score and length.
If "estimate" was used as a parameter instead of specifying nucleotide matching scores, then the program reads in all sequences and calculates the background frequencies for the nucleic acids.
The scores for nucleotide matches are then calculated as:
AA = (1 - &radic(freqA * freqA));
AC = (&radic(freqA * freqC) - 1);
AG = (&radic(freqA * freqG) - 1);
AT = (&radic(freqA * freqT) - 1);
CC = (1 - &radic(freqC * freqC));
CG = (&radic(freqC * freqG) - 1);
CT = (&radic(freqC * freqT) - 1);
GG = (1 - &radic(freqG * freqG));
GT = (&radic(freqG * freqT) - 1);
TT = (1 - &radic(freqT * freqT));
gap-open and gap-extend penalties are defined by the user
The Profile-profile alignment is slightly more complicated as individual match-scores are no longer applicable. Mclip uses the the log-odds ratio of one profile emitting the residue counts observed in another profile as the match metric. The more likely one profile column is to generate the counts observed in another profile column, the better the score.
Alignment steps:
1. The background residue frequencies for each profile are calculated.
2. A score-profile is derived for each of the two profiles (i and j) we want to align. They are generated anew prior to each profile-profile alignment as the residue frequencies can change quite dramatically and this influences the corresponding scores. The score profiles are derived by taking each "frequency profile" position and converting the values to scores according to the formulae below:
There are three alignment modes; default is "Mode1".
"fikx= frequency of residue x at position k in profile i."
"Nik=Number of residues from which the residue frequencies at position k in profile i were derived."
"bjx= background frequency of residue x in profile j."
"Sikx= Score for residue x at position k in profile i."
Mode0:
Multiplies the pseudocounts added with the background frequency of the character
IF("x" is a gap-character ('-')){
Sikx=log2(((fikx*Nik+bjx*pseudocount)/(Nik+bjx*pseudocount))/1)
}ELSE{(if "x" is a non-gap ('A','C','G','T')) Sikx=log2(((fikx*Nik+bjx*pseudocount)/(Nik+bjx*pseudocount))/bjx)
} Mode1:
Uses the same amount of pseudocounts for each character
IF("x" is a gap-character ('-')){
Sikx=log2(((fikx*Nik+pseudocount)/(Nik+pseudocount))/1)
}ELSE{(if "x" is a non-gap ('A','C','G','T')) Sikx=log2(((fikx*Nik+pseudocount)/(Nik+pseudocount))/bjx)
} Mode2:
No differentiation between gaps and A,C,G,T.
Sikx=log2(((fikx*Nik+pseudocount)/(Nik+pseudocount))/bjx)
This gives the alignment of two rare residues a higher weight than the alignment of two frequently occurring residues. The gap-specific formula makes sure that gaps are always penalized. However, should the alignment routine assign an alignment-gap to a profile column consisting mostly of gaps, the penalty will be close to zero.
When calculating the score-profile, the nucleic acid background frequencies of the profile to align are used in the denominator instead of the more common background frequencies over the whole dataset. This is because the residue frequencies of a profile can differ quite substantially from those in the input dataset depending on the residue frequencies of the local alignments from which the profile was derived and how many (optional) iterations were performed.
Mode0:
Use this if you don't know what pseudocounts are, or just want a quick, "first guess" motif-finding without having to worry about what parameters to use.
The reason the pseudocounts are multiplied with the background frequency of the "other" profile (the "j" profile) is that, in practice, this makes the motif results a lot more stable in regards to the pseudocounts setting.
Using this approach, changing the "-pseudocount" value from 1e-2 to 1e-10, for example, resulted in only minor changes in the motifs returned for most of my test-sets.
Not using this approach (for example adding the same amount of pseudocounts to each residue) made the results quite sensitive to the pseudocount setting and the "optimal" pseudocount value varied greatly across the test-sets and motifs.
Mode1:
Here the same amount of pseudocounts is used for each residue. This gives the best results when the right amount of pseudocounts are used, but it takes some more fiddling to find the best setting. This mode is quite sensitive to the amount of pseudocounts used (but values between 1e-3 to 1e-5 should give reasonable results in most cases).
Mode2:
In Mode2 Mclip does not distinguish between gaps and alignable characters in the profiles. This makes the alignment procedure quite a bit slower (see below: gapping) but provides the statistically most "correct" way of the three of dealing with gaps in the profile-profile alignments.
The profile-profile alignment can then proceed by standard smith-waterman, using the following as "match score":
"Pikx=Frequency of residue x at position k in profile i"
"Sikx=Score value for residue x at position k in score_profile i"
"Pjlx=Frequency of residue x at position l in profile j"
"Sjlx=Score value for residue x at position l in profile j"
"M(k,l)=Match score for aligning column k of profile i with column l of profile j"
M(k,l)=(&sum(Pikx*Sjlx+Sikx*Pjlx))/2;
Gapping:
Gapping of an alignment in Mode0 or Mode1 is done by using the gap-score of the alternate profile as gap-open/extend penalty (i.e. if "profile i" is assigned a gap, the gap-score of "profile j" in the current column is added).
As my gap-scores are always negative (i.e. penalizing) the score profile provides a position specific gap-penalty. Profile columns containing many gaps have a lower gap-penalty that those containing less.
The only gapping restriction is that a gap in one profile may not occur immediately after a gap in the other profile.
While this is an unusual way of dealing with gaps (i.e. it does not differentiate between gaps and insertions or gap-open and gap-extention, not is it quite statistically "correct") it makes the alignment considerably faster and simpler.
We do not expect this simplification to have a great influence on the results as we expect to find neither many nor long stretches of gaps in motifs.
For people uncomfortable with this way of treating gaps there is the Mode2 option. This will treat gap-characters like any other character when calculating the score profiles. Gapping in a profile-profile alignment is then performed by hypothetically inserting a column consisting only of gaps into one of the profiles (seting Nik=Nik+1 for that column). Scores are calculated for this colums as specified above, again treating gaps the same as any other character. The penalty for inserting a gap in that profile at that position is the "match" score of this "gap"-column to the other profile (see above). Unfortunately this approach requires three times as many multiplications during the pairwise alignments, making motif detection quite a bit slower. It also has the side effect of making the alignment procedure a lot more sensitive to the -pseudocounts setting. In general, if you plan to use Mode2 you will also want to reduce the amount of pseudocounts you add.
Note: only Mode1 is available via the web-interface.
All this allows the local alignments to be generated using a standard smith-waterman approach. In addition, the sequence signal in the profiles is not watered down by pseudocounts as these are only added to the score profiles used in the alignment. The original profiles are then combined to motifs according to the alignment traces. Once more: Pseudocounts are only included during the local alignment of profiles; the final motifs reflect only the frequencies of the aligned profiles and NOT the pseudocounts that were added!.
Due to the pseudocounts it is possible that two profiles of low residue count will appear to align quite well, but the resulting agglomerate motif will not contain any conservation pattern (as the alignment was due to the pseudocounts). These motifs will not produce any good hits to the original sequences and will be discarded.
5. Clique detection
Clique detection is an iterative approach.
1. A trace is selected at random and used to "seed" a set ("traces" contain the information about the local alignments of a profile pair).
2. All other traces sharing at least one aligned residue with that trace are added to the set. All traces sharing residues with those traces are also added, etc. until no new traces are found (i.e. single linkage clustering). This is to reduce the complexity of the problem a bit by disregarding traces that have nothing in common.
3. The set is then examined for cliques: If "profile A" aligns certain residues with "profile B", and "profile B" aligns these same residues with "profile C", then "profile A" has to align the same residues with "profile C" for them to be in a clique. (repeat the process to add more sequences to the clique).
Mclip determines whether all sequences in the single-linkage set belong to the same clique by counting the number of overlapping alignments. If not all sequences overlap in their alignment with all others, the set does not consist of a single clique. Mclip then removes those alignment traces with the highest number of overlaps (most likely the ones bridging two or more cliques). From the leftover traces a new one is selected at random and steps 2 & 3 are repeated until the single-linkage set has been whittled down to a single clique.
This process is repeated until all traces have either been "removed" or are part of a clique.
4. A motif is then derived from each clique (and the regions in the "removed" traces aligning to the clique).
Motifs look just like profiles and give the residue frequencies at the various motif positions as well as the number of residues these frequencies were derived from.
Position
A
C
G
T
-
Number of residues
0
0.2
0.8
0
0
0
100
1
0.05
0
0.9
0
0.05
95
2
1
0
0
0
0
111
3
0
0
0
1
0
115
4
0
0
0.8
0.2
0
115
5
0
1
0
0
0
115
6
0.1
0.1
0.1
0.1
0.6
115
7
1
0
0
0
0
115
8
0.1
0
0
0.9
0
115
9
0
0
0.9
0.1
0.1
110
6. Motif-sequence alignment
This is once again fairly straight forward and corresponds to a simple sequence-profile alignment (many implementations exist).
Just a short explanation on how the P-values (and subsequently the E-values) are calculated.
1. Get the background frequencies for the residues for each sequence.
2. For each position in the motif, get the average score and the standard deviation of aligning the background frequencies of a given sequence to that position.
3. Based on the summed averages and standard-deviations we can define a bell-curve which provides an estimate of what alignment scores are how likely to be generated by chance when aligning that sequence to the current motif.
4. Based on the Z-score ((alignment_score-avg_score)/score_stdev)) it is possible to calculate an approximate P-value.
5. The E-value is then calculated as P-value*number of comparisons performed (i.e. P-value * number_of_residues_in_sequences) (This overestimates the number of comparisons performed and biases the results towards "conservative" estimates, but, based on eyeballing the output, does not seem to have a negative effect on the results).
NOTE: At no point do I estimate how probable it is for me to get a certain motif from the given set of sequences. That is way beyond my statistical abilities. Instead, I generate a list of possible motifs using the approach described above and then estimate how likely each sequence is to match each of these motifs (A question for which it is far easier to get approximate statistical estimates).
The example motif below shows that the distribution of scores generated by aligning random sequences to the motif resembles a gaussian (the distribution was generated from 10.000 random sequences with the same nucleotide frequency as the sequence to align)
Z-score= how far away form the average, in number of standard deviations, is my alignment score.
P-value= How likely am I to get an alignment score greater or equal to the observed score from the given score distribution. (corresponds to the surface area of the little blue bit on the right).
Overview of what Z-score corresponds to what P-value (and a few other measures. http://en.wikipedia.org/wiki/Image:Normal_distribution_and_scales.gif").
As expected, the overwhelming majority of putative motifs found by this program are ungapped.
To show that detection of gapped putative motifs is possible, I appended some examples that I came across during my analyses.
(Capital letters correspond to match-states, lower-case letters correspond to insert-states, dots are end-gaps)
This program should work on all machines for which a java1.5 or better runtime environment is available (i.e. Mac, Windows, Linux/Unix) (http://java.sun.com)
And, as a short help to those people who don't regularly use java programs via the terminal (double clicking on it will NOT work as the program needs input parameters):
On a Windows machine the terminal can be found under "Start->All Programs->Accessories->Command prompt"
On a Mac the terminal can be found under "Applications->Utilities->Terminal.app".
I trust everyone working under Linux/Unix to know where to find a terminal :-).
Assuming you have downloaded Mclip to a directory on your desktop and placed a set of sequences you want to analyze there as well:
Once you are in the terminal, use the "cd" command to go to the directory containing Mclip (i.e. "cd C:\Documents and Settings\my_username\Desktop\mclip_directory\" on a computer running windows). You can then start the program using the command "java -jar mclip.jar -i input_file_name". If you have many sequences you want to compare and run into "OutOfMemoryError's" you have two options. Either allow java to use more than the default 70MB (or so, varies across architectures) or use the -lowmem option in mclip. To allow java to use more RAM, use the -Xmx parameter (i.e. "java -Xmx500m -jar mclip.jar" will allow mclip to use up to 500 MB of RAM). The -lowmem setting in Mclip (i.e. "java -jar mclip.jar -lowmem t") trades execution speed for memory efficiency. Mclip will use less memory but take about twice as long to finish.
I hope you enjoy using the program and it proves useful. For any comment, must-have or nice-to-have feature requests, email Tancred.Frickey at rsbs.anu.edu.au.