<
ANU Home | HORUS | Staff Home | Students | RSBS
The Australian National University
Research School of Biological Sciences
ANU COLLEGE OF SCIENCE
  
    
Printer Friendly Version of this Document
NOTE: This page is best viewed with any browser BUT internet explorer (i.e. mozilla, firefox, opera, netscape, etc.).

SSalign
Developed at the Research School for Biological Sciences,
Australian National University, Canberra, Australia.
Download

This program was developed due to a need to look at all possible stem-loop structures that a large number of DNA/RNA sequences could form. Aligning each sequence with its reverse using available programs proved not fast enough and quite problematic as extensive post-processing of the results was required to remove redundant and find (and clean) "impossible" alignments.
Example "impossible" alignment:
Sequence:

AAAAACGCTTTTT
If this sequence is aligned with its reverse in a standard smith-waterman approach, the following alignment is generated:
F:AAAAACGC-TTTTT
R:TTTTT-CGCAAAAA
However, this alignment is "impossible" as we are trying to predict stem-loop structures formed in ONE sequence yet this alignment assigns the same sequence residues multiple times and to different positions in the alignment (in this case, all residues are required to be present twice to form this alignment).
A correct alignment would be:
AAAAA
TTTTT
(and a 3-residue loop formed by CGC)

I decided it probably would be easier and faster to write a DNA/RNA alignment program of my own, specifically geared towards aligning a sequence with it's reverse. This program takes into account that each sequence is being aligned to itself, generates only the alignments we are interested in and thereby obviates the need for complex post-processing of the results.
During the course of testing, we came across a number of alignments where the smith-waterman alignments generated by "mlocals" (part of the USC seqaln package) differed from those my program generated. The reason for this, was that mlocals filled the alignment matrix starting at 0/0 and going to N/N, while SSalign fills the matrix in the inverse manner, starting at N/N and going to 0/0.
In the majority of cases this produces the same results, but in a few cases, the ends of the alignments differed (see below for an explanation why). To resolve this inconsistency ssalign was changed in such a manner as to produce consistent alignments, irrespective of whether a sequence or it's reverse was being aligned.

This shows a region of an alignment matrix filled from 0/0 to N/N
The red circle shows where the alignment trace is started, green circles show alternative starting points of equal score. The grey lines show the path(s) that would be traced in a normal smith-waterman alignment. The light-blue/gray lines show the alignable region that is missed using this approach yet alignable when the matrix is filled from the other end.

Alignment trace 1:
TATATATGA
ATATAT-CT

Alignment trace 2:
ATATAT
TATATA
This shows the same region as above in an alignment matrix filled from N/N to 0/0 (equivalent to using the reverse of a sequence)
The red circle shows where the alignment trace is started, green circles show alternative starting points of equal score. The grey lines show the path(s) that would be traced in a normal smith-waterman alignment. The light-blue/gray lines show the alignable region that is missed using this approach yet alignable when the matrix is filled from the other end.

Alignment trace 1:
AC-TATATA
TGTATATAT

Alignment trace 2:
ATATAT
TATATA
In both cases the problem is that the alignment trace stops as soon as a "zero" score is reached. This causes "neutral" stretches of alignment (i.e. stretches aligning residues but not increasing the overall score) to be excluded depending on which side of the alignable region they are.
There are two ways to make the alignment strategy consistent. The first is to start at either of the alternative (green) starting points by selecting the shortest alignment that produces the same, maximum, score. The second is to allow the sequence alignment to traverse regions of "zero" score (which can be acheived by starting new alignment traces only for negative values, thereby allowing match and gapopen/extend states to produce scores of "0" and thereby allowing the backtrace procedure to include "neutral" alignment regions).
Both these approaches have been implemented in SSalign and either the long-tracing or short-tracing mode can be used.

(long tracing):
SSalignment trace 1:
AC-TATATATGA
TGTATATAT-CT

SSalignment trace 2:
ATATAT
TATATA

(short tracing):
SSalignment trace 1:
TATATA
ATATAT

SSalignment trace 2:
ATATAT
TATATA

Program Options:
Option Default Description
-i standard input If the sequence data is to be read from the standard input, leave unspecified. If the sequence is to be read from a file, specify the name of the file.
-aa -3 The score for a match of A with A
-at 2 The score for a match of A with T/U (or T/U with A)
-ac -3 The score for a match of A with C (or C with A)
-ag -3 The score for a match of A with G (or G with A)
-tt -3 The score for a match of T/U with T/U
-tc -3 The score for a match of T/U with C (or C with T/U)
-tg 1 The score for a match of T/U with G (or G with T/U)
-cc -3 The score for a match of C with C
-cg 3 The score for a match of C with G (or G with C)
-gg -3 The score for a match of G with G
-go -5 Gap-open score
-ge -1 Gap-extend score
-min 8 Minimal score a local alignment has to acheive to be returned
-v 1 Verbose (1) or non-verbose (0) output
-mode 0 The mode of tracing to perform: 0=long-tracing, 1=short-tracing.

To run the program try: java [-Xmx???m] -jar ssalign.jar [options].