<
ANU Home | HORUS | Staff Home | Students | RSBS
The Australian National University
Research School of Biological Sciences
ANU COLLEGE OF SCIENCE
  
    
Printer Friendly Version of this Document
NOTE: This page is best viewed with any browser BUT internet explorer (i.e. mozilla, firefox, opera, netscape, etc.).
PhyloGenie
Tancred Frickey(2) and Andrei Lupas(1)
Originally developed at (1)the Max-Planck-Institute for Developmental Biology Tuebingen
(2)Research School of Biological Sciences
Australian National University, Canberra

PhyloGenie can be downloaded HERE.

Basics:
PhyloGenie is a PERL script combining the various steps necessary to producing a phylome. BLAST or PSIBLAST searches are performed for all FASTA format sequences in an input file, the HSP's corresponding to user defined selection criteria (E-value, coverage, score per column, identity) are extracted and used as a basis for multiple sequence alignment. Phylogenies can then be infered using the supplied standard NJ method or any other program accepting aligned FASTA sequences as input and generating Newick format trees. As analysis of phylogenies is far more complex than analysis of BLAST results we also provide a tool to filter a set of phylogenetic trees for those corresponding to user defined topological selection criteria.

Introduction:
The amount of sequences being generated by genome projects by far exceeds our ability to manually assign them meaningful biological annotation. Automated methods are essential to analyze the flood of "unknown" or "hypothetical" sequences in a reasonable timeframe. A frequent assumption facilitating automated analysis is that sequences have the same function as their closest relative. Using best BLAST hits to find these close relatives may be a viable option in many cases, however it was shown that best blast hits are not necessarily the closest sequence relatives (Koski & Golding 2001), thereby casting doubt on the reliability of this approach. Calculating a phylogeny for each sequence and using the trees to find the closest relatives is a more robust if computation intensive approach.
A good alignment is the basis for a good phylogeny, as misaligned regions, wrong gapping or unfortunate selection of sequence representatives can lead to erroneous trees. Sequence selection and alignment therefore seem to us the most critical steps on the path from starting sequence to phylogeny. When producing alignments it is necessary to decide between aligning full-length sequences and aligning only conserved regions for which sequence similarity, presumably due to shared descent, is unambiguously determinable. Using conserved regions only, may cause remotely related but still alignable regions to be missed, but it also greatly reduces the likelyhood of aligning nonhomologous regions as may be found in multiple domain, fused or circularly permuted proteins.
BLAST and PSIBLAST are currently the methods of choice for local sequence similarity detection as these programs are fast, reliable and sensitive.The alignments they generate unfortunately are subject to excessive and inconsistent gapping, caused by converting pairwise, local alignments to multiple sequence alignments.
We have implemented a few post-processing steps that alleviate the above mentioned problems. First, full length sequences for all HSP's (High Scoring Segment Pairs) are extracted. Then the HSP's with E-values better than a specified cutoff are converted to a multiple alignment. To remedy the excessive and inconsistent gapping problems this step generates, the gapped regions in the resulting alignment are realigned using a global alignment program (here clustalw).
To increase sensitivity and better define alignable sequence regions, a profile-Hidden-Markov Model (HMM) search against the full length sequences extracted in the first step can be performed. Generating a HMM is beneficial for two reasons: 1. Assigning insert states to highly variable regions removes most of the edge-creep effect present in PSIBLAST searches 2. The increased sensitivity of profile search methods can recover sequence regions a simple BLAST search missed.
Once all alignments are generated, phylogenies can be infered by any tree construction program producing newick format (new Hampshire bracket format) trees.
A large repository of phylogenetic trees is mostly useless unless a way of separating relevant from irrelevant data for the question at hand is provided. We provide a tool to reduce the number of trees that have to be manually examined by extracting from the database all 'interesting' trees, i.e. all those containing specific topological features (see README for PHAT (Phylome Analysis Tool) included in the PhyloGenie package)
Closely related to analysis of phylogenies is the problem of rooting them in an automated manner. Due to missing directionality, analysis of unrooted trees is far more complex than of rooted trees. However, correctly rooting phylogenies is a nontrivial problem. We also present a rooting approach that guarantees correct directionality for at least the branch containing the seed sequence (i.e. the sequence for which this tree was computed).