This page provides information about the homologs to the affymetrix exemplar sequences on the various plant affymetrix microarray chip. I was quite unhappy with the information provided by affymetrix, so I decided to create a small web-interface that would let me look up the closest homologs for every probe set a bit more easily, and help me decide how much to trust the annotations provided.
The program used to generate all of this data was PhyloGenie (Frickey T., Lupas A. Nucleic Acids Research 2004 32(17):5231-5238).
To be able to use Phylogenie to calculate all of this data a few preparatory steps were necessary:
-The exemplar sequences had to be translated in 6 frames (PhyloGenie requires protein sequences)
-The reading frame with the longest ORF was assumed to be the coding frame (appended the frame to the name, i.e. *_+1 for the +1 reading frame) (NOTE: This works great for the overwhelming majority of cases. However, in some rare cases it does not. (for example for RNA-genes (which have no coding frame!)). In this case DO NOT TRUST the trees as using an incorrect amino acid translation will most likely mess up the alignment procedure and produce bad trees (but this only applies to a minority of cases and is only stated here for completeness))
-A set of databases had to be generated to search against. As we wanted to be able to look up the closest homologs present on Affymetrix microarray chips for other plants, we generated files containing the 6-frame translations of the exemplar sequences for the various plant-chips.The NCBI nonredundant protein database "nr" was used to provide outgroups and flesh out the phylogenies so as to avoid problems arising from sparse sequence data and to simplify determining the degree of homology of two probes.Databases used:
PhyloGenie performs:
-Blast search to detect putative homologs (i.e. sequences of high local similarity to the query).
-Extraction and Multiple sequence alignment (MSA) of the HSP's (this is because BLAST only provides an alignment of all the HSP's to the query sequence, but does not take into account similarities between the HSP's, which can lead to excessive gapping and/or misalignment of residues (see the phylogenie paper for examples).
-Deriving a Hidden-Markov-Model (HMM) from the MSA (makes further searches and realignments easier and faster).
-Re-searching with the HMM (will find more homologs, give a better estimate of the E-values and better define the actual start and end of the homologous region).
-Aligning all HMM-search hits to the HMM (convert the search results back to a multiple sequence alignment (In this case no re-alignment of the results is needed, since the HMM already contains the sequence information of a large number of sequences and the excessive gapping and misalignment is not much of an issue).
-Inferring a phylogenetic tree (find a likely path of how this set of sequences may have arisen from a common ancestor).
Each of the individual steps creates a file that can be viewed via this interface (*.bls for the blast results, *.cln for the clustalW multiple sequence alignment of the HSP's, *.hmm for the derived HMM (HMMER), *.hms for the HMM-search results, *.hln for the final HMM-based alignment and *.tre for the newick-format Neigbor-Joining phylogeny. For most users the *.hln alignment and/or phylogenetic tree will be of most interest. However, as with all fully automated methods, things can go wrong and we provide the other files as a means of double checking the results and not having to blindly trust us.
The BLAST results are in HTML format and can be viewed in any web-browser.
All alignments are in aligned-fasta format and can be viewed in anything capable of displaying text and most alignment viewers/editors (I recommend ALNEdit (alternative download HERE), but I doubt I am being impartial :-)
The HMM-search results are in the HMMer output format and can be viewed with anything capable of displaying text
The phylogenetic trees are in New-Hampshire-Bracket-Format (Newick), a format most tree viewers support (I recommend JTreeview (alternative download HERE) as I know it can read these files and has a nice range of tree-analysis features).
All files are available for download HERE
Complex Queries:
While this sort of repository is quite useful to search for the homologs of individual genes, it is not very user friendly in regards to the larger questions that might be asked of such a dataset, such as:
-Since Glycine Max is a tetraploid organism, which of the trees reflect the 2:1 relationship in the number of genes expected when comparing tetraploid and diploid organisms; which don't show this relationship?
-For which of my exemplar sequences can I not find homologs in plants/legumes/other-taxonomic-groups.
and many many more...
To make it possible to ask these kinds of questions as well, we set up a web-interface to PHAT (Phylome Analysis Tool (part of the PhyloGenie package)) which allows complex queries to be formed. A list of all exemplars with trees that matched the query is returned.
Automated assignment of orthologs:
While it is very helpful to have phylogenetic trees to base assignment of orthology or paralogy on, it remains cumbersome to actually have to look at the trees one by one to decide whether or not two sequences really are orthologous.
In addition, trees are difficult to parse, pass on to other programs, or transform into spreadsheet compatible data. To simplify this a bit, we provide a way to look for putative orthologs for a given species.
The user selects the chip on which to base the predictions (for example the Arabidopsis ATH1-121501 chip), and specifies one or multiple species for which the orthologous genes are to be found (for example Lycopersicon esculentum).
The output is a tab-delimited list containing the following entries:
Query: The Sequence for which the tree was generated (i.e. 261271_at [Arabidopsis ATH1-121501]).
Paralogs: Other sequences from the query species (i.e. Arabidopsis) that are more closely related to the query than the closest homolog from the species we want to find (i.e. Lycopersicon)
Orthologs: The most closely related sequence(s) from the species we are looking for (i.e. Lycopersicon)
Query-homologs: Other sequences from the query species that are present in the tree
Find-species-homologs: Other sequences from the species we want to find that are present in the tree.
For more details on how we determine orthologs, go HERE
Any questions, comments, bug-reports, must-have or nice-to-have feature requests are welcome at Tancred.Frickey at rsbs.anu.edu.au