Retrieval, alignment and analysis  of 16S rDNA sequences using  the GreenGenes and Ribosomal Database Project (RDP-II) web servers

The problem

Suppose that you have just assembled the sequence reads that you obtained from your new bacterial isolates (or clone libraries). First of all you would like to know to what organisms or sequences yours are related to. Secondly, we would like to retrieve some reference sequences that are closely related to our new ones and perform a global multiple sequence alignment with that dataset. Thirdly, we would like to obtain a phylogeny using a powerful inference method based on the previously obtained multiple sequence alignment.

Standard, generally applicable solutions most suitable for the analysis protein-coding sequences:

For many people the obvious way of finding the answer to the first problem is to run BLAST searches against databases held at NCBI, EMBL or DDBJ). This can be easily done by using the online web interfaces provided by these institutions, such as the NCBI's BLAST site. Depending on the type of search you want to perform, you will choose one of the different BLAST protocols (BLASTN, BLASTP, BLASTX ...) Notice that the defauld BLASTN scoring parameters used by the NCBI's and WU BLAST servers are quite different, the former being adecuate to find highly similar DNA sequences (typically with > 95% identity), while the latter is more adequate to find more distantly related nucleotide sequence homologs. For more information about the Basic Local Alignment Search Tool (the different BLAST protocols, search parameters, output formats etc.) we recommend having a look at NCBI's BLAST tutorial and the Blast book by Ian Korf and colleagues, as well as the the key paper by Stephen Altschul and colleagues describing Gapped-BLAST and PSI-Blast.

The online BLAST search engines are very useful when you have only a few sequences to analyze, but rapidly gets tedious or even impractical if you have many. In that case you are much better off using software tools such as netblast, or writing your own scripts to run batch BLAST searches locally or over the net. The latter tasks are very much facilitated with open-source bioinformatics resources such as BioPerl, BioPython or BioRuby modules.

The BLAST-based solutions given above are generally applicable to any DNA or protein sequences you may wish to analyze. It is important to stress that although general and very powerful, BLAST searches produce only pair-wise local alignments. For phylogenetic analyses we need global multiple sequence alignments. Furthermore, you can not a priory restrict the BLAST programs to output only "high quality" subject sequences corresponding only to non-redundant sequences from named organisms, or perhaps corresponding only to type strains.  These would have to be selected  manually before or after downloading the subject sequences found after performing the BLAST searches. This can be a very tedious task, partiuclarly for SSU rDNA sequences because there are so many of them deposited on a dayly basis in GenBank and other sequence databases. Once a dataset has been assembled, generally in the form of a FASTA-formatted file, we need to perform a multiple sequence alignment. The most popular of the multiple sequence alignment softwares available are the different forms of Clustal, basically ClustalW (non-interactive, menu-driven or programmable version) and ClustalX (interactive, Xwindows-based graphical display). These programs can be downloaded for free and binaries (as well as source code) exist for all major platforms, and run also on several web servers. However, more powerful, accurate and faster multiple sequence alignment algorithms are implemented in software programs such as muscle and probcons. Multiple sequence alignment is a very intense bioinformatics research area, and many new and more powerful and accurate algorithms and software implementations are constantly being published, as a PubMed or Google search will show.

If you are willing to perform a multiple sequence alignment of protein-coding sequences sequences , the best way to do this is by 1) translating the DNA to their corresponding protein sequences, 2) performing the multiple alignment on the later ones and 3) finally aligning the nucleotide sequences using the alignment of the translation products as an alignment mask. This ensures that the codon structure of CDSs is preserved. Software packages such as DAMBE or web servers such as RevTrans (among others) allow to perform these tasks easily.

For the particular case of ribosomal gene sequences, we need to pay attention to secondary structure motifs in order to generate a trustful alignment. The good news is that for the latter case, there are several excellent web servers that porvide online access to powerful tools that will greatly aid in the tasks of analysing rrs sequences, particularly retrieving SSU rDNA sequence data, identification of closely related sequences and performing multiple sequence alignments and checking for chimeras. The two sites that will be considered in this tutorial for these tasks are the GreenGenes and RDP-II web servers.


The proposed solution for finding closely related SSU rDNA sequences, retrieving selected sequences and performing a multiple sequence alignment:

The NAST program provides a very convenient and easy-to-use solution to this particular problem. If you have something up to 500 query rRNA (rDNA) sequences, you can upload the FASTA file containing them and ask to align them to each other, or with a number of non-redundant close relatives. Each query sequence in your uploaded file will be searched for 16S rRNA gene sequences and aligned according to a Core Set of alignment templates. You can even upload a whole fasta genome and NAST will find, extract, and align the 16S rRNA genes for you. Additionally, Simrank (an N-mer comparison tool) will be used to find nearest-neighbors (non-chimeric) as well as nearest-isolates for each of your sequences from the entire Greengenes database. Then, the query sequences can be returned by email either aligned or unaligned, with or without the near-neighbor sequences included in supplementary files. Please refrain from uploading files containing more than 500 sequences. If you have a larger project, try breaking it up into several files, or contact to make alternate arrangements. This version of NAST will align at a rate of ~10 sequences per minute. How to perform this is shown next.

Important note:  Before submitting your sequences make sure that you have filetered out redundant sequences. For phylogenetic analyses you should always use only distinct haplotypes, that is remove all repeated sequences. This can be easily done on a Windows machine using programs such as DAMBE. Simply open your FASTA file with DAMBE and it will automatically identify repeated sequences, allowing the user to retain only distinct haplotypes. A second major argument to remove redundant sequences is that you should never overload a server with unnecessary jobs.


Protocol 1. Performing a multiple sequence alignment of a set of newly obtained SSU rRNA gene sequences along with a set of non-redundant, closely related and non-chimeric reference sequences, using the GreenGenes site.

1. Navigate to the green-genes site and select the align option from the menu.

2. Upload the file containing your FASTA-formatted rrs sequences and define minimum query to subject alignment parameters, i.e. min. length of the subject sequence and min. query-subject pairwise alignment identity level for a match to be considered.

Make sure that you have removed any repeated sequences from your FASTA file before submitting your job to the server! (see the important note in the previous paragraph).

3. Define the output files you desire.

There are several files that the green-genes server can return to you. For most standard purposes, you may want to use either the first and second options (only your uploaded sequences are returned in the alignment format chosen, along with an excel file containing all the information regarding the fate of the alignment of each of the uploaded sequences). Alternatively, you may perhaps wish to select only the fourth or fith (last) option. The latter will output your sequences in a file of the desired format, containing the selected number of nearest-neighbor non-chimeric sequences from named isolates (redundant neighbors removed). File wil be named "xyz_nn_NAST.fasta", for example. This last option is what most rhizobiologists will like to chose for their diversity studies, as indicated below.


 4. Define the format for the output files.

Here you should select different options among two different classes of output formats (alignment gaps and sequence format).

First chose how to deal with alignment gaps. To use the aligned sequences in standard phylogeny reconstruction programs or packages such as MEGA3, PAUP*, PHYLIP or PHYML , the most convenient way to deal with alignment gaps is the one provided by the first check-box option.

Secondly, chose your preferred sequence format output. Here you can chose among several popular formats such as FASTA, MEGA, NEXUS and PHYLIP.

I would recommend that you use the FASTA with description format, which contains an informative header and is easyly interconveted into other formats.

5. Define the preferred output file(s) delivery option.

Provide the email address where you wish the server to deliver the results. These will be found in a compressed tar (*.tgz) file, which can be opened on all major platforms. On Unix/Linux machines open a terminal and type the following at the command line: tar -xvzf my_greenGenes_file.tgz to unzip and untar the file. On WindowsXP/2000 MacOSX systems, simply 2X click on the file and it should decompress.

There are many more very useful functions implemented at the GreenGenes site. Take time to explore them in depth. It will be rewarding, I promise. The site alsohas  tutorial material (GreenGenes-tutorial). You should at some time also read the relevant papers describing this great site.

The returned alignment files should be ready now to proceed to the next step in our analysis: finding the best-fitting model for our alignment, as shown in tutorial 2.

Note: very similar analyses and sequence manipulations as those performed above can also be done at the RDP-II site, which we will use in the following Protocol. 

General Greengenes Database and Tools Citation:

DeSantis, T. Z., P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie, K. Keller, T. Huber, D. Dalevi, P. Hu, and G. L. Andersen. 2006. Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB. Appl Environ Microbiol 72:5069-72.

NAST Alignment Tool Citation:

DeSantis, T. Z., P. Hugenholtz, K. Keller, E. L. Brodie, N. Larsen, Y. M. Piceno, R. Phan, and G. L. Andersen. 2006. NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic Acids Res 34:W394-9.

Original Publication

DeSantis, T. Z., I. Dubosarskiy, S. R. Murray, and G. L. Andersen. 2003. Comprehensive aligned sequence construction for automated design of effective probes (CASCADE-P) using 16S rDNA. Bioinformatics 19:1461-8.


Protocol 2. Retrieval and multiple sequence alignment of  some rhizobial type strain SSU rDNA sequences using the Ribosomal Database Project II (RDP-II) site and required edition work of the retrieved alignment

1. Navigate to the RDPII site and select the hierarchy browser option.
2. Set the browser filter options.
Note that we have chosen to retrieve only the sequences from cultured type strains for which almost full-lenght sequences of good quality are available, using the NCBI taxonomy scheme. 


3. Start navigating the lineage tree (displayed following the NCBI organism classification scheme, as found at the NCBI's taxonomy page)  and selecting the taxa you're interested in.

After hitting the Browse button, we get a classification tree displayed. Originally there are 0 sequences selected. We proceed to select our rhizobial sequences as follows: Click on phylum Proteobacteria, class Alphaproteobacteria, order Rhizobiales.

truncated output. Clicking on the grey boxex  in front of the taxa you want to select changes their color to red and includes those taxa in the selection. You can go deeper into the lineage tree, by chosing a particular genus and selecting ontly those type strains you really want to include in your analysis. There's a lot of flexibility provided by the RDP-II site to navigate these lineage trees.
Note that on the upper right corner you have the summary of the taxa you have selected and there is the button to order the download process of those sequences to your computer


4. Choose an alginment model and format for download.
After hitting the download option you will get the following  display, where you an choose the alignment model and format. For most users and for the purpose of the following tutorials on model selection and phylogeny estimation, we will download the alignments of the selected rhizobial type strains as a FASTA file where all common gaps (gap-only columns) are removed, as shown below. Hit the download button and the corresponding file will be downloaded to your computer.

5. Edit the FASTA headers and alignments.
If you open the downloaded file with a text editor (Wordpad, vi, emacs, nedit ... NEVER USE WORD PROCESSORS such as Word to process sequence files !!!) you will see that the FASTA header of each sequence contains symbols like ( ) and ; while some also contain other simbols such as = : ? etc, which should be eliminated before proceeding with downstream analyses. Particularly the ,  ( ) : and ; symbols need to be removed because they are reserved characters for Newick tree formats used by all current phylogeny programs.
>S000011152 Bradyrhizobium japonicum (T); DSM 30131 T; X87272
>S000364392 Rhizobium sullae (T); "type strain:IS 123=USDA 4950=DSM 14623"; Y10170
Use your text editor (find-replace function) or other tools such as sed or Perl scripts to eliminate those disturbing characters before proceeding with downstream sequence analyses User your favourite sequence editor or ClustalX to view the alignment. In my experience, you will encounter sequences that have some residues misaligned at several positions. Two obvious cases from our example is shown below for sequences S000380772 (T at column 180 should be moved to col. 181) and S000381164, where the cg residues at pos. 210 on the alignment should be moved two positons to the right and the resulting gapped columns 210 and 211 removed, as shown in the second, edited alignment that follows the first one.

6. Edit the 5' and 3' termini of the alignments.
If you now look at the 5' and 3' ends of the alignments you will see that the different sequences are of different lengths. A multiple alignment algorithm will automatically consider missing sequence stretches as the result of an indel (insertion/deletion) event and will consequently fill-up those sites with gaps ( - ). That´s why the shorter sequences are filled up with gaps at their termini so that all sequences contain the same number of  columns or characters. 

However, these terminal gaps at the shorter sequences do not correspond to evolutionary insertion or deletion events. These differences in sequence length are simply the result of different PCR amplification and/or sequencing protocols. Therefore, the correct way to handle these sequence termini is to trim-off the sequence overhangs of the longest sequences and to fill up with ? characters the shorter ones, as shown below. All modern phylogeny programs use the - symbol to represent evolutionary informative indel  events, while the ? character is interpreted as missing data. That´s the reason why it is important to edit the 3' and 5' sequence ends in multiple alignments as explained above.

Final remarks:
The take-home message here is that a careful look at the multiple sequence alignments returned by whatever alignment program is always required to make manual correctios of misaligned residues and to perform further sequence editing, such as replacement of terminal gaps with question marks. A close inspection of the aligned sequences will often uncover sequences with obvious sequencing errors, as is the case in our dataset. What´s even more problematic is the fact that there are redundant sequence entries for the same type strains but unfortunately in too many cases these sequences are not identical. So please, before submitting sequences to GenBank or other sequence databases, check carefully your sequences in the context of a good multiple sequence alignment. If you suspect that your sequence could have mistakes, get some more and better sequence reads to assemble a correct contig before submitting to the databases. The entire community will be grateful for that effort.

There are many more very useful functions implemented at the RDP-II site. Take time to explore them in depth. It will be rewarding. The site also has diverse tutorial materials ( RDP-II-tutorial-HTMLRDP-II-tutorial-VIDEO ). You should at some time also read the relevant papers describing this great site.

How to cite services from the RDP-II site

If you download data or use software available from RDP, please cite us in the following way:

The ribosomal database project (RDP-II): introducing myRDP space and quality controlled public data. J. R. Cole; B. Chai; R. J. Farris; Q. Wang; A. S. Kulam-Syed-Mohideen; D. M. McGarrell; A. M. Bandela; E. Cardenas; G. M. Garrity; J. M. Tiedje. Nucleic Acids Research 2007 35 (Database issue): D169-D172; doi: 10.1093/nar/gkl889 [Abstract]

Cole JR, Chai B, Farris RJ, Wang Q, Kulam SA, McGarrell DM, Garrity GM, Tiedje JM. The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis. Nucleic Acids Res. 2005 Jan 1;33(Database Issue):D294-D296. doi: 10.1093/nar/gki038 [PMID: 15608200Oxford University Press]

You may also want to visit my page on bioinformatics resources and tutorials, which may contain further material of your interest.