This tutorial will show how to use the get_homologues (Contreras-Moreira & Vinuesa, 2013) and get_phylomarkers (Vinuesa et al. 2018) software packages to perform a thorough pan-genomic and phylogenomic analyses of a small dataset of conjugative multiresistance plasmids of the pIncC (formerly IncA/C) incompatibility group of great concern in the clinical setting Ambrose et al. 2018. This is an updated version of a similar tutorial published by Vinuesa & Contreras-Moreira 2015.
Both software packages can be run on any platform using the Docker container, or directly on UNIX and GNU/Linux servers.
In any case, you will have to work using Linux \(Shell\) commands. Therefore it is important that you have some proficiency working on the command line.
Here is a complete tutorial on biocomputing in the GNU/Linux environment, with lots of examples.
The whole exercises will be run on buluc. You need to log into the server with the \(ssh\) command, available in any Linux machine or from the \(mobaXterm\) package that you should have installed on your Windows machine.
type your password
ssh -X vinuesa@*.*.*.* vinuesa@buluc.*.*.*'s password: Last login: Wed Nov 11 17:46:45 2020 from akbal.*.*.* ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo Bienvenidos a BULUC ... [vinuesa@buluc ~]$
In this section we will compute consensus core- and pan-genomes of the pIncA/C dataset as the intersection of clusters computed by the tree clustering algorithms implemented in get_homologues, as shown in the flowchart below:
GET_HOMOLOGUES is a sophisticated piece of software, with several external dependencies and auxiliary scripts, as depicted in Fig. 1.
To make sure that get_homlogues.pl is in PATH and that the kit is complete tiype the following command
# 0. make sure get_homlogues.pl in PATH and that the kit is complete /export/apps/get_homologues/get_homologues.pl version 26022020 Program written by Bruno Contreras-Moreira (1) and Pablo Vinuesa (2). 1: http://www.eead.csic.es/compbio (Estacion Experimental Aula Dei/CSIC/Fundacion ARAID, Spain) 2: http://www.ccg.unam.mx/~vinuesa (Center for Genomic Sciences, UNAM, Mexico) Primary citations: Contreras-Moreira B, Vinuesa P (2013) GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl Environ Microbiol 79(24):7696-701. (PubMed:24096415) Vinuesa P and Contreras-Moreira B (2015) Robust Identification of Orthologues and Paralogues for Microbial Pan-Genomics Using GET_HOMOLOGUES: A Case Study of pIncA/C Plasmids. In Bacterial Pangenomics, Methods in Molecular Biology Volume 1231, 203-232, edited by A Mengoni, M Galardini and M Fondi. This software employs code, binaries and data from different authors, please cite them accordingly: OrthoMCL v1.4 (www.orthomcl.org , PubMed:12952885) COGtriangles v2.1 (sourceforge.net/projects/cogtriangles , PubMed=20439257) NCBI Blast-2.2 (blast.ncbi.nlm.nih.gov , PubMed=9254694,20003500) Bioperl v1.5.2 (www.bioperl.org , PubMed=12368254) HMMER 3.1b2 (hmmer.org) Pfam (http://pfam.xfam.org , PubMed=26673716) Diamond v0.8.25 (https://github.com/bbuchfink/diamond, PubMed=25402007) Checking required binaries and data sources, all set in phyTools.pm : EXE_BLASTP : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/ncbi-blast-2.8.1+/bin/blastp) EXE_BLASTN : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/ncbi-blast-2.8.1+/bin/blastn) EXE_FORMATDB : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/ncbi-blast-2.8.1+/bin/makeblastdb) EXE_MCL : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/mcl-14-137/src/shmcl/mcl) EXE_MAKEHASH : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/COGsoft/COGmakehash/COGmakehash ) EXE_READBLAST : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/COGsoft/COGreadblast/COGreadblast ) EXE_COGLSE : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/COGsoft/COGlse/COGlse ) EXE_COGTRI : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/COGsoft/COGtriangles/COGtriangles ) EXE_HMMPFAM : OK (/export/apps/get_homologues-x86_64-20200226//bin/hmmer-3.1b2/binaries/hmmscan --noali --acc --cut_ga /export/apps/get_homologues-x86_64-20200226/db/Pfam-A.hmm) EXE_DMNDP : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/diamond-0.8.25/diamond blastp) EXE_DMNFT : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/diamond-0.8.25/diamond makedb) EXE_INPARA : OK (path:/export/apps/get_homologues-x86_64-20200226/_cluster_makeInparalog.pl) EXE_ORTHO : OK (path:/export/apps/get_homologues-x86_64-20200226/_cluster_makeOrtholog.pl) EXE_HOMOL : OK (path:/export/apps/get_homologues-x86_64-20200226/_cluster_makeHomolog.pl) EXE_SPLITBLAST : OK (path:/export/apps/get_homologues-x86_64-20200226/_split_blast.pl) EXE_SPLITHMMPFAM : OK (path:/export/apps/get_homologues-x86_64-20200226/_split_hmmscan.pl)
The main pipeline depicted above (Fig. 1) is run with the core \(script\) get_homologues.pl.
The options of the \(script\) are displayed by simply calling it with any of the following commands
usage: /export/apps/get_homologues/get_homologues.pl [options] -h this message -v print version, credits and checks installation -d directory with input FASTA files ( .faa / .fna ), (overrides -i, GenBank files ( .gbk ), 1 per genome, or a subdirectory use of pre-clustered sequences ( subdir.clusters / subdir_ ) with pre-clustered sequences ignores -c, -g) ( .faa / .fna ); allows for new files to be added later; creates output folder named 'directory_homologues' -i input amino acid FASTA file with [taxon names] in headers, (required unless -d is set) creates output folder named 'file_homologues' Optional parameters: -o only run BLAST/Pfam searches and exit (useful to pre-compute searches) -c report genome composition analysis (follows order in -I file if enforced, ignores -r,-t,-e) -R set random seed for genome composition analysis (optional, requires -c, example -R 1234, required for mixing -c with -c -a runs) -m runmode [local|cluster] (default local) -n nb of threads for BLAST/HMMER/MCL in 'local' runmode (default=2) -I file with .faa/.gbk files in -d to be included (takes all by default, requires -d) Algorithms instead of default bidirectional best-hits (BDBH): -G use COGtriangle algorithm (COGS, PubMed=20439257) (requires 3+ genomes|taxa) -M use orthoMCL algorithm (OMCL, PubMed=12952885) Options that control sequence similarity searches: -X use diamond instead of blastp (optional, set threads with -n) -C min %coverage in BLAST pairwise alignments (range [1-100],default=75) -E max E-value (default=1e-05,max=0.01) -D require equal Pfam domain composition (best with -m cluster or -n threads) when defining similarity-based orthology -S min %sequence identity in BLAST query/subj pairs (range [1-100],default=1 [BDBH|OMCL]) -N min BLAST neighborhood correlation PubMed=18475320 (range [0,1],default=0 [BDBH|OMCL]) -b compile core-genome with minimum BLAST searches (ignores -c [BDBH]) Options that control clustering: -t report sequence clusters including at least t taxa (default t=numberOfTaxa, t=0 reports all clusters [OMCL|COGS]) -a report clusters of sequence features in GenBank files (requires -d and .gbk files, instead of default 'CDS' GenBank features example -a 'tRNA,rRNA', NOTE: uses blastn instead of blastp, ignores -g,-D) -g report clusters of intergenic sequences flanked by ORFs (requires -d and .gbk files) in addition to default 'CDS' clusters -f filter by %length difference within clusters (range [1-100], by default sequence length is not checked) -r reference proteome .faa/.gbk file (by default takes file with least sequences; with BDBH sets first taxa to start adding genes) -e exclude clusters with inparalogues (by default inparalogues are included) -x allow sequences in multiple COG clusters (by default sequences are allocated to single clusters [COGS]) -F orthoMCL inflation value (range [1-5], default=1.5 [OMCL]) -A calculate average identity of clustered sequences, (optional, creates tab-separated matrix, by default uses blastp results but can use blastn with -a recommended with -t 0 [OMCL|COGS]) -P calculate percentage of conserved proteins (POCP), (optional, creates tab-separated matrix, by default uses blastp results but can use blastn with -a recommended with -t 0 [OMCL|COGS]) -z add soft-core to genome composition analysis (optional, requires -c [OMCL|COGS]) This program uses BLAST (and optionally HMMER/Pfam) to define clusters of 'orthologous' genomic sequences and pan/core-genome gene sets. Several algorithms are available and search parameters are customizable. It is designed to process (in a HPC computer cluster) files contained in a directory (-d), so that new .faa/.gbk files can be added while conserving previous BLAST results. In general the program will try to re-use previous results when run with the same input directory.
For the details of each option, please have a look a the extensive get_homologues manual. Here we can review only some of the most important ones.
GET_HOMOLOGUES allows users to run 3 different clustering algorithms (\(BDBH\), \(COGtriangles\) and \(OMCL\)) to define homologous gene families, as defined in Table 1.
|BDBH||default||Starting from a reference genome, keep adding genomes stepwise, while storing the sequence clusters that result of merging the best bidirectional best hits|
|COGS||-G||Merges triangles of inter-genomic symmetrical best matches, as described in PubMed=20439257|
|OMCL||-M||OrthoMCL v1.4, uses the Markov Cluster Algorithm to group sequences, with inflation (-F) controlling cluster granularity, as described in PubMed=12952885|
To learn all details around these algorithms, please read the online GET_HOMOLOGUES manual
The only required option is either -i, used to choose an input file, or preferably -d instead, which indicates an input folder.
In previous versions only files with extensions .fa[a] and .gb[k] were considered when parsing the -d directory. Currently, GZIP- or BZIP2-compressed input files are also accepted!
By using .faa input files in theory you might only calculate clusters of protein sequences. In contrast, the advantage of using .gbk files is that you obtain both nucleotide and protein clusters. If both types of input files are combined, only protein clusters will be produced. However, if each input .faa file has a twin .fna file in place, containing the corresponding nucleotide sequences in the same order, the program will attempt to produce the corresponding clusters of nucleotide sequences. The possible input file combinations are summarized in Table 1:
|input file extensions||output clusters|
|.gbk||amino acid + DNA sequence|
|.faa||amino acid sequence|
|.gbk & .faa||amino acid sequence|
|.faa & .fna||amino acid + DNA sequence|
|.gbk & .faa & .fna||amino acid + DNA sequence|
In summary, the use of an input folder or directory (-d) containing GenBank files is recommended, as it allows for new files to be added there in the future, reducing the computing required for updated analyses. For instance, if a user does a first analysis with 5 input genomes today, it is possible to check how the resulting clusters would change when adding an extra 10 genomes tomorrow, by copying these new 10 .faa / .gbk input files to the pre-existing -d folder, so that all previous BLAST searches are re-used.
To run \(get\_homologues.pl\ -d\) you need to be in the directory on top of the one holding the GenBank files. In our case this is the one saved in \(\$top\_dir\), holding the \(pIncAC\) directory with the *gbk files.
The BDBH algorithm implemented in GET_HOMOLOGUES is a heuristic version that does NOT cluster the results of the all-against-all blast data, just those against the smallest genome (default) or a (\(-R\ refGenome\)), as shown in Fig. 2. Therefore it is not suitable for computing pan-genomes. Use it only for core-genome calculations!
Let’s instruct get_homologues.pl to compute the BDBH clusters for the 12 genomes. Make sure you are in \(\$top\_dir\) and issue the following command, which will compute a core-genome based on the BDBH (default) clustering algorithm, excluding inparaloges (-e)
get_homologues.pl -d pIncAC -t 12 -e -n 2#
Note: the command above is equivalent to
get_homologues.pl -d pIncAC -e -n 2, as get_homologues.pl will by default use -t = number_of_genomes available in the \(\$gbk\_dir\)
# 1. run GET_HOMOLOGUES to compute the set of core-genome clusters using the BDBH, # get_homologues.pl -d pIncAC &> log.BDBH_def & get_homologues.pl -d pIncAC -t 12 -e -n 2 # BDBH clusters containing sequences for the 12 genomes (-t 12) # excluding those with inparalogues (-e): # use 2 processors for the job # This will take ~ 3 minutes # version 26022020 # results_directory=/home/vinuesa/pangenomics/pIncAC_homologues # parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 BATCHSIZE=100 KEEPSCNDHSPS=1 # diamond job:0 # checking input files... # Salmonella_enterica_33676_pIncAC.gbk 185 # Salmonella_enterica_YU39_pIncAC.gbk 178 # Salmonella_enterica_pAM04528.gbk 180 # Salmonella_enterica_sv_Kentucky.gbk 164 # _Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk 158 # _Escherichia_coli_UMNK88_plasmid_pUMNK88_NC_017645.gbk 173 # _Escherichia_coli_plasmid_pAPEC1990_61_NC_019066.gbk 172 # _Escherichia_coli_plasmid_pAR060302_NC_012692.gbk 189 # _Escherichia_coli_plasmid_pNDM-1_Dok01_NC_018994.gbk 224 # _Escherichia_coli_plasmid_pPG010208_NC_019065.gbk 148 # _Escherichia_coli_plasmid_peH4H_NC_012690.gbk 173 # _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk 137 # 12 genomes, 2081 sequences # taxa considered = 12 sequences = 2081 residues = 545474 MIN_BITSCORE_SIM = 16.9 # mask=KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_ (_algBDBH) # running makeblastdb with /home/vinuesa/pangenomics/pIncAC_homologues/Salmonella_enterica_33676_pIncAC.gbk.fasta # running makeblastdb with /home/vinuesa/pangenomics/pIncAC_homologues/Salmonella_enterica_YU39_pIncAC.gbk.fasta # ... # running BLAST searches ... # done # ... # concatenating and sorting BLAST/DIAMOND results... # sorting _Salmonella_enterica_33676_pIncAC.gbk results (0.097MB) # sorting _Salmonella_enterica_YU39_pIncAC.gbk results (0.091MB) # sorting _Salmonella_enterica_pAM04528.gbk results (0.1MB) # clustering inparalogues in _Escherichia_coli_plasmid_peH4H_NC_012690.gbk # ... # done # parsing blast result! (/home/vinuesa/cursos/pangenomics_PDCBq_Nov20/test_sequences/pIncAC_homologues/tmp/all.blast , 1.1MB) # parsing file finished # creating indexes, this might take some time (lines=2.25e+04) ... # construct_taxa_indexes: number of taxa found = 12 # number of file addresses/BLAST queries = 2.1e+03 # clustering orthologous sequences # clustering inparalogues in _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk (reference) # 0 sequences # clustering inparalogues in Salmonella_enterica_33676_pIncAC.gbk # 4 sequences # finding BDBHs between _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk and Salmonella_enterica_33676_pIncAC.gbk # 59 sequences # clustering inparalogues in Salmonella_enterica_YU39_pIncAC.gbk # 3 sequences # finding BDBHs between _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk and Salmonella_enterica_YU39_pIncAC.gbk # 84 sequences # ... # looking for valid ORF clusters (n_of_taxa=12)... # number_of_clusters = 33 # cluster_list = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_.cluster_list # cluster_directory = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_ # runtime: 218 wallclock secs ( 4.40 usr 0.75 sys + 174.83 cusr 17.86 csys = 197.84 CPU) # RAM use: 33.7 MB
The results of the previous and following get_homologue.pl runs can be found in the pIncAC_homologues/ directory, which includes the all-against-all blast results.
ls -1d * pIncAC pIncAC_homologues
We will explore its contents later on.
Let’s now compute the pan-genome clusters with the COGtriangles and OMCL algorithms, which will use the blastp results already available in pIncAC_homologues/ from the previous BDBH run.
Run with \(-t\ 0\) to compute clusters of all sizes, which is required for the pan-genome analysis.
get_homologues.pl -d pIncAC -G -t 0
get_homologues.pl -d pIncAC -G -t 0 # COGtriangles, computing clusters of all sizes (-t 0) # finds 408 clusters in 14 wallclock secs, as it recycles # the blast results of the previous run. # version 26022020 # results_directory=/home/vinuesa/pangenomics/pIncAC_homologues # parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 BATCHSIZE=100 KEEPSCNDHSPS=1 # diamond job:0 # checking input files... # Salmonella_enterica_33676_pIncAC.gbk 185 # Salmonella_enterica_YU39_pIncAC.gbk 178 # Salmonella_enterica_pAM04528.gbk 180 # Salmonella_enterica_sv_Kentucky.gbk 164 # _Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk 158 # _Escherichia_coli_UMNK88_plasmid_pUMNK88_NC_017645.gbk 173 # _Escherichia_coli_plasmid_pAPEC1990_61_NC_019066.gbk 172 # _Escherichia_coli_plasmid_pAR060302_NC_012692.gbk 189 # _Escherichia_coli_plasmid_pNDM-1_Dok01_NC_018994.gbk 224 # _Escherichia_coli_plasmid_pPG010208_NC_019065.gbk 148 # _Escherichia_coli_plasmid_peH4H_NC_012690.gbk 173 # _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk 137 # 12 genomes, 2081 sequences # taxa considered = 12 sequences = 2081 residues = 545474 MIN_BITSCORE_SIM = 16.9 # mask=KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_ (_algCOG) # skipped genome parsing (pIncAC_homologues/tmp/selected.genomes) # running BLAST searches ... # done # parsing file (COG) # parsing file (COG) finished # creating indexes, this might take some time (lines=2.25e+04) ... # construct_taxa_indexes: number of taxa found = 12 # number of file addresses/BLAST queries = 2.1e+03 # clustering orthologous sequences # checking lineage-specific expansions # making COGs # prunning COGs # done # add_unmatched_singletons : 0 sequences, 0 taxa # looking for valid ORF clusters (n_of_taxa=0)... # number_of_clusters = 408 # cluster_list = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_.cluster_list # cluster_directory = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_ # runtime: 15 wallclock secs ( 0.81 usr 0.14 sys + 2.46 cusr 0.58 csys = 3.99 CPU) # RAM use: 28.3 MB
Run with \(-t\ 0\) to compute clusters of all sizes, which is required for the pan-genome analysis
Option \(-c\) for generating pan-genome composition tables to be used for plotting the core decay and pan-genome growth curveas as a function of increasing number of genomes.
get_homologues.pl -d pIncAC/ -M -t 0 -c
get_homologues.pl -d pIncAC/ -M -t 0 -c # OMCL, finds 393 clusters in 5 wallclock secs. # ... # identifying inparalogs in _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk # 0 sequences # running MCL (inflation=1.5) ... # running MCL finished # find_OMCL_clusters: parsing clusters (/home/vinuesa/pangenomics/pIncAC_homologues/tmp/all_ortho.mcl) # adding _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk: core=32 pan=336 # pan-genome (number of genes, can be plotted with plot_pancore_matrix.pl) # file=pIncAC_homologues/pan_genome_algOMCL.tab genomes mean stddev | samples 0 167 22 | 172 170 183 137 164 217 170 169 137 154 1 215 22 | 215 206 184 203 239 262 229 208 200 201 2 242 15 | 246 245 236 247 258 269 242 243 216 220 3 273 22 | 274 281 292 282 301 288 270 262 217 263 4 291 21 | 295 322 294 288 301 312 307 268 251 268 5 307 15 | 300 329 302 307 303 317 315 315 270 309 6 316 17 | 302 335 307 307 303 336 334 323 282 330 7 317 18 | 303 336 307 312 303 340 335 324 282 331 8 324 17 | 340 336 317 320 309 341 335 324 282 331 9 330 10 | 343 337 336 321 312 341 335 331 314 331 10 334 9 | 343 345 336 321 331 341 343 332 317 331 11 341 4 | 348 346 339 340 338 341 344 340 335 336 # core-genome (number of genes, can be plotted with plot_pancore_matrix.pl) # file=pIncAC_homologues/core_genome_algOMCL.tab genomes mean stddev | samples 0 167 22 | 172 170 183 137 164 217 170 169 137 154 1 112 22 | 123 106 165 97 134 116 89 90 94 109 2 93 11 | 90 83 105 94 117 81 83 88 92 95 3 75 8 | 74 62 83 81 74 74 62 72 91 75 4 65 13 | 55 58 78 57 72 53 43 69 89 74 5 57 13 | 53 57 75 49 69 53 35 42 69 72 6 49 10 | 53 42 54 49 61 41 35 34 66 51 7 44 8 | 53 35 46 43 61 40 35 34 41 51 8 42 6 | 51 34 44 41 45 40 35 34 40 51 9 38 5 | 50 34 33 41 40 39 34 34 35 43 10 36 4 | 42 32 33 40 32 39 32 34 34 42 11 32 0 | 32 32 32 32 32 32 32 32 32 32 # add_unmatched_singletons : 117 sequences, 12 taxa # looking for valid ORF clusters (n_of_taxa=0)... # number_of_clusters = 393 # cluster_list = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_.cluster_list # cluster_directory = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ # runtime: 3 wallclock secs ( 1.82 usr 0.13 sys + 1.79 cusr 0.26 csys = 4.00 CPU) # RAM use: 30.3 MB
Remember that for the OMCL clustering we ran the following code:
get_homologues.pl -d pIncAC/ -M -t 0 -c
The \(-c\) flag tells get_homologues.pl to compute the core- and pan-genome composition data by randomly sampling 10 times pairs of genomes to compute the observed numbers of common and new genes. The results are written to the core_genome_algOMCL.tab and pan_genome_algOMCL.tab files, found in the pIncAC_homologues results directory. These tables are used as input for plot_pancore_matrix.pl, which renders the graphics and fits different functions to the data.
[options]: -h this message -i input .tab data file (required, generated by get_homologues.pl) -f type of fit [pan|core_Tettelin|core_Willenbrock|core_both] (default: core_Tettelin, PubMed:16172379,18088402) -F font scale for fitted formulas (optional, default=0.8) -a save snapshots for animations in dir (optional, example: -a animation)
cd pIncAC_homologues/ # lests save the absolute path to this directory in the blast_dir variable blast_dir=$(pwd) # these are the relevant input files: core_genome_algOMCL.tab and pan_genome_algOMCL.tab ls *.tab # plot core-decay curves (Tettelin & Willenbrock fits) plot_pancore_matrix.pl -i core_genome_algOMCL.tab -f core_both # plot pan-genome growth curves (Tettelin's exponential fit) plot_pancore_matrix.pl -i pan_genome_algOMCL.tab -f pan
Figure 3. The core-genome decay curves for 12 pIncA/C plasmids
Figure 4. The pan-genome growth curve for 12 pIncA/C plasmids
Based on the output shown above, what type of pan-genome (closed or open) do you think best represents this collection of pIncA/C plasmids?
Before exploring the contents of the get_homologues.pl runs stored in the [pIncAC]_homologues directory, a brief note on how to setup multiple sequential get_homologues.pl runs in non-interactive mode. This is the best option to run large datasets which may take hours or days to finish.
This is easy. just type the get_homologues.pl invocations that you would like to run into a text file (\(script\)), which i have named run_get_hom.batch holding the code shown below:
get_homologues.pl -d pIncAC -t 12 -e -n 2 &> log.BDBH get_homologues.pl -d pIncAC -G -t 0 &> log.COG get_homologues.pl -d pIncAC/ -M -t 0 -c &> log.OMCL
The run_get_hom.batch \(script\) would run exactly the same analyses that we have run interactively, but in a non-interactive mode.
Note that the output that each get_homologues.pl invocation would normally print to STDOUT (the screen) is redirected to a file called log.[ALGORITHM], making use of the “&>” syntax which instructs to redirect STDOUT and STDERR to the file indicated after the symbols.
Now we can call the batch script in the following manner:
nohup bash run_get_hom.batch &> /dev/null &
The \(nohup\) (no hang-up) command combined with the strange-looking “&> /dev/null &” tells the server to run the script in the background and to make the run independent of the login session. This means that after issuing the command, the script runs detached from our terminal. We can shutdown our computer and the process will continue executing on the server.After some time, we can check the results of the runs by listing the files and directories as shown:
ls -1 log.BDBH log.COG log.OMCL pIncAC pIncAC_homologues run_get_hom.batch
Another useful command to know is
tail -f log.BDBH, which can be called on any of the logfiles while they are being written to disk, showing their last 10 lines as they’re appended to the tail of the file in real time. Nice!
Al previously mentioned, GET_HOMOLOGUES writes the results of all the previous runs in the **[pIncAC]_homologues/ directory**, stored in the \(\$blast\_dir\) variable.
Therein you will find the following files and directories
The run-specific output directories are named using the following components to make them unique for each run configuration:
BuchaphCc_f0_alltaxa_algBDBH_e0_ | | | | | | | | | -e option was not used (inparalogues are in) | | | the clustering algorithm is BDBD (default) | | all clusters contain at least 1 sequence from each taxa (default -t behavior) | -f option not used (no length filtering) reference proteome
Let’s explore the results directory pIncAC_homologues/ using basic \(Shell\) commands:
# assumes that we are in $blast_dir (the results directory pIncAC_homologues/) cd $blast_dir # list ls | less ls *faa *fna *fasta ls *faa *fna *fasta | wc head -5 Salmonella_enterica_YU39_pIncAC.gbk.f*a ls Salmonella_enterica_YU39_pIncAC.gbk.f*a.* # find the subdirectories holding the run-specific clusters of homologous sequences (gene families) find . -type d # exlore results (clusters) in the KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_ # which hods the BDBH results for the 12 genomes (-t 12), without inparalogues (-e) ls ls *fna | wc -l ls *faa | wc -l grep -c '>' *fna grep -c '>' *faa
GET_HOMOLOGUES is unique among available pangenomics software in many respects. One of them is its capacity to compute consensus core-genome clusters from the solutions generated by the three clustering algorithms it implements, and consensus pan-genome clusters from COGtriangles and OMCL clustering results, as demonstrated below.
[options]: -h this message -d comma-separated names of cluster directories (min 1 required, example -d dir1,dir2) -o output directory (required, intersection cluster files are copied there) -n use nucleotide sequence .fna clusters (optional, uses .faa protein sequences by default) -r take first cluster dir as reference set, which might contain (optional, by default cluster dirs are expected a single representative sequence per cluster to be derived from the same taxa; overrides -t,-I) -s use only clusters with syntenic genes (optional, parses neighbours in FASTA headers) -t use only clusters with single-copy orthologues from taxa >= t (optional, default takes all intersection clusters; example -t 10) -I produce clusters with single-copy seqs from ALL taxa in file (optional, example -I include_list; overrides -t) -m produce intersection pangenome matrices (optional, ideally expects cluster directories generated with get_homologues.pl -t 0) -x produce cluster report in OrthoXML format (optional) -T produce parsimony-based pangenomic tree
# Compute consensus core-genome clusters using compare_clusters.pl of the GET_HOMOLOGUES suite. # Note that we will run the compare_clusters.pl script twice, once with the -n flag # (to compute the consesus clusters at the DNA level, producing *.fna files) # and a second instance without the flag, to get the protein clusters (*.faa files) # assumes we are in $blast_dir cd $blast_dir # find the subdirectories holding the run-specific results find . -type d # Compute the core genome; use -t 12 to keep only clusters containing a copy of the gene in each of the 12 genomes analyzed compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_,\ ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\ ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o core_BCM -t 12 -n # repeat command, without -n, to compute the clusters at the protein level (*faa files) compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_,\ ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\ ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o core_BCM -t 12 # Lets explore the core_BCM directory cd core_BCM ls *fna | wc -l ls *faa | wc -l grep -c '>' *faa # confirm that all 31 clusters contain only one sequence from each plasmid/proteome grep '>' *faa | cut -d\| -f2,3 | sort | uniq -c # explore the contents of the *venn_t12.txt files for f in *txt; do echo "#$f"; cat $f; echo '---------------------'; done
Figure 5. Venn diagram depicting the consensus core-genome of 31 genes for the 12 input pIncA/C sequences.