1 Aim of the tutorial

This tutorial will show how to use the get_homologues (Contreras-Moreira & Vinuesa, 2013) and get_phylomarkers (Vinuesa et al. 2018) software packages to perform a thorough pan-genomic and phylogenomic analyses of a small dataset of conjugative multiresistance plasmids of the pIncC (formerly IncA/C) incompatibility group of great concern in the clinical setting Ambrose et al. 2018. This is an updated version of a similar tutorial published by Vinuesa & Contreras-Moreira 2015.

1.1 GET_HOMOLOGUES and GET_PHYLOMARKERS source code and documentation


2 Running GET_HOMOLOGUES and GET_PHYLOMARKERS on buluc using the Linux \(Shell\)

Both software packages can be run on any platform using the Docker container, or directly on UNIX and GNU/Linux servers.

In any case, you will have to work using Linux \(Shell\) commands. Therefore it is important that you have some proficiency working on the command line.

Here is a complete tutorial on biocomputing in the GNU/Linux environment, with lots of examples.

2.1 Connecting to buluc

The whole exercises will be run on buluc. You need to log into the server with the \(ssh\) command, available in any Linux machine or from the \(mobaXterm\) package that you should have installed on your Windows machine.

ssh username@...

type your password

ssh -X vinuesa@*.*.*.*
vinuesa@buluc.*.*.*'s password: 
Last login: Wed Nov 11 17:46:45 2020 from akbal.*.*.*
ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

Bienvenidos a BULUC

...

[vinuesa@buluc ~]$

3 GET_HOMOLOGUES tutorial: computing core- and pan-genomes

In this section we will compute consensus core- and pan-genomes of the pIncA/C dataset as the intersection of clusters computed by the tree clustering algorithms implemented in get_homologues, as shown in the flowchart below:

3.1 GET_HOMOLOGUES options

GET_HOMOLOGUES is a sophisticated piece of software, with several external dependencies and auxiliary scripts, as depicted in Fig. 1.

3.1.1 Checking that all dependencies are in place

To make sure that get_homlogues.pl is in PATH and that the kit is complete tiype the following command

get_homologues.pl -v

# 0. make sure get_homlogues.pl in PATH and that the kit is complete
/export/apps/get_homologues/get_homologues.pl version 26022020

Program written by Bruno Contreras-Moreira (1) and Pablo Vinuesa (2).

 1: http://www.eead.csic.es/compbio (Estacion Experimental Aula Dei/CSIC/Fundacion ARAID, Spain)
 2: http://www.ccg.unam.mx/~vinuesa (Center for Genomic Sciences, UNAM, Mexico)

Primary citations:

 Contreras-Moreira B, Vinuesa P (2013) GET_HOMOLOGUES, a versatile software package for scalable and
 robust microbial pangenome analysis. Appl Environ Microbiol 79(24):7696-701. (PubMed:24096415)

 Vinuesa P and Contreras-Moreira B (2015) Robust Identification of Orthologues and Paralogues for 
 Microbial Pan-Genomics Using GET_HOMOLOGUES: A Case Study of pIncA/C Plasmids. In Bacterial Pangenomics,
 Methods in Molecular Biology Volume 1231, 203-232, edited by A Mengoni, M Galardini and M Fondi.

This software employs code, binaries and data from different authors, please cite them accordingly:
 OrthoMCL v1.4 (www.orthomcl.org , PubMed:12952885)
 COGtriangles v2.1 (sourceforge.net/projects/cogtriangles , PubMed=20439257)
 NCBI Blast-2.2 (blast.ncbi.nlm.nih.gov , PubMed=9254694,20003500)
 Bioperl v1.5.2 (www.bioperl.org , PubMed=12368254)
 HMMER 3.1b2 (hmmer.org)
 Pfam (http://pfam.xfam.org , PubMed=26673716)
 Diamond v0.8.25 (https://github.com/bbuchfink/diamond, PubMed=25402007)

Checking required binaries and data sources, all set in phyTools.pm :
        EXE_BLASTP : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/ncbi-blast-2.8.1+/bin/blastp)
        EXE_BLASTN : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/ncbi-blast-2.8.1+/bin/blastn)
      EXE_FORMATDB : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/ncbi-blast-2.8.1+/bin/makeblastdb)
           EXE_MCL : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/mcl-14-137/src/shmcl/mcl)
      EXE_MAKEHASH : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/COGsoft/COGmakehash/COGmakehash )
     EXE_READBLAST : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/COGsoft/COGreadblast/COGreadblast )
        EXE_COGLSE : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/COGsoft/COGlse/COGlse )
        EXE_COGTRI : OK (path:/export/apps/get_homologues-x86_64-20200226//bin/COGsoft/COGtriangles/COGtriangles )
       EXE_HMMPFAM : OK (/export/apps/get_homologues-x86_64-20200226//bin/hmmer-3.1b2/binaries/hmmscan --noali --acc --cut_ga  /export/apps/get_homologues-x86_64-20200226/db/Pfam-A.hmm)
         EXE_DMNDP : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/diamond-0.8.25/diamond blastp)
         EXE_DMNFT : OK (path:/export/apps/get_homologues-x86_64-20200226/bin/diamond-0.8.25/diamond makedb)
        EXE_INPARA : OK (path:/export/apps/get_homologues-x86_64-20200226/_cluster_makeInparalog.pl)
         EXE_ORTHO : OK (path:/export/apps/get_homologues-x86_64-20200226/_cluster_makeOrtholog.pl)
         EXE_HOMOL : OK (path:/export/apps/get_homologues-x86_64-20200226/_cluster_makeHomolog.pl)
    EXE_SPLITBLAST : OK (path:/export/apps/get_homologues-x86_64-20200226/_split_blast.pl)
  EXE_SPLITHMMPFAM : OK (path:/export/apps/get_homologues-x86_64-20200226/_split_hmmscan.pl)

3.1.2 Running the main \(script\) get_homologues.pl

The main pipeline depicted above (Fig. 1) is run with the core \(script\) get_homologues.pl.

The options of the \(script\) are displayed by simply calling it with any of the following commands

  • get_homologues.pl
  • get_homologues.pl -h
usage: /export/apps/get_homologues/get_homologues.pl [options]

-h this message
-v print version, credits and checks installation
-d directory with input FASTA files ( .faa / .fna ),           (overrides -i,
   GenBank files ( .gbk ), 1 per genome, or a subdirectory      use of pre-clustered sequences
   ( subdir.clusters / subdir_ ) with pre-clustered sequences   ignores -c, -g)
   ( .faa / .fna ); allows for new files to be added later;    
   creates output folder named 'directory_homologues'
-i input amino acid FASTA file with [taxon names] in headers,  (required unless -d is set)
   creates output folder named 'file_homologues'

Optional parameters:
-o only run BLAST/Pfam searches and exit                       (useful to pre-compute searches)
-c report genome composition analysis                          (follows order in -I file if enforced,
                                                                ignores -r,-t,-e)
-R set random seed for genome composition analysis             (optional, requires -c, example -R 1234,
                                                                required for mixing -c with -c -a runs)
-m runmode [local|cluster]                                     (default local)
-n nb of threads for BLAST/HMMER/MCL in 'local' runmode        (default=2)
-I file with .faa/.gbk files in -d to be included              (takes all by default, requires -d)

Algorithms instead of default bidirectional best-hits (BDBH):
-G use COGtriangle algorithm (COGS, PubMed=20439257)           (requires 3+ genomes|taxa)
-M use orthoMCL algorithm (OMCL, PubMed=12952885)

Options that control sequence similarity searches:
-X use diamond instead of blastp                               (optional, set threads with -n)
-C min %coverage in BLAST pairwise alignments                  (range [1-100],default=75)
-E max E-value                                                 (default=1e-05,max=0.01)
-D require equal Pfam domain composition                       (best with -m cluster or -n threads)
   when defining similarity-based orthology
-S min %sequence identity in BLAST query/subj pairs            (range [1-100],default=1 [BDBH|OMCL])
-N min BLAST neighborhood correlation PubMed=18475320          (range [0,1],default=0 [BDBH|OMCL])
-b compile core-genome with minimum BLAST searches             (ignores -c [BDBH])

Options that control clustering:
-t report sequence clusters including at least t taxa          (default t=numberOfTaxa,
                                                                t=0 reports all clusters [OMCL|COGS])
-a report clusters of sequence features in GenBank files       (requires -d and .gbk files,
   instead of default 'CDS' GenBank features                    example -a 'tRNA,rRNA',
                                                                NOTE: uses blastn instead of blastp,
                                                                ignores -g,-D)
-g report clusters of intergenic sequences flanked by ORFs     (requires -d and .gbk files)
   in addition to default 'CDS' clusters
-f filter by %length difference within clusters                (range [1-100], by default sequence
                                                                length is not checked)
-r reference proteome .faa/.gbk file                           (by default takes file with
                                                                least sequences; with BDBH sets
                                                                first taxa to start adding genes)
-e exclude clusters with inparalogues                          (by default inparalogues are
                                                                included)
-x allow sequences in multiple COG clusters                    (by default sequences are allocated
                                                                to single clusters [COGS])
-F orthoMCL inflation value                                    (range [1-5], default=1.5 [OMCL])
-A calculate average identity of clustered sequences,          (optional, creates tab-separated matrix,
 by default uses blastp results but can use blastn with -a      recommended with -t 0 [OMCL|COGS])
-P calculate percentage of conserved proteins (POCP),          (optional, creates tab-separated matrix,
 by default uses blastp results but can use blastn with -a      recommended with -t 0 [OMCL|COGS])
-z add soft-core to genome composition analysis                (optional, requires -c [OMCL|COGS])

 This program uses BLAST (and optionally HMMER/Pfam) to define clusters of 'orthologous'
 genomic sequences and pan/core-genome gene sets. Several algorithms are available
 and search parameters are customizable. It is designed to process (in a HPC computer
 cluster) files contained in a directory (-d), so that new .faa/.gbk files can be added
 while conserving previous BLAST results. In general the program will try to re-use
 previous results when run with the same input directory.

For the details of each option, please have a look a the extensive get_homologues manual. Here we can review only some of the most important ones.

3.2 Running GET_HOMOLOGUES with different clustering algorithms to compute consensus core- and pan-genomes

GET_HOMOLOGUES allows users to run 3 different clustering algorithms (\(BDBH\), \(COGtriangles\) and \(OMCL\)) to define homologous gene families, as defined in Table 1.

  • Table 1. Overview of the three clustering algorithms implemented in get_homologues
name option comment
BDBH default Starting from a reference genome, keep adding genomes stepwise, while storing the sequence clusters that result of merging the best bidirectional best hits
COGS -G Merges triangles of inter-genomic symmetrical best matches, as described in PubMed=20439257
OMCL -M OrthoMCL v1.4, uses the Markov Cluster Algorithm to group sequences, with inflation (-F) controlling cluster granularity, as described in PubMed=12952885

To learn all details around these algorithms, please read the online GET_HOMOLOGUES manual

3.2.1 The sequence data for get_homologues

The only required option is either -i, used to choose an input file, or preferably -d instead, which indicates an input folder.

In previous versions only files with extensions .fa[a] and .gb[k] were considered when parsing the -d directory. Currently, GZIP- or BZIP2-compressed input files are also accepted!

By using .faa input files in theory you might only calculate clusters of protein sequences. In contrast, the advantage of using .gbk files is that you obtain both nucleotide and protein clusters. If both types of input files are combined, only protein clusters will be produced. However, if each input .faa file has a twin .fna file in place, containing the corresponding nucleotide sequences in the same order, the program will attempt to produce the corresponding clusters of nucleotide sequences. The possible input file combinations are summarized in Table 1:

  • Table 2: Valid input file combinations.
input file extensions output clusters
.gbk amino acid + DNA sequence
.faa amino acid sequence
.gbk & .faa amino acid sequence
.faa & .fna amino acid + DNA sequence
.gbk & .faa & .fna amino acid + DNA sequence

In summary, the use of an input folder or directory (-d) containing GenBank files is recommended, as it allows for new files to be added there in the future, reducing the computing required for updated analyses. For instance, if a user does a first analysis with 5 input genomes today, it is possible to check how the resulting clusters would change when adding an extra 10 genomes tomorrow, by copying these new 10 .faa / .gbk input files to the pre-existing -d folder, so that all previous BLAST searches are re-used.

To run \(get\_homologues.pl\ -d\) you need to be in the directory on top of the one holding the GenBank files. In our case this is the one saved in \(\$top\_dir\), holding the \(pIncAC\) directory with the *gbk files.

3.2.2 Computing BDBH clusters with GET_HOMOLOGUES (default clustering mode)

The BDBH algorithm implemented in GET_HOMOLOGUES is a heuristic version that does NOT cluster the results of the all-against-all blast data, just those against the smallest genome (default) or a (\(-R\ refGenome\)), as shown in Fig. 2. Therefore it is not suitable for computing pan-genomes. Use it only for core-genome calculations!

  • Figure 2. GET_HOMOLOGUE’s implementation of the BDBH algorithm using a reference genome


Let’s instruct get_homologues.pl to compute the BDBH clusters for the 12 genomes. Make sure you are in \(\$top\_dir\) and issue the following command, which will compute a core-genome based on the BDBH (default) clustering algorithm, excluding inparaloges (-e)

  • get_homologues.pl -d pIncAC -t 12 -e -n 2 #

Note: the command above is equivalent to get_homologues.pl -d pIncAC -e -n 2, as get_homologues.pl will by default use -t = number_of_genomes available in the \(\$gbk\_dir\)

# 1. run GET_HOMOLOGUES to compute the set of core-genome clusters using the BDBH, 
# get_homologues.pl -d pIncAC &> log.BDBH_def &
get_homologues.pl -d pIncAC -t 12 -e -n 2  # BDBH clusters containing sequences for the 12 genomes (-t 12) 
                                           # excluding those with inparalogues (-e): 
                                           # use 2 processors for the job
                                           # This will take ~ 3 minutes

# version 26022020
# results_directory=/home/vinuesa/pangenomics/pIncAC_homologues
# parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 BATCHSIZE=100 KEEPSCNDHSPS=1
# diamond job:0

# checking input files...
# Salmonella_enterica_33676_pIncAC.gbk 185
# Salmonella_enterica_YU39_pIncAC.gbk 178
# Salmonella_enterica_pAM04528.gbk 180
# Salmonella_enterica_sv_Kentucky.gbk 164
# _Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk 158
# _Escherichia_coli_UMNK88_plasmid_pUMNK88_NC_017645.gbk 173
# _Escherichia_coli_plasmid_pAPEC1990_61_NC_019066.gbk 172
# _Escherichia_coli_plasmid_pAR060302_NC_012692.gbk 189
# _Escherichia_coli_plasmid_pNDM-1_Dok01_NC_018994.gbk 224
# _Escherichia_coli_plasmid_pPG010208_NC_019065.gbk 148
# _Escherichia_coli_plasmid_peH4H_NC_012690.gbk 173
# _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk 137

# 12 genomes, 2081 sequences

# taxa considered = 12 sequences = 2081 residues = 545474 MIN_BITSCORE_SIM = 16.9

# mask=KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_ (_algBDBH)

# running makeblastdb with /home/vinuesa/pangenomics/pIncAC_homologues/Salmonella_enterica_33676_pIncAC.gbk.fasta

# running makeblastdb with /home/vinuesa/pangenomics/pIncAC_homologues/Salmonella_enterica_YU39_pIncAC.gbk.fasta

# ...

# running BLAST searches ...
# done
# ...

# concatenating and sorting BLAST/DIAMOND results...
# sorting _Salmonella_enterica_33676_pIncAC.gbk results (0.097MB)
# sorting _Salmonella_enterica_YU39_pIncAC.gbk results (0.091MB)
# sorting _Salmonella_enterica_pAM04528.gbk results (0.1MB)
# clustering inparalogues in _Escherichia_coli_plasmid_peH4H_NC_012690.gbk
# ...
# done

# parsing blast result! (/home/vinuesa/cursos/pangenomics_PDCBq_Nov20/test_sequences/pIncAC_homologues/tmp/all.blast , 1.1MB)
# parsing file finished

# creating indexes, this might take some time (lines=2.25e+04) ...

# construct_taxa_indexes: number of taxa found = 12
# number of file addresses/BLAST queries = 2.1e+03

# clustering orthologous sequences

# clustering inparalogues in _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk (reference)
# 0 sequences

# clustering inparalogues in Salmonella_enterica_33676_pIncAC.gbk
# 4 sequences

# finding BDBHs between _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk and Salmonella_enterica_33676_pIncAC.gbk
# 59 sequences

# clustering inparalogues in Salmonella_enterica_YU39_pIncAC.gbk
# 3 sequences

# finding BDBHs between _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk and Salmonella_enterica_YU39_pIncAC.gbk
# 84 sequences

# ...

# looking for valid ORF clusters (n_of_taxa=12)...


# number_of_clusters = 33
# cluster_list = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_.cluster_list
# cluster_directory = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_

# runtime: 218 wallclock secs ( 4.40 usr  0.75 sys + 174.83 cusr 17.86 csys = 197.84 CPU)
# RAM use: 33.7 MB

The results of the previous and following get_homologue.pl runs can be found in the pIncAC_homologues/ directory, which includes the all-against-all blast results.

ls -1d *

pIncAC
pIncAC_homologues

We will explore its contents later on.

Let’s now compute the pan-genome clusters with the COGtriangles and OMCL algorithms, which will use the blastp results already available in pIncAC_homologues/ from the previous BDBH run.

3.2.3 Clustering sequences with the COGtriangles algorithm (\(-G\))

Run with \(-t\ 0\) to compute clusters of all sizes, which is required for the pan-genome analysis.

  • get_homologues.pl -d pIncAC -G -t 0
get_homologues.pl -d pIncAC -G -t 0  # COGtriangles, computing clusters of all sizes (-t 0)
                                     # finds 408 clusters in 14 wallclock secs, as it recycles
                                     # the blast results of the previous run.
                                     
                                     
# version 26022020
# results_directory=/home/vinuesa/pangenomics/pIncAC_homologues
# parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 BATCHSIZE=100 KEEPSCNDHSPS=1
# diamond job:0

# checking input files...
# Salmonella_enterica_33676_pIncAC.gbk 185
# Salmonella_enterica_YU39_pIncAC.gbk 178
# Salmonella_enterica_pAM04528.gbk 180
# Salmonella_enterica_sv_Kentucky.gbk 164
# _Aeromonas_hydrophila_plasmid_pRA1_NC_012885.gbk 158
# _Escherichia_coli_UMNK88_plasmid_pUMNK88_NC_017645.gbk 173
# _Escherichia_coli_plasmid_pAPEC1990_61_NC_019066.gbk 172
# _Escherichia_coli_plasmid_pAR060302_NC_012692.gbk 189
# _Escherichia_coli_plasmid_pNDM-1_Dok01_NC_018994.gbk 224
# _Escherichia_coli_plasmid_pPG010208_NC_019065.gbk 148
# _Escherichia_coli_plasmid_peH4H_NC_012690.gbk 173
# _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk 137

# 12 genomes, 2081 sequences

# taxa considered = 12 sequences = 2081 residues = 545474 MIN_BITSCORE_SIM = 16.9

# mask=KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_ (_algCOG)

# skipped genome parsing (pIncAC_homologues/tmp/selected.genomes)


# running BLAST searches ...
# done


# parsing file (COG)
# parsing file (COG) finished

# creating indexes, this might take some time (lines=2.25e+04) ...

# construct_taxa_indexes: number of taxa found = 12
# number of file addresses/BLAST queries = 2.1e+03

# clustering orthologous sequences
# checking lineage-specific expansions
# making COGs
# prunning COGs
# done

# add_unmatched_singletons : 0 sequences, 0 taxa


# looking for valid ORF clusters (n_of_taxa=0)...


# number_of_clusters = 408
# cluster_list = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_.cluster_list
# cluster_directory = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_

# runtime: 15 wallclock secs ( 0.81 usr  0.14 sys +  2.46 cusr  0.58 csys =  3.99 CPU)
# RAM use: 28.3 MB                             

3.2.4 Clustering sequences with the OMCL algorithm (\(-M\))and compute pan-genome composition (\(-c\))

Run with \(-t\ 0\) to compute clusters of all sizes, which is required for the pan-genome analysis

Option \(-c\) for generating pan-genome composition tables to be used for plotting the core decay and pan-genome growth curveas as a function of increasing number of genomes.

  • get_homologues.pl -d pIncAC/ -M -t 0 -c
get_homologues.pl -d pIncAC/ -M -t 0 -c     # OMCL, finds 393 clusters in 5 wallclock secs.

# ...
# identifying inparalogs in _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk
# 0 sequences

# running MCL (inflation=1.5) ...
# running MCL finished

# find_OMCL_clusters: parsing clusters (/home/vinuesa/pangenomics/pIncAC_homologues/tmp/all_ortho.mcl)


# adding _Klebsiella_pneumoniae_plasmid_pNDM-KN_NC_019153.gbk: core=32 pan=336

# pan-genome (number of genes, can be plotted with plot_pancore_matrix.pl)
# file=pIncAC_homologues/pan_genome_algOMCL.tab
genomes mean    stddev  |   samples
0   167 22  |   172 170 183 137 164 217 170 169 137 154 
1   215 22  |   215 206 184 203 239 262 229 208 200 201 
2   242 15  |   246 245 236 247 258 269 242 243 216 220 
3   273 22  |   274 281 292 282 301 288 270 262 217 263 
4   291 21  |   295 322 294 288 301 312 307 268 251 268 
5   307 15  |   300 329 302 307 303 317 315 315 270 309 
6   316 17  |   302 335 307 307 303 336 334 323 282 330 
7   317 18  |   303 336 307 312 303 340 335 324 282 331 
8   324 17  |   340 336 317 320 309 341 335 324 282 331 
9   330 10  |   343 337 336 321 312 341 335 331 314 331 
10  334 9   |   343 345 336 321 331 341 343 332 317 331 
11  341 4   |   348 346 339 340 338 341 344 340 335 336 

# core-genome (number of genes, can be plotted with plot_pancore_matrix.pl)
# file=pIncAC_homologues/core_genome_algOMCL.tab
genomes mean    stddev  |   samples
0   167 22  |   172 170 183 137 164 217 170 169 137 154 
1   112 22  |   123 106 165 97  134 116 89  90  94  109 
2   93  11  |   90  83  105 94  117 81  83  88  92  95  
3   75  8   |   74  62  83  81  74  74  62  72  91  75  
4   65  13  |   55  58  78  57  72  53  43  69  89  74  
5   57  13  |   53  57  75  49  69  53  35  42  69  72  
6   49  10  |   53  42  54  49  61  41  35  34  66  51  
7   44  8   |   53  35  46  43  61  40  35  34  41  51  
8   42  6   |   51  34  44  41  45  40  35  34  40  51  
9   38  5   |   50  34  33  41  40  39  34  34  35  43  
10  36  4   |   42  32  33  40  32  39  32  34  34  42  
11  32  0   |   32  32  32  32  32  32  32  32  32  32  


# add_unmatched_singletons : 117 sequences, 12 taxa


# looking for valid ORF clusters (n_of_taxa=0)...


# number_of_clusters = 393
# cluster_list = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_.cluster_list
# cluster_directory = pIncAC_homologues/KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_

# runtime:  3 wallclock secs ( 1.82 usr  0.13 sys +  1.79 cusr  0.26 csys =  4.00 CPU)
# RAM use: 30.3 MB

3.2.4.1 Plotting core- and pan-genome decay/growth curves with plot_pancore_matrix.pl

Remember that for the OMCL clustering we ran the following code:

get_homologues.pl -d pIncAC/ -M -t 0 -c

The \(-c\) flag tells get_homologues.pl to compute the core- and pan-genome composition data by randomly sampling 10 times pairs of genomes to compute the observed numbers of common and new genes. The results are written to the core_genome_algOMCL.tab and pan_genome_algOMCL.tab files, found in the pIncAC_homologues results directory. These tables are used as input for plot_pancore_matrix.pl, which renders the graphics and fits different functions to the data.

  • The plot_pancore_matrix.pl options
    • plot_pancore_matrix.pl -h
[options]: 
-h   this message
-i   input .tab data file                   (required, generated by get_homologues.pl)
-f   type of fit [pan|core_Tettelin|core_Willenbrock|core_both]
                                            (default: core_Tettelin, PubMed:16172379,18088402)
-F   font scale for fitted formulas         (optional, default=0.8)
-a   save snapshots for animations in dir   (optional, example: -a animation)
  • Let’s cd into pIncAC_homologues/, locate the core_genome_algOMCL.tab and pan_genome_algOMCL.tab files and compute the graphs with plot_pancore_matrix.pl
cd pIncAC_homologues/

# lests save the absolute path to this directory in the blast_dir variable
blast_dir=$(pwd)

# these are the relevant input files: core_genome_algOMCL.tab and pan_genome_algOMCL.tab
ls *.tab

# plot core-decay curves (Tettelin & Willenbrock fits)
plot_pancore_matrix.pl -i core_genome_algOMCL.tab -f core_both 

# plot pan-genome growth curves (Tettelin's exponential fit)
plot_pancore_matrix.pl -i pan_genome_algOMCL.tab -f pan

Figure 3. The core-genome decay curves for 12 pIncA/C plasmids

Figure 4. The pan-genome growth curve for 12 pIncA/C plasmids

Based on the output shown above, what type of pan-genome (closed or open) do you think best represents this collection of pIncA/C plasmids?

3.3 Configuring a batch script to run multiple get_homologues.pl analyses in non-interactive mode for large datasets

Before exploring the contents of the get_homologues.pl runs stored in the [pIncAC]_homologues directory, a brief note on how to setup multiple sequential get_homologues.pl runs in non-interactive mode. This is the best option to run large datasets which may take hours or days to finish.

This is easy. just type the get_homologues.pl invocations that you would like to run into a text file (\(script\)), which i have named run_get_hom.batch holding the code shown below:

  • Contents of the run_get_hom.batch \(script\)
    get_homologues.pl -d pIncAC -t 12 -e -n 2 &> log.BDBH
    get_homologues.pl -d pIncAC -G -t 0 &> log.COG
    get_homologues.pl -d pIncAC/ -M -t 0 -c &> log.OMCL
    

The run_get_hom.batch \(script\) would run exactly the same analyses that we have run interactively, but in a non-interactive mode.

Note that the output that each get_homologues.pl invocation would normally print to STDOUT (the screen) is redirected to a file called log.[ALGORITHM], making use of the “&>” syntax which instructs to redirect STDOUT and STDERR to the file indicated after the symbols.

3.3.1 Running a batch command \(script\) with \(nohup\) in the background

Now we can call the batch script in the following manner:

  • nohup bash run_get_hom.batch &> /dev/null &

The \(nohup\) (no hang-up) command combined with the strange-looking “&> /dev/null &” tells the server to run the script in the background and to make the run independent of the login session. This means that after issuing the command, the script runs detached from our terminal. We can shutdown our computer and the process will continue executing on the server.

After some time, we can check the results of the runs by listing the files and directories as shown:
ls -1
log.BDBH
log.COG
log.OMCL
pIncAC
pIncAC_homologues
run_get_hom.batch

Another useful command to know is tail -f log.BDBH, which can be called on any of the logfiles while they are being written to disk, showing their last 10 lines as they’re appended to the tail of the file in real time. Nice!

3.4 Explore the results of get_homologues.pl runs stored in the [pIncAC]_homologues directory

Al previously mentioned, GET_HOMOLOGUES writes the results of all the previous runs in the **[pIncAC]_homologues/ directory**, stored in the \(\$blast\_dir\) variable.

Therein you will find the following files and directories

  • Regular files
    • genome-specific CDSs: .gbk.fna
    • genome-specific proteomes: .gbk.faa
    • genome-specific proteomes wit simplified header for blastp analyses: .gbk.fasta
    • genome-specific blastp databases: .gbk.fasta.phr .gbk.fasta.pin .gbk.fasta.psq
    • pair-wise (all-against-all) blastp results as gzip compressed files: genome1.gbk.fasta_genome2.gbk.fasta.blast.gz
    • pan-genome composition tables (when using option \(-c\): core_genome_algOMCL.tab pan_genome_algOMCL.tab
  • Directories
    • tmp: which holds diverse log-files, combined and sorted blast-tables, genome indices …
    • run specific directories like KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_, holding the core and|or pan-genome clusters at the DNA (.fna) and protein (.faa) levels

The run-specific output directories are named using the following components to make them unique for each run configuration:

BuchaphCc_f0_alltaxa_algBDBH_e0_
|         |  |       |       |   
|         |  |       |       -e option was not used (inparalogues are in)
|         |  |       the clustering algorithm is BDBD (default)
|         |  all clusters contain at least 1 sequence from each taxa (default -t behavior)
|         -f option not used (no length filtering)
reference proteome

Let’s explore the results directory pIncAC_homologues/ using basic \(Shell\) commands:

# assumes that we are in $blast_dir (the results directory pIncAC_homologues/)
cd $blast_dir

# list
ls | less

ls *faa *fna *fasta
ls *faa *fna *fasta | wc

head -5 Salmonella_enterica_YU39_pIncAC.gbk.f*a

ls Salmonella_enterica_YU39_pIncAC.gbk.f*a.*

# find the subdirectories holding the run-specific clusters of homologous sequences (gene families)
find . -type d

# exlore results (clusters) in the KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_
#   which hods the BDBH results for the 12 genomes (-t 12), without inparalogues (-e)
ls

ls *fna | wc -l
ls *faa | wc -l

grep -c '>' *fna
grep -c '>' *faa

3.5 Computing a consensus core-genome from the BDBH, COGtriangles and OMCL clustering solutions using compare_clusters.pl

GET_HOMOLOGUES is unique among available pangenomics software in many respects. One of them is its capacity to compute consensus core-genome clusters from the solutions generated by the three clustering algorithms it implements, and consensus pan-genome clusters from COGtriangles and OMCL clustering results, as demonstrated below.

  • The compare_clusters.pl options
    • compare_clusters.pl -h
[options]: 
-h   this message
-d   comma-separated names of cluster directories                  (min 1 required, example -d dir1,dir2)
-o   output directory                                              (required, intersection cluster files are copied there)
-n   use nucleotide sequence .fna clusters                         (optional, uses .faa protein sequences by default)
-r   take first cluster dir as reference set, which might contain  (optional, by default cluster dirs are expected
     a single representative sequence per cluster                   to be derived from the same taxa; overrides -t,-I)
-s   use only clusters with syntenic genes                         (optional, parses neighbours in FASTA headers)
-t   use only clusters with single-copy orthologues from taxa >= t (optional, default takes all intersection clusters; example -t 10)
-I   produce clusters with single-copy seqs from ALL taxa in file  (optional, example -I include_list; overrides -t)
-m   produce intersection pangenome matrices                       (optional, ideally expects cluster directories generated
                                                                    with get_homologues.pl -t 0)
-x   produce cluster report in OrthoXML format                     (optional)
-T   produce parsimony-based pangenomic tree
  • Compute consensus core-genome clusters using compare_clusters.pl
# Compute consensus core-genome clusters using compare_clusters.pl of the GET_HOMOLOGUES suite.
# Note that we will run the compare_clusters.pl script twice, once with the -n flag 
# (to compute the consesus clusters at the DNA level, producing *.fna files)
# and a second instance without the flag, to get the protein clusters (*.faa files)

# assumes we are in $blast_dir
cd $blast_dir

# find the subdirectories holding the run-specific results
find . -type d

# Compute the core genome; use -t 12 to keep only clusters containing a copy of the gene in each of the 12 genomes analyzed
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o core_BCM -t 12 -n

# repeat command, without -n, to compute the clusters at the protein level (*faa files)
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\
./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o core_BCM -t 12

# Lets explore the core_BCM directory
cd core_BCM

ls *fna | wc -l
ls *faa | wc -l

grep -c '>' *faa


# confirm that all 31 clusters contain only one sequence from each plasmid/proteome
grep '>' *faa | cut -d\| -f2,3 | sort | uniq -c

# explore the contents of the *venn_t12.txt files
for f in *txt; do echo "#$f"; cat $f; echo '---------------------'; done

Figure 5. Venn diagram depicting the consensus core-genome of 31 genes for the 12 input pIncA/C sequences.