PGAweb Manual

What's PGAweb?

PGAweb is a webserver for bacterial pan-genome analysis. It had integrated our previous published stand-alone tools PGAP, PanGP and PGAP-X into one server and simplifies the analysis of pan-genome.

What's the difference between PGAP and PGAP-X?

The main difference between these two pipelines is orthologous genes identification algorithm. PGAP provides two methods, GeneFamily method and MultiParanoid method, while PGAP-X introduced a novel algorithm for orthologous cluster identification. This algorithm can also distinguish paralogs by their genomic location from whole genome alignment. Some other differences and features are as follows:

  • PGAP, supports five analytical modules listed as following:
    1. Orthologs clusters identification across multiple genome datasets
    2. Pan-genome analysis for given input strains
    3. Functional genes variation identification and SNP calling across given strains
    4. Evolutionary analysis based on pan-genome and SNP data
    5. Orthologs clusters function analysis
  • PGAP-X, is a platform for microbial comparative genomics analysis and visualization by following modules:
    1. Genome Alignment for given strains
    2. Orthologous gene identification for given strains
    3. Genome variation analysis
    4. Pan-genome analysis

Input data format

Input data for PGAP pipeline (Complete genomes or draft genomes)

Four types of input file format are supported.

  1. NCBI new data format

    For NCBI new format data, three files with _cds_from_genomic.fna.gz, _protein.faa.gz and _feature_table.txt.gz extensions are required by each strain, these three-type files should have the same prefix. Downloadable samples of this format. e.g:




  2. NCBI old data format

    For NCBI old format data, three files with .faa, .ffn, .ptt  extensions are required by each strain, these three-type files should have the same prefix. Downloadable samples of this format. e.g:




  3. PGAP raw data format

    For PGAP raw data, three files with .nuc, .pep, .function extensions are required by each strain, these three-type files should have the same prefix. More info about PGAP raw data can also be available at Manual of PGAP pipelines. Downloadable sample of this format. e.g:




  4. NCBI GenBank(full) format

    For NCBI GenBank(full) format data, only one file with .gb extension is required by each strain. Downloadable sample of this format. e.g:

These all input datasets can be easily download from NCBI and multiple files with different formats can be uploaded in one job submission. PGAweb recognizes four type files by their extensions and checks these files ( whether the length of nucleotide sequence equal three time of the corresponding length of amino acid sequence plus 3 ) when users choose PGAP pipelines. All problematic data will be listed on page while it is uploading. Get more details about problematic data.

Input data for PGAP-X pipeline (Complete genomes only)

Complete genome sequence files (.fna) and corresponding genome annotation files(.ptt) are needed. e.g:



Example Data


Steps for PGAP

  • Method selection

    For orthologs clusters identification, two methods are available in this pipeline: GeneFamily Method(GF) ( default method ) and MultiParanoid Method(MP).

  • Function selection

    Cluster analysis of homologous genes is required. And other functional modules can choose according to users’ needs.

    Homologous gene clustering, searching homologs or orthologs among multiple genomes.

    Pan-genome analysis, finding out the relation between pan-genome size and genome number.

    Homologous clusters variation analysis, detecting mutations and indels in each gene cluster.

    Evolution analysis, constructing phylogenetic trees with different methods (NJ, UPGMA or ML) and data (gene distance matrix or core gene variation data).

    Function analysis, annotating each cluster with function description and COG classification and calculating the function distribution of core, dispensable and unique clusters according to COG classification.

    Of particular note is that when uploaded data contains NCBI new format data, function analysis is disabled since COG annotation is missing in feature_table. If you need function analysis module, please convert NCBI new format data to PGAP raw data, then add COG annotation into .function file and upload again.

  • Parameters settings

    Following parameters could be used for adjusting the cutoff for homologs identification.

    GF method of PGAP

    Score, minimum score in blastp.

    Evalue, maximal E-value in blastp.

    Coverage, minimum alignment coverage for two homologous proteins.

    Identity, minimum alignment identity for two homologous proteins.

    Bootstrap, bootstrap times for phylogenetic tree.

    MP method of PGAP

    Score, minimum score in blastp.

    Local, minimum local alignment overlap.

    Global, minimum global alignment overlap.

    Bootstrap, bootstrap times for phylogenetic tree.

Steps for PGAP-X

  • Function selection

    Genome alignment is the basic section of the whole pipeline. And other functional modules can choose according to users’ needs.

    Genome alignment, performing whole genome sequences alignment and visualizing genome structure.

    Orthologs analysis, clustering orthologous gene and visualizing gene distribution by their conservation level.

    Variation analysis, performing genetic variation analysis based on whole genome alignment.

    Pan-genome analysis, indicating open or close feature and diversity of gene content in the bacterial population.

  • Parameters settings

    Orthologs clusters can be identified based on nucleotide sequences or protein sequences.

    Coverage, minimum alignment coverage for two homologous proteins

    Identity, minimum alignment identity for two homologous proteins

    Parameters for Variation Analysis

    Variation frequency, the frequency of variation in 1kb regions.

    Variation number, the number of variation in 1kb regions.

    Reference, if users select pairwise substitution region analysis model, a reference genome is required.

    Select a query for visualization, select one representative genome for visualization in pairwise substitution region analysis.


All the analysis results can be download from task resulting page.

Outputs of PGAP

  1. Error message

    0.error.message: if some problems had been found before performing analysis, the pipeline of PGAP would be terminated and all error messages would be reported in this file.

    Problematic_data.txt:The pipeline will check missing COG annotation from input files and check whether the length of nucleotide sequence equal three time of the length of amino acid sequence plus 3. Gene sequence and corresponding protein sequence will be removed from input files and record in the problematic data file if they do not meet these criteria. Checked result will be shown on submitting page, users can correct them before submitting or ignore it and submitting.

  2. Homologous gene cluster results

    1.Orthologs_Cluster.txt: cluster list detail, if some strain has no gene in the cluster, it will be marked with “-”.

    1.Gene_Distribution_By_Conservation.txt: gene number in each strain by clusters conservation.

  3. Pan-genome analysis results

    2.PanGenome.Profile.txt: pan-genome and core genome function model.

    2.PanGenome.Data.txt: the temporary data are used for fitting pan-genome function.

  4. Genome variation and SNP calling

    3.CDS.variation.txt: variation details in CDS region are listed in this file. There are eight columns (demonstrated figure is shown below).

    The 1st column is the Cluster ID, which is consistent with the ID in 1.Orthologs_Cluster.txt.

    The 2nd column is the cluster conservation level of current cluster.

    The 3rd column is the gene number of current cluster.

    The 4rd column is the variation position, which obtained by the alignment results of protein sequences in this cluster. For INDEL events, the locus is an integer. For synonymous mutation and nonsynonymous mutation, the locus is a floating number, in which the integer part is marked the position of amino acid in the alignment results of protein sequences, while the decimal part is marked the position of codon. For example, 53.3 means that the variation location on the 3rd codon of the 53th amino acid.

    The 5th column shows the amino acid types on current position.

    The 6th column shows the nucleotide types on current position. For InDel, only “-” will be given.

    The 7th column shows all gene nucleotide profiles in current position (for InDel, amino acid will be listed). The order of nucleotide/amino acid is consistent with the gene order in current cluster in 1.Orthologs_Cluster.txt.

    The 8th column shows the variation type (InDel, synonymous and nonsynonymous).

    3.CDS.variation.for.evolution.txt: the temporary DNA sequence alignment file in phylip format. This file records the variation in the core CDS region. Another difference between 3.Core.CDS.variation.txt and 3.CDS.variation.txt is that, if there is a variation in some amino acid, the corresponding three nucleotides will be output in this file.

    3.CDS.variation.analysis.txt: this file is stored the summary results of file 3.CDS.variation.txt.

  5. Evolutionary analysis based on pan-genome and SNP

    Files will be generated when function evolution analysis is selected. In the resulting part, all phylogenetic trees are calculated by Phylip with .tree suffix files. Dynamic visual phylogenetic trees are showed according to these text files.

  6. Orthologs clusters function analysis

    5.Orthologs_Cluster_Function.txt: COG classification results for each gene cluster.

    5.Orthologs_Whole_Cluster_COG_Distribution.txt: COG enrichment for all clusters.

    5.Orthologs_Core_Cluster_COG_Distribution.txt: COG enrichment for core gene cluster.

    5.Orthologs_Dispensable_Cluster_COG_Distribution.txt: COG enrichment for dispensable gene cluster.

    5.Orthologs_specifc_Cluster_COG_Distribution.txt: COG enrichment for strain specific gene cluster.

Outputs of PGAP-X

  1. Genome Alignment

    Based on genome alignment result, genome structure will be visualized. Homologous DNA fragments across strains are marked in the same color.

  2. Orthologs analysis

    Based on result of orthologs analysis, gene distributions are visualized on their genomes. Each row represents a genome. If genes have the same conservation value of the orthologous gene cluster, genes are shown in the same colored blocks.

  3. Genetic Variation Analysis
    • Multiple genome substitution region analysis
    • These filtered genomic region or genic region would be displayed on the genome structure in genome scale.

    • Pairwise substitution region analysis
    • For pairwise variation analysis, all variation sites of each genome will be detected based on the whole genome alignment result. No less than m (m represents variation number) substitution sites in one region or region that substitution frequency is no less than filter condition will be identified as high substitution regions. All variation sites among pairwise genomes will be detected and reported in output text files but only the high substitution regions of selected strain will be displayed.

  4. Pan-genome Analysis

    In the pan-genome analysis module, the curves for pan-genome size and core gene size will be viewed in the same window.

Supported Browsers

Chrome, Firefox, Safari, Opera, IE 9+.

For better user experience, IE version below 9 is not supported, please upgrade your browser to IE 9 at least.

PDF version of this document