logo gVolante

Completeness Assessment of Genome/Transcriptome Sequences

Tutorial for completeness assessment of genome/transcriptome sequences

    STEP 1. Prepare your file
    Prepare a single multi-fasta file of genome or transcriptome assembly. You can also use a compressed multi-fasta file (.gz, .tgz, .bz2, .tbz, .tar or .zip). For this tutorial, we have prepared a test_file for your trial.
    Information about the test_file:
    - File format: zipped multi-fasta
    - Species: human
    - Sequence Type: peptide (amino acid) - selected from a comprehensive sequence set
    STEP 2. Upload the file
    We recommend you to compress your fasta file before submitting it, as slow uploading speed is often fatal. After selecting a compressed fasta file, push the [UPLOAD FILE] button. DO NOT press the button twice, as it redoes file uploading and can cause a problem.
    STEP 3. Enter your project information
    The field 'E-mail address' is optional but recommended to fill in. If you do so, you can receive the results even for a time-consuming analysis via an email. The value of cut-off length is used in computing N50 statistics. If you want to compute it using all the sequences in the given file, enter ‘1’.
    STEP 4. Choose an analysis pipeline and an ortholog set
    As an analysis pipeline, it is recommended to choose CEGMA, for assessing genome assemblies, but it takes a long time (up to days) to finish. In assessing a vertebrate sequence set, we recommend CVG as an ortholog set. Using CVG should give you more accurate completeness scores than using other gene sets (see Hara et al., 2015, for more details). For an accurate analysis, choose "Mammal, Vertebrate, or other". The information helps improve gene prediction of submitted sequences.
    STEP 5. Start the analysis
    Push the [START YOUR ANALYSIS] button. After a validation of the submitted file, job information page will be shown, and then the server will start an analysis. If you don't input an E-mail address, save the Job_ID or the hyperlink of the results before you leave the page.
    Approximate time required for an analysis if no present queue:
    CEGMA on genome: 2~3 days
    BUSCO v2 on genome: 1~2 days
    BUSCO v2 on transcriptome: 1~2 hours
    BUSCO v1 on genome: 2 hours
    BUSCO v1 on transcriptome: 30 minutes
    If the submitted file includes many duplicate genes or sequences, processing time will be significantly longer.
    STEP 6. Check the results
    Via an E-mail message or checking the results page, you can access the analysis results. Therein, gVolante reports the project information, completeness scores, and N50 sequence statistics. Completeness assessment results are classified into 'Complete', 'Duplication', 'Partial', and 'Missing'. For more information about the classification, please refer to the definitions of those categories obtained from the original articles introducing the individual programs. The 'Ortholog detail' page provides the record of individual retrieved and missing genes in the given set of reference genes. To further analyze the absence of a certain gene from a phylogenetic viewpoint, you can proceed to aLeaves web server.


    Core Vertebrate Genes (CVG)
    Optimizing and benchmarking de novo transcriptome sequencing: from library preparation to assembly evaluation.
    Hara Y, Tatsumi K, Yoshida M, Kajikawa E, Kiyonari H, Kuraku S.
    BMC Genomics. 2015. 16: 977.
    Assessing the gene space in draft genomes.
    Parra G, Bradnam K, Ning Z, Keane T, Korf I.
    Nucleic Acids Res. 2009. 37: 289-97.
    BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.
    Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM.
    Bioinformatics. 2015. 31: 3210-2.
    BUSCO applications from quality assessments to gene prediction and phylogenomics.
    Waterhouse RM, Seppey M, Simão FA, Manni M, Ioannidis P, Klioutchnikov G, Kriventseva EV, Zdobnov EM.
    Mol Biol Evol. 2017. doi: 10.1093/molbev/msx319.