gVolante - Completeness Assessment of Genome/Transcriptome Assembly

Frequently Asked Questions

About Database

Q: What is the criterion for choosing one out of multiple assembly versions for a particular species?

A: Basically, we have chosen the latest version of assemblies available at GenBank of NCBI (as of November 2016). For some species, multiple versions of assemblies were listed so that it allows a comparison of completeness scores between those different versions.

About Completeness Assessment

Q: Can I fully trust the assessment results?

A: The completeness assessment you can perform on this server is regarding protein-coding gene space, and it is just one aspect of the composition of the sequence set you have. To add more values to the assessment, this web server provides more basic statistics on sequence lengths (e.g., N50 scaffold length). For more details of the trickiness of completeness assessment, please refer to this commentary paper by Veeckman et al. (The Plant Cell, Vol. 28: 1759-68. 2016).

Q: What is CVG?

A: CVG (Core Vertebrate Genes) is our original reference gene set of 233 ortholog groups to be fed into a completeness assessment program (Hara et al., 2015). Every ortholog in CVG contains one-to-one orthologs (without any paralog generated in the vertebrate lineage nor gene loss) of the including vertebrate genomes covering Chondrichthyes and Cyclostomata. Our pilot assessments demonstrated that evaluations referring to the CVG achieved higher accuracy and resolution than those that referred to other reference gene sets. The CVG data set is available online in our laboratory's web site.

Q: What do the categories 'Complete', 'Partial', 'Duplicate' and 'Missing' mean in the completeness assessment results?

A: For details of these metrics, please refer to the following definitions of them obtained from the original articles (in the REFERENCE page).

CEGMA

'Complete' refers to those predicted proteins in the set of 248 CEGs that when aligned to the HMM for the KOG for that protein-family, give an alignment length that is 70% of the protein length. I.e. if CEGMA produces a 100 amino acid protein, and the alignment length to the HMM to which that protein should belong is 110, then we would say that the protein is 'complete' (91% aligned). If a protein is not complete, but if it still exceeds a pre-computed minimum alignment score, then we call the protein 'partial'.

BUSCO

Classification of BUSCO-matching genes that meet the 'expected-score' cut-off employs the protein length distribution of each BUSCO to determine whether the ortholog is ‘Complete’ or ‘Fragmented’. Orthologs are considered to be ‘Complete’ if the length of their aligned sequence is within two standard deviations (2σ) of the BUSCO group’s mean length (i.e. 95% expectation), otherwise they are classified as ‘Fragmented’ recoveries (Figure S1). A BUSCO is classified as ‘Duplicated’ when multiple BUSCO-matching genes meet both the ‘expected-score’ and the ‘expected-length’ cut-offs, i.e. multiple copies of full-length orthologs are found in the gene set being assessed. Lastly, any BUSCO without a BUSCO-matching gene that meets the ‘expected- score’ cut-off is classified as ‘Missing’.

Q: My sequence file includes original information that led to currently unpublished results, so I am hesitant to upload it. How will the uploaded files be handled?

A: The submitted file and the E-mail address that you entered will be erased from the server immediately after the analysis has finished or failed upon encountering any problem, and thus will not be used for any other purpose than the completeness assessment requested. Moreover, SSL encryption ensures that the file transferred from your PC to the gVolante server are always encrypted.

Q: My analysis has not finished yet after one week. Is anything wrong?

A: The computational resources on the web server is limited, and particularly, gene prediction that runs inside the program requires more processing power. In addition, the computational time increases when you submit a sequence file with high redundancy and/or large size.

Q: I found that one sequence per ortholog group has a link to aLeaves web server in the 'Ortholog Detail' page. How has the sequence been chosen?

A: When using BUSCO, the sequence is a hypothetical ancestral sequence for each ortholog improvised by BUSCO. When using CEGMA, the human sequence is selected as the representative of each ortholog group.

Q: The result of my completeness assessment shows several partial and missing orthologs. Is this result always accurate?

A: Not necessarily. Ideally, for each species, one should choose an appropriate set of parameters for gene prediction, which can largely influence completeness scores. However, our collection of completeness scores was based on analyses under a uniform assessment criterion.

Q: I realized that I am requested to enter a cut-off length in the "Analysis" page. What will be the impact of this cut-off setting on the analysis results?

A: It has an impact on length-based metrics computed. In general, length-based metrics, such as N50 length, are largely influenced by length cut-off ― without abundant short sequences, N50 length and mean largely increase, for example.

Q: What are the parameter values for BUSCO and CEGMA?

A: For BUSCO, the default values are applied. In CEGMA analysis, the following parameters are used for gene prediction from genome sequences.

Max intron length: allowable distance between separate exons
Gene flanks: length of genomic regions in which flanking exons are searched for

Because the genes of elasmobranchs (sharks, rays, and skates) tend to harbor relatively long introns, we offer a customized parameter set ‘Elasmo’ for elasmobranchs, in addition to the parameter sets introduced originally as options for CEGMA (‘Mammal’, ‘Vertebrate’, and ‘Non-vertebrate’) that have been modified by us (see below for details).

Gene Set	Taxon	Max intron length	Gene flanks
CVG (Core Vertebrate Genes, optimized)	Mammal	100,000	10,000
	Elasmo	200,000	10,000
	Vertebrate	50,000	10,000
CEG (Core Eukaryotic Genes, CEGMA defaults)	Mammal	40,000	5,000
	Elasmo	200,000	10,000
	Vertebrata	20,000	5,000
	Non-vertebrata	5,000	2,000

Q: How long will my results be stored in gVolante?

A: Four weeks after the submission. Alternatively, you can download the assessment data from the YOUR RESULTS page.