Protein coding gene prediction using QIAGEN CLC Genomics Workbench


QIAGEN Digital Insights

Protein coding gene prediction using QIAGEN CLC Genomics Workbench

Ab initio gene finding is a central step in genome analysis, which must account for the biology of the investigated genome(s) in order to perform adequately. Signals are many fold, and include coding potential, hexamer distributions, RNA polymerase-binding and spliceosome-binding sequences, all of which depend on GC content.

The GeneMark family algorithms have been continuously used for genome annotation, starting with the first complete genome (Haemophilus influenza) sequenced in 1995.  Currently, an algorithm of the GeneMark family is being used by NCBI as a part of the prokaryotic genome annotation pipeline. Two algorithms, MetaGeneMark and GeneMark-ES, are available as plugins in QIAGEN CLC Genomics Workbench and QIAGEN CLC Genomics Server. MetaGeneMark has proven to deliver accurate gene predictions in metagenomes. GeneMark-ES is an automatic ab initio gene prediction tool for compact eukaryotic genomes. Gene finding in whole genome-sequenced microbial genomes can also be performed using the “Find Prokaryotic Genes” tool of QIAGEN CLC Microbial Genomics Module.


The MetaGeneMark plugin represents a new release of the gene finding algorithm for metagenomic sequences. For each metagenomic contig, MetaGeneMark uses values of the GC content of each ORF in the contig to select sets of gene model parameters (1,2). For a given GC content value, the algorithm uses parameters that vary for archaeal and bacterial domains. This approach ensures that there are no parameters that a user has to select or adjust. The algorithm is fast; it can process 1 GB of metagenomic contigs on a single CPU in less than half an hour.


The GeneMark-ES plugin delivers ab initio predictions of protein-coding genes in eukaryotic genomes (3,4). The GeneMark.hmm algorithm employs a hidden semi-Markov model. The model parameters are determined iteratively using Viterbi training. The most probable parse of a genomic sequence into exons, introns and intergenic regions is thus determined simultaneously with unsupervised training of the model parameters from the genomic sequence, rendering GeneMark-ES a fully automatic tool. GeneMark-ES was shown to produce high gene prediction accuracy for genomes with lengths less than 400 MB.  Longer genomes present a challenge due to longer, on average, intergenic regions. The unsupervised training procedure is a computationally expensive task and may take several hours.

Find prokaryotic genes

Ab initio gene finding for microbial genomes can be performed using the “Find Prokaryotic Genes” tool of QIAGEN CLC Microbial Genomics Module. The tool creates a gene prediction model from the input sequence, which estimates GC content, conserved sequences corresponding to ribosomal binding sites, start and stop codon usages, and a statistical model (namely, an Interpolated Markov Model) for estimating the probability of a sequence to be part of a gene compared to the background. The model is then used to predict coding sequences from the input sequence. This tool is inspired by Glimmer3 (5).


The MetaGeneMark manual and GeneMark-ES manual provide detailed instructions on plugin usage. The use of the algorithms was documented in more than 2000 research publications. The QIAGEN CLC Microbial Genomics Module manual has extensive documentation on the ”Find Prokaryotic Genes” tool and settings and downstream analysis capabilities.

In case RNA-seq data exist, the QIAGEN CLC Genomics Workbench toolbox enables easy verification of ab initio gene predictions, as described in the application note ‘Improving structural annotation in complex genomes with QIAGEN CLC Genomics Workbench.

See also our blog on transcript discovery using QIAGEN CLC Genomics Workbench.


  1. Besemer J. and Borodovsky M. (1999) Heuristic approach to deriving models for gene finding.
    Nucleic Acids Research 27 (19): 3911.
  2. Zhu W., Lomsadze A. and Borodovsky M. (2010) Ab initio gene identification in metagenomic sequences. Nucleic Acids Research 38 (12): e132. doi: 10.1093/nar/gkq275
  3. Lomsadze A., et al. (2005) Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Research 33: 6494.
  4. Ter-Hovhannisyan V., et al. (2008) Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Research 18:1979.
  5. Delcher AL, Bratke KA, Powers EC, Salzberg SL. (2007) Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23 (6): 673. doi: 10.1093/bioinformatics/btm009