Abstract
Many of the technical issues involved in sequencing complete genomes are essentially solved. Technologies already exist that provide sufficient solutions for ascertaining sequencing error rates and for assembling sequence data. Currently, however, standards or rules for the annotation process are still an outstanding problem.
How shall the genomes be annotated, what shall be annotated, which computational tools are most effective, how reliable are these annotations, how organism-specific do the tools have to be and ultimately how should the computational results be presented to the community? All these questions are unsolved. This tutorial will give an overview and assessment of the current state of annotation based upon experiences gained at the Drosophila melanogaster genome project.
In the tutorial we will do three things. First, we will break down the annotation process and discuss the various aspects of the problem. This will serve to clarify the term "annotation", which is often used to collectively describe a process that has a number of discrete steps. Second, with the participation of computational biologists from the community we will compare existing tools for sequence annotation. We will do this by providing a 3 megabase sequence that has already been well-characterized at our center as a testbed for evaluating other feature-finding algorithms. This is similar to what has been done at the CASP (critical assessment of techniques for protein structure prediction) conferences (http://predictioncenter.llnl.gov) for protein structure prediction. Third, we will discuss which annotation problems are essentially solved and which problems remain.