Computational Science Manager II Cold Spring Harbor Laboratory
High-quality gene models with accurate gene structure are essential for scientists to design experiments targeting specific genes. Traditionally, validating gene models requires manual curation and is a labor-intensive and time-consuming process, where one or more individuals evaluate and correct computational predictions using all available experimental evidence. Before manual curation, automated gene finders are used to identify genes, regulatory sequences, and repetitive elements. These pipelines are constantly evolving. Algorithms are faster, have improved accuracy and sensitivity. The pipelines also use large volumes of diverse data and types. Even with these improvements, the pipelines still make errors (think fused or split genes). In a pilot study (Tello-Ruiz et al, 2019), we tested two distinct approaches to identify low quality predictions from automated gene finders using MAKER-P quality scores and the alignments between translated protein sequences and its homologs across species using the Gramene gene tree visualizer. During our first maize annotation jamboree (CSHL, Nov. 2017), we selected a subset of B73_RefV4 classical maize genes for curation and found that both methods predicted low quality gene models. Example annotation errors included incorrectly predicted exon structure (loss, gain or incorrect length). Students also identified concerns with both the 5’ and 3’ UTRs. Following the initial review of the candidate loci, a group of eight students corrected the models using the Apollo gene editor. The curated models are available to the plant community as a track on the Gramene genome browser. Ideally, this method could be used as a means to enable students and citizen scientists to participate in the annotation of any sequenced eukaryotic genome. To date, we have organized four maize annotation jamborees. The implementation of these approaches was successful when piloted with eighteen sophomore-level undergraduate students in an Honors Genetics course at Middle Tennessee State University and will be discussed.
Coauthors: Cristina Marco – CSHL DNA Learning Center; Kapeel Chougule – Cold Spring Harbor Laboratory; Andrew Olson – Cold Spring Harbor Laboratory; Rebecca Seipelt-Thiemann – Middle Tennesse State University; David Micklos – CSHL DNA Learning Center; Doreen Ware – Cold Spring Harbor Laboratory