Gene Model Curation

From DictyWiki

Jump to: navigation, search

Contents

Curated Models

Determining the correct gene model

Curated Model = manually curated gene model
To determine the correct gene model
  1. Perform a pairwise BLAST of the CDS from the GenBank record(s) against the CDS from the Sequencing Center Gene Prediction.
  2. If the CDS are 100% identical, great. If not, record number of nucleotide differences and residue number of amino acid substitutions/insertions/deletions.
  3. For sequence curation of unpublished genes, BLASTP or BLASTX at NCBI and/or UniProt, compare with ESTs, and curate with either EST or similarity support. Some genes don't have support, they can be curated with genomic context.
  4. Perform pairwise BLAST when necessary.
  5. View gene models in GBrowse, zooming out to see general gene structure, ESTs, neighboring genes.
  6. Check for ESTs also with BLASTN CDS vs. EST sequences. This is an important step as ESTs can potentially align non-specifically in GBrowse.
  7. If one EST differs from the genomic sequence, we assume the genomic is correct (unless sequence similarity strongly agrees with EST). If two ESTs agree and are different from the genomic, make a private and list them on Genes with Sequence Discrepancies.
  8. Check splice donors [consensus for Dicty: (C/A)AG | GT(A/G)AGT] and splice acceptors [consensus for Dicty: (T/C)NN(C/T)AG | (A/G)] and start site (ATG; -3, -6, and -9 are typically A, upstream is AT rich with CG islands). Use "dump" a decorated FASTA file from GBrowse to look at introns and upstream sequence.
  9. If the gene has an RNAseq alignment that supports exon/introns, beginning/end, or simply 'bridge' support between EST fragments, this is legitimate support and marked as 'unpublished transcript'. RNAseq data from Harry is available in contigs that can be loaded into GBrowse. See a good example gene that's only supported by RNAseq from Harry [1].
  10. See also the guidelines for similarity-based curation.

Creating a Curated Model

To create a Curated Model
  1. Go to Curate Gene from dictyBase Curator Central. Enter Gene Name.
  2. Scroll down to the Features section and click 'Edit' for the Gene Prediction (Source = Sequencing Center; Deleted? = N). A new window will open.
  3. Click 'Create dictyBase Curated Gene.'
  4. A new feature will be created and will be identical to the Gene Prediction (gene sequence and structure). It is automatically the primary feature. Record feature number of old and new features (sometimes features can get lost, so it is a good idea to have these numbers just in case).
  5. Click on 'Curate New Feature' to add information to the Curated Model (see Feature Curation Tool for details).
  6. If the Sequencing Center Gene Prediction is the correct gene model, you may skip ahead to Step 8.
  7. If the Sequencing Center Gene Prediction is NOT the correct gene model, load the Curated Model in Apollo and make changes accordingly.
  8. After your satisfactory Curated Model has been created, return to the Gene Curation Page and refresh the page; there should now be at least two features (Sources = Sequencing Center and dictyBase Curator). Write a private Curator Note: "Verified date/initials" and Commit.
  9. Write public Curator Notes when applicable.
  10. Refresh the Gene Page for the gene. The gene should now have the Curated Model as its primary feature.

Incomplete support

Curated Models with incomplete support
  1. Sometimes evidence for a gene model is not 100%. In cases where a Curated Model can be created (i.e., there is not a sequence problem), create a Curated Model.
  2. In the Feature Curation page, check the 'Incomplete Support' box. A note will appear next to the Curated Model on the Gene Page that says, "The supporting evidence for this gene model is incomplete."

Special cases

Splice variants

Creating an alternative transcript
Description

In general, the criterion for an alternative transcript is the presence of at least one convincing EST showing a different intron/exon structure. A convincing EST is one that has a 100% match to the sequence and for example, contain at least one additional intron (or a publication where it exists).

Example genes
capA, cbpD2, yipf1.
dictyBase curator
  1. To create a second Curated Model, repeat the procedure for creating a Curated Model after creating an initial Curated Model. Both Curated Model features will be primary.
  2. On the Gene Page, in the Associated Sequences section, two Curated Models will appear: Curated Model A and Curated Model B. Take note of which dictyBaseID corresponds to which Curated Model (splice variant).
  3. Record the splice variant in the shared Excel file on cgm-1.



Splitting a gene

Description
When an automated gene prediction has merged two or more ORFs, it is necessary to split the gene prediction by modifying intron/exon boundaries and the relationships between genes and features. In this case, we obsolete the DDB_G ID by creating two new genes.
dictyBase curator
  1. Create TWO OR MORE new genes (using any gene name, this can be changed later).
  2. Make Curated Models ONE by ONE from the Gene Prediction that codes for two or more genes ( if the Gene Prediction has fused several genes).
  3. Create the first Curated Model, and open in Apollo.
  4. Edit the first Curated Model according to plan and save.
  5. Open the Feature Curation Page for the Curated Model. Change the gene name to the gene name of the first gene you just created and Commit.
  6. Repeat steps 3 to 5 until all genes (2 or more) have been taken care of.
  7. Doing it sequentially creates less confusion in Apollo, as genes are only partially overlapping when you start to edit. Make sure you always know which gene you are modifying, as indicated in the lower left-hand corner when dealing with partially overlapping sequences!
  8. Associate the gene prediction to the largest of the newly created genes, or the 'first' ,whatever seems more appropriate.
  9. Delete the original gene by REPLACING the gene ID with all newly created gene IDs and make a curation note.
  10. Add eventual notes and annotations to new genes and DONE!

Merging genes together

Description
When an automated gene prediction has split a gene into two or more ORFs, it is necessary to merge the gene predictions by modifying intron/exon boundaries and the relationships between genes and features.
dictyBase curator
  1. Select the Gene Prediction that represents a larger portion of the actual gene and create a Curated Model.
  2. Open the Feature Curation Page for the other Gene Prediction. Change the gene name to the name of the gene for which you just created a Curated Model and Commit. That gene now has three associated features: two Gene Predictions and one Curated Model.
  3. In the now feature-less gene curation page, under 'Replace Gene ID By', add the DDB_G ID of the other gene. This will map the old gene ID to the new one.
  4. Load the Curated Model into Apollo.
  5. Go to the Exon Detail Editor and modify the intron/exon boundaries.
  6. Save in Apollo and you are DONE!


Flipping a gene

dictyBase curator

DDB_G ID should be updated

  1. Delete old gene model and gene
  2. Create new gene
  3. Created curated gene model
  4. Map old DDB_G to new DDB_G

5' and 3' UTRs

Description
Currently we are not representing untranslated regions (UTRs) in the database, graphically or otherwise. When an intron in a UTR exists, the display is potentially confusing for users.
dictyBase curator
Add public and/or private notes about the UTR intron:
  • An intron exists in the 5'UTR of this gene, which accounts for the apparent difference between the Curated Model and the GenBank mRNA.
  • An intron exists in the 5'UTR of this gene, which accounts for the apparent difference between the Curated Model and the ESTs.




Chromosome 2 duplication

Genes on the Chromosome 2 duplication will be named with a hyphen (-) and the number 1 (first repeat) and the number 2 (second repeat). For example: carA-1, carB-2. If the gene on the first repeat does not have a gene name but is named after the primary ID and is referred to as DDB_G....... , use the primary ID of the duplicated gene preceded by 'gene' for the gene on the duplication.
Reciprocal links should always be added to both gene pages in the description as follows:

there is a second copy of this gene, [gene name with hyperlink]

Hyperlink examples:

<a href="/db/cgi-bin/gene_page.pl?primary_id=DDB_G0273915"><i>iliE-2</i></a>
<a href="/db/cgi-bin/gene_page.pl?primary_id= DDB_G0273907"><i>DDB_G0273907</i></a>

Sequence support like ESTs, similarity, is always identical on both duplicated genes.



Genes with Sequence Discrepancies

The following is a list of genes in which curators have found putative sequence problems, either by comparison with ESTs or because of the presence/absence of particular sequence features.

Gene name || dictyBaseID || Chromosome || Start Position || Stop Position || Strand || Notes || Curator


Sequence discrepancies with supporting evidence

Supporting evidence includes independent sequence (ESTs, GenBank records, unpublished DNA) and sequence similarity with other spp.


  • vps26 || DDB0205116 ||3 || 5326311 || 5328458 || -1 || ESTs show different 3' sequence || KP

atcaattattGGTAGGTGTTacaagcgaaaaattagg

  • DDB0238065

According to similarity to DDB0238064, the 2nd intron should not exist; however if we remove it the ORF would go out of frame (in/del near the end of the 2nd intron?)

  • DDB_G0282649_ps || EST ddc9k17 (DDB0110048) has a 1 nt insertion compared with the genomic sequence; an artificial 2 nt intron has been created in this gene model to compensate.
  • sslA_1/sslA_2 || two good ESTs (ddc4i05, dda5d10) show there is a 1 nt insertion in the genomic sequence || PF
  • DDB0233185 || two decent ESTs show the start should be 11 nnt upstream || PG
  • ucr (DDB0238608) || several ESTs show there are many mismatches in the 5'UTR and N-terminus of this gene, also affecting the start codon; this also belongs to the next category - missing start || PF
  • trxD || there is just one EST that has several mismatches, and there is a non-consensus splice site, which could be the result of sequence mistakes in the genomic || PF
  • DDB0238630 || 2 genes merged using two arbitrary, wrong introns (Intron 3 and 4) to stay in frame. This is a conserved gene and sequence around these introns needs to be checked || PF
  • DDB0238663 || has good ESTs that extend into gene prediction intron, but only second ATG of what's now a one exon gene could be used. Check sequence of 5' end of this gene || PF
  • cpiC (DDB0252666 )|| GenBank mRNA (AB189920) has 5 T's not 6 (ttatttttaa, not ttattttttaa) and PMID 16328887 shows the phenylalanine at position 94 is the C-terminal residue. || PF
  • DDB0266832 gene merger due to insertion of one T in the genomic sequence (agtgaaaatataatgTttaactgaaaaag), which created a premature stop. The one EST and D. purpureum both confirm the merger, so the T for sure is wrong. || PF
  • cdc73 (DDB0267069) 1 nt (C) insertion in first exon (ACTTTAAATcATGGTGCTT), supported by one EST (DDB0084207). ||PF
  • DDB_G0293832 is at the end of a contig and about 300 nt, 5' end of the gene, is missing. THe gap between contigs is longer than represented in dictyBase. Supported by ESTs SSE559. || PF
  • DDB_G0271812 || User request by D. Veltman. no ESTS but very similar to Dp sequence, whose help I could determine that there is a likely gc insertion in the genomic. Douwe sequenced the gene and confiirmed sequence problems in the genomic. || PF 10-2009
  • DDB_G0278193 || Many problems in stretch of genomic sequence where wrong introns were in the gene predictions. Evidence for correct model by ESTs. || PF
  • DDB_G0275563 || 7 changes in comparison to Gareth's AX4 seq, one leads to premature stop; artificial gap added. || PF
  • DDB_G0282637 || This gene has several confirmed mistakes (Gareth's AX4 seq); made artificial gap where there is a wrong stop. || PF
  • DDB_G0280837 || This gene has two N where Gareth has a T or C: gtggagggttattttTaacCccaataccatttt; the curated model does not have a stop codon because the genomic sequence contains an N instead of a T, resulting in NAA what should be TAA || PF

Sequence discrepancies with missing features

Features include start codon, stop codon, splice donor/acceptor.

  • DDB_G0282649_ps: To make a gene model the second intron must splice on AA rather than AG. This one is either a pseudogene or has a genome sequence error. PG

Sequence discrepancies with no evidence

This group includes sequences that have been flagged as problematic based on curator inference.

  • DDB0204606: the first intron appears to be coding. Moreover, this gene has no obvious 'sister gene' so unlikely to be a pseudogene. PG
  • DDB0238423: 1 nt deletion at or around nt 36; curated as a pseudogene. PG
  • DDB0238737: has no ESTs but sequence similarity suggests the 3'end is missing PG 11-9-2007
  • DDB0238738: ESTs has many differences, but the ESTs look poor quality PG 11-9-2007
  • DDB_G0268964:no ESTs, premature stop, only 1 mutation. Created 1 nt gap to stay in frame. PF 15-03-10
  • DDB_G0283785: no EST coverage in this spot, but there is one EST that supports the overall existence of the gene. Made a 2nt gap to extend 7th exon 5' by similarity, still not sure if this is all, buy it's possible. PF26-03-10
  • DDB_G0305545: gene resulted from a split, but has no start in frame. Made a 1 nt deletion in a TTTTT string to put in frame. Beginning now matched Dpur, though I can't get them to align in blast as it's very low complexity and then breaks up a bit. Not sure sequence is correct, but best I could do for now, and has good Dpur ortholog. PF 19-04-1010
  • DDB_G0278613: one mutation (likely deletion) puts this out of frame - deleted 2 nt to put in frame, but arbitrary, no support.

return to SOPs Index

Personal tools