Sequence Curation
From DictyWiki
Contents |
GenBank Records
GenBank Loader
- Description
- New GenBank records are imported automatically on a weekly basis. Check the GenBank Loader weekly to see if new records are in the database. See the criteria for reconciling GenBank records with a Gene Prediction.
- dictyBase curator
- Check to see that the GenBank record is aligning with the correct gene.
- Check all information for accuracy and compatibility with existing information in the database.
- Click 'Load' and information will be available on production.
Checking GenBank records
- To determine whether a GenBank sequence should be reconciled with a Gene Prediction
- BLAST CDS of GenBank against all dictyBase CDS: Top hits should be itself and the Gene Prediction identified in the BLAST report; make sure the gene being linked is the best hit.
- Likewise, BLASTing the Gene Prediction against all dictyBase CDS should have the same top hits; all other hits should be insignificant.
Public notes for GenBank records
- Description
- This note may be used for any gene in which the Sequencing Center sequence has been compared to sequences in GenBank or EST sequences (gene may or may not have a Curated Model). Typically we do not report sequence differences in non-coding regions (introns and upstream/downstream sequences). Use the note that is most appropriate for the gene.
- Notes
- Note regarding this sequence: the sequences from the Sequencing Center and GenBank record [XXXXX] are identical.
- [one GenBank record]
- Note regarding this sequence: the sequences from the Sequencing Center and GenBank records [XXXXX] and [YYYYY] are identical.
- [two or more GenBank records]
- Note regarding this sequence: there is a discrepancy between the sequence from the Sequencing Center and the sequence in GenBank record [XXXXX], however, the sequence from the Sequencing Center has been verified.
- [This note is used when two or more ESTs from independent libraries confirm the Sequencing Center sequence. Amino acid substitutions are not reported in this case. "Discrepancy" is always singular even if multiple nucleotide differences exist.]
- Note regarding this sequence: there is a discrepancy between the sequence from the Sequencing Center and the EST sequences, however, the sequence from the Sequencing Center has been verified.
- [This note is used when one of the Sequencing Centers (Jena or Baylor) confirm the Sequencing Center sequence. Amino acid substitutions are not reported in this case. "Discrepancy" is always singular even if multiple nucleotide differences exist.]
- Note regarding this sequence: there is a(n) X nt difference between the sequence from the Sequencing Center and the sequence in GenBank record [XXXXX], resulting in X amino acid substitution(s) at position(s) Y and Z.
- Note regarding this sequence: there is a(n) X nt difference between the sequence from the Sequencing Center and the sequence in GenBank record [XXXXX]; the encoded proteins are identical.
Curated Models
Determining the correct gene model
- Curated Model = manually curated gene model
- To determine the correct gene model
- Perform a pairwise BLAST of the CDS from the GenBank record(s) against the CDS from the Sequencing Center Gene Prediction.
- If the CDS are 100% identical, great. If not, record number of nucleotide differences and residue number of amino acid substitutions/insertions/deletions.
- View gene models on GBrowse, zooming out to see general gene structure, ESTs, neighboring genes.
- Check for ESTs with BLASTN CDS vs. EST sequences (especially important for GenBank records that are genomic sequences). This is an important step as ESTs can potentially align non-specifically in GBrowse.
- For GenBank records that contain genomic sequences (especially if no ESTs exist), perform a pairwise BLAST of the CDS against the genomic sequence from the Sequencing Center. Check splice donors [consensus for Dicty: (C/A)AG | GT(A/G)AGT] and splice acceptors [consensus for Dicty: (T/C)NN(C/T)AG | (A/G)] and start site (ATG; -3, -6, and -9 are typically A, upstream is AT rich with CG islands). Alternatively, "dump" a decorated FASTA file from GBrowse to look at introns and upstream sequence.
- For genomic sequences that do not have ESTs, BLASTP or BLASTX at NCBI against nr or swissprot to see if protein is conserved.
- If enough data exists, create a Curated Model.
- See also the guidelines for similarity-based curation.
Creating a Curated Model
- To create a Curated Model
- Go to Curate Gene from dictyBase Curator Central. Enter Gene Name.
- Scroll down to the Features section and click 'Edit' for the Gene Prediction (Source = Sequencing Center; Deleted? = N). A new window will open.
- Click 'Create dictyBase Curated Gene.'
- A new feature will be created and will be identical to the Gene Prediction (gene sequence and structure). It is automatically the primary feature. Record feature number of old and new features (sometimes features can get lost, so it is a good idea to have these numbers just in case).
- Click on 'Curate New Feature' to add information to the Curated Model (see Feature Curation Tool for details).
- If the Sequencing Center Gene Prediction is the correct gene model, you may skip ahead to Step 8.
- If the Sequencing Center Gene Prediction is NOT the correct gene model, load the Curated Model in Apollo and make changes accordingly.
- After your satisfactory Curated Model has been created, return to the Gene Curation Page and refresh the page; there should now be at least two features (Sources = Sequencing Center and dictyBase Curator). Write a private Curator Note: "Verified date/initials" and Commit.
- Write public Curator Notes when applicable.
- Refresh the Gene Page for the gene. The gene should now have the Curated Model as its primary feature.
Incomplete support
- Curated Models with incomplete support
- Sometimes evidence for a gene model is not 100%. In cases where a Curated Model can be created (i.e., there is not a sequence problem), create a Curated Model.
- In the Feature Curation page, check the 'Incomplete Support' box. A note will appear next to the Curated Model on the Gene Page that says, "The supporting evidence for this gene model is incomplete."
When a Curated Model cannot be created (DELETE???)
- When NOT to make a Curated Model
- When a gene has a major sequence flaw, such as premature stop or internal stop, do not create a Curated Model.
- When a gene lies at the end of a contig, do not create a Curated Model.
- If the gene structure is clearly wrong, but you cannot determine an alternate gene model, do not create a Curated Model.
- However, if part of a gene model is incorrect and cannot be fixed, but a different part of the gene model can be improved upon with a Curated Model, create a Curated Model.
- If a Curated Model cannot be determined
- Write a public Curator Note describing the reason for not creating a Curated Model.
- Record (best if flagged in red) in your personal table or a separate "unverifiable" table the genes for which you cannot create a Curated Model. These genes may have more data in the future or corrected in subsequent versions of the genome.
PUBLIC NOTES: Use the following Curator Notes when a Curated Model cannot be created due to various reasons.
Sequence problems (GenBank)
- Due to a discrepancy between the sequences from the Sequencing Center and GenBank record [XXXXX], a Curated Model cannot be added at this time.
- Optional second sentence:
- ESTs confirm the sequence in the GenBank record.
- OR
- Sequence similarity suggests the sequence in the GenBank record is correct.
Sequence problems (Second Generation)
- Due to a sequence discrepancy, a Curated Model cannot be added at this time.
End of contig (rare case)
- The abcD gene extends beyond the end of a chromosomal contig, and therefore a Curated Model cannot be added at this time.
Unverifiable Generation 2 (gene models without GenBank records, curation by ISS)
- The available data are inconclusive to determine the correct gene model. The gene model presented here was obtained from the Dictyostelium Genome Consortium.
Conflicting EST(s) but not enough evidence for another Curated Model
- 1 EST conflicts with the Curated Model but the available data are insufficient to create an alternative Curated Model.
Special cases
Splice variants
- Creating an alternative transcript
- Description
In general, the criterion for an alternative transcript is the presence of at least one convincing EST showing a different intron/exon structure. A convincing EST is one that has a 100% match to the sequence and for example, contain at least one additional intron (or a publication where it exists).
- dictyBase curator
- To create a second Curated Model, repeat the procedure for creating a Curated Model after creating an initial Curated Model. Both Curated Model features will be primary.
- On the Gene Page, in the Associated Sequences section, two Curated Models will appear: Curated Model A and Curated Model B. Take note of which dictyBaseID corresponds to which Curated Model (splice variant).
- Record the splice variant in the shared Excel file on cgm-1.
Splitting a gene
- Description
- When an automated gene prediction has merged two or more ORFs, it is necessary to split the gene prediction by modifying intron/exon boundaries and the relationships between genes and features.
- dictyBase curator
- Make TWO Curated Models from the Gene Prediction that codes for two genes (or three Curated Models if the Gene Prediction has fused three genes).
- Create a new gene (using any gene name, this can be changed later).
- Open the Feature Curation Page for one of the Curated Models. Change the gene name to the gene name of the gene you just created and Commit.
- You now have two overlapping genes, each of which has a Curated Model feature.
- Load the Curated Models in Apollo.
- Go to the Exon Detail Editor. Note that there are two sequences; make sure you know which gene you are modifying, as indicated in the lower left-hand corner.
- Edit the intron/exon boundaries of the first gene.
- Edit the intron/exon boundaries of the second gene.
- Save in Apollo and you are DONE!
Merging genes together
- Description
- When an automated gene prediction has split a gene into two or more ORFs, it is necessary to merge the gene predictions by modifying intron/exon boundaries and the relationships between genes and features.
- dictyBase curator
- Select the Gene Prediction that represents a larger portion of the actual gene and create a Curated Model.
- Open the Feature Curation Page for the other Gene Prediction. Change the gene name to the name of the gene for which you just created a Curated Model and Commit. That gene now has three associated features: two Gene Predictions and one Curated Model.
- Delete the now feature-less gene. Be sure to make a private curator note explaining the deletion.
- Load the Curated Model in Apollo.
- Go to the Exon Detail Editor and modify the intron/exon boundaries.
- Save in Apollo and you are DONE!
5' and 3' UTRs
- Description
- Currently we are not representing untranslated regions (UTRs) in the database, graphically or otherwise. When an intron in a UTR exists, the display is potentially confusing for users.
- dictyBase curator
- Add public and/or private notes about the UTR intron:
- An intron exists in the 5'UTR of this gene, which accounts for the apparent difference between the Curated Model and the GenBank mRNA.
- An intron exists in the 5'UTR of this gene, which accounts for the apparent difference between the Curated Model and the ESTs.
Chromosome 2 duplication
Genes on the Chromosome 2 duplication will be named with a hyphen (-) and the number 1 (first repeat) and the number 2 (second repeat). For example: carA-1, carB-2. If the gene on the first repeat does not have a gene name but is named after the primary ID and is referred to as geneDDB....... , use the primary ID of the duplicated gene preceded by 'gene' for the gene on the duplication.
Reciprocal links should always be added to both gene pages in the description as follows:
there is a second copy of this gene, [gene name with hyperlink]
- Hyperlink examples
<a href="/db/cgi-bin/gene_page.pl?dictybaseid=DDB0238222">rnrB-2</a>
<a href="/db/cgi-bin/gene_page.pl?dictybaseid=DDB0238702">geneDDB0238702</a>
return to SOPs Index
