Next curator meeting

Revisit curation priorities (After March 1st)

Proposed on Jan 27th:

  • Pascale: 20% + 50% = 70% GO (ref genome etc); 20% Literature (2-3 papers); 10% Other
  • Petra: 10% GO; 40% gene models (10 models); 40% Literature (4-6 papers); 10% Other
  • Bob: 90% Gene models (18-25 models); 10% Literature (gene associations)

Curation status note

Pascale proposes to change guidelines to mark comprehensive when all EXP has been annotated.

Update on GOA's protein2GO curation tool

See screenshots: Image:Protein2GO.ppt


  • Need to annotate to Uniprot IDs but can export to dictyBase IDs (so our IDs will need to be in synch)
  • Can annotate to post-translationally modified proteins and splice variants (and soon to complexes)
  • notes field, where the curator can describe issues preventing annotation, such as , need to contact author, need to obtain paper, etc. The notes are emailed monthly to the curator to remind them to complete the annotation.
  • unique annotation constraint: cannot do the same protein, GO term, and paper twice. (ie, cannot do the same annotation with two evidence codes).
  • deleted annotations are saved.
  • external annotations are uploaded every night.
  • tool send QC reports every week, for example if a qualifier is used in the wrong GO aspect (for example contributes_to with component).
  • new annotations are uploaded weekly into QuickGO (can be downloaded or accessed by webservice)
  • GAFs are exported every month to the FTP site (contains new manual and IEA annotations).
  • SQL interface where annotations can be downnloaded, for example by curator
  • they have a dedicated programmer

Future features:

  • annotation page will show annotations of all proteins in a paper
  • annotation to complexes

Update on automated gene products

  • Yulia removed all gene products and inserted new ones based on inparanoid data.
Number of genes with automated gene products RTEs wrong/uninformative
Old 998 ~ 300 ~ 400
New 649 0 see list Image:Automated GP 2010-03.txt

Colleague search

See Dan Lusche - and others: people are confused.

textpresso GO component annotation

  • Need to check with Kimberly - I think we;re still waiting for their verification of the first analysis.
  • Pascale: emailed Kimberly Jan 19, 2010.

Display of EST IDs in Gbrowse versus Gene page versus BLAST results

Problem is that on Gbrose we see the 'Japanese' IDs, and on the gene page and in BLAST we see the DDB IDs. I (pascale) would argue that we should avoid displaying internal IDs for external resources.

Improving phenotype annotations

  1. There is a problem when there are multiple, similar strains (Stock Center hs multiple versions of the same strain): example:
  2. Another issue is to mark 'wild type' when the researchers only tested on specific phenotype: example:

Gene Names

Ugly gene names:

  • V4-7, 29C, 2C (clone names) , ACC1 (AAC-rich mRNA), rcd* (random cDNA), rsc* (random slug DNA): are those names useful? Should we move them to synonyms and re-name the genes DDB_G?
  • Easier? : Bill's DG mutants: several genes are still named like the strains (ie, DG####). Should we change that to DDB_G? Marc already started (Petra)
  • Ted's V#### strains: for those that are linked to genes, should we rename to geneX- and put the V#### strain name in the systematic name?
  • We should probably put the V#### names as the systematic name anyway.


  • There is not PKG in Dicty (see kinase paper), so maybe we should rename the pkg* genes? THERE is pkgB!! There is also pkgA, pkgC - Petra

There is a gene called pkg but it's not a protein kinase G. - Pascale

  • In particular pgkB which is called pkbr1 in the literature (ref 11841, 4186)- this is not pgkB, but pkgB Petra

Related sequences

  • Do we want to link to BLINK?

    • The URL is easy; pastes the gi of the protein
  • Or maybe we want to wait for the Panther families and show that? Or show both?

SOPs for gene names and gene product names

Manual gene products

We are updating, but (pascale) would like to see if we can retro-fit what we've got and keep relatively consistent even if we change our practices.

SELECT distinct gene_product 
  FROM gene_product 
 WHERE is_automated = 0 AND  gene_product_no in ( SELECT g.gene_product_no 
	  FROM gene_product g 
	  JOIN locus_gp gp on g.gene_product_no=gp.gene_product_no 
	 WHERE gp.locus_no in ( SELECT feature_id 
	  FROM cgm_chado.v_gene_features l  )  ) order by gene_product
  • -like?
  • -family?
  • xx Dicty-specific gene family / * amoeboid? (some genes are nicely conserved in purpureum)
  • domain-containing?
  • what to do about numbers relationships between numbers in gene names and gene products:
    • car1 -> cAMP receptor 1?
    • gata: eg, gtaD = GATA zinc finger domain-containing protein 4
    • uch1, uch2 : ubiquitin C-terminal hydrolase 1, 2?

Automated gene products

SELECT distinct gene_product 
  FROM gene_product 
 WHERE is_automated = 1 AND  gene_product_no in ( SELECT g.gene_product_no 
	  FROM gene_product g 
	  JOIN locus_gp gp on g.gene_product_no=gp.gene_product_no 
	 WHERE gp.locus_no in ( SELECT feature_id 
	  FROM cgm_chado.v_gene_features l  )  ) order by gene_product

SOPs for 'Genomic context'

More examples:

  • sir2D: supported by genomic context? simiarty is only in ~3' half of the gene
    • I think if one side is supported, incomplete support should be used (Petra).
  • DDB_G0270306: supported by genomic context?
    • This, if no similarity that helps for the gene model, yes, I would say it's genomic context (Petra).

SOPs for 'Supported by sequence similarity'

Pascale: I've never been comfortable having a gene supported by sequence similarity if it's supported by other Dicty genes. ie, sometimes it's only two duplicated genes and you're not sure of either, but they both support each other (probably need examples).
Petra: I agree you should not use it in the example you give. But if one or more genes have good support, and there is another without support, I find it legit to use sequence similarity.

  • Pascale : need some examples

Bob: An example of the first type, as Pascale described above is: [[1]]. This is one member of large gene family which only matches other dicty proteins. Don't have any examples such as Petra describes.

Displaying only 'relevant' papers on Gene page

  • One possibility would be for us to use 'Function/process' for important papers and have those shown in priority
  • Or: create a new Topic
  • Or: could put the papers with the most topics clicked
  • Or: could exclude 'review' and Genome-wide analysis

Finding the number of genes in a family

  • this is not possible unless all genes share the same prefix, and that no related prefix is in the synonyms

Annotation issues in need of further discussion

  • Curation of papers using inhibitors (issue #0041).
  • Descriptions containing "similar to" (issue #0042).
  • Abbreviations in the gene product and description (issue #0043).
  • "Disease gene-related" literature topic (issue #0044).
  • Pseudogene annotation (issue #0056).
  • "Homolog" in gene product (issue #0059).
  • "Unpublished" sequences for Curated Model annotation (issue #0060).

