MAY 3 2010

From DictyWiki

Jump to: navigation, search


EBI Roadshow summary

  • ArrayExpress: Microarray data repository, uses MO (Microarray Ontology). We have Baylor Expression tool, so no real need. But there are 9 Dicty entries, 7 by different Dicty peoplewho collaborated with Gareth, and two by Anup, probably to test it out [1]. It's a powerful tool for organisms with a lot of data.
  • Atlas: New expression tool (1 year old). See the example of expression of genes involved in the neuropeptide signaling pathway in the mouse brain [2]
  • Reactome: Dicty in Reactome [3]. Dicty pathways are all electronically inferred from human. Question: would it be useful to have Dicty annotated where it's strong (G-protein coupled signaling) and link to it? This would mean trying help them annotate I guess.
  • Ensembl:
    • Many species, Dicty under protists [4], and hard to find. Asked Dan Lawson to move it up to be included in main organisms, like yeast.
    • Genome browser includes 5' and 3' UTRs, variant, regulatory regions, and SNPs, all dependent on data availability.
    • Dicty gene annotations were taken from our GFF3 file OCT-14-1009. Discussion with Dan Lawson about updates. He said wormbase has ensembl database and updates are regularly every 6 month. They are firming up their process to generate Ensembl from GFF3, but in the absence of us doing anything they will update when time and resources allow.
    • Ensembl Dicty Biomart [5] is nice and provides data we don't have: Domains including transmembrane domains, gene types (pseudogenes, ncRNAs..), homologs of other protists (currently; I asked for more / human), paralogs. No second data set as we have for ESTs, but definitely some good stuff. We should take as inspiration to really work on ours. Or link out to theirs?

Rapid gene model curation update

  • # number of models done
    • Petra: 590 genes, in 16 sessions and a total of 924 minutes (15 h, 24 min); 1.56 min/gene; 504 annotated, 348 complete, 156 incomplete, 86 skipped.
    • Pascale: 95 genes, 78 annotated, 42 complete, 36 incomplete, 17 skipped
    • Bob: 355 genes reviewed: 303 curated (152 complete, 151 incomplete) 52 skipped; did not record total minutes spent, so cannot provide time stats
  • # number of models skipped:

Skipped Model stats

  • Petra: so far curated: 13; ranged from very complicated (2 hours+) to relatively easy after blast (15 min). Needed almost 12 hours on these genes, or an average of 55 min/gene.
  • Bob: Reviewed 31 gene models. Curated 26 in 800 min. which is 30.8 minutes per gene model. (time spent ranged from 20 min. to 60 min.) Skipped 5 gene models which I could not resolve (time spent on these ranging from 15 to 50 minutes.)

Category 4 gene trial - Petra (Genes with ESTs but No Dpur hit - 359 genes)

  • 30 min trial
  • 24 genes inspected: 7 complete, 7 incomplete, 10 skipped
  • Inspection is faster than for cat. 6 genes, as one does not need to check Dpur Gbrowse. I typically clicked on the protein tab to see if there are domains, especially for incomplete 1 exon genes. ESTs ranged from 1-5, one gene with 130
  • Complete genes have complete EST coverage and intron/exon, start/end all looks good and unambiguous
  • Incomplete are genes that look good but don't have complete EST support.
  • Skip rate is very high, 42%! 2 genes need merger, some otherwise disagreed with ESTs, some are probably ok there was just not enough confidence to approve without further investigation (blast, reprediction)
  • Question: Is it worth with the high skip rate? Skipping takes about a minute. Though with my notes I get a pretty good skip list that has some ESTs and needs fixing.

Curation tool

New Reports for Gene model curation. This app will be separated from existing curation interface and require separate login . Page will contain
  1. Login interface Done
  2. GBrowse display of a region( gene + 1kb 5',3') (SC Predictions, geneID repredictions, ESTs) Done
  3. Decorated FASTA of a region Done
  4. Blink Done
  5. 'supported by' checkbox with ESTs, sequence similarity, genomic context, incomplete support, conflicting evidence Done
  6. 'Approve' button: Done
    1. creates curated model delivered from gene prediction
    2. inserts curation note
  7. Display BLAST alignment
  8. Display protein sequence

Left for the next round:

  1. Domains (with SignalP an TMHMM data) In process
    1. Would it be domains for SC Prediction(s) only or for repredictions as well?
    2. Display: graphical/tabular/decorated FASTA?


  • Done: Thomas sent annotations for the remaining 55 that were annotated by Pascale plus 20 from OrthoMLC. Resulting in a total of 487, 421 _RTE genes, and 66 _TE genes.
  • Sidd adapted Script for gene list.

Anup: RNA seq browser

  • What should we do? Add links? Add a single link to the tool ?
  • How much effort should we invest in getting assemblies?

From the information Anup sent, here's some stats about ESTs and RNA-seq data:

  • 6,721 genes have ESTs according to dictyMart
  • 8,435 genes have RNA seq

  • RNAseq + ESTs: 4,736 genes
  • RNAseq; No ESTs: 3,690 genes
  • EST, no RNAseq: 1,985

Visualizing RNAseq data

GFF file : 9745162 nucleotide with positive value, which have to stored

Have setup a working demo in my machine(sidd), will show it.
Will try to setup a demo on testdb by tomorrow to get some feedback and how it performs.

Further Strategy?

NAR paper 2011

We need to tell the editor (Michael Galepin?) by July 1st whether we intend to submit. We dont need to discuss this today but it should remain on the agenda until we've done this.

Personal tools