June 9 2011

From DictyWiki

(Difference between revisions)
Jump to: navigation, search
Revision as of 18:57, 6 June 2011 (edit)
YuliaBushmanova (Talk | contribs)
(Importing D.fasciculatum)
← Previous diff
Revision as of 19:04, 6 June 2011 (edit)
YuliaBushmanova (Talk | contribs)
(Importing D.fasciculatum, P.pallidum)
Next diff →
Line 106: Line 106:
; created blast databases (ppal, dfas) ; created blast databases (ppal, dfas)
 +
 +----
 +====To discuss====
 +; Organism
 +: naming - where to store strain names (Polysphondylium pallidum PN500, Dictyostelium fasciculatum SH3) - species column? (it is also used in url)
 +: ESTs, mitochondria loading? (ppal)
 +: abbreviations (3 letters for DDB/DDB_G) - DFA/DFA_G, PPA/PPA_G
 +; Scaffolds
 +: conventional naming is "scaffold_#", which can be confusing for multiple species
 +: used more meaningful GenBank locus/accession, i.e. GL290984
 +: added GenBank external id, search has to be modified to allow search by dbxrefs on features not tied to genes (now features can be searched only by sequence ID (i.e. DDB0231574)
 +; Contigs
 +: not imported for D.purpureum, data available for all organisms
 +: used species-specific id for name, should it be GenBank locus/accession instead? (i.e. ADBJ01000006)
 +; Genes
 +: linkouts: to GenBank (via protein id), to SACGB [http://sacgb.fli-leibniz.de/cgi/annsheet.pl?gid=16876&ssi=free]: via locus tag (results in search) [http://sacgb.fli-leibniz.de/cgi/freesearch.pl?ssi=free&word=PPL_03507&.submit=Go] or via inner id (can be derived from fasta)?
 +: skip "hypotetical protein" gene product?
 +: import rferences and tie to features? or genes? or both
 +; GBrowse
 +:citations, track names
 +
 +;D.purpureum
 +** submitted to GenBank, should gene models be updated?
 +** has assembly information (we do not show gaps/contigs currently)
 +** GenBank linkouts?
== Stock center strains == == Stock center strains ==

Revision as of 19:04, 6 June 2011

Contents

GO prep for Protein2GO

General issues

  1. Well on track. emily just sent another list of problematic annotations that I sent corrections to Sidd to implement.

Textpresso GO annotations

Skype call Summary

Review dicty Paper Pipeline and Textpressso

dictyBAse: Papers have been added to the Textpresso corpus as curated; last year not so many because curation focus was on gene models. PubMed searches using keywords (e.g. Dictyostelium) find papers, PDFs are downloaded manually and relevant genes attached. This is a bottleneck; is there an easier way to do upload? multiple papers at once?

Other Options for Paper Download

Can use scp or another file transfer protocol, or give Arun an account on a machine at dictyBase and he can get the papers via a script

Preprints are okay for Textpresso, so can download the full text as soon as it's available

Alternatively, automated downloads from PMC or directly from journal web sites could be put into place, although for downloading from journal sites, there are more specifics

dictyBase could provide Textpresso with relevant PMIDs and Textpresso could set up the download pipeline

PMC has a six-month delay, though, but right now that should not be a problem - we can revisit that in the future

Future Plans

Once new papers are in the corpus, perform CCC search on dicty papers from April 2010 - November 2010. This will allow dicty curators to look over the results and determine how well the searches are working for them.

Update dicty gene list (last update was August 2008), add synonyms, and consider if there are any gene names that might contribute to false positives (e.g., ER for TAIR)

Time frame: mid-June for generating source files for testing search results

Since then

  1. Petra sent up to date list with gene names and synonyms
  2. Petra sent PMIDs for uncurated 2010 papers

Gene Curation Update

  • about 30 complicated genes with changes to go for Bob. Almost there!
  • Writing abstract for meeting, thinking about slides and numbers needed that I can't get myself easily, thinking about paper to write...


Release 2-20

  • Setting GAF workflow
  • Fix caching issue ?
  • Add Harry's data ?
  • Add display of orthologs EC numbers to gene page display

Software development future

Q? : How to fit them in timeline in sync with our plan chart

Importing D.fasciculatum, P.pallidum

loaded scaffolds (supercontigs) (D.fasciculatum, P.pallidum, P.pallidum mitochondrion)
  • named after genbank record
  • added genbank dbxref
  • add description from genbank (i.e. "Polysphondylium pallidum PN500 unplaced genomic scaffold PPL_scaffold2, whole genome shotgun sequence.")
  • TODO: add reference
  • NOTE: not searchable by dbxref/name
loaded contigs (fake for mitochondria) (
  • named after genbank record
  • added genbank dbxref
  • TODO: add reference
  • NOTE: not searchable by dbxref/name, mitochondrial genome does not have contigs, need to create artificial one to display in gbrowse.
loaded genes
  • added gene product (excl. "hypotetical protein")
  • TODO: add reference
loaded mRNA & polypeptide features
added SACGB dbxref
added genbank dbxref
added EC dbxref (mitochondrial genes)
added 'codon start/translation_start' prop
  • TODO: add reference
loaded tRNA, rRNA features
  • TODO: add reference
imported ESTs (ppal) [1]
  • aligned to genome (~50%)
  • TODO: add reference (Gray,M.W. TBestDB [2] Polysphondylium pallidum)
created blast databases (ppal, dfas)

To discuss

Organism
naming - where to store strain names (Polysphondylium pallidum PN500, Dictyostelium fasciculatum SH3) - species column? (it is also used in url)
ESTs, mitochondria loading? (ppal)
abbreviations (3 letters for DDB/DDB_G) - DFA/DFA_G, PPA/PPA_G
Scaffolds
conventional naming is "scaffold_#", which can be confusing for multiple species
used more meaningful GenBank locus/accession, i.e. GL290984
added GenBank external id, search has to be modified to allow search by dbxrefs on features not tied to genes (now features can be searched only by sequence ID (i.e. DDB0231574)
Contigs
not imported for D.purpureum, data available for all organisms
used species-specific id for name, should it be GenBank locus/accession instead? (i.e. ADBJ01000006)
Genes
linkouts: to GenBank (via protein id), to SACGB [3]: via locus tag (results in search) [4] or via inner id (can be derived from fasta)?
skip "hypotetical protein" gene product?
import rferences and tie to features? or genes? or both
GBrowse
citations, track names
D.purpureum
    • submitted to GenBank, should gene models be updated?
    • has assembly information (we do not show gaps/contigs currently)
    • GenBank linkouts?

Stock center strains

List of REMI strains from Christopher Quang Dung Dinh/Adam Kuspa contains links to chromosmal location (i.e. [5]. Chromosome 2 coordinates are already incorrect due to the shift. External data is out of our control but it is possible to store this data on our side by create new feature representing single mutation point, this feature would have location on chromosome and will be linked with strain. Plan and estimate:

Write middleware for handling new feature
Figure out data model (5 days)
Figure out software interface (5 days)
Write software adaptor working with two schemas (3 weeks)
Display (gbrowse?, strain page) (2 weeks)
Personal tools