June 20 2011

From DictyWiki

Jump to: navigation, search

Contents

Decisions of this meeting

GenBank Loader

  • Will not be fixed or updated
  • Curators approve in table and Yulia updates in databse once or twice a year, depending how many there are
  • After each update, the records get wiped from the database

Sidd's Projects (after GO release)

  1. GenBank Update
  2. Gene Page Cache
  3. Harry's data and new gene tab
  4. (Estimate for) GBrowse 2 with new VM
  5. (Estimate for) Intermine
  6. (Estimate fo)r Biomart

Yulia's Projects

  1. Finish Genome update
  2. Gene Page Cache
  3. Harry's data and new gene tab

Petra Action Item

  • Email Adam,, Joan, Dave if they agree with Baldauf Dicty tree for taxonomy in Chado

Load GenBank Records tool

  • The tool is broken since last release, and seem to break all the time as it's such old code
  • There are not a lot of GenBank records, but time and again there is something to load, and it's still good if controlled by curator.
  • It's probably better to create a new tool (Yulia) than to keep fixing the old
  • If creating new tool, the question is to filter records more stringently for strains, to limit what gets in and to choose from. Currently we have many records that we cannot link and these, so far, stay as 'corpses'. However, filtering is not simple and I'm not sure it can be done - otherwise we should have a 'reject' function. Examples:
    • Mating type genes: AX4 is mating type I and only has matA, but Genbank FN543123 is from strain WS205 as they sequenced this, but in the paper they say AX2 and all mating type I are the same " We confirmed that the ORF is present in all the other type I strains used in the microarray study (100% identical in amino acid sequence in all cases; table S3), and sequenced the entire locus from another type I strain, WS205, to confirm that no other obvious coding sequences are present." However, mating types II and III has 2 extra genes, mat B,C and S,T, respectively and sequences are from AC4 strains, see e.g.CBA34789. We cannot associate those. If we had filtered for strains, the WS205 good record would not have been presented to us and it would be bad not to have that.
    • Sequences from Companies/Patents: There is this company that submits all these patented sequences to GenBank. we have associated them for a while, but nowadays we check how many there are already for a gene and we don't add more if there is already one. Several have already two mrpl13. These cannot be filtered by strain as these are just D. discoideum records, e.g. CBF89765, a 3rd record for mrpl13. A reject option would be nice for these cases.
    • 152 records of wild strains cox1/2: Strassmann/Queller have published a paper about genetic diversity in wild dicty strains, and as far as I can see submitted partial mitochondrial cox1/2 genes (22 aa) from all their wild strains, e.g AEG75504. I think those should all be rejected, but need to check if they have an axenic control (didn't see in quick look). Plus, we have already a full and partial GenBank record for cox1/2.
    • 'Normal Record': Right now there is one new regular, single Genbank submission that cannot be pulled in without the tool FAA00712. Plus they name the gene, so the tool would come in handy, as we could also accept names from there.

GBrowse2

  1. Timeline to make public?
  2. Would be good to plan a general Gbrowse update, to make one nicer version in a few month. Can be after the meeting, but this year. If someone asks at the meeting, nice to have a response.

Release 2-20

  • Modify GAF dumping script to skip ncRNA's
    • To get our GAF file gets loaded in protein2goa
    • Curation can be started without any further delay
  • Working on obo update script.
  • Polish GAF loading script to include GAF references and DOI without pubmed.

Importing D.fasciculatum, P.pallidum

genbank records
D.fasciculatum
P.pallidum, P.pallidum mitochondrion, P.pallidum ribosomal
loaded scaffolds (supercontigs)
  • named after genbank record
  • added genbank dbxref
  • add description from genbank (i.e. "Polysphondylium pallidum PN500 unplaced genomic scaffold PPL_scaffold2, whole genome shotgun sequence.")
  • TODO: add reference
  • NOTE: not searchable by dbxref/name
loaded contigs (fake for mitochondrial and ribosomal) (
  • named after genbank record
  • added genbank dbxref
  • TODO: add reference
  • NOTE: not searchable by dbxref/name, mitochondrial genome does not have contigs, need to create artificial one to display in gbrowse.
loaded genes
  • added gene product (excl. "hypotetical protein")
  • TODO: add reference
loaded mRNA & polypeptide features
added SACGB dbxref
added genbank dbxref
added EC dbxref (mitochondrial genes)
added 'codon start/translation_start' prop
  • TODO: add reference
loaded tRNA, rRNA features
  • TODO: add reference
imported ESTs (ppal) [1]
  • aligned to genome 90% (4034 of 4452)
  • TODO: add reference (Gray,M.W. TBestDB [2] Polysphondylium pallidum)
created blast databases (ppal, dfas)

To discuss

Organism
  • abbreviations (3 letters for DDB/DDB_G) - DFA/DFA_G, PPA/PPA_G
  • where to store strain names (Polysphondylium pallidum PN500, Dictyostelium fasciculatum SH3), if to store at all. Common practice is to use species column. It is currently used to identify organism (as well as common name), i.e it is used in URLs like
genomes.dictybase.org/pallidum/gene => genomes.dictybase.org/pallidum PN500/gene  
genomes.dictybase.org/pallidum/gene => genomes.dictybase.org/pallidum/PN500/gene

- We don't have the D. discoideum strain anywhere! And for discoideum it's probably most likely that we get another strain. [Petra]

  • mitochondrial genome for pallidum belongs to different strain (CK8 vs PN500 for the genome), same way dicty mitochondrial genome [3] belongs to AX3 strain
Contigs
not imported for D.purpureum, data available for all organisms
Genes
linkouts: to GenBank (via protein id), to SACGB [4]: via locus tag (results in search) [5] or via SACDB internal id (can be derived from fasta)?
D.purpureum
  • submitted to GenBank [6]
  • has assembly information (we do not have/show gaps/contigs for dpur)
Search
  • existing search is hardcoded to search discoideum or purpureum data:
    • Gene Names/Synonyms - discoideum only
    • Gene IDs - any
    • ESTs - any
    • dictyBaseDPIDs - any but name comes from SITE_NAME env variable
    • external ids - dicty or all, depending on SITE_NAME env variable
    • Gene Product - not activated on multigenome, searches dicty only.
  • search results display is not suited for cross-species search
Existing search New search
can be rewritten to use species parameter, making search species-specific can be written to use both databases in order to make cross-species search.
would take less time but limit functionality would require complete rethinking of search strategies
would require both sites to use the same search would allow main site to use old search for the transition period

Dicty Meeting 14-18 Aug

  1. Registered: Rex, Petra, Sidd, Bob
  2. Gene curation abstract: submitted
  3. SAB meeting: will start towards end of June.
  4. Drink coasters [7] 2500: $378.00 plus shipping. One side designed so far

GO prep for Protein2GO

  1. Sidd is looking into extending our table to accept DOI numbers. I will then fix these annotations, updating out internal ref to DOI numbers, and this solution would be good for any new annotations in this category to come.
  2. Planned release end of June, so dictyBase can accept the file from GOA and the appending of tRNA, trxA/B and some other annoations goes smoothly.

Textpresso GO annotations

  1. wormbase will implement trial run with 2010 papers; from Kimberly: "This will require us running a

search on the dictyBase 2010 corpus, cloning the curation form for dicty, and with you, working out the requirements for exporting a GAF file."

  1. To load papers (in the future): We need to manually load into our folder, and Arun gets access to upload regularly from there what's new


Stock center strains

List of REMI strains from Christopher Quang Dung Dinh/Adam Kuspa contains links to chromosmal location (i.e. [8]. Chromosome 2 coordinates are already incorrect due to the shift. External data is out of our control but it is possible to store this data on our side by create new feature representing single mutation point, this feature would have location on chromosome and will be linked with strain. Plan and estimate:

Write middleware for handling new feature
Figure out data model (5 days)
Figure out software interface (5 days)
Write software adaptor working with two schemas (3 weeks)
Display (gbrowse?, strain page) (2 weeks)
Personal tools