Homologue Data Sources¶

InterMine comes with several data converter for homologue data, e.g. TreeFam, PANTHER, OrthoDB, Homlogene, etc. Follow the instructions below to include these datasets in your InterMine.

Identifiers¶

The default rule for bio-InterMine is to put the MOD identifiers (eg. MGI:XXX or ZDB-GENE-XXX) in the primaryIdentifier field. This is tricky because some homologue sources use the Ensembl identifiers (Ensembl identifiers belong in the Gene.crossReferences collection).

To solve this problem, each homologue source uses the NCBI identifier resolver. This resolver takes the Ensembl ID and replaces it with the corresponding MOD identifier.

How to use an ID resolver¶

Download the identifier file - ftp://ftp.ncbi.nih.gov/gene/DATA/gene_info.gz
Unzip the file to /DATA_DIR/ncbi/gene_info

Warning

Make sure permissions on the file are correct so the build process can read this file.

Download the identifier file for humans - http://www.flymine.org/download/idresolver/humangene to another directory, eg. /DATA_DIR/human/identifiers
Create a sub directory /DATA_DIR/idresolver/ as file root path and add symbolic links to the two files.

$ cd /DATA_DIR/idresolver/
$ ln -s /DATA_DIR/ncbi/gene_info entrez
$ ln -s /DATA_DIR/human/identifiers humangene

Add the root path to the file in ~/.intermine/MINE.properties

resolver.file.rootpath=/DATA_DIR/idresolver/

See Id Resolvers for details on how ID resolvers work in InterMine.

Warning

The entrez identifiers file appears to only have the sequence identifier for worm instead of the WBgene identifier

Alternately you can load identifier sources.

Here are the download scripts we use here at InterMine:

Data Download

We use WormMart but are happy to hear of a better source for worm identifiers.

Here are the project XML entries used by FlyMine:

FlyMine Project XML

Table of Contents

Previous topic

Next topic

This Page

Homologue Data Sources¶

Identifiers¶

How to use an ID resolver¶