The DataDownloader system uses a plugin architecture to make it more straightforward to download data from arbitrary sources, and to add new sources to the system
The system is a package located in our scripts repo here: https://github.com/intermine/intermine-scripts/tree/master/bio/DataDownloader
The package contains:
If you are using Ubuntu (tested on 12.10), you can run the following command to install the packages:
$ sudo apt-get install libpath-class-perl libmoosex-types-path-class-perl liblog-handler-perl liblog-report-perl libdatetime-perl libmoosex-followpbp-perl libyaml-perl libmodule-find-perl libperlio-gzip-perl libouch-perl libnumber-format-perl
Other perl modules need to be installed via CPAN:
$ cpan
cpan[1]> install MooseX::ABC
cpan[2]> install MooseX::FileAttribute
To learn how to configure data sources of your mine, look here for examples:
DataDownloader/config
The yaml file of your mine is where data download script reads the instruction
To run a set of data downloads, the following call should suffice:
perl DataDownloader/bin/download_data -e intermine
The Current working directory of the script is immaterial.
Specific sources can be run by naming them on the command line:
perl DataDownloader/bin/download_data -e intermine Uniprot GOAnnotation
Source names are case-sensitive. You can get a list of the available sources with the switch ‘–sources’.
A source is a class in the ‘DataDownloader::Source’ package that implements the following method:
And accepts the following arguments in its constructor:
A template for creating a source is available in the form of an abstract class all Sources are expected to inherit from. This class, DataDownloader::Source::ABC makes it simple to add straightforward source downloaders, and provides helpers to make it convenient to add complex ones.
A minimal source can be seen in the form of bio/scripts/DataDownloader/lib/DataDownloader/Source/FlyAnatomyOntology.pm:
package DataDownloader::Source::FlyAnatomyOntology;
use Moose;
extends 'DataDownloader::Source::ABC';
use constant {
TITLE => 'Fly Anatomy Ontology',
DESCRIPTION => "Drosophila Anatomy ontology from FlyBase",
SOURCE_LINK => "http://www.flybase.net/",
SOURCE_DIR => 'ontologies/fly-anatomy',
SOURCES => [{
FILE => 'fly_anatomy.obo',
SERVER => 'http://obo.cvs.sourceforge.net/*checkout*/obo/obo/ontology/anatomy/gross_anatomy/animal_gross_anatomy/fly',
}],
};
1;
This source fully inherits the behaviour of the ‘DataDownloader::Source::ABC’ abstract class, and only adds configuration. In this case, it defines a set of constants that describe this source:
And some constants that define the data to fetch:
Each source is a hash-reference with the following keys:
Further keys that can be defined include: