Protein Data Bank

Types of data loaded

genes, proteins, GO annotation, protein domains, publications, UniProt features, comments, synonyms, cross references, EC numbers, components

How to download the data

This source loads data from the UniProt website here:

The UniProt source expects the data files to be in a special format:


To download a single taxon, you can use this URL:

parameter value
taxonomy e.g. 9606 for human
reviewed yes for swiss prot, no for trembl
compress if yes, zipped

How to load the data into your mine


Gene identifier fields

You can specify which gene fields are assigned when UniProt data is loaded. An example entry:

10116.uniqueField = primaryIdentifier
10116.primaryIdentifier.dbref = RGD
10116.secondaryIdentifier.dbref = Ensembl = primary

The format for the file is:


An example

A rat uniprot entry:

The second line of that configuration would set the ID value as the gene.primaryIdentifier:

<dbReference type="RGD" id="619834" key="33">
        <property type="gene designation" value="Acf"/>

The third line would set this ID value as gene.secondaryIdentifier:

<dbReference type="Ensembl" id="ENSRNOG00000033195" key="30">
        <property type="organism name" value="Rattus norvegicus"/>

The last line would set the value between the <name/> tags as gene.symbol:

        <name type="primary">A1cf</name>
        <name type="synonym">Acf</name>
        <name type="synonym">Asp</name>

The values for can be primary, ORF or ordered locus.

Protein feature types

You can also configure which protein features to load.

To load specific feature types only, specify them like so:

# in
feature.types = helix

To load NO feature types:

# in
feature.types = NONE

To load ALL feature types, do not specify any feature types, remove that line from this config file. Loading all feature types is the default behaviour.


<source name="uniprot" type="uniprot" >
  <property name="uniprot.organisms" value="7227 9606"/>
  <property name="" location="/data/uniprot"/>
  <property name="creatego" value="true"/>
  <property name="creategenes" value="true"/>
  <property name="allowduplicates" value="false"/>
  <property name="loadfragments" value="false"/>
  <property name="loadtrembl" value="true"/>
property description default
creategenes if TRUE, process genes true
creatego if TRUE, process GO annotation false
allowduplicates if TRUE, allow proteins with duplicate sequences to be processed false
loadfragments if TRUE, load all proteins even if isFragment = true false
loadtrembl if FALSE, not load trembl data for given organisms, load sprot data only true


This source loads FASTA data for isoforms. The UniProt entry is does not contain the sequences for isoforms.

<source name="uniprot-fasta" type="fasta">
  <property name="fasta.taxonId" value="7227 9606"/>
  <property name="fasta.className" value=""/>
  <property name="fasta.classAttribute" value="primaryAccession"/>
  <property name="fasta.dataSetTitle" value="UniProt data set"/>
  <property name="fasta.dataSourceName" value="UniProt"/>
  <property name="" location="/data/uniprot/current"/>
  <property name="fasta.includes" value="uniprot_sprot_varsplic.fasta"/>
  <property name="fasta.sequenceType" value="protein" />
  <property name="fasta.loaderClassName" value=""/>

UniProt keywords

Loads the names for the UniProt keywords contained in the main UniProt converter.

<source name="uniprot-keywords" type="uniprot-keywords">
  <property name="" location="/data/uniprot/current"/>
  <property name="" value="keywlist.xml"/>