Data Integration
======================

Data integration works by using keys for each class of object to define equivalence for objects of that class.  For example:

* `primaryIdentifier` is used as a key for `Gene`
* `taxonId` is used as a key for `Organism`

For each `Gene` object loaded, a query is performed in the database to find any existing `Gene` objects with the same `primaryIdentifier`.  If any are found, fields from both objects are merged and the resulting object stored.

Many performance optimisation steps are applied to this process.  We don't actually run a query for each object loaded, requests are batched and queries can be avoided completely if the system can work out no integration will be needed.

We may also load data from some other source that provides information about genes but doesn't use the identifier scheme we have chosen for `primaryIdentifier`.  Instead it only knows about the `symbol`, in that case we would want that source to use the `symbol` to define equivalence for `Gene`.

Important points:

* A `primary key` defines a field or fields of a class that can be used to search for equivalent objects
* Multiple primary keys can be defined for a class, sources can use different keys for a class if they provide different identifiers
* One source can use multiple primary keys for a class if the objects of that class don't consistently have the same identifier type
* `null` - if a source has no value for a field that is defined as a primary key then the key is not used and the data is loaded without being integrated.

See :doc:`/database/database-building/primary-keys` for more information.

.. index:: data integration