Enrichment widgets are located on the list analysis page. There are a number of different types of enrichment widgets, but all list a term, a count and an associated p-value. The term can be something like a publication name or a GO term. The count is the number of times that term appears for objects in your list. The p-value is the probability that result occurs by chance, thus a lower p-value indicates greater enrichment.

The p-value is calculated using the Hypergeometric distribution. Four numbers are used to calculate each p-value:

```
(M choose k) (N-M choose n-k)
P = -----------------------------
N choose n
```

- n
- the number of objects in your list
- N
- the number of objects in the reference population
- k
- the number of objects annotated with this item in your list
- M
- the number of objects annotated with item in the reference population

Apache library - Hypergeometric Distribution

When multiple tests (statistical inferences)are run in parallel, the probability of false positive (Type I) errors increases. To address this issue, many multiple test corrections have been developed to take into account the number of tests being carried out and to correct the p-values accordingly. Enrichment widgets have three different multiple test corrections: Bonferroni, Holm-Bonferroni, and Benjamini Hochberg.

In enrichment widgets the number of “tests run” is the number of terms associated with objects in the “reference list”. Please Note, in earlier versions of InterMine (0.95 and below) the number of “tests run” was the number of terms associated with objects in the “query list”. This change has made the multiple test correction more rigorous, and will reduce the occurrence of spuriously low p-values.

Each enrichment widget has four test correction options:

No test correction performed, these are the raw results. These p-values will be lower (more significant) than if test correction was applied.

Bonferroni is the simplest and most conservative method of multiple test correction. The number of tests run (the number of terms associated with objects in the reference list) is multiplied by the un-corrected p-value of each term to give the corrected p-value.

```
Adjusted p-value = p-value x (number of tests - rank)
```

This correction is the less stringent than the Bonferroni, and therefore tolerates more false positives.

```
Adjusted p-value = p-value x (number of tests/rank)
```

- The p-values of each gene are ranked from the smallest to largest.
- The p-value is multiplied by the total number of tests divided by its rank.

The probability of a given set of genes being hit in a ChIP experiment is amongst other things proportional to their length – very long genes are much more likely to be randomly hit than very short genes are. This is an issue for some widgets – for example, if a given GO term (such as gene expression regulation) is associated with very long genes in general, these will be much more likely to be hit in a ChIP experiment than the ones belonging to a GO term with very short genes on average. The p-values should be scaled accordingly to take this into account. There are a number of different implementations of corrections, we have choosen the simplest one. The algorithm was developed by Taher and Ovcharenko (2009) for correcting GO enrichment. Corrected probability of observing a given GO term is equal to the original GO probability times the correction coefficient CCGO defined for each GO term.

```
Adjusted P = P x CCGO
```

where the correction coefficient CCGO is calculated as:

```
LGO/LWH
CCGO = ----------------
NGO/NWG
```

- LGO
- Average gene length of genes associated with a GO term
- LWG
- Average length of the genes in the whole genome
- NGO
- Number of genes in the genome associated with this GO term
- NWG
- Total number of genes in the whole genome.

Note

The relevant InterMine source.

The reference population is by default the collection of **all the genes with
annotation** for the given organism. This can be changed to any available
list of genes.

Beissbarth T, Speed TP.

Bioinformatics. 6.2004; 20(9): 1464-1465.

PubMed id: 14962934

Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G.

Bioinformatics. 2004 Dec 12;20(18):3710-5. Epub 2004 Aug 5.

PubMed id: 15297299

Benjamini, Yoav; Hochberg, Yosef

Journal of the Royal Statistical Society. 1995, Series B (Methodological) 57 (1): 289–300.

van der Laan, Mark J.; Dudoit, Sandrine; and Pollard, Katherine S.

Statistical Applications in Genetics and Molecular Biology: Vol. 3 : Iss. 1, Article 15, 2004.

Taher, L. and Ovcharenko, I. (2009), Bioinformatics <http://bioinformatics.oxfordjournals.org/content/25/5/578> Vol. : Iss. 5: 578–584.

Note

You can read more about **Hypergeometric Distribution** at Simple Interactive Statistical Analysis or Wolfram MathWorld. **Bonferroni Correction** is discussed in this Wolfram MathWorld article.