# Documentation

## Illustrated user guide

Download as PDF |
---|

guide.pdf |

## Data sources

### Biological categories

Database | Version | Retrieval data |
---|---|---|

Gene Ontology | - | June 2019 |

KEGG | - | June 2019 |

miRBase | 22 | June 2019 |

miRCarta | 1.1 | June 2019 |

Reactome | - | June 2019 |

WikiPathways | - | June 2019 |

### miRNA targets

Database | Version | Retrieval data |
---|---|---|

MiRanda | 3.3a | June 2019 |

miRTarBase | 7 | June 2019 |

TargetScan | 7.1 | June 2019 |

## Statistical analysis

All compute intensive tasks have been performed using the GeneTrail2 C++ library [1] and GNU Parallel [2]. Results of the enrichment analysis were evaluated using the freely available statistical programming environment R, version 3.5.

### Parameter overview

Statistical test | Over-representation analysis |

P-value adjustment | Benjamini-Hochberg |

$\alpha$-level | 0.05 |

Minimal category size | 2 |

Maximal category size | 1000 |

### Over-representation analysis

In order to judge if a certain biological category is significantly enriched for a certain miRNA, we use a test called *over-representation analysis (ORA)*.
This approach has been employed by many authors, e.g. [3], [4],
[5], [6], [7].
Here we use the version of ORA that was presented by Backes et al. [3].
This approach is based on the hypergeometric distribution and can be used to test if a set of selected biological entities is significantly more or less present in a biological category than expected by chance.

We use ORA to judge if a biological pathway contains more targets of a certain miRNA than expected by chance. In order to calculate this chance, ORA relies on a reference set R (background). In our case this is a list of all miRNA targets for the corresponding confidence.

Assume a biological category C has k entries in list $T = (t_{1},t_{2},\ldots,t_{n})$ and l entries in reference set $R=(r_{1},r_{2},\ldots,r_{m})$. Based on this information we expect to find $k'=\frac{n*l}{m}$ elements of test set T in category C on average.

If T is a subset of R, the hypergeometric test is applied to compute a p-value for C:

$$P_C(k)=\sum\limits_{i=k}^{n} \frac{\binom{l}{i}\binom{m-l}{n-i}}{\binom{m}{n}}$$### Benjamini Hochberg adjustment

The Benjamini-Hochberg method [8], [9] is a step-up approach to control the false discovery rate. It assumes all p-values to be independent. Given $n$ increasingly sorted p-values $\{p_1,...,p_n\}$, we can can compute the adjusted p-values using the following formula:

$$\tilde p_{i}\ =\ \begin{cases} p_{i} & \text{for } i=n\\ \min \left( \tilde p_{(i-1)}, \frac{n}{i}p_{i} \right) & \text{for }i=n-1 ,...,1 \end{cases}$$### Bibliography

- Multi-omics Enrichment Analysis using the GeneTrail2 Web Service Bioinformatics Oxford University Press
- GNU Parallel - The Command-Line Power Tool ;login: The USENIX Magazine (View online)
- GeneTrail—advanced gene set enrichment analysis Nucleic acids research Oxford Univ Press (View online)
- Global functional profiling of gene expression Genomics Elsevier (View online)
- Identifying biological themes within lists of genes with EASE Genome Biol (View online)
- Ontological analysis of gene expression data: current tools, limitations, and open problems Bioinformatics Oxford Univ Press (View online)
- GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using Gene Ontology hierarchies BMC bioinformatics BioMed Central Ltd
- Controlling the false discovery rate: a practical and powerful approach to multiple testing Journal of the Royal Statistical Society. Series B (Methodological) JSTOR
- More powerful procedures for multiple significance testing Statistics in medicine Wiley Online Library