Documentation

Illustrated user guide

Download as PDF
guide.pdf

Data sources

Biological categories

Database Version Retrieval data
Biocarta - 11. Jan. 2016
Cytogenic bands 38 11. Jan. 2016
Gene Ontology - 11. Jan. 2016
KEGG - 11. Jan. 2016
miRBase 21 11. Jan. 2016
NCI Pathway interaction database (PID) - 11. Jan. 2016
Pfam - 11. Jan. 2016
Reactome - 11. Jan. 2016
SMPDB - 11. Jan. 2016
WikiPathways - 28. Feb. 2016

miRNA targets

Database Version Retrieval data
DIANA microT-CDS 5 09. Sep. 2016
miRDB 5.0 09. Sep. 2016
miRTarBase 6 11. Jan. 2016
TargetScan 7.1 09. Sep. 2016

Statistical analysis

All compute intensive tasks have been performed using the GeneTrail2 C++ library [1] and GNU Parallel [2]. Results of the enrichment analysis were evaluated using the freely available statistical programming environment R, version 3.0.2.

Parameter overview

Statistical test Over-representation analysis
P-value adjustment Benjamini-Hochberg
$\alpha$-level 0.05
Minimal category size 2
Maximal category size 1000

Over-representation analysis

In order to judge if a certain biological category is significantly enriched for a certain miRNA, we use a test called over-representation analysis (ORA). This approach has been employed by many authors, e.g. [3], [4], [5], [6], [7]. Here we use the version of ORA that was presented by Backes et al. [3]. This approach is based on the hypergeometric distribution and can be used to test if a set of selected biological entities is significantly more or less present in a biological category than expected by chance.

We use ORA to judge if a biological pathway contains more targets of a certain miRNA than expected by chance. In order to calculate this chance, ORA relies on a reference set R (background). In our case this is a list of all miRNA targets for the corresponding confidence.

Assume a biological category C has k entries in list $T = (t_{1},t_{2},\ldots,t_{n})$ and l entries in reference set $R=(r_{1},r_{2},\ldots,r_{m})$. Based on this information we expect to find $k'=\frac{n*l}{m}$ elements of test set T in category C on average.

If T is a subset of R, the hypergeometric test is applied to compute a p-value for C:

$$P_C(k)=\begin{cases} \sum\limits_{i=k}^{n} \frac{\binom{l}{i}\binom{m-l}{n-i}}{\binom{m}{n}} ,& \text{if }k' < k\\ \sum\limits_{i=0}^{k} \frac{\binom{l}{i}\binom{m-l}{n-i}}{\binom{m}{n}} ,& \text{if }k'\ge k \end{cases}$$

Benjamini Hochberg adjustment

The Benjamini-Hochberg method [8], [9] is a step-up approach to control the false discovery rate. It assumes all p-values to be independent. Given $n$ increasingly sorted p-values $\{p_1,...,p_n\}$, we can can compute the adjusted p-values using the following formula:

$$\tilde p_{i}\ =\ \begin{cases} p_{i} & \text{for } i=n\\ \min \left( \tilde p_{(i-1)}, \frac{n}{i}p_{i} \right) & \text{for }i=n-1 ,...,1 \end{cases}$$

Bibliography

  1. Stöckel, Daniel and Kehl, Tim and Trampert, Patrick and Schneider, Lara and Backes, Christina and Ludwig, Nicole and Gerasch, Andreas and Kaufmann, Michael and Gessler, Manfred and Graf, Norbert and Meese, Eckart and Keller, Andreas and Lenhof, Hans-Peter Multi-omics Enrichment Analysis using the GeneTrail2 Web Service Bioinformatics Oxford University Press
  2. O. Tange GNU Parallel - The Command-Line Power Tool ;login: The USENIX Magazine (View online)
  3. Backes, Christina and Keller, Andreas and Kuentzer, Jan and Kneissl, Benny and Comtesse, Nicole and Elnakady, Yasser A and Müller, Rolf and Meese, Eckart and Lenhof, Hans-Peter GeneTrail—advanced gene set enrichment analysis Nucleic acids research Oxford Univ Press (View online)
  4. Draghici, Sorin and Khatri, Purvesh and Martins, Rui P. and Ostermeier, G. Charles and Krawetz, Stephen A. Global functional profiling of gene expression Genomics Elsevier (View online)
  5. Hosack, Douglas A and Dennis Jr, Glynn and Sherman, Brad T and Lane, H Clifford and Lempicki, Richard A and others Identifying biological themes within lists of genes with EASE Genome Biol (View online)
  6. Khatri, Purvesh and Draghici, Sorin Ontological analysis of gene expression data: current tools, limitations, and open problems Bioinformatics Oxford Univ Press (View online)
  7. Zhang, Bing and Schmoyer, Denise and Kirov, Stefan and Snoddy, Jay GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using Gene Ontology hierarchies BMC bioinformatics BioMed Central Ltd
  8. Benjamini, Yoav and Hochberg, Yosef Controlling the false discovery rate: a practical and powerful approach to multiple testing Journal of the Royal Statistical Society. Series B (Methodological) JSTOR
  9. Hochberg, Yosef and Benjamini, Yoav More powerful procedures for multiple significance testing Statistics in medicine Wiley Online Library