---
title: "Data Visualizations"
author: "Author: Daniela Cassol (danielac@ucr.edu) and Thomas Girke (thomas.girke@ucr.edu)"
date: "Last update: `r format(Sys.time(), '%d %B, %Y')`"
output:
BiocStyle::html_document:
toc_float: true
code_folding: show
BiocStyle::pdf_document: default
package: systemPipeTools
vignette: |
%\VignetteEncoding{UTF-8}
%\VignetteIndexEntry{systemPipeTools: Data Visualizations}
%\VignetteEngine{knitr::rmarkdown}
fontsize: 14pt
bibliography: bibtex.bib
type: docs
weight: 1
---
```{css, echo=FALSE, eval=TRUE}
pre code {
white-space: pre !important;
overflow-x: scroll !important;
word-break: keep-all !important;
word-wrap: initial !important;
}
```
```{r style, echo = FALSE, results = 'asis', eval=TRUE}
BiocStyle::markdown()
options(width = 80, max.print = 1000)
knitr::opts_chunk$set(
eval = as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache = as.logical(Sys.getenv("KNITR_CACHE", "TRUE")),
tidy.opts = list(width.cutoff = 80), tidy = TRUE
)
```
```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE, eval=TRUE}
suppressPackageStartupMessages({
library(systemPipeTools)
library(systemPipeR)
})
```
# Data Visualization with `systemPipeR`
*systemPipeTools* package extends the widely used *[systemPipeR](https://systempipe.org/)* (SPR) [@H_Backman2016-bt]
workflow environment with enhanced toolkit for data visualization,
including utilities to automate the analysis of differentially expressed genes (DEGs).
*systemPipeTools* provides functions for data transformation and data exploration via
scatterplots, hierarchical clustering heatMaps, principal component analysis,
multidimensional scaling, generalized principal components, t-Distributed
Stochastic Neighbor embedding (t-SNE), and MA and volcano plots.
All these utilities can be integrated with the modular design of the *systemPipeR*
environment that allows users to easily substitute any of these features and/or
custom with alternatives.
## Metadata and Reads Counting Information
The first step is importing the `targets` file and raw reads counting table.
- The `targets` file defines all FASTQ files and sample comparisons of the analysis workflow.
- The raw reads counting table represents all the reads that map to gene (row) for each sample (columns).
```{r targets_counts, eval=TRUE}
## Targets file
targetspath <- system.file("extdata", "targets.txt", package = "systemPipeR")
targets <- read.delim(targetspath, comment = "#")
cmp <- systemPipeR::readComp(file = targetspath, format = "matrix", delim = "-")
## Count table file
countMatrixPath <- system.file("extdata", "countDFeByg.xls", package = "systemPipeR")
countMatrix <- read.delim(countMatrixPath, row.names = 1)
showDT(countMatrix)
```
## Data Transformation
For gene differential expression, raw counts are required, however for data
visualization or clustering, it can be useful to work with transformed count data.
`exploreDDS` function is convenience wrapper to transform raw read counts using the
[`DESeq2`](@Love2014-sh) package transformations methods. The input file
has to contain all the genes, not just differentially expressed ones. Supported
methods include variance stabilizing transformation (`vst`) (@Anders2010-tp), and
regularized-logarithm transformation or `rlog` (@Love2014-sh).
```{r exploreDDS, eval=TRUE, warning=FALSE}
exploredds <- exploreDDS(countMatrix, targets, cmp = cmp[[1]], preFilter = NULL,
transformationMethod = "rlog")
exploredds
```
Users are strongly encouraged to consult the [`DESeq2`](@Love2014-sh) vignette for
more detailed information on this topic and how to properly run `DESeq2` on data
sets with more complex experimental designs.
## Scatterplot
To decide which transformation to choose, we can visualize the transformation effect
comparing two samples or a grid of all samples, as follows:
```{r exploreDDSplot, eval=TRUE, warning=FALSE}
exploreDDSplot(countMatrix, targets, cmp = cmp[[1]], preFilter = NULL,
samples = c("M12A", "M12A", "A12A", "A12A"), scattermatrix = TRUE)
```
The scatterplots are created using the log2 transform normalized reads count,
variance stabilizing transformation (VST) (@Anders2010-tp), and
regularized-logarithm transformation or `rlog` (@Love2014-sh).
## Hierarchical Clustering Dendrogram
The following computes the sample-wise correlation coefficients using the `stats::cor()`
function from the transformed expression values. After transformation to a
distance matrix, hierarchical clustering is performed with the `stats::hclust`
function and the result is plotted as a dendrogram, as follows:
```{r hclustplot, eval=TRUE}
hclustplot(exploredds, method = "spearman")
```
The function provides the utility to save the plot automatically.
## Hierarchical Clustering HeatMap
This function performs hierarchical clustering on the transformed expression
matrix generated within the `DESeq2` package. It uses, by default, a `Pearson`
correlation-based distance measure and complete linkage for cluster join.
If `samples` selected in the `clust` argument, it will be applied the `stats::dist()`
function to the transformed count matrix to get sample-to-sample distances. Also,
it is possible to generate the `pheatmap` or `plotly` plot format.
```{r heatMaplot_samples, eval=TRUE}
## Samples plot
heatMaplot(exploredds, clust = "samples", plotly = TRUE)
```
If `ind` selected in the `clust` argument, it is necessary to provide the list of
differentially expressed genes for the `exploredds` subset.
```{r heatMaplot_DEG, eval=TRUE, warning=FALSE}
## Individuals genes identified in DEG analysis
### DEG analysis with `systemPipeR`
degseqDF <- systemPipeR::run_DESeq2(countDF = countMatrix, targets = targets,
cmp = cmp[[1]], independent = FALSE)
DEG_list <- systemPipeR::filterDEGs(degDF = degseqDF,
filter = c(Fold = 2, FDR = 10))
### Plot
heatMaplot(exploredds, clust = "ind",
DEGlist = unique(as.character(unlist(DEG_list[[1]]))))
```
The function provides the utility to save the plot automatically.
## Principal Component Analysis
This function plots a Principal Component Analysis (PCA) from transformed expression
matrix. This plot shows samples variation based on the expression values and
identifies batch effects.
```{r PCAplot, eval=TRUE}
PCAplot(exploredds, plotly = FALSE)
```
The function provides the utility to save the plot automatically.
## Multidimensional scaling with `MDSplot`
This function computes and plots multidimensional scaling analysis for dimension
reduction of count expression matrix. Internally, it is applied the `stats::dist()`
function to the transformed count matrix to get sample-to-sample distances.
```{r MDSplot, eval=TRUE}
MDSplot(exploredds, plotly = FALSE)
```
The function provides the utility to save the plot automatically.
## Dimension Reduction with `GLMplot`
This function computes and plots generalized principal components analysis for
dimension reduction of count expression matrix.
```{r GLMplot, eval=TRUE, warning=FALSE}
exploredds_r <- exploreDDS(countMatrix, targets, cmp = cmp[[1]],
preFilter = NULL, transformationMethod = "raw")
GLMplot(exploredds_r, plotly = FALSE)
```
The function provides the utility to save the plot automatically.
## MA plot
This function plots log2 fold changes (y-axis) versus the mean of normalized
counts (on the x-axis). Statistically significant features are colored.
```{r MAplot, eval=TRUE}
MAplot(degseqDF, comparison = "M12-A12", filter = c(Fold = 1, FDR = 20),
genes = "ATCG00280")
```
The function provides the utility to save the plot automatically.
## t-Distributed Stochastic Neighbor embedding with `tSNEplot`
This function computes and plots t-Distributed Stochastic Neighbor embedding (t-SNE)
analysis for unsupervised nonlinear dimensionality reduction of count expression
matrix. Internally, it is applied the `Rtsne::Rtsne()` [@Rtsne] function, using the exact
t-SNE computing with `theta=0.0`.
```{r tSNEplot, eval=TRUE}
tSNEplot(countMatrix, targets, perplexity = 5)
```
## Volcano plot
A simple function that shows statistical significance (`p-value`) versus magnitude
of change (`log2 fold change`).
```{r volcanoplot, eval=TRUE}
volcanoplot(degseqDF, comparison = "M12-A12", filter = c(Fold = 1, FDR = 20),
genes = "ATCG00280")
```
# Version information
```{r sessionInfo, eval=TRUE}
sessionInfo()
```
# Funding
This project is funded by NSF award [ABI-1661152](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1661152).
# References