Asuragen Glossary of Terms


MicroRNA (miRNA)


miRNA Biology

Transcription
miRNAs are initially expressed as part of transcripts termed primary miRNAs (pri-miRNAs) (Lee 2002). They are apparently transcribed by RNA Polymerase II, and include 5' caps and 3' poly(A) tails (Smalheiser 2003, Cai 2004). The miRNA portion of the pri-miRNA transcript likely forms a hairpin with signals for RNA–specific nuclease cleavage.


Hairpin release in the nucleus
The dsRNA-specific ribonuclease Drosha digests the pri-miRNA in the nuclease to release hairpin, precursor miRNA (pre-miRNA) (Lee 2003). Pre-miRNAs appear to be approximately 70 nt RNAs with 1–4 nt 3' overhangs, 25–30 bp stems, and relatively small loops. Drosha also generates either the 5' or 3' end of the mature miRNA, depending on which strand of the pre-miRNA is selected by RISC (Lee 2003, Yi 2003).


Export to the cytoplasm
Exportin-5 (Exp5) seems to be responsible for the export of pre-miRNAs from the nucleus to the cytoplasm. Exp5 has been shown to bind directly and specifically to correctly processed pre-miRNAs. It is required for miRNA biogenesis, with a probable role in coordination of nuclear and cytoplasmic processing steps. (Lund 2003, Yi 2003).


Dicer processing
Dicer is a member of the RNase III superfamily of bidentate nucleases that has been implicated in RNA interference in nematodes, insects, and plants. Once in the cytoplasm, Dicer cleaves the pre-miRNA approximately 19 bp from the Drosha cut site (Lee 2003, Yi 2003). The resulting double-stranded RNA has 1–4 nt 3' overhangs at either end (Lund 2003). Only one of the two strands is the mature miRNA; some mature miRNAs derive from the leading strand of the pri-miRNA transcript, and with other miRNAs the lagging strand is the mature miRNA.


Strand selection by RISC
To control the translation of target mRNAs, the double-stranded RNA produced by Dicer must strand separate, and the single stranded mature miRNA must associate with the RISC (Hutvagner 2002). Selection of the active strand from the dsRNA appears to be based primarily on the stability of the termini of the two ends of the dsRNA (Schwarz 2003, Khvorova 2003). The strand with lower stability base pairing of the 2–4 nt at the 5' end of the duplex preferentially associates with RISC and thus becomes the active miRNA (Schwarz 2003).


miRNA Regulation of Translation
Virtually all of the miRNAs that have been studied in animals reduce steady state protein levels for the targeted gene(s) without impacting the corresponding levels of mRNA (Olsen 1999). The mechanism by which miRNAs reduce protein levels is not fully understood, but one study involving the C. elegans lin-4 miRNA/lin-14 mRNA pair indicates that lin-4 miRNA does not affect the poly(A) tail length, transport to the cytoplasm, nor entry into polysomes of the lin-14 mRNA (Olsen, 1999). If this observation holds true for all animal miRNAs, then downstream steps such as translational elongation, translational termination, or protein stability are likely influenced by miRNAs. Mounting evidence suggests that miRNAs function via a similar enzyme complex as siRNAs.


miRNA Data Analysis


Deliverables

Project Description:
The Project Description which includes the Asuragen RNA description (RNA_desc), hybridization ID (hyb_ID), and the experimental parameters for your samples are provided. It also provides a key, which can be used to associate your samples names with the Asuragen ID for all data, results, and figure files.


BioArray QC:
Information on the sample background, signal intensities and threshold values for each array is reported. A determination of the % of miRNA determined to be present in a given sample is also provided.


Raw Signal:
The “raw signal” for each miRNA obtained by subtracting the maximum of the local background and negative control signals from the foreground signal averaged across the two technical replicate spots for each miRNA on the array.


Array Normalized Signal:
The median-scaled, log2-transformed intensities for each miRNA on the array. The median of the present signals for each array (spots above the threshold value) is used for the scaling. Cells that are blank correspond to those miRNA that were below the signal threshold.


Global Normalized Signal:
Global normalization is generated by computing the Variance Stabilization Normalization (VSN; Huber et al., 2002) of all the arrays within the project. These numbers provide the basis of further figures and analysis. Please note that the values are in a generalized logarithm base 2 (glog2). To convert to a generalized fold-change, differences in glog2 values should be exponentiated base 2.


DEM (Differentially Expressed microRNA):
The miRNA for which significant differences in expression are identified between groups. Significance is defined by statistical analysis (ANOVA or t-test), with the false discovery rate set to 0.05. The mean values, differences in expression in glog2 scale, p-values with significant flags, and miRNA annotations for a complete pair-wise comparison of those genes are reported.


Statistics:
The VSN transformed expression glog2 values within the project are summarized. Also reported are the mean, maximum, and minimum expression intensities of the samples within each experimental group. Here we also report the % present calls for all miRNA in the experiment.


mRNA


mRNA Normalization/Summarization

MAS 5.0
The Affymetrix MAS 5.0 Algorithm calculates the signal value from the combined, background-adjusted PM and MM values of the probes in one probe set.


The process is outlined as follows:

  • Cell intensities are preprocessed for global background
  • An ideal mismatch value is calculated and subtracted to adjust the PM intensity
  • The adjusted PM intensities are log-transformed to stabilize the variance
  • The Tukey’s biweight estimator is used to provide a robust mean of the resulting values
  • Signal is output as the antilog of the resulting value
  • Finally, the signal is scaled using a trimmed mean

The MAS 5.0 algorithm occurs on a chip-by-chip basis and is not applied across an entire set of chips.


RMA
Robust Multi-array Average (RMA) adjusts gene expression values obtained from hybridization of Affymetrix® GeneChip® arrays, proposed by Irizzary et al. (2003). The method fits a robust linear model to the probe-level data, analyzing each hybridized chip in the context of other chips in the experiment. The algorithm consists of three steps—a model-based background correction stage that neutralizes the effects of background noise, a subsequent quantile normalization stage that aligns expression values to a common distribution, and finally, an iterative median polishing procedure summarizes the data and generates a single expression value for each probe set.


GC-RMA
GC-RMA is a modification of the RMA algorithm replacing the model used in the background correction stage with a more sophisticated computation that uses each probe’s sequence information to adjust the measured intensity for the effects of non-specific binding due to the differences in bond strength between the two types of base pairs. It also takes into account the optical noise present in data acquisition for an even greater accuracy and sensitivity. The two steps of the RMA algorithm following background correction, namely, the global, cross-chip normalization and summarization through median-polishing, remain unchanged.


PLIER
The PLIER method improves expression estimate by accounting for experimentally observed patterns in probe behavior and by handling error at the appropriate low and high signal values.


Benefits include:

  • Higher reproducibility of signal (lower coefficient of variation) without loss of accuracy
  • Higher differential sensitivity for low expressors, specifically below 2 picomolar concentration
  • Dynamic estimation of most informative probes to determine signal

This method was developed by building upon many of the concepts that have been published recently in the field of GeneChip data analysis, including model-based expression analysis and robust multichip analysis. It also builds upon the signal algorithm provided in MAS 5.0 by taking into account experimental data in weighting probes to determine the overall probe set signal. Like other model-based approaches, PLIER accounts for the difference between probes by means of a parameter called probe affinity. (Probe affinity represents the strength of signal produced at a specific concentration for a given probe.) PLIER estimates the signal for the entire probe set more accurately by utilizing these inherent probe affinities, empirical probe performance, and by handling error appropriately across low and high concentrations. Probe affinities are calculated using experimental data across multiple arrays. PLIER also utilizes an error model that assumes error depends on the probe, rather than on the signal alone.


All of the methods listed above seek to normalize the signal values across arrays

Bioinformatics: Clustering/Classification


Clustering

Clustering is a method that groups genes or samples into groups such that units within a cluster are more similar to each other than they are to cases in other clusters. Clustering is a useful exploratory technique for gene-expression data when there is an expectation that there are patterns of gene expression but the exact nature of that pattern is unknown, as it groups similar objects together and allows the biologist to identify potentially meaningful relationships between the objects (either genes or experiments or both). This differs from classification were the identity (or the within group pattern) of the groups are known beforehand.


The following clustering methods can be employed:

Hierarchical Clustering
Hierarchical clustering creates a hierarchical tree of similarities between the samples called dendrogram or heatmap. The most usual implementation is the agglomerative hierarchical clustering, which starts with a family of clusters with one sample each, and merges the clusters iteratively, based on some distance measure, until there is only one cluster left. Array qualities can be roughly assessed using hierarchical clustering. Ideally common samples should cluster into similar classes.


K-means Clustering
The K-means method is known as a partitional method. This method permits the user to predefine the number of clusters after which the algorithm partitions the data iteratively until a solution is found.


Principal Components Analysis (PCA)
PCA is designed to capture the variance in a dataset in terms of principal components, reducing the dimensionality of the data from many thousands to only a handful of the most informative components.


Affymetrix File Terms

GCOS: GeneChip Operating Software automates the control of GeneChip Fluidics Stations and Scanners and acquires data, manages sample, and experimental information. It also performs analysis of gene expression files utilizing the Affymetrix Statistical Algorithm. This software generates the files listed below.


Experiment File (*.EXP): This file contains the parameters of the experiment such as Array Type, Experiment Name, Equipment Parameters, Sample Description, and others.


Image Data File (*.DAT): This file is the pixilated image file generated by the scanner from the array after processing on the Fluidics Station.


Cell Intensity File (*.CEL): The cell file contains the processed cell intensities from the primary image in the *.DAT file. Asuragen uses this file for further analysis.


Probe Array Results File (*.CHP): The .chp file is the output file from the GeneChip Operating Software expression analysis of the probe array utilizing Affymetrix Statistical Algorithm. The chip file contains the data that can be used for other analyses.


Report File (*.RPT): The report file is generated from the .chp file. This expression report summarizes information about expression analysis settings, probe set hybridization intensity data, and other quality metrics for sample and array performance.


Data Transfer Tool File (*.DTT): This file contains your complete raw data, packaged, and ready to be imported into your copy of Affymetrix GCOS software. It consists of compressed files containing any combination of .dat, .cel and .chp files of a database / project / sample / experiment.


Library Files (.cif, .cdf, .psi): These files are unique to each probe array type and contain information for scanning and analysis parameters, array design, and probe information.