Introduction to Microarray Gene Expression Shyamal D. Peddada

Introduction to Microarray Gene Expression Shyamal D. Peddada

Introduction to Microarray Gene Expression Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences (NIH) Research Triangle Park, NC Outline of the four talks A general overview of microarray data

Some important terminology and background Various platforms Sources of variation Normalization of data Analysis of gene expression data - Nominal explanatory variables

Two types of explanatory variables Scientific questions of interest A brief discussion on false discovery rate (FDR) analysis Some existing methods of analysis. Outline of the four talks Analysis of ordered gene expression data

Common experimental designs Some existing statistical methods An example Demonstration of ORIOGEN Some open research problems Analysis of data from cell-cycle experiments Some background on cell-cycle experiments

Modeling the data Data from multiple experiments Some open research problem Talk 1: An overview of microarray data To perform statistical analysis of any given data It is important to understand all sources of (i) bias, (ii) variability. Some basic understanding of the underlying technology!

Understand the sampling/experimental design Some Important Terminology and Background Central Dogma of Molecular Biology Some background terminology: DNA and RNA DNA (Deoxyribonucleic acid) - Contains genetic code or instructions for the development and function living organisms. It is double stranded.

Four Nucleotides (building blocks of DNA) Adenine (A), Guanine (G), Thymine (T), Cytosine (C) Base pairs: (A, T) (G, C) E.g. 5 ---AAATGCAT---3 3 ---TTTACGTA---5 Some background terminology: DNA and RNA

RNA (Ribonucleic acid) - transcribed (or copied) from DNA. It is single stranded. (Complimentary copy of one of the strands of DNA) RNA polymerase - An enzyme that helps in the transcription of DNA to form RNA. Four Nucleotides (building blocks of DNA) Adenine (A), Guanine (G), Uracil (U), Cytosine (C)

Base pairs: (A, U) (G, C) Some background terminology: Types of RNA Types of RNA - (transfer) tRNA, (ribosomal) rRNA, etc. mRNA - messenger RNA. Carries information from DNA to ribosomes where protein synthesis takes place (less stable than DNA).

Some background terminology: Oligos Oligonucleotide - a short segment of DNA consisting of a few base pairs. In short it is commonly called Oligo. mer - unit of measurement for an Oligo. It is the number of base pairs. So 30 base pair Oligo would be 30-mer long. Some background terminology:

Probes cDNA - complimentary DNA. DNA sequence that is complimentary to the given mRNA. Obtained using an enzyme called reverse transcriptase. Probes - a short segment of DNA (about 100mer or longer) used to detect DNA or RNA that compliments the sequence present in the probe. Some background terminology: Blots - Origins of Microarrays

Southern blot (Edwin Southern, 1975 J. Molec. Biol.) A method used to identify the presence of a DNA sequence in a sample of DNA. Western blot (immunoblot) to identify a specific protein from a tissue extract. Some background terminology

Southwestern blot to identify and characterize DNA-binding proteins. Northern blot A method used to study the gene expression from a sample of mRNA. Microarrays Northern blot Vs Microarray Microarray

Northern blot Rate of expression analysis Thousands of Few genes at a genes at a time time (High throughput) Automation Automation

possible Manual Scope Allows to explore relationships among several 100s of genes at the same time Limited What is a Microarray?

Sequences from thousands of different genes are immobilized, or attached, at fixed locations. Spotted, or actually synthesized directly onto the support. Microarray Technology Two color dye array (Spotted array) Spotted cDNA microarrays Spotted oligo microarrays

Single dye array In situ oligo microarrays Microarray Technology Spotted Microarrays Spotted DNA Microarray Spotted DNA array is typically home made so you need to think about:

cDNA or Oligo Location of the Oligo in a given gene Oligo length - number of bp? Spotted DNA Microarray Gene expression: Red Y log 2 Green Y < 0; gene is over expressed in green labeled

sample compared to red-labeled sample Y = 0; gene is equally expressed in both samples Y > 0; gene is over expressed in red-labeled sample compared to green labeled sample Single Dye Microarrays Major Commercial Platforms More than 50 companies are currently offering various DNA microarray platforms, reagents and software Affymetrix dominated the marker for many years

Manufacturer Code Protocol Platform # of Probes Applied Biosystems ABI One-color microarray Human Genome Survey Microarray v2.0 32878

Affymetrix AFX One-color microarray HG-U133 Plus 2.0 GeneChip 54675 Agilent* AG1 One-color microarray Whole Human Genome Oligo Microarray, G4112A 43931

Eppendorf EPP One-color microarray DualChip Microarray GE Healthcare GEH One-color microarray CodeLink Human Whole Genome, 300026 54359 Illumina

ILM One-color microarray Human-6 BeadChip, 48K v1.0 47293 *Agilent has one and two-color microarray platform 294 Affymetrix GeneChip Each gene is represented by 11 to 20 oligos of 25-mers

Probe: An oligo of 25-mer Probe Pair: a PM and MM pair Perfect match (PM): A 25-mer complementary to a reference sequence of interest (part of the gene) Mismatch (MM): same as PM with a single base change for the

middle (13th) base (G <-> C, A <-> T) Probe set: a collection of probe-pairs (11 to 20) related to a fraction of gene Affymetrix call for the presence of a signal Affymetrix detection algorithm uses probe pair intensities to obtain detection p-value Using this p-value they decide whether the signal is present, marginal or absent

Affy call Detection of p-value Calculate Kendalls tau T for each probe pair T = (PM-MM) / (PM+MM) Determine the statistical significance of the gene by computing the p-value. Affy call

Ref: Affymetrix Technical Manual Affymetrix Vs Illumina Ref: Pan Du & Simon Lin Microarray Data Analysis Why Normalize Data? To calibrate/adjust data so as to reduce or eliminate the effects arising from variation in technology and other sources rather than due to true biological differences between test groups.

Sources of bias/variation Tissue or cell lines mRNA It can degrade over time - so there is a potential batch effect if portions of experiment are performed at different times Purity and quantity Dye color effect (spotted arrays)

Variation due to technology - is substantially reduced with improved technology Etc. A useful graphical representation of data Data matrix: X mxn , Rank(X) r min(m,n) n. m :# genes,n # samples.

Let S : m m sample covariance matrix. A useful graphical representation of data Let its spectral decomposition be given by S ' where

: m r matrix of eigenvectors : r r diagonal matrix of non- zero eigenvalues 1 2 ... r 0. A useful graphical representation of data Then Z ' X : r n matrix of " eigengenes" Z i i ' X : i th eigengene. Plot

Z1 vs Z 2 Common Normalization Methods Internal Control Normalization Global Normalization

Linear Normalization (Spotted arrays) Non-linear Normalization Method (Spotted arrays) LOWESS curve. ANOVA COMBAT (for batch effect) Internal control normalization

(Housekeeping gene(s)) Expression of each gene is measured relative to the average of house keeping genes. Basic assumption: Expression of housekeeping genes does not change. Disadvantage: House keeping genes may be highly expressed sometimes. Unexpected regulation of house keeping gene(s) leads to misinterpretation

Global Normalization Basic assumption Mean/Median expression ratio of all monitored mRNAs is constant across a chip. Regression of R log on a constant G In simple terms the log ratios are corrected by a

common mean or median This method can also be applied to single Dye data Linear Normalization (for spotted arrays) Basic assumption Mean/Median expression ratio of all monitored mRNAs depends upon the average intensity Regression of

R log on (1/2) log(RG) G Non-Linear Normalization (for spotted arrays) Basic assumption Mean/Median expression ratio of all monitored mRNAs depends upon the average intensity Regression of Where

R log on C(log(RG)) G C(log(RG)) is estimated by the robust scatter plot smoother LOWESS (Locally WEighted Scatterplot Smoothing) Analysis of Variance (ANOVA)

Standard Analysis of Variance model Response variable - Gene expression Explanatory variables: Dye color Batch

Other potential effects? Advantage: Statistically significant genes can be identified while controlling for the various experimental conditions/factors. Some important experimental designs Pooled Samples versus Separate samples Sometimes there may not be sufficient biological sample/specimen from a given animal. In such cases biological samples

are pooled from several identical animals to form a sample. An example of a pooling design (for each treatment group) Subjects Pool Observations (Microarray chips) The pooling design Subjects

Pool Observations (Microarray chips) 9 36 (3 per pool) More generally: n pm (r=n/p per pool) The standard design

Subjects # Pool Observations (Microarray chips) 9 99 (r=1) More generally: n p=n m=n (r=1) Some issues

What are the underlying parameters? Effect of pooling on power. The basic assumption. Validity of the assumption. Parameters Total variation in the expression of a gene can be decomposed in to:

Biological variation Technical variation Biological samples (n) Number of pools (p) Biological samples per pool (r=n/p) Observed number of samples (e.g. microarrays) (m) Some comments about pooling Variance of the estimated mean expression of a gene depends on:

number of pools (p) number of bio samples per pool (r) number of arrays (m) biological variation Technical variation. Pooling works well when the biological variation in the gene expression is substantially larger than the technical variation. Power comparisons

# Bio #Micro Pool size 5/group 5/group 6/group 6/group 1 (Standard design) 1 (Standard design) 6/group 3/group 8/group 4/group 10/group 5/group 2 (i.e 3 pools/group) 2 (i.e. 4 pools/group)

2 (i.e. 5 pools/group) - Zhang and Gant (2005) Power 0.81 0.95 0.30 0.80 0.98 Power comparisons Conditions of the simulation study: Biological variation is 4 times the technical variation.

False positive rate is 0.001. Detect 2-fold expression. Data are normally distributed. A fundamental assumption Biological averaging: Suppose an experiment consists of pooling r samples. Then the expression of a gene in the pooled sample is assumed to be the average of the genes expression in the r samples. This assumption need not be true especially if the expression values are transformed non-linearly. Some important experimental designs

Reference designs (Spotted array) Each treatment sample is hybridized against a common reference control. Loop designs (Spotted array) Suppose we have a control and three experimental groups A, B and C. Then hybridize Control and A, A with B, B with C and C with A. Data Analysis - Preliminaries

Normalization Transformation of data (usual methods) Perhaps first fit ANOVA and plot the residuals Log transformation Square root

More generally, Box-Cox family of transformations Identify potential outliers in the data (again, perhaps use the residuals) Data Analysis Method of Analysis depends upon the scientific question of interest. In the next three lectures we describe several general methods and illustrate some using real data!

Recently Viewed Presentations

  • Jamestown

    Jamestown

    Key Terms . Virginia Joint-Stock Company. John Smith. House Of Burgesses. Anglican faith. West Indies. Headright System. John Rolfe. Tobacco. Headright System. Powhatan
  • Plant Parts - Independence FFA

    Plant Parts - Independence FFA

    Times New Roman Arial Calibri Comic Sans MS Default Design Microsoft Clip Gallery Plant Parts - Leaves Why are plants important? Basic Parts of the Plant Leaf Anatomy Internal Leaf Structure Functions of Leaf Cells Electron Micrograph of Leaf Photosynthesis
  • Networking  Computer network A collection of computing devices

    Networking Computer network A collection of computing devices

    Computer network A collection of computing devices that are connected in various ways in order to communicate and share resources Usually, the connections between computers in a network are made using physical wires or cables However, some connections are wireless,...
  • The Basics Of CRM - LIVA

    The Basics Of CRM - LIVA

    This includes Capturing Leads Storage and analysis of the customers, vendors and partners Internal information (organizational) CRM Ecosystem Coined in by META group CRM has 3 aspects Operational Collaborative Analytical Operational Aspect Operational aspect of CRM is automation to Customer's...
  • Peppered Moth (Natural Selection)

    Peppered Moth (Natural Selection)

    PEPPERED MOTH. When parts of England became polluted; smoke destroyed the trees and blackened the tree bark.Pale colored moths, when resting on the tree trunks, were eaten by birds. Rare black moths were camouflaged in the black background; eventually birds...
  • Visões do Graal - PUC-Rio

    Visões do Graal - PUC-Rio

    De novo em letra minúscula O graal de cada um - principal objetivo na vida, muitas vezes não material. O graal do pesquisador: cura de doença, mas até a pesquisa pelo amor à pesquisa. O Everest está lá - portanto...
  • Associate Trainer Induction Day

    Associate Trainer Induction Day

    Learning outcomes for today. Develop your understanding of the course rep role. Discuss the student learning experience and explore how you can use it to improve your course's collective experience.
  • FIN 3000 - Baruch College

    FIN 3000 - Baruch College

    is an association of two or more persons who come together as co-owners for the purpose of operating a business for profit. There is no separation between the partnership and the owners with respect to debts or being sued. Advantages:...