Intuitive Introduction to the Important Ideas of Inference Robin Lock St. Lawrence University Patti Frazer Lock St. Lawrence University Kari Lock Morgan Duke / Penn State Eric F. Lock Duke / U Minnesota Dennis F. Lock Iowa State / Miami Dolphins ICOTS9 Flagstaff, AZ July 2014 The Lock5 Team Robin & Patti St. Lawrence Dennis Iowa State/ Miami Dolphins Kari Duke / Penn State

Eric Duke / UMinn Outline Estimating with confidence (Bootstrap) Understanding p-values (Randomization) Implementation Organization of simulation methods? Role for distribution-based methods? Textbook/software support? U.S. Common Core Standards (Grades 9-12) Statistics: Making Inferences & Justifying Conclusions HSS-IC.A.1 Understand statistics as a process for making inferences about population parameters based on a random sample from that population. HSS-IC.A.2 Decide if a specified model is consistent with results from a given data-generating process, e.g., using simulation. HSS-IC.B.3 Recognize the purposes of and differences among sample surveys, experiments, and observational studies; explain how randomization relates to each.

HSS-IC.B.4 Use data from a sample survey to estimate a population mean or proportion; develop a margin of error through the use of simulation models for random sampling. HSS-IC.B.5 Use data from a randomized experiment to compare two treatments; use simulations to decide if differences between parameters are significant. Example #1: Body Temperatures Sample of body temperatures (in oF) for n=50 students (Shoemaker, JSE, 1996) =50 =98.26 =0.765 Goal: Find an interval that is likely to contain the mean body temperature for all students Key concept: How much should we expect the sample means to vary just by random chance? Can we estimate this using ONLY data from this sample? Brad Efron

Stanford University Bootstrapping Let your data be your guide. Basic Idea: Create simulated samples, based only the original sample data, to approximate the sampling distribution and standard error of the statistic. Brad Efron Stanford University Bootstrapping Let your data be your guide. To create a bootstrap distribution: Assume the population is many, many copies of the original sample.

Simulate many new samples from the population by sampling with replacement from the original sample. Compute the sample statistic for each bootstrap sample. Finding a Bootstrap Sample Original Sample (n=6) A simulated population to sample from Bootstrap Sample (sample with replacement from the original sample) Bootstrap Sample Original Sample 97.6 98.9 98.4

96.9 97.7 98.2 97.4 99.3 98.5 96.4 99.4 99.0 98.8 99.5 98.3 98.0 97.5 98.2 98.6 98.0 99.0 98.8 97.8 96.8

97.8 98.9 98.8 97.6 97.4 100.8 97.8 97.2 98.2 98.0 98.1 97.7 98.8 98.4 97.7 98.2 =98.26 98.0 99.0 98.4 97.9 98.3 98.2 98.4 99.0 98.7

98.7 99.3 98.2 99.0 99.0 97.2 100.8 98.0 98.9 98.1 98.4 96.4 98.7 98.2 97.7 96.4 99.0 97.6 97.7

98.6 98.6 98.6 99.3 99.0 96.8 98.5 96.8 97.6 98.9 98.3 99.4 99.0 99.5 98.8 97.2 98.9 98.2 98.3 98.2 98.4 98.8 98.3 98.2 96.8 97.6

98.4 100.8 97.7 99.0 97.4 99.0 =98.35 98.8 97.5 98.7 96.8 98.4 98.3 98.7 98.0 97.8 99.0 99.0 96.9 98.8 98.0 98.2 98.0 98.2 98.7 97.8 97.6 100.8 96.8

97.7 96.9 100.8 98.8 98.2 97.7 99.3 99.3 98.4 98.2 98.4 97.8 98.4 97.4 98.7 97.7 98.7 97.8 97.6 97.8 97.9 98.2 98.3 97.8 98.7 97.5 97.5 98.7 Repeat 1,000s of times! =98.22

Original Sample Sample Statistic Bootstrap Sample Bootstrap Statistic Bootstrap Sample Bootstrap Statistic

Many times StatKey We need technology! Bootstrap Sample Bootstrap Statistic Bootstrap Distribution

StatKey www.lock5stat.com/statkey Freely available web apps with no login required Runs in (almost) any browser (incl. smartphones/tablets) Google Chrome App available (no internet needed) Standalone or supplement to existing technology * ICOTS talk on StatKey: Session 9B, Thursday 7/17 at 10:55 Bootstrap Distribution for Body Temp Means How do we get a CI from the bootstrap distribution? Method #1: Standard Error Find the standard error (SE) as the standard deviation of the bootstrap statistics Find an interval with 2

Bootstrap Distribution for Body Temp Means Standard Error ) How do we get a CI from the bootstrap distribution? Method #1: Standard Error Find the standard error (SE) as the standard deviation of the bootstrap statistics Find an interval with 2 Method #2: Percentile Interval For a 95% interval, find the endpoints that cut off 2.5% of the bootstrap means from each tail, leaving 95% in the middle 95% Confidence Interval

Chop 2.5% in each tail Keep 95% in middle Chop 2.5% in each tail We are 95% sure that the mean body temperature for all students is between 98.04oF and 98.49oF Bootstrap Confidence Intervals Version 1 (Statistic 2 SE): Great preparation for moving to traditional methods Version 2 (Percentiles): Great at building understanding of confidence intervals Same process works for different parameters

Why does the bootstrap work? Sampling Distribution Population BUT, in practice we dont see the tree or all of the seeds we only have ONE seed Bootstrap Distribution What can we do with just one seed? Estimate the distribution and

variability (SE) of s from the bootstraps Bootstrap Population Grow a NEW tree! Chris Wild: Use the bootstrap errors that we CAN see to estimate the sampling errors that we CANT see. Golden Rule of Bootstraps The bootstrap statistics are to the original statistic

as the original statistic is to the population parameter. Example #2: Sleep vs. Caffeine Volunteers shown a list of 25 words. Before recall: Randomly assign to either Sleep (1.5 hour nap) OR Caffeine (and awake) Measure number of words recalled. n Sleep Caffeine mean stdev 12 15.25 3.31

12 12.25 3.55 Does this provide convincing evidence that the mean number of words recalled after sleep is higher than after caffeine or could this difference be just due to random chance? Mednick, Cai, Kannady, and Drummond, Comparing the Benefits of Caffeine, Naps and Palceboon Verbal, Motor and Perceptual Memory Behavioural Brain Research (2008) Example #2: Sleep vs. Caffeine H0: S = C Ha: S > C = mean number of words recalled Based on the sample data:

.0 Is this a significant difference? How do we measure significance? ... KEY IDEA P-value: The proportion of samples, when H0 is true, that would give results as (or more) extreme as the original sample. Say what???? Traditional Inference 1. Check conditions 2. Which formula? =

2 2 + 5. Which theoretical distribution? 6. df? 7. Find p-value 8. Interpret a decision 3. Calculate numbers and

plug into formula = 15.25 12.25 2 2 3.31 3.55 + 12 12 4. Chug with calculator =2.14 0.025 < p-value < 0.050

Randomization Approach Create a randomization distribution by simulating many samples from the original data, assuming H0 is true, and calculating the sample statistic for each new sample. Estimate p-value directly as the proportion of these randomization statistics that exceed the original sample statistic. Randomization Approach Number of words recalled Sleep 9 11 13 14 14 15 16 17

17 18 18 21 Caffeine 6 7 10 10 12 12 13 14 14 15 16 18 Original Sample

To simulate samples under H0 (no difference): Re-randomize the values into Sleep & Caffeine groups =12.25 =15.25 =3.0 Randomization Approach Number of words recalled Sleep 9 11 13 14 14 15

16 17 17 18 18 21 Caffeine 6 7 9 10 10 11 12 12 13 13 14 14 14

14 15 15 16 16 17 17 18 18 18 21 6 7 10 10 12 12 13 14 14

15 16 To simulate samples under H0 (no difference): Re-randomize the values into Sleep & Caffeine groups 18 =19.22 =15.25 =3.0 Randomization Approach Number of words recalled Sleep

Caffeine 6 7 9 10 10 11 11 12 12 12 13 12 13 14 13 14 14 13 14 15

14 15 16 14 16 17 14 18 17 14 18 18 15 21 15 16 16 17 17 18 18

18 21 StatKey To simulate samples under H0 (no difference): Re-randomize the values into Sleep & Caffeine groups Compute Repeat this process 1000s of times to see how unusual is the original difference of 3.0. =14.00 =13.50 = 0.50

p-value = proportion of samples, when H0 is true, that are as (or more) extreme as the original sample. p-value Implementation Issues What about traditional (distribution-based) methods? Intervals first or tests? One Crank or Two? Textbooks? Technology/Software? How does everything fit together? We use simulation methods to build understanding of the key ideas of inference. We then cover traditional normal and t-based procedures as short-cut formulas. Students continue to see all the standard methods but with a deeper understanding of the meaning.

Intro Stat Revise the Topics Descriptive Statistics one and two samples Normal distributions Bootstrap confidence intervals Data production (samples/experiments) Randomization-based hypothesis tests Sampling distributions (mean/proportion) Normal distributions Confidence intervals (means/proportions)

Hypothesis tests (means/proportions) ANOVA for several means, Inference for regression, Chi-square tests Transition to Traditional Inference Confidence Interval: ) Hypothesis Test: Need to know: Formula for SE Conditions to use a traditional distribution One Crank or Two? John Holcomb (ICOTS8) Crank #1: Reallocation Example: Scramble the sleep/caffeine labels in the word memory experiment

Crank #2: Resample Example: Sample body temps with replacement to get bootstrap samples Example: Suppose we sampled 12 nappers and 12 caffeine drinkers to compare word memory... Textbooks? Statistical Reasoning in Sports (WH Freeman) Tabor & Franklin Statistics: Unlocking the Power of Data (Wiley) Lock, Lock, Lock Morgan, Lock, Lock Statistical Thinking: A Simulation Approach to Modeling Uncertainty (Catalyst Press) Zieffler & Catalysts for Change Introduction to Statistical Investigations (Wiley) Tintle, Chance, Cobb, Rossman, Roy, Swanson and VanderStoep Software?

StatKey www.lock5stat.com/statkey Rossman/Chance Applets www.rossmanchance.com VIT: Visual Inference Tools Chris Wild www.stat.auckland.ac.nz/~wild/VIT/ Mosaic (R package) Kaplan, Horton, Pruim http://mosaic-web.org/r-packages/ Fathom/TinkerPlots Finzer, Konold Thanks for Listening! Questions? Robin [email protected] Patti [email protected] Kari [email protected]

Eric [email protected] Dennis [email protected] All [email protected]