Two-sample tests Binary or categorical outcomes (proportions) Are the observations correlated? Outcom e Variable independent correlated Binary or categorical (e.g. fracture, yes/no) Chi-square test: McNemars chi-square

test: compares binary compares proportions between two or more groups Alternative to the chi-square test if sparse cells: Fishers exact test: outcome between correlated groups (e.g., before and after) compares proportions between independent groups when there are sparse data (some cells <5). ratios or risk ratios

Conditional logistic regression: multivariate McNemars exact test: Logistic regression: regression technique for a binary outcome when groups are correlated (e.g., matched data) Relative risks: odds multivariate technique used when outcome is binary; gives multivariateadjusted odds ratios GEE modeling: multivariate regression technique for a binary outcome when groups

compares proportions between correlated groups when there are sparse data (some cells <5). Recall: The odds ratio (two samples=cases and controls) Smoker (E) Stroke (D) 15 Non-smoker (~E) 35 No Stroke (~D) 8 42

50 50 ad 15 * 42 OR 2.25 bc 35 * 8 Interpretation: there is a 2.25-fold higher odds of stroke in smokers vs. non-smokers. Inferences about the odds ratio Does the sampling distribution follow a normal distribution?

What is the standard error? Simulation 1. In SAS, assume infinite population of cases and controls with equal proportion of smokers (exposure), p=.23 (UNDER THE NULL!) 2. Use the random binomial function to randomly select n=50 cases and n=50 controls each with p=.23 chance of being a smoker. 3. Calculate the observed odds ratio for the resulting 2x2 table. 4. Repeat this 1000 times (or some large number of

times). 5. Observe the distribution of odds ratios under the null hypothesis. Properties of the OR (simulation) (50 cases/50 controls/23% exposed) Under the null, this is the expected variability of the sample ORnote the right skew Properties of the lnOR Normal! Properties of the lnOR From the simulation, can get the empirical standard error (~0.5) and p-value (~.10)

Properties of the lnOR Or, in general, standard error = 1 1 1 1 a b c d Inferences about the ln(OR) Smoker (E) Stroke (D) 15 Non-smoker (~E) 35 No Stroke (~D) 8

42 50 50 OR 2.25 ln(OR) 0.81 ln(2.25) 0 0.81 Z 1.64 0.494 1 1 1 1 8 15 35 42 p=.10

Confidence interval Smoker (E) Stroke (D) 15 Non-smoker (~E) 35 No Stroke (~D) 8 42 50 50 95% CI ln OR 0.81 1.96 * 0.494 0.16,1.78 95% CI OR e

.16 1.78 ,e 0.85,5.92 Final answer: 2.25 (0.85,5.92) Practice problem: Suppose the following data were collected in a case-control study of brain tumor and cell phone usage: Brain tumor No brain tumor Own a cell phone

20 60 Dont own a cell phone 10 40 Is there sufficient evidence for an association between cell phones and brain tumor? Answer 1. What is your null hypothesis? Null hypothesis: OR=1.0; lnOR = 0 Alternative hypothesis: OR 1.0; lnOR>0 2. What is your null distribution? lnOR~ N(0, 101 201 601 401 ) ; 1 1

1 1 10 20 60 40 TWO-SIDED TEST =SD (lnOR) = .44 3. Empirical evidence: = 20*40/60*10 =800/600 = 1.33 lnOR = .288 TWO-SIDED TEST: it would be just as extreme if the sample lnOR were .65 5. Not enough evidence to reject the null hypothesis of no association standard deviations or more below the null

4. Z = (.288-0)/.44 = .65 p-value = P(Z>.65 or Z<-.65) = .26*2 Key measures of relative risk: 95% CIs OR and RR: For an odds ratio, 95% confidence limits: OR * exp 1 1 1 1 1.96 a b c d

, OR * exp 1 1 1 1 1.96 a b c d For a risk ratio, 95% confidence limits: RR * exp 1 a /( a b ) 1 c /( c d ) 1 .

96 a c , RR * exp 1 a /( a b ) 1 c /( c d ) 1 . 96 a c

Continuous outcome (means) Are the observations independent or correlated? Outcome Variable independent correlated Alternatives if the normality assumption is violated (and small sample size): Continuous (e.g. pain scale, cognitive

function) Ttest: compares means Paired ttest: compares Non-parametric statistics between two independent groups means between two related groups (e.g., the same subjects before and after) Wilcoxon sign-rank test: non-parametric Repeated-measures ANOVA: compares changes Wilcoxon sum-rank test

(=Mann-Whitney U test): ANOVA: compares means between more than two independent groups Pearsons correlation coefficient (linear correlation): shows linear over time in the means of two or more groups (repeated measurements) correlation between two continuous variables Mixed models/GEE modeling: multivariate Linear regression:

regression techniques to compare changes over time between two or more groups; multivariate regression alternative to the paired ttest non-parametric alternative to the ttest Kruskal-Wallis test: nonparametric alternative to ANOVA Spearman rank The two-sample t-test The two-sample T-test Is the difference in means that we

observe between two groups more than wed expect to see based on chance alone? The standard error of the difference of two means Recall, Var (A-B) = Var (A) + Var (B) if A and B are independent! 2 x y 2 y x n

m **First add the variances and then take the square root of the sum to get the standard error. Shown by simulation: One sample of 30 (with SD=5). SE 5 30 One sample of 30 (with 5 SD=5). SE .91 .91

30 SE Difference of the two samples. 25 25 1.29 5 30 30 SE .91 SE (diff ) 30 5 30

.91 Distribution of differences If X and Y are the averages of n and m subjects, respectively: 2 X n Ym y 2 x ~N ( x y , ) n m

But As before, you usually have to use the sample SD, since you wont know the true SD ahead of time So, again becomes a Tdistribution... Estimated standard error of the difference. 2 x y 2 sy sx

n m Just plug in the sample standard deviations for each group. Case 1: un-pooled variance Question: What are your degrees of freedom here? Answer: Not obvious! Case 1: ttest, unpooled variances T X n Ym 2 sx n

2 sy ~ t m It is complicated to figure out the degrees of freedom here! A good approximation is given as df harmonic mean (or SAS will tell you!): 2 1 1 n m Case 2: pooled variance If you assume that the standard deviation of the characteristic (e.g., IQ) is the same in both groups, you can pool all the data to estimate a common standard deviation. This maximizes your degrees of freedom (and thus variances

your power). pooling : n s x2 (x n and (m m 1 i 1

( xi x n ) 2 i 1 i 1 1) s x2 ( yi y m ) 2 n s 2p and (n n 1

xn ) 2 i 1 m s 2y i 1) s 2y m (y i

s 2p (n 1) s x2 ( m 1) s 2y nm 2 ym ) 2 i 1 2 ( xi x n ) m ( yi y m ) 2 i 1 nm 2

Degrees of Freedom! Estimated standard error (using pooled variance estimate) x y sp 2 n sp 2 m

where : n s 2p i 1 ( xi xn )2 m ( yi ym )2 i 1 nm 2 The degrees of

freedom are n+m-2 Case 2: ttest, pooled variances T X n Ym 2 sp n s 2p 2 sp m

(n 1) s x2 (m 1) s 2y nm 2 ~ t n m 2 Alternate calculation formula: ttest, pooled variance X n Ym T sp s 2p m s 2p n

s 2p ( mn mn ~ t n m 2 1 1 n m nm ) s 2p ( ) s p ( ) m n mn mn mn Pooled vs. unpooled variance Rule of Thumb: Use pooled unless you

have a reason not to. Pooled gives you more degrees of freedom. Pooled has extra assumption: variances are equal between the two groups. SAS automatically tests this assumption for you (Equality of Variances test). If p<.05, this suggests unequal variances, and better to use unpooled ttest. Example: two-sample ttest In 1980, some researchers reported that men have more mathematical ability than women as evidenced by the 1979 SATs, where a sample of 30 random male adolescents had a mean score 1 standard deviation of 43677 and 30 random female adolescents scored lower: 41681 (genders were similar in educational backgrounds, socio-economic

status, and age). Do you agree with the authors conclusions? Data Summary n Sampl e Mean Sample Standard Deviation Group 1: women 30 416 81

Group 2: men 30 436 77 Two-sample t-test 1. Define your hypotheses (null, alternative) H0: - math SAT = 0 Ha: - math SAT 0 [two-sided] Two-sample t-test 2. Specify your null distribution: F and M have similar standard deviations/variances, so make a pooled estimate of variance. s 2p

(n 1) sm2 (m 1) s 2f nm 2 (29)77 2 ( 29)812 6245 58 6245 6245 M 30 F30 ~T58 (0, ) 30 30 6245 6245 20.4 30

30 Two-sample t-test 3. Observed difference in our experiment = 20 points Two-sample t-test 4. Calculate the p-value of what you observed 20 0 T58 .98 20.4 data _null_; pval=(1-probt(.98, 58))*2; Example 2: Difference in means Example: Rosental, R. and Jacobson, L. (1966) Teachers

expectancies: Determinates of pupils I.Q. gains. Psychological Reports, 19, 115-118. The Experiment (note: exact numbers have been altered) Grade 3 at Oak School were given an IQ test at the beginning of the academic year (n=90). Classroom teachers were given a list of names of students in their classes who had supposedly scored in the top 20 percent; these students were identified as academic bloomers (n=18). BUT: the children on the teachers lists had actually been randomly assigned to the list.

At the end of the year, the same I.Q. test was re-administered. Example 2 Statistical question: Do students in the treatment group have more improvement in IQ than students in the control group? What will we actually compare? One-year change in IQ score in the treatment group vs. one-year change in IQ score in the control group. Results: Academic bloomers (n=18) Change in IQ score:

12.2 (2.0) 12.2 points The standard deviation of change scores was 2.0 in both groups. This affects statistical significance Controls (n=72) 8.2 (2.0) 8.2 points Difference=4 points What does a 4-point difference mean?

Before we perform any formal statistical analysis on these data, we already have a lot of information. Look at the basic numbers first; THEN consider statistical significance as a secondary guide. Is the association statistically significant? This 4-point difference could reflect a true effect or it could be a fluke. The question: is a 4-point difference bigger or smaller than the expected sampling variability?

Hypothesis testing Step 1: Assume the null hypothesis. Null hypothesis: There is no difference between academic bloomers and normal students (= the difference is 0%) Hypothesis Testing Step 2: Predict the sampling variability assuming the null hypothesis is true These predictions can be made by mathematical theory or by computer simulation. Hypothesis Testing Step 2: Predict the sampling variability assuming the null hypothesis is truemath theory: 2

s p 4.0 "gifted " control 4 4 ~T88 (0, 0.52) 18 72 Hypothesis Testing Step 2: Predict the sampling variability assuming the null hypothesis is truecomputer simulation: In computer simulation, you simulate taking repeated samples of the same size from the same population and observe the sampling variability. I used computer simulation to take

1000 samples of 18 treated and 72 controls Computer Simulation Results Standard error is about 0.52 3. Empirical data Observed difference in our experiment = 12.2-8.2 = 4.0 4. P-value t-curve with 88 dfs has slightly wider cut-offs for 95% area (t=1.99) than a normal curve (Z=1.96) 12.2 8.2 4 t88

8 .52 .52 p-value <.0001 Visually If we ran this study 1000 times we wouldnt expect to get 1 result as big as a difference of 4 (under the null hypothesis). 5. Reject null! Conclusion: I.Q. scores can bias expectancies in the teachers minds and cause them to

unintentionally treat bright students differently from those seen as less bright. Confidence interval (more information!!) 95% CI for the difference: 4.01.99(.52) = (3.0 5.0) t-curve with 88 dfs has slightly wider cutoffs for 95% area (t=1.99) than a normal curve (Z=1.96) What if our standard deviation had been higher? The standard deviation for change scores in treatment and control were each 2.0. What if change scores had been much more

variablesay a standard deviation of 10.0 (for both)? Standard error is 0.52 Standard error is 2.58 Std. dev in change scores = 2.0 Std. dev in change scores = 10.0 With a std. dev. of 10.0 LESS STATISICAL POWER! Standard error is 2.58

If we ran this study 1000 times, we would expect to get +4.0 or 4.0 12% of the time. P-value=.12 Dont forget: The paired Ttest Did the control group in the previous experiment improve at all during the year? Do not apply a two-sample ttest to answer this question! After-Before yields a single sample of

differences within-group rather than betweengroup comparison Continuous outcome (means); Are the observations independent or correlated? Outcome Variable independent correlated Alternatives if the normality assumption is violated (and small sample size): Continuous (e.g. pain scale, cognitive

function) Ttest: compares means Paired ttest: compares Non-parametric statistics between two independent groups means between two related groups (e.g., the same subjects before and after) Wilcoxon sign-rank test: non-parametric Repeated-measures ANOVA: compares changes Wilcoxon sum-rank test

(=Mann-Whitney U test): ANOVA: compares means between more than two independent groups Pearsons correlation coefficient (linear correlation): shows linear over time in the means of two or more groups (repeated measurements) correlation between two continuous variables Mixed models/GEE modeling: multivariate Linear regression:

regression techniques to compare changes over time between two or more groups; multivariate regression alternative to the paired ttest non-parametric alternative to the ttest Kruskal-Wallis test: nonparametric alternative to ANOVA Spearman rank Data Summary Group 1: Change n

Sampl e Mean Sample Standard Deviation 72 +8.2 2.0 the previous experiment improve at all during the year? t71 8.2 0

2 2 72 8.2 28 .29 p-value <.0001 Normality assumption of ttest If the distribution of the trait is normal, fine to use a t-test. But if the underlying distribution is not

normal and the sample size is small (rule of thumb: n>30 per group if not too skewed; n>100 if distribution is really skewed), the Central Limit Theorem takes some time to kick in. Cannot use ttest. Note: ttest is very robust against the normality assumption! Alternative tests when normality is violated: Nonparametric tests Continuous outcome (means); Are the observations independent or correlated? Outcome Variable independent correlated Alternatives if the

normality assumption is violated (and small sample size): Continuous (e.g. pain scale, cognitive function) Ttest: compares means Paired ttest: compares Non-parametric statistics between two independent groups means between two related groups (e.g., the same subjects before and after)

Wilcoxon sign-rank test: non-parametric Repeated-measures ANOVA: compares changes Wilcoxon sum-rank test (=Mann-Whitney U test): ANOVA: compares means between more than two independent groups Pearsons correlation coefficient (linear correlation): shows linear over time in the means of two or more groups (repeated measurements)

correlation between two continuous variables Mixed models/GEE modeling: multivariate Linear regression: regression techniques to compare changes over time between two or more groups; multivariate regression alternative to the paired ttest non-parametric alternative to the ttest Kruskal-Wallis test: nonparametric alternative to ANOVA

Spearman rank Non-parametric tests t-tests require your outcome variable to be normally distributed (or close enough), for small samples. Non-parametric tests are based on RANKS instead of means and standard deviations (=population parameters). Example: non-parametric tests 10 dieters following Atkins diet vs. 10 dieters following Jenny Craig Hypothetical RESULTS: Atkins group loses an average of 34.5 lbs.

J. Craig group loses an average of 18.5 lbs. Conclusion: Atkins is better? Example: non-parametric tests BUT, take a closer look at the individual data Atkins, change in weight (lbs): +4, +3, 0, -3, -4, -5, -11, -14, -15, -300 J. Craig, change in weight (lbs) -8, -10, -12, -16, -18, -20, -21, -24, -26, -30 Jenny Craig 30 25 20 P e r c 15

e n t 10 5 0 -30 -25 -20 -15 -10 -5 0

5 Weight Change 10 15 20 Atkins 30 25 20 P e r c 15 e n

t 10 5 0 -300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 Weight Change -60

-40 -20 0 20 t-test inappropriate Comparing the mean weight loss of the two groups is not appropriate here. The distributions do not appear to be normally distributed. Moreover, there is an extreme outlier (this outlier influences the

mean a great deal). Wilcoxon rank-sum test RANK the values, 1 being the least weight loss and 20 being the most weight loss. Atkins +4, +3, 0, -3, -4, -5, -11, -14, -15, -300 1, 2, 3, 4, 5, 6, 9, 11, 12, 20 J. Craig -8, -10, -12, -16, -18, -20, -21, -24, -26, -30 7, 8, 10, 13, 14, 15, 16, 17, 18, 19 Wilcoxon rank-sum test

Sum of Atkins ranks: 1+ 2 + 3 + 4 + 5 + 6 + 9 + 11+ 12 + 20=73 Sum of Jenny Craigs ranks: 7 + 8 +10+ 13+ 14+ 15+16+ 17+ 18+19=137 Jenny Craig clearly ranked higher! P-value *(from computer) = .018 *For details of the statistical test, see appendix of these slides Binary or categorical outcomes (proportions) Are the observations correlated? Outcom e Variable

independent correlated Binary or categorical (e.g. fracture, yes/no) Chi-square test: McNemars chi-square test: compares binary compares proportions between two or more groups outcome between two correlated groups (e.g., before

and after) Relative risks: odds ratios or risk ratios Logistic regression: multivariate technique used when outcome is binary; gives multivariateadjusted odds ratios Conditional logistic regression: multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data) GEE modeling: multivariate regression technique for a Alternative to the chi-square test if

sparse cells: Fishers exact test: compares proportions between independent groups when there are sparse data (some cells <5). McNemars exact test: compares proportions between correlated groups when there are sparse data (some cells <5). Difference in proportions (special case of chi-square test) Null distribution of a difference in proportions Standard error of a proportion=

Standard error can be estimated by= (still normally distributed) p (1 p ) n p (1 p ) n Standard error of the difference of two proportions= p 1 (1 p 1 ) p 2 (1 p 2 ) or n1 n2 (n ) p (n2 ) p2 p (1 p ) p (1 p ) , where p 1 1 n1 n2 n1 n2

The variance of a difference is the sum of variances (as with difference in means). Analagous to pooled variance in the ttest Null distribution of a difference in proportions Difference of proportions p (1 p ) p (1 p ) ~ N ( p1 p2 , ) n1 n2 Follows a normal because binomial can be approximated with

normal Difference in proportions tes Null hypothesis: The difference in proportions is 0. p1 p2 Z p * (1 p ) p * (1 p ) n1 n2 p Recall, variance of a proportion is p(1-p)/n n1 p1 n2 p2 (just average proportion) n1 n2 p1 proportion in group 1 p2 proportion in group 2

n1 number in group 1 n2 number in group 2 Use average (or pooled) proportion in standard error formula, because under the null hypothesis, groups Recall case-control example: Smoker (E) Stroke (D) 15 Non-smoker (~E) 35 No Stroke (~D)

8 42 50 50 Absolute risk: Difference in proportions exposed Smoker (E) Stroke (D) 15 Non-smoker (~E) 35 No Stroke (~D) 8

42 50 50 P ( E / D) P( E / ~ D) 15 / 50 8 / 50 30% 16% 14% Difference in proportions exposed 14% 0% .14 Z 1.67 .23 * .77 .23 * .77 .084 50 50 95% CI : 0.14 1.96 * .084 0.03 to .31

Example 2: Difference in proportions Research Question: Are antidepressants a risk factor for suicide attempts in children and adolescents? Example modified from: Antidepressant Drug Therapy and Suicide in Severely Depressed Children and Adults ; Olfson et al. Arch Gen Psychiatry.2006;63:865872. Example 2: Difference in Proportions Design: Case-control study Methods: Researchers used Medicaid records to compare prescription histories

between 263 children and teenagers (6-18 years) who had attempted suicide and 1241 controls who had never attempted suicide (all subjects suffered from depression). Statistical question: Is a history of use of antidepressants more common among cases than controls? Example 2 Statistical question: Is a history of use of antidepressants more common among heart disease cases than controls? What will we actually compare? Proportion of cases who used antidepressants in the past vs. proportion of controls who did Results

Any antidepressant drug ever No (%) of cases (n=263) No (%) of controls (n=1241) 120 (46%) 448 (36%) 46% 36% Difference=10% Is the association statistically significant?

This 10% difference could reflect a true association or it could be a fluke in this particular sample. The question: is 10% bigger or smaller than the expected sampling variability? Hypothesis testing Step 1: Assume the null hypothesis. Null hypothesis: There is no association between antidepressant use and suicide attempts in the target population (= the difference is 0%) Hypothesis Testing Step 2: Predict the sampling variability assuming the null hypothesis is true

p cases p controls ~ N(0, = 568 568 568 568 (1 ) (1 ) 1504 1504 1504 1504 + = .033) 263 1241

Also: Computer Simulation Results Standard error is about 3.3% Hypothesis Testing Step 3: Do an experiment We observed a difference of 10% between cases and controls. Hypothesis Testing Step 4: Calculate a p-value .10 Z= = 3.0; p = .003 .033 P-value from our simulation

We also got 3 results as small or smaller than 10%. When we ran this study 1000 times, we got 1 result as big or bigger than 10%. P-value From our simulation, we estimate the p-value to be: 4/1000 or .004 Hypothesis Testing Step 5: Reject or do not reject the null hypothesis. Here we reject the null. Alternative hypothesis: There is an association between antidepressant use and suicide in the

target population. What would a lack of statistical significance mean? If this study had sampled only 50 cases and 50 controls, the sampling variability would have been much higheras shown in this computer simulation Standard error is about 3.3% Standard error is about 10% 263 cases and 1241 controls.

50 cases and 50 controls. With only 50 cases and 50 controls Standard error is about 10% If we ran this study 1000 times, we would expect to get values of 10% or higher 170 times (or 17% of the time). Two-tailed p-value Two-tailed p-value = 17%x2=34%

Practice problem An August 2003 research article in Developmental and Behavioral Pediatrics reported the following about a sample of UK kids: when given a choice of a non-branded chocolate cereal vs. CoCo Pops, 97% (36) of 37 girls and 71% (27) of 38 boys preferred the CoCo Pops. Is this evidence that girls are more likely to choose brand-named products? Answer Null says ps are equal so estimate standard error using overall observed p 1. Hypotheses: H0: p-p= 0 Ha: p-p 0 [two-sided] 63

63 63 63 (1 ) (1 ) 75 75 75 75 ) 2. Null distribution of difference of two proportions: p f pm ~ N (0, 37 38 .84(.16) .84(.16) .085 37

38 3. Observed difference in our experiment = .97-.71= .26 4. Calculate the p-value of what you observed: data _null_; pval=(1-probnorm(3.06))*2; put pval; Z .26 0 3.06 .085 Key two-sample Hypothesis Tests Test for Ho: x- y = 0 (2 unknown, but roughly equal): t n 2 x y

s 2p nx s 2p ; s 2p ( n x 1) s x2 ( n y 1) s 2y n 2 ny Test for Ho: p1- p2= 0: p 1 p 2 n1 p 1 n2 p 2 Z ;p n1 n2

( p )(1 p ) ( p )(1 p ) n1 n2 Corresponding confidence intervals For a difference in means, 2 independent samples (2s unknown but roughly equal): ( x y ) t n 2, / 2 s 2p s 2p n x samples: ny For a difference in proportions, 2 independent

( p 1 p 2 ) Z / 2 ( p )(1 p ) ( p )(1 p ) n1 n2 Appendix: details of ranksum test Wilcoxon Rank-sum test Rank all of the observations in order from 1 to n. T1 is the sum of the ranks from smaller population (n1 ) T2 is the sum of the ranks from the larger population (n 2 ) U 1 n1 n 2 U 2 n1 n2 n1 ( n1 1) T1 2

for n1 10, n 2 10, n 2 ( n 2 1) T2 2 n1 n 2 2 Z n1 n 2 ( n1 n2 1) 12 U 0 min(U 1 , U 2 ) Find P(U U0) in Mann-Whitney U tables With n2 = the bigger of the 2 populations U0 Example

For example, if team 1 and team 2 (two gymnastic teams) are competing, and the judges rank all the individuals in the competition, how can you tell if team 1 has done significantly better than team 2 or vice versa? T1 sum of ranks of group 1 (smaller) Answer T2 sum of ranks of group 2 (larger) Intuition: under the null hypothesis of no difference between the two groups If n1=n2, the sums of T1 and T2 should be equal. But if n1 n2, then T2 (n2=bigger group) should automatically be

bigger. But how much bigger under the null? For example, if team 1 has 3 people and team 2 has 10, we could rank all 13 participants from 1 to 13 on individual performance. If team1 (X) and team2 dont differ in talent, the ranks ought to be spread evenly among the two groups, e.g. 1 2 X 4 5 6 X 8 9 10 X 12 13 (exactly even distribution if team1 ranks 3rd, 7th, and 11th) Remember this? n1 n2 (n1 n2 )(n1 n2 1) T1 T2 i 2 i 1 2

sum of within-group ranks for smaller n1 n ( n 1) group. i 1 1 2 i 1 2 (n1 n1n2 n1 n1n2 n2 n2 ) n1 (n1 1) n2 (n2 1) n1n2 2 2

2 13 (13)(14) e.g., here : T1 T2 i 91 55 6 30 2 i 1 sum of within-group ranks for larger n2 n2 (n2 1) group. i i 1 Take-home point: n1 (n1 1) n2 (n2 1) T1 T2

n1n2 2 2 2 It turns out that, if the null hypothesis is true, the difference between the larger-group sum of ranks and the smaller-group sum of ranks is exactly equal to the difference between T 1 and T2 10 10(11) 55 2 The difference between the sum of the i i 1 3

i 1 3( 4) 6 2 ranks within each individual group is 49. 55 6 49 T1 = 3 + 7 + 11 =21 T2 = 1 + 2 + 4 + 5 + 6 + 8 + 9 +10 + 12 +13 = 70 70-21 = 49 Magic! Under the null, n (n 1) n1 (n1 1) T2 T1 2 2 2 2 The difference between the sum of the

ranks of the two groups is also equal to 49 if ranks are evenly interspersed (null is true). n1 (n1 1) n2 (n2 1) T2 T1 n1n2 2 2 n2 (n2 1) n1 (n1 1) T2 T1 2 2 n2 (n2 1) n1n2 T2 2 2 n1 (n1 1) n1n2 T1

2 2 n2 (n2 1) define U 2 n1n2 T2 2 n1 (n1 1) define U1 n1n2 T1 2 Their sum should equal n1n2 . From slide 23 From slide 24 Define new statistics Here, under null: U2=55+30-70

U1=6+30-21 U2+U1=30 under null hypothesis, U1 should equal U2: n 2 (n 2 1) n1 (n1 1) E(U 2 - U 1 ) E[( ) (T2 T1 )] 0 2 2 The Us should be equal to each other and will equal n1n2/2: U1 + U2 = n1n2 Under null hypothesis, U1 = U2 = U0 E(U1 + U2) = 2E(U0) = n1n2 E(U1 = U2=U0) = n1n2/2 So, the test statistic here is not quite the difference in the sum-of-ranks of the 2 groups Its the smaller observed U value: U0 For small ns, take U0, and get p-value directly from a U

table. For large enough ns (>10 per group) n1n2 E (U 0 ) 2 n1 n 2 U0 U 0 E (U 0 ) 2 Z Var (U 0 ) Var (U 0 ) Var (U 0 ) n1n2 (n1 n2 1) 12

Add observed data to the example Example: If the girls on the two gymnastics teams were ranked as follows: Team 1: 1, 5, 7 Observed T1 = 13 Team 2: 2,3,4,6,8,9,10,11,12,13 Observed T2 = 78 Are the teams significantly different? Total sum of ranks = 13*14/2 = 91 n1n2=3*10 = 30 Under the null hypothesis: expect U1 - U2 = 0 and U1 + U2 = 30 (each should equal about 15 under the null) and U0 = 15 U1=30 + 6 13 = 23 U2= 30 + 55 78 = 7 U0 = 7 Not quite statistically significant in U tablep=.1084 (see attached) x2 for two-tailed test Example problem 2 A study was done to compare the Atkins Diet (low-carb) vs. Jenny Craig

(low-cal, low-fat). The following weight changes were obtained; note they are very skewed because someone lost 100 pounds; the mean loss for Atkins is going to look higher because of the bozo, but does that mean the diet is better overall? Conduct a Mann-Whitney U test to compare ranks. Atkins Jenny Craig -100 -11 -8 -15 -4 -5 +5

+6 +8 -20 +2 Answer Sum of ranks for JC = 25 (n=5) Sum of ranks for Atkins=41 (n=6) n1n2=5*6 = 30 Atkins 1 5 7 9 11 8

under the null hypothesis: expect U1 - U2 = 0 and U1 + U2 = 30 and U0 = 15 U1=30 + 15 25 = 20 U2= 30 + 21 41 = 10 U0 = 10; n1=5, n2=6 Go to Mann-Whitney chart.p=.2143x 2 = .42 Jenny Craig 4 3 6 10 2