In total 242 participants were invited, 175 began the battery, 157 completed the battery and 150 provided data that passed qc for both time points. Our target sample size was determined in advance of data collection and data collection continued until this number of participants with data that survived qc was reached.
Data collection took place on average 115 number of days after the completion of the initial battery with a range of 60 to 228 days.
The variables included in this report are:
- meaningful variables (includes only hdddm parameters)
- EZ diffusion parameters
- Raw RT and Accuracy measures
- Variables found in the literature (for comparison)
For reference here are the variables that are not included in the analyses of the remainder of this report because they were not of theoretical interest in factor structure analyses of this data. These include drift diffusion and other model parameters for specific conditions within a task; survey variables that are not part of the dependant variables for that survey in the literature and demographics (these are saved for prediction analyses).
test_data2 <- read.csv(paste0(retest_data_path, 't1_data/variables_exhaustive.csv'))
df <- data.frame(names(test_data2)[which(names(test_data2) %in% names(test_data) == FALSE)])
names(df) = c('vars')
df %>%
filter(vars != "X")
rm(test_data2, df)
summary(test_demog %>%
select(Sex, Age))
Sex Age
Min. :0.00 Min. :21.0
1st Qu.:0.00 1st Qu.:28.0
Median :1.00 Median :33.0
Mean :0.52 Mean :34.1
3rd Qu.:1.00 3rd Qu.:38.8
Max. :1.00 Max. :59.0
summary(retest_demog %>%
select(Sex, Age))
Sex Age
Min. :0.000 Min. :21.0
1st Qu.:0.000 1st Qu.:29.0
Median :1.000 Median :33.0
Mean :0.527 Mean :34.5
3rd Qu.:1.000 3rd Qu.:38.8
Max. :1.000 Max. :60.0
Here’s what our literature review looks like
lit_review
tmp = lit_review[duplicated(lit_review$reference)==FALSE,]
The literature review included 151 papers, 17167 participants and 574 data points on reliability.
rm(tmp)
An interactive version of this plot could be find here
Takeaways from this review are:
- Survey measures have been reported to higher reliability compared to task measures
- Survey measures have less variability in the reported reliabiltiy estimates compared to task measures
Point estimates of reliability for the demographic variabels.
demog_boot_df %>%
group_by(dv) %>%
summarise(median_icc2.1 = quantile(icc2.1, probs=0.5),
icc2.1_2.5 = quantile(icc2.1, probs = 0.025),
icc2.1_97.5 = quantile(icc2.1, probs = 0.975)) %>%
arrange(-median_icc2.1)
Point estimates of reliability for demog items
demog_rel_df = make_rel_df(test_demog, t2_df = retest_demog, metrics=c('icc2.1'))
demog_rel_df %>%
select(dv, icc2.1) %>%
arrange(-icc2.1)
Plotting point estimates of reliability for demographic items coloring those that have a time limit darker in the histogram.
timed_demogs <- c("DoctorVisitsLastMonth","DaysHalfLastMonth","Worthless","DaysPhysicalHealthFeelings","Depressed","EverythingIsEffort","Hopeless","RestlessFidgety","Nervous","DaysLostLastMonth","Last30DaysUsual")
demog_rel_df %>%
mutate(timed = ifelse(dv %in% timed_demogs, 1, 0)) %>%
ggplot(aes(icc2.1, fill = factor(timed)))+
geom_histogram(position = "identity", alpha = 0.7, color = NA)+
xlab("icc2.1")+
ylab("Count")+
scale_fill_manual(values = c("grey", "grey25"))+
theme(legend.position = "none")
Instead of the summary metric like the retest reliability estimate for each measure we can check whether the difference score distribution for each measure depends on the delay between the two measurements. Since the difference score distribution is at subject level we can check whether the order of subjects in this distribution depends on their order in the distribution of days between completing the two tests.
To check this we make data frame with difference between two scores for each measure for each subject. Since the scores for different measures are on different scales for comparability the difference scores are normalized (ie demeaned and dividied by the sd of the difference score distribution.) Note that the variance of the difference score distribution accounts for the variance in both time points by summing them. Normalization equates the means of each difference score distribution to 0 which would mask any meaningful change between the two time points but the analysis here does not interpret the mean of the difference score distributions but is interested in its relation to the days between completion. We check if the variables show systematic differences between the two points later.
Here we check if the difference is larger the longer the delay.
numeric_cols = get_numeric_cols(test_data)
t1_2_difference = data.frame()
for(i in 1:length(numeric_cols)){
tmp = match_t1_t2(numeric_cols[i],t1_df=test_data ,t2_df=retest_data,format='wide')
tmp = tmp %>%
mutate(difference = scale(`2` - `1`))
t1_2_difference = rbind(t1_2_difference, tmp)
}
t1_2_difference$difference = as.data.frame(t1_2_difference$difference)$V1
t1_2_difference = t1_2_difference %>% separate(dv, c("task", "dv2"), sep="\\.", remove=FALSE)
Add completion dates to this data frame.
t1_2_difference = merge(t1_2_difference, task_comp_times[,c('sub_id', 'task','days_btw')], by=c('sub_id', 'task'))
What does the distribution of differences look like: The distribution of differences between two time points for each measure
t1_2_difference %>%
ggplot(aes(difference, alpha=dv))+
geom_histogram(position='identity')+
theme(legend.position = 'none')
How do the difference score distributions look like with respect to the days between completion?
t1_2_difference %>%
ggplot()+
geom_smooth(aes(as.numeric(days_btw), abs(difference), group=factor(dv)), method='lm', se=FALSE, color = 'grey', alpha = 0.5)+
geom_smooth(aes(as.numeric(days_btw), abs(difference)), method='lm', color = "black", se=FALSE)+
theme(legend.title = element_blank())+
xlab('Days between completion')+
ylab('Scaled difference score')
To test if the slope of the black is significant we would run a mixed effects model with a fixed effect for days between completion, random slope for each dv depending on the days between and random intercept for each dv.
Before I was using subjects as a random effect but days between the two time points for each measure depends on subj id. What varies randomly is which dv we are looking for its distribution of differences in relation to the days between the time points. So I changed the model to have fixed effect for the days between, a random slope for (dependent variables can be differentially sensitive to th effect of days between) and a random intercept for dependent variable.
Significant fixed effect suggests that on average the longer the delay the smaller the difference.
summary(lmerTest::lmer(abs(difference) ~ scale(days_btw)+(scale(days_btw) | dv), data=t1_2_difference))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: abs(difference) ~ scale(days_btw) + (scale(days_btw) | dv)
Data: t1_2_difference
REML criterion at convergence: 132273
Scaled residuals:
Min 1Q Median 3Q Max
-1.127 -0.719 -0.260 0.402 15.807
Random effects:
Groups Name Variance Std.Dev. Corr
dv (Intercept) 0.003190 0.0565
scale(days_btw) 0.000149 0.0122 1.00
Residual 0.468366 0.6844
Number of obs: 63454, groups: dv, 446
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.72218 0.00382 457.00000 189.05 <2e-16 ***
scale(days_btw) -0.00985 0.00278 3236.00000 -3.55 0.0004 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
scl(dys_bt) 0.146
# summary(tmp <- MCMCglmm(abs(difference) ~ scale(days_btw), random = ~us(scale(days_btw)):dv, data=t1_2_difference, verbose=FALSE))
But if I just run one mixed effects model then we don’t get a sense of the simple effects (how many of the variables this effect of the days between is significant and in which direction). I can run it separately for each dv to see if all difference score distributions are affected the same way depending on the days between completion. But since we are running so many tests we need to correct for multiple comparisons. How many of these tests are significant FDR correcting? None.
get_delay_effect = function(df){
mod = lm(abs(difference) ~ scale(days_btw), data = df)
out = data.frame(estimate=coef(summary(mod))["scale(days_btw)","Estimate"], pval=coef(summary(mod))["scale(days_btw)","Pr(>|t|)"])
return(out)
}
days_effect = t1_2_difference %>%
group_by(dv) %>%
do(get_delay_effect(.))
#Correct p-values for multiple comparisons
sum(p.adjust(days_effect$pval, method = "fdr") < 0.05)
[1] 0
Conclusion: The average difference between the scores of a measure doesn’t change with the increased delay.
Some surveys have overlapping questions. Do these correlate within and across sessions?
First determine the overlapping questions.
duplicate_items
#surveys to read in
extract_items = c('worker',unique(with(duplicate_items, c(item1_ID, item2_ID))))
#correlations to compute:
#item1_t1 - item2_t1,
#item1_t2 - item2_t2,
#item1_t1 - item2_t2,
#item1_t2 - item2_t1
duplicate_items_data_t1 = duplicate_items_data_t1 %>%
filter(worker %in% duplicate_items_data_t2$worker) %>%
select(extract_items)
duplicate_items_data_t2=duplicate_items_data_t2 %>%
filter(worker %in% duplicate_items_data_t1$worker) %>%
select(extract_items)
duplicate_items = duplicate_items %>%
mutate(t1_t1_cor = NA,
t2_t2_cor = NA,
t1_t2_cor = NA,
t2_t1_cor = NA,
t1_t1_polycor = NA,
t2_t2_polycor = NA,
t1_t2_polycor = NA,
t2_t1_polycor = NA)
for(i in 1:nrow(duplicate_items)){
duplicate_items$t1_t1_cor[i] = abs(cor(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))
duplicate_items$t2_t2_cor[i] = abs(cor(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))
duplicate_items$t1_t2_cor[i] = abs(cor(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))
duplicate_items$t2_t1_cor[i] = abs(cor(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))
duplicate_items$t1_t1_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))$rho[2])
duplicate_items$t2_t2_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))$rho[2])
duplicate_items$t1_t2_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))$rho[2])
duplicate_items$t2_t1_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))$rho[2])
}
for(i in 1:nrow(duplicate_items)){
duplicate_items$t1_t1_cor[i] = abs(cor(scale(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])]),
scale(duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])])))
duplicate_items$t2_t2_cor[i] = abs(cor(scale(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])]),
scale(duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])])))
duplicate_items$t1_t2_cor[i] = abs(cor(scale(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])]),
scale(duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])])))
duplicate_items$t2_t1_cor[i] = abs(cor(scale(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])]),
scale(duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])])))
duplicate_items$t1_t1_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))$rho[2])
duplicate_items$t2_t2_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))$rho[2])
duplicate_items$t1_t2_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))$rho[2])
duplicate_items$t2_t1_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))$rho[2])
}
duplicate_items %>%
arrange(-t1_t1_polycor, -t2_t2_polycor)
# summary(duplicate_items$t1_t1_cor)
# summary(duplicate_items$t2_t2_cor)
# summary(duplicate_items$t1_t2_cor)
# summary(duplicate_items$t2_t1_cor)
summary(duplicate_items$t1_t1_polycor)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.284 0.355 0.583 0.567 0.742 0.822
summary(duplicate_items$t2_t2_polycor)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.101 0.446 0.612 0.580 0.700 0.904
# summary(duplicate_items$t1_t2_polycor)
# summary(duplicate_items$t2_t1_polycor)
tmp = duplicate_items %>%
select(similarity, t1_t1_polycor, t2_t2_polycor) %>%
gather(key, value, -similarity)
tmp %>%
ggplot(aes(similarity, value, col=key))+
geom_smooth(method="lm", alpha = 0.15)+
ylab("Polychoric correlation")+
xlab("Levenshtein distance")+
scale_color_discrete(breaks = c("t1_t1_polycor", "t2_t2_polycor"),
labels = c("T1 correlations", "T2 correlations"),
name = element_blank())+
theme_bw()
summary(lm(value~key*scale(similarity),tmp))
Call:
lm(formula = value ~ key * scale(similarity), data = tmp)
Residuals:
Min 1Q Median 3Q Max
-0.4483 -0.1631 0.0159 0.1343 0.3367
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5675 0.0448 12.66 2e-14
keyt2_t2_polycor 0.0127 0.0634 0.20 0.84
scale(similarity) 0.0924 0.0454 2.03 0.05
keyt2_t2_polycor:scale(similarity) -0.0515 0.0643 -0.80 0.43
(Intercept) ***
keyt2_t2_polycor
scale(similarity) *
keyt2_t2_polycor:scale(similarity)
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.195 on 34 degrees of freedom
Multiple R-squared: 0.128, Adjusted R-squared: 0.051
F-statistic: 1.66 on 3 and 34 DF, p-value: 0.193
Summarize bootstrapped results and merge to lit review data
Here’s what our data looks like: (583 data points for 171 measures)
rel_comp
Distribution of reliabilities, sample sizes and delays in the literature
rel_comp %>%
select(dv, task, retest_reliability, sample_size, days) %>%
filter(days < 3600) %>%
gather(key, value, -dv, -task) %>%
ggplot(aes(value, fill=task))+
geom_density(alpha=0.5, position='identity')+
facet_wrap(~key, scales='free')+
theme(legend.title = element_blank())
summary(rel_comp$sample_size[rel_comp$task == "survey"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
11 38 73 126 153 791
summary(rel_comp$sample_size[rel_comp$task == "task"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.0 30.0 45.5 76.6 79.5 1120.0
rel_comp %>%
group_by(task) %>%
summarise(mean_sample_size = mean(sample_size),
sem_sample_size = sem(sample_size)) %>%
ggplot(aes(task,mean_sample_size,col=task))+
geom_point(size=5)+
geom_errorbar(aes(ymin=mean_sample_size-sem_sample_size, ymax=mean_sample_size+sem_sample_size), width = 0, size=2)+
xlab("")+
ylab("Sample size")+
theme(legend.position = 'none',
axis.text.x = element_text(size=14),
axis.title.y = element_text(size=14))
The literature has smaller sized samples for task measures compared to survey measures that report retest reliability.
# summary(lm(sample_size ~ task,rel_comp))
summary(m1 <- MCMCglmm(sample_size ~ task, data=rel_comp, verbose=FALSE))
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 7197
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 16224 14178 17956 1000
Location effects: sample_size ~ task
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 126.6 110.9 143.4 1000 <0.001 ***
tasktask -50.2 -69.8 -28.8 1000 <0.001 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What predicts retest reliability in the literature? Task, sample size, days
mod1 = lmer(retest_reliability ~ task + (1|dv), rel_comp)
mod2 = lmer(retest_reliability ~ task + sample_size + (1|dv), rel_comp)
anova(mod1, mod2)
refitting model(s) with ML (instead of REML)
mod3 = lmer(retest_reliability ~ task + sample_size + days+ (1|dv), rel_comp)
anova(mod2, mod3)
refitting model(s) with ML (instead of REML)
#summary(mod2)
summary(MCMCglmm(retest_reliability ~ task + sample_size, random = ~ dv, data = rel_comp, verbose=FALSE))
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: -482.3
G-structure: ~dv
post.mean l-95% CI u-95% CI eff.samp
dv 0.0177 0.0115 0.0246 1000
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.0208 0.0176 0.0238 1000
Location effects: retest_reliability ~ task + sample_size
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 0.7239222 0.6768852 0.7610282 1000 <0.001 ***
tasktask -0.1399050 -0.1914627 -0.0884644 1000 <0.001 ***
sample_size -0.0001260 -0.0002276 -0.0000173 1000 0.022 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rel_comp %>%
group_by(task) %>%
summarise(mean_rr = mean(retest_reliability))
Are the residuals of this model heteroscedastic? Yes the residuals do not depend on the predictor.
tmp = data.frame(rel_comp$sample_size, resid(mod2)) %>%
rename(sample_size = rel_comp.sample_size, resid = resid.mod2.)
tmp %>%
ggplot(aes(sample_size, resid))+
geom_point()+
geom_smooth(method = "lm")
summary(lm(resid ~ sample_size, tmp))
Call:
lm(formula = resid ~ sample_size, data = tmp)
Residuals:
Min 1Q Median 3Q Max
-0.4573 -0.0702 0.0183 0.0783 0.3691
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.54e-16 6.71e-03 0 1
sample_size 3.71e-18 4.16e-05 0 1
Residual standard error: 0.129 on 572 degrees of freedom
Multiple R-squared: 1.34e-29, Adjusted R-squared: -0.00175
F-statistic: 7.65e-27 on 1 and 572 DF, p-value: 1
Tasks have significantly lower reliability and reliability decreases with increasing sample size.
rel_comp %>%
ggplot(aes(sample_size, retest_reliability, color=task))+
geom_smooth(method='lm')+
# geom_point(alpha = 0.2)+
theme(legend.title = element_blank(),
axis.title.y = element_text(size=14),
axis.title.x = element_text(size=14))+
xlab("Sample Size")+
ylab("Retest reliability")+
ylim(-0.15, 1)
Highlighting the difference between survey and task reliability for comparability to our results.
rel_comp %>%
ggplot(aes(task, retest_reliability, fill = task))+
geom_boxplot()+
theme(legend.position = 'none',
axis.text.x = element_text(size=14),
axis.title.y = element_text(size=14))+
xlab("")+
ylab("Mean icc2.1")+
ylim(-0.4,1)
Note: We checked whether our results diverge most from studies with smaller sample sizes. Square difference between our mean estimate and the reliability from the literature decreases exponentially with sample size. The smaller the sample size in the literature the more the reliability estimate differs from our results. But this was a weak result because most of the studies in the literature have smaller sample sizes and you see both small and large deviations for these studies (these were not significant either).
Correlation between our mean estimates from bootstrapped samples and the literature review for task variables
n_df = rel_comp %>%
group_by(dv) %>%
tally()
lit_emp_cor = function(){
boot_comp = data.frame()
for(i in 1:length(unique(rel_comp$dv)) ){
cur_dv = unique(rel_comp$dv)[i]
n = n_df$n[n_df$dv == cur_dv]
sample_df = boot_df %>% filter(dv == cur_dv)
tmp = sample_n(sample_df, n)
boot_comp = rbind(boot_comp, tmp)
}
rm(cur_dv, n, sample_df, tmp)
#check if cbind is ok
# sum(boot_comp$dv == rel_comp$dv)
#cbinding pearson because that is the most common metric in the lit
rel_comp = cbind(rel_comp, boot_comp$pearson)
#rename new column
names(rel_comp)[which(names(rel_comp) == "boot_comp$pearson")] = "pearson"
out = data.frame(task = NA, survey = NA)
out$task = with(rel_comp %>% filter(task == "task"), cor(pearson, retest_reliability))
out$survey = with(rel_comp %>% filter(task == "survey"), cor(pearson, retest_reliability))
rel_comp = rel_comp[,-16]
return(out)
}
lit_emp_cor_out = plyr::rdply(100, lit_emp_cor)
write.csv(lit_emp_cor_out,'/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/lit_emp_cor_out.csv')
X .n task survey
Min. : 1.0 Min. : 1.0 Min. :0.176 Min. :-0.0236
1st Qu.: 25.8 1st Qu.: 25.8 1st Qu.:0.233 1st Qu.: 0.0414
Median : 50.5 Median : 50.5 Median :0.247 Median : 0.0663
Mean : 50.5 Mean : 50.5 Mean :0.247 Mean : 0.0627
3rd Qu.: 75.2 3rd Qu.: 75.2 3rd Qu.:0.262 3rd Qu.: 0.0858
Max. :100.0 Max. :100.0 Max. :0.297 Max. : 0.1644
Model comparisons building model to predict the reliabilities in the literature from a sample from our results versus a sample form the literature.
sample one row per measure out of lit review r^2 of retest_reliability ~ sampled_reliability vs. r^2 of retest_reliability ~ mean_icc2.1
coef of retest_reliability ~ sampled_reliability vs. coef of retest_reliability ~ mean_icc2.1
comp_lit_pred <- function(df){
sample_from_dv <- function(df){
if(nrow(df)>1){
row_num = sample(1:nrow(df),1)
sample_row = df[row_num,]
df = df[-row_num,]
df$lit_predictor = sample_row$retest_reliability
}
return(df)
}
sampled_df = df %>%
group_by(dv) %>%
do(sample_from_dv(.)) %>%
na.omit()
if(length(unique(sampled_df$task))>1){
mod_lit = lm(retest_reliability ~ lit_predictor+scale(sample_size)+task, data=sampled_df)
mod_boot = lm(retest_reliability ~ mean_pearson+scale(sample_size)+task, data=sampled_df)
}
else{
mod_lit = lm(retest_reliability ~ lit_predictor+scale(sample_size), data=sampled_df)
mod_boot = lm(retest_reliability ~ mean_pearson+scale(sample_size), data=sampled_df)
}
out = data.frame(r2_lit = summary(mod_lit)$adj.r.squared,
r2_boot = summary(mod_boot)$adj.r.squared,
m_lit = coef(summary(mod_lit))["lit_predictor","Estimate"],
m_boot = coef(summary(mod_boot))["mean_pearson","Estimate"])
return(out)
}
comp_lit_pred_out = plyr::rdply(1000, comp_lit_pred(rel_comp))
write.csv(comp_lit_pred_out,'/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/comp_lit_pred_out.csv')
with(comp_lit_pred_out, t.test(r2_lit, r2_boot, paired=T))
Paired t-test
data: r2_lit and r2_boot
t = 13, df = 1000, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.03962 0.05431
sample estimates:
mean of the differences
0.04696
tmp = comp_lit_pred_out %>%
select(r2_lit, r2_boot) %>%
gather(key, value)
summary(MCMCglmm(value ~ key, data = tmp, verbose=F))
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: -4132
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.00741 0.00699 0.00788 664
Location effects: value ~ key
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 0.1654 0.1595 0.1705 1000 <0.001 ***
keyr2_lit 0.0469 0.0389 0.0540 1000 <0.001 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We checked if survey and task measures differed in how close our results were to the literature (i.e. Are you better in estimating the literature using our data for surveys vs tasks?). We found an interaction, such that: The literature does a better job in predicting itself for tasks compared to surveys. On the other hand, our data is better in predicting the literature estimates for the surveys than it is in predicting tasks.
One might wonder: Why are the reliability estimates from surveys worse compared to tasks in predicting the literature? Because they are less variable. In this procedure having high variance in what is being predicted (i.e. the literature reliability estimates) is better.
We also conducted LOOCV on this data and checked what it looked like if we estimate the left one out using the remaining ones for the noise ceiling model? The procedure followed these steps: - Sample one row of literature,
- fit model of sample size+task on remaining data,
- use that model to predict the reliability estimates,
- compare those to the left out reliability estimate from the literature versus the average reliability estimate from our data
We found that the (squared) difference is slightly larger when predicting the left-out value from our data versus predicting left out value from literature. The squared differences were larger for predicting the literature compared to predicting our data for surveys while for tasks the squared differences were larger when predicting our data compared to predicting the literature. (This fits with other results too: our results differ from the literature more for the survey measures).
We do not present these results in further detail as they yield only converging evidence as the analyses above.
Literature vs. our data on a measure basis
rel_comp %>%
group_by(dv, task) %>%
summarise(mean_lit = mean(retest_reliability),
mean_emp = unique(mean_pearson)) %>%
ggplot(aes(mean_emp, mean_lit, col=task, shape=task))+
geom_smooth(method="lm")+
geom_point()+
xlim(-0.25, 1)+
ylim(-0.25, 1)+
ylab("Average Literature Reliability Estimate")+
xlab("Average Empirical Reliability Estimate")+
theme(legend.title = element_blank(),
legend.position = 'bottom',
plot.caption = element_text(family = "Times", lineheight = 2, hjust = 0))+
labs(caption = str_wrap("FIGURE 2: Correlation between mean reliability estimates for each measure found in the literature with the mean reliability from our data.", width = 120))
Warning in rm(tmp, rel_comp, mod1, mod2, mod3, i, lit_emp_cor_out,
comp_lit_pred, : object 'comp_lit_pred' not found
Though we are primarily reporting icc2.1’s as our metric of reliability the results don’t change depending on the metric chosen. Here we plot point estimates of three different reliability metrics against each other (icc2.1, Pearson, Spearman). The bootstrapped version is essentially the same but the plots are busier due to more datapoints.
rel_df = make_rel_df(t1_df = test_data, t2_df = retest_data, metrics = c('spearman', 'icc2.1','icc3.k', 'pearson', 'var_breakdown', 'partial_eta', 'sem'))
rel_df$task = 'task'
rel_df[grep('survey', rel_df$dv), 'task'] = 'survey'
rel_df[grep('holt', rel_df$dv), 'task'] = "task"
rel_df = rel_df %>%
select(dv, task, spearman, icc2.1,icc3.k, pearson, partial_eta, sem, var_subs, var_ind, var_resid)
We calculated 74 measures for surveys and 372 measures for cognitive tasks.
As the scatter plots depict the correlations between different types reliability metrics were very high.
cor(rel_df[,c('spearman', 'icc2.1', 'icc3.k', 'pearson')])
spearman icc2.1 icc3.k pearson
spearman 1.0000 0.9529 0.9324 0.9577
icc2.1 0.9529 1.0000 0.9788 0.9983
icc3.k 0.9324 0.9788 1.0000 0.9800
pearson 0.9577 0.9983 0.9800 1.0000
Note: Some variables have <0 icc2.1’s. This would be the case if the \(MS_{error}\)>\(MS_{between}\). Data for these variables have no relationship between the two time points.
Summarized bootstrapped reliabilities
boot_df %>%
group_by(dv) %>%
summarise(icc2.1_median = quantile(icc2.1, probs = 0.5),
icc2.1_2.5 = quantile(icc2.1, probs = 0.025),
icc2.1_97.5 = quantile(icc2.1, probs = 0.975),
spearman_median = quantile(spearman, probs = 0.5),
spearman_2.5 = quantile(spearman, probs = 0.025),
spearman_97.5 = quantile(spearman, probs = 0.975)) %>%
datatable() %>%
formatRound(columns=c('icc2.1_median', 'icc2.1_2.5', 'icc2.1_97.5', 'spearman_median', 'spearman_2.5', 'spearman_97.5'), digits=3)
Variable level summary of bootstrapped reliabilities.
Example of the overlaying procedure.
Comparison of survey measures to cognitive task measures in the bootstrapped results. Multilevel model with random intercepts for each measure and fixed effect of survey versus cognitive measure.
boot_df = boot_df %>%
mutate(task = ifelse(grepl("survey",dv), "survey","task"),
task = ifelse(grepl("holt",dv), "task", task))
boot_df %>%
group_by(task) %>%
summarise(icc2.1_median = quantile(icc2.1, probs = 0.5),
icc2.1_2.5 = quantile(icc2.1, probs = 0.025),
icc2.1_97.5 = quantile(icc2.1, probs = 0.975),
spearman_median = quantile(spearman, probs = 0.5),
spearman_2.5 = quantile(spearman, probs = 0.025),
spearman_97.5 = quantile(spearman, probs = 0.975),
num_vars = n()/1000) %>%
datatable() %>%
formatRound(columns=c('icc2.1_median', 'icc2.1_2.5', 'icc2.1_97.5', 'spearman_median', 'spearman_2.5', 'spearman_97.5'), digits=3)
summary(icc2.1_by_task_model)
summary(MCMCglmm(icc2.1 ~ task, data=rel_df, verbose=F))
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: -193
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.0379 0.0327 0.0426 1000
Location effects: icc2.1 ~ task
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 0.781 0.738 0.825 1000 <0.001 ***
tasktask -0.432 -0.482 -0.384 869 <0.001 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The quantiative explanation for the difference in reliability estimates between surveys and tasks, as recently detailed by Hedge et al. (2017), lies in the difference in sources of variance between these measures. Specifically, the icc2.1 is calculated as the ratio of variance between subjects variance to all sources of variance. Thus, measures with high between subjects variance would have high test-retest reliability. Intuitively, measures with high between subjects variance are also better suited for individual difference analyses as they would capture the differences between the subjects in a sample.
Here we first plot the percentage of variance explained by the three sources of variance for the point estimates of measure reliabilities. The plot only includes raw measures (no DDM parameters) and the measures are ranked by percentage of between subject variability for each task/survey (i.e. the best to worst individual difference measure for each task/survey). Then we compare statistically whether the percentage of variance explained by these sources differ between tasks and surveys.
Comparing types of variance for survey vs task measures: Survey measures have higher between subject variability
Note: This analysis includes DDM variables too.
Running separate models for different sources of variance because interactive model with variance type*task seemed too complicated.
First we find that task measures have a smaller percentage of their overall variance explained by variability between subjects compared to survey measures.
tmp = boot_df %>%
select(dv, task, var_subs_pct, var_ind_pct, var_resid_pct)
# summary(var_subs_pct_by_task_model)
rel_df = rel_df %>%
mutate(var_subs_pct = var_subs/(var_subs+var_ind+var_resid)*100,
var_ind_pct = var_ind/(var_subs+var_ind+var_resid)*100,
var_resid_pct = var_resid/(var_subs+var_ind+var_resid)*100)
summary(MCMCglmm(var_subs_pct ~ task, data=rel_df, verbose=F))
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 3808
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 298 260 336 1000
Location effects: var_subs_pct ~ task
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 79.5 75.7 83.2 1124 <0.001 ***
tasktask -30.2 -34.9 -26.1 1000 <0.001 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We also find that a significantly larger percentage of their variance is explained by between session variability. Larger between session variability suggests systematic differences between the two sessions. Such systematic effects can be due to e.g. learning effects as explored later.
# summary(var_ind_pct_by_task_model)
summary(MCMCglmm(var_ind_pct ~ task, data=rel_df, verbose=F))
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 4043
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 505 438 570 1000
Location effects: var_ind_pct ~ task
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 10.80 5.74 15.68 1000 <0.001 ***
tasktask 16.18 10.68 21.84 1000 <0.001 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# summary(var_resid_pct_by_task_model)
summary(MCMCglmm(var_resid_pct ~ task, data=rel_df, verbose=F))
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 3369
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 111 96.9 126 1000
Location effects: var_resid_pct ~ task
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 9.76 7.23 11.91 1094 <0.001 ***
tasktask 13.91 11.46 16.57 1000 <0.001 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rm(var_subs_pct_by_task_model, var_ind_pct_by_task_model, var_resid_pct_by_task_model)
tmp_save = tmp%>%
gather(key, value, -dv, -task) %>%
group_by(task, key) %>%
summarise(median = median(value),
sd = sd(value)) %>%
mutate(key = ifelse(key == 'var_ind_pct', 'Between session variance', ifelse(key == 'var_subs_pct', 'Between subjects variance', ifelse(key == 'var_resid_pct', 'Residual variance',NA)))) %>%
rename(Median = median, SD = sd) %>%
arrange(task, key)
tmp_save
Summarizing for clearer presentation. This graph is currently using the bootstrapped reliabilities and is therefore messier than if just using the point estimates.
Here we summarize the results on a task level to make it more digestable and easier to make contact with the literature.
We reduce the list of task measures to a list of one per task by averaging only the measures that are extracted and used from these tasks in the literature. We call these the ‘meaningful measures.’
tmp = measure_labels %>%
mutate(dv = as.character(dv)) %>%
filter(task == 'task',
dv %in% meaningful_vars) %>%
left_join(boot_df[,c("dv", "icc2.1")], by = 'dv') %>%
separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
select(-extra_1, -extra_2) %>%
group_by(task_name) %>%
summarise(median_icc2.1 = median(icc2.1),
icc2.1_2.5 = quantile(icc2.1, probs = 0.025),
icc2.1_97.5 = quantile(icc2.1, probs = 0.975),
num_measures = n()/1000,
mean_num_trials = round(mean(num_all_trials)))%>%
arrange(-median_icc2.1)
tmp %>%
datatable() %>%
formatRound(columns=c('median_icc2.1','icc2.1_2.5','icc2.1_97.5'), digits=3)
Does number of items in a task have a significant effect on the average icc2.1 of meaningful measures for all trials from a task? No.
tmp = measure_labels %>%
mutate(dv = as.character(dv)) %>%
filter(task == 'task',
dv %in% meaningful_vars) %>%
# left_join(boot_df[,c("dv", "icc2.1")], by = 'dv') %>%
left_join(rel_df[,c("dv", "icc2.1")], by = 'dv') %>%
separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
select(-extra_1, -extra_2)
#summary(icc2.1_by_num_trials_model)
summary(MCMCglmm(icc2.1 ~ num_all_trials, data=tmp, verbose=F))
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: -11.77
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.0495 0.0357 0.0662 1000
Location effects: icc2.1 ~ num_all_trials
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 0.305240 0.252702 0.368419 1000 <0.001 ***
num_all_trials 0.000188 -0.000196 0.000573 1000 0.34
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tmp %>%
ggplot(aes(num_all_trials, icc2.1))+
geom_point()+
geom_smooth(method="lm")+
theme_bw()+
xlab("Number of trials")+
ylab("icc2.1")
The above analysis was looking at the effect of number of trials across tasks. But some tasks might be bad for individual difference measurement regardless of how many trials there are in them whereas for others fewer trials might be yielding a sufficiently reliable measure.
For tasks for which dependent variables are estimated using many trials one can ask: Does the same measure get less reliable if fewer trials are used to estimate its reliability?
This won’t make sense for all tasks. For example to estimate a risk aversion parameter you need all trials for Holt and Laury. For Kirby and Bickel you have specific conditions looking at fewer trials. The Cognitive Reflection Task might be more appropriate to analyze each item seaprately. The writing task does not have trial numbers. For all others it might be interesting to investigate.
These kinds of analyses are too task-specific and in-depth for a paper that is trying to give a global sense of the differences between self-regulation measures in their suitablity for individual difference analyses based on their stability across time. Such analyses would provide a detailed examination of how to extract the most reliable/best individual difference measure from tasks with a set of mediocre variables to begin with. Though we do not provide such a comprehensive analysis of this sort in this paper we provide a single example of this approach and hope the open access we provide to the data spurs further work.
For this example we look at the retest reliability of dependent measures from the threebytwo with >400 trials. Here is a graph of how the point estimates of the retest reliability changes for each of the dependent measures using different numbers of trials to estimate them.
Takeaways from this graph: - Reliability estimates stabilize after about an eigthth of the trials (note that this might not be true for other model parameter estimates; here we are only looking at raw response time and accuracy measures) - Only three measures that are not contrast measures have reliabilities >0 - All the other measures are basically completely unreliable (including the two ‘meaningful variables’ in the literature: the cue switch cost RT’s)
trial_num_rel_df %>%
gather(key, value, -breaks) %>%
filter(key %in% c("avg_rt_icc", "learning_to_learn_icc", "nonperseverative_errors_icc")) %>%
ggplot(aes((breaks+1)*10, value, shape=key))+
geom_point()+
geom_line(alpha = 0.5)+
theme_bw()+
xlab("Number of trials")+
ylab("ICC")+
theme(legend.title = element_blank(),
legend.position = 'bottom',
plot.caption = element_text(family = "Times", lineheight = 2, hjust = 0)) +
scale_shape_manual(values = c(0:9),
breaks = c("avg_rt_icc","learning_to_learn_icc", "nonperseverative_errors_icc"),
labels = c("Average Response Time", "Learning to Learn", "Nonperseverative Errors")) +
labs(caption = str_wrap("FIGURE 6: Change in point estimates of ICC for three dependent measures selected from the Shift task using increasing numbers of trials to estimate them. These measures show the different types of relationships reliability estimates have based on the trial numbers used to estimate them.", width = 180))
Checking DDM results in the bootstrapped estimates. Variables using all trials are significantly more reliable compared to difference scores. Raw measures don’t differ from DDM parameters. Which DDM is better depends on whether all trials are used.
tmp = measure_labels %>%
mutate(dv = as.character(dv),
contrast = ifelse(overall_difference == "difference", "contrast", "non-contrast")) %>%
filter(ddm_task == 1,
rt_acc != 'other') %>%
drop_na() %>%
# left_join(boot_df[,c("dv", "icc2.1")], by = 'dv')
left_join(rel_df[,c("dv", "icc2.1")], by = 'dv')
tmp_save = tmp %>%
drop_na() %>% #try removing this in final release
group_by(contrast, raw_fit, rt_acc) %>%
summarise(icc2.1_median = quantile(icc2.1, probs = 0.5),
icc2.1_2.5 = quantile(icc2.1, probs = 0.025),
icc2.1_97.5 = quantile(icc2.1, probs = 0.975),
num_vars = n()/1000)
tmp_save %>%
datatable() %>%
formatRound(columns=c('icc2.1_median', 'icc2.1_2.5', 'icc2.1_97.5'), digits=3)
Comparing contrast vs non-contrast: overall has higher reliability than difference.
# summary(lmerTest::lmer(icc2.1 ~ contrast + (1|dv) ,tmp))
# summary(MCMCglmm(icc2.1 ~ contrast, random = ~ dv, data=tmp, verbose=FALSE))
summary(MCMCglmm(icc2.1 ~ contrast, data=tmp, verbose=FALSE))
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: -261.1
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.0219 0.0184 0.0257 1000
Location effects: icc2.1 ~ contrast
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 0.154 0.127 0.184 1000 <0.001 ***
contrastnon-contrast 0.288 0.249 0.326 1000 <0.001 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comparing raw vs ddm in overall estimates: EZ is significantly better than HDDM and comparable to raw estimates.
# summary(icc2.1_by_rawfit_noncon_model)
summary(MCMCglmm(icc2.1 ~ raw_fit, data=tmp %>% filter(contrast == "non-contrast"), verbose=FALSE))
Warning in MCMCglmm(icc2.1 ~ raw_fit, data = tmp %>% filter(contrast
== : some fixed effects are not estimable and have been removed. Use
singular.ok=TRUE to sample these effects, but use an informative prior!
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: -150.2
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.0223 0.0173 0.0274 1000
Location effects: icc2.1 ~ raw_fit
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 0.4810 0.4354 0.5257 1000 <0.001 ***
raw_fithddm -0.0971 -0.1577 -0.0268 1000 0.002 **
raw_fitraw -0.0281 -0.0895 0.0276 1000 0.358
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comparing raw vs ddm in difference scores: EZ is significantly worse than HDDM and comparable to raw estimates.
# summary(icc2.1_by_rawfit_con_model)
summary(MCMCglmm(icc2.1 ~ raw_fit, data=tmp %>% filter(contrast == "contrast"), verbose=FALSE))
Warning in MCMCglmm(icc2.1 ~ raw_fit, data = tmp %>% filter(contrast
== : some fixed effects are not estimable and have been removed. Use
singular.ok=TRUE to sample these effects, but use an informative prior!
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: -119.2
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.0189 0.0135 0.0239 1000
Location effects: icc2.1 ~ raw_fit
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 0.118627 0.081777 0.157888 1000 <0.001 ***
raw_fithddm 0.095471 0.018056 0.164123 1000 0.014 *
raw_fitraw 0.051217 -0.000661 0.106071 1000 0.068 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tmp %>%
group_by(contrast) %>%
summarise(mean_icc2.1 = mean(icc2.1),
sd_icc2.1 = sd(icc2.1))
tmp = measure_labels %>%
mutate(dv = as.character(dv)) %>%
filter(task == 'survey') %>%
# left_join(boot_df[,c("dv", "icc2.1")], by = 'dv') %>%
left_join(rel_df[,c("dv", "icc2.1")], by = 'dv') %>%
separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
select(-extra_1, -extra_2)
tmp_save = tmp %>%
group_by(task_name) %>%
summarise(median_icc2.1 = median(icc2.1),
icc2.1_2.5 = quantile(icc2.1, probs = 0.025),
icc2.1_97.5 = quantile(icc2.1, probs = 0.975),
num_measures = n()/1000,
mean_num_trials = round(mean(num_all_trials)))%>%
arrange(-median_icc2.1)
tmp_save %>%
datatable() %>%
formatRound(columns=c('median_icc2.1', 'icc2.1_2.5','icc2.1_97.5'), digits=3)
Does number of items in a survey have a significant effect on the average icc2.1 of survey measures? No.
# summary(lmerTest::lmer(icc2.1 ~ num_all_trials + (1|dv), data = tmp))
summary(MCMCglmm(icc2.1 ~ num_all_trials, data=tmp, verbose=FALSE))
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: -144.7
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.00767 0.00528 0.0103 1038
Location effects: icc2.1 ~ num_all_trials
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 0.76555 0.73139 0.80103 725 <0.001 ***
num_all_trials 0.00179 -0.00126 0.00437 1000 0.21
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tmp %>%
ggplot(aes(num_all_trials, icc2.1))+
geom_point()+
geom_smooth(method="lm")+
theme_bw()+
xlab("Number of trials")+
ylab("icc2.1")