Replicability Crisis in Science?
18-22 September 2023
Credibility of scientific claims is established with evidence for their replicability using new data (Nosek & Errington, 2020)
Replication is repeating a study’s procedure and observing whether the prior finding recurs (Jeffreys, 1973)
Replication is a study for which any outcome would be considered diagnostic evidence about a claim from prior research (Nosek & Errington, 2020).
Replication is often intended as conditioned to the original result. The original result could be a false positive or a biased result. Also the replication attempt could be a false positive or a false negative (Nosek & Errington, 2020).
To be a replication, two things must be true. Outcomes consistent with a prior claim would increase confidence in the claim, and outcomes inconsistent with a prior claim would decrease confidence in the claim (Nosek & Errington, 2020).
This is somehow similar with a Bayesian reasoning where evidence about a phenomenon is updated after collecting more data.
Exact replications are commonly considered as the gold-standard but in practice (especially in Social Sciences, Psychology, etc.) are rare.
Let’s imagine, an original study \(y_{or}\) finding a result.
A direct replication is defined as the repetition of an experimental procedure.
A conceptual replication is defined as testing the same hypothesis with different methods.
Let’s imagine an extreme example: testing the physiological reaction to arousing situation:
Exact replication is often not feasible. Even using the same experimental setup (direct replication) does not assure that we are studying the same phenomenon.
Even when an experiment use almost the same setup of the original study there is a source of unknown uncertainty. Which is the impact of a slightly change in the experimental setup on the actual result?
How to evaluate the actual impact?
For the purpose of notation and simplicity we can define a meta-analytical-based replication model (Hedges & Schauer, 2019c; Schauer, 2022; Schauer & Hedges, 2021)
\[ y_i = \mu_{\theta} + \delta_i + \epsilon_i \]
\[ \delta_i \sim \mathcal{N}(0, \tau^2) \]
\[ \epsilon_i \sim \mathcal{N}(0, \sigma^2_i) \]
For the examples we are going to simulate (unstandardized) effect sizes (see Slide “Extra - Simulating Meta-Analysis” of the meta-analysis workshop). Basically:
\[ \Delta = \overline{X_1} - \overline{X_2} \]
\[ SE_{\Delta} = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}} \]
With \(X_{1_j} \sim \mathcal{N}(0, 1)\) and \(X_{2_j} \sim \mathcal{N}(\Delta, 1)\)
continue…
Thus our observed effect sizes \(y_i\) is sampled from: \[ y_i \sim \mathcal{N}(\mu_\theta, \tau^2 + \frac{1}{n_1} + \frac{1}{n_2}) \]
Where \(\frac{1}{n_1} + \frac{1}{n_2}\) is the sampling variability (\(\sigma^2_i\)).
The sampling variances are sampled from:
\[ \sigma_i^2 \sim \frac{\chi^2_{n_1 + n_2 - 2}}{n_1 + n_2 - 2} (\frac{1}{n_1} + \frac{1}{n_2}) \]
Everything is implemented into the sim_studies() function:
yi vi sei
1 -0.448917264 0.05470588 0.2338929
2 0.320404862 0.04946823 0.2224146
3 -0.005809151 0.06306609 0.2511296
4 -0.119407165 0.06343033 0.2518538
5 0.020454832 0.08837627 0.2972815
6 0.602803505 0.10279809 0.3206214
7 0.990654574 0.05977646 0.2444923
8 0.330797921 0.06472821 0.2544174
9 -0.398616671 0.05947599 0.2438770
10 0.241340611 0.06802436 0.2608148
This distinction (see Brandt et al., 2014 for a different terminology) refers to parameters \(\theta_i\). With exact are considering a case where:
\[ \theta_1 = \theta_2 = \theta_3, \dots, \theta_k \]
Thus the true parameters of \(k\) replication studies are the same. Thus the variability among true effects \(\tau^2 = 0\).
Similarly, due to (often not controllable) differences among experiments (i.e., lab, location, sample, etc.) we could expect a certain degree of variability \(\tau^2\). In other terms \(\tau^2 < \tau^2_0\) where \(\tau^2_0\) is the maximum variability (that need to be defined). In this way studies are replicating:
\[ \theta_i \sim \mathcal{N}(\mu_\theta, \tau^2_0) \]
Coarsely, we can define a replication success when two or more studies obtain the “same” result. The definion of sameness it is crucial:
The different methods that we are going to see are focused on a specific type of aggreement. For example, we could consider \(\theta_1 = 3x\) and \(\theta_2 = x\) to have the same sign but the replication study is on a completely different scale. Is this considered a successful replication?
This refers to how the replication setup is formulated. With \(k = 2\) studies where \(k_1\) is the original study and \(k_2\) is the replication we have a one-to-one setup. In this setup we compare the replication with the original and according to the chosen method and expectation we conclude if \(k_1\) has been replicated or not.
When \(k > 2\) we could collapse the replication studies into a single value (e.g., using a meta-analysis method) and compare the results using a one-to-one or we can use a method for one-to-many designs.
Regardless the method, falsification approaches compared the original with the replicate(s) obtaining a yes-no answer or a continuous result. On the other side consistency methods are focused on evaluating the degree of similarity (i.e., consistency) among all studies.
The simplest method is called vote counting (Hedges & Olkin, 1980; Valentine et al., 2011). A replication attempt \(\theta_{rep}\) is considered successful if the result has the same direction of the original study \(\theta_{orig}\) and it is statistically significant i.e., \(p_{\theta_{rep}} \leq \alpha\). Similarly we can count the number of replication with the same sign as the original study.
Let’s simulate an exact replication:
## original study
n_orig <- 30
theta_orig <- theta_from_z(2, n_orig)
orig <- data.frame(
yi = theta_orig,
vi = 4/(n_orig*2)
)
orig$sei <- sqrt(orig$vi)
orig <- summary_es(orig)
orig yi vi sei zi pval ci.lb ci.ub
1 0.5163978 0.06666667 0.2581989 2 0.04550026 0.01033725 1.022458
## replications
k <- 10
reps <- sim_studies(k = k, theta = theta_orig, tau2 = 0, n_orig, n_orig, summary = TRUE)
head(reps) yi vi sei zi pval ci.lb ci.ub
1 0.4106045 0.04883118 0.2209778 1.8581255 0.0631511926 -0.02250404 0.8437130
2 0.5925199 0.07018358 0.2649218 2.2365837 0.0253135619 0.07328260 1.1117571
3 0.4861510 0.07065960 0.2658187 1.8288815 0.0674173580 -0.03484417 1.0071461
4 0.2306294 0.08292434 0.2879659 0.8008913 0.4231945639 -0.33377336 0.7950321
5 0.6164282 0.06172083 0.2484368 2.4812278 0.0130930662 0.12950111 1.1033554
6 0.7851106 0.04562832 0.2136079 3.6754762 0.0002374062 0.36644691 1.2037744
Let’s compute the proportions of replication studies are statistically significant:
Let’s compute the proportions of replication studies with the same sign as the original:
We could also perform some statistical tests. See Bushman & Wang (2009) and Hedges & Olkin (1980) for vote-counting methods in meta-analysis.
Let’s imagine an original experiment with \(n_{orig} = 30\) and \(\hat \theta_{orig} = 0.5\) that is statistically significant \(p \approx 0.045\). Now a direct replication (thus assuming \(\tau^2 = 0\)) study with \(n_{rep} = 350\) found \(\hat \theta_{rep_1} = 0.15\), that is statistically significant \(p\approx 0.047\).
Another approach check if the replication attempt \(\theta_{rep}\) is contained in the % confidence interval of the original study \(\theta_{orig}\). Formally:
\[ \theta_{orig} - \Phi(\alpha/2) \sqrt{\sigma^2_{orig}} < \theta_{rep} < \theta_{orig} + \Phi(\alpha/2) \sqrt{\sigma^2_{orig}} \]
Where \(\Phi\) is the cumulative standard normal distribution, \(\alpha\) is the type-1 error rate.
One potential problem of this method regards that low precise original studies are “easier” to replicate due to larger confidence intervals.
The same approach can be applied checking if the original effect size is contained within the replication confidence interval. Clearly these methods depends on the precision of studies. Formally:
\[ \theta_{rep} - \Phi(\alpha/2) \sqrt{\sigma_{rep}^2} < \theta_{orig} < \theta_{rep} + \Phi(\alpha/2) \sqrt{\sigma_{rep}^2} \]
The method has the same pros and cons of the previous approach. One advantage is that usually replication studies are more precise (higher sample size) thus the parameter and the % CI is more reliable.
One problem of the previous approaches is taking into account only the uncertainty of the original or the replication study. Patil et al. (2016) and Spence & Stanley (2016) proposed a method to take into account both sources of uncertainty.
If the original and replication studies comes from the same population, the sampling distribution of the difference is centered on 0 with a certain standard error \(\theta_{orig} - \theta_{rep_0} \sim \mathcal{N}\left( 0, \sqrt{\sigma^2_{\hat \theta_{orig} - \hat \theta_{rep}}} \right)\) (subscript \(0\) to indicate that is expected to be sampled from the same population as \(\theta_{orig}\))
\[ \hat \theta_{orig} \pm z_{95\%} \sqrt{\sigma^2_{\theta_{orig} - \theta_{rep}}} \]
If factors other than standard error influence the replication result, \(\theta_{rep_0}\) is not expected to be contained within the 95% prediction interval.
In the case of a unstandardized mean difference we can compute the prediction interval as:
\[ \sqrt{\sigma^2_{\epsilon_{\hat \theta_{orig} - \hat \theta_{rep_0}}}} = \sqrt{\left( \frac{\hat \sigma^2_{o1}}{n_{o1}} +\frac{\hat \sigma^2_{o2}}{n_{o2}}\right) + \left(\frac{\hat \sigma^2_{o1}}{n_{r1}} + \frac{\hat\sigma^2_{o2}}{n_{r2}}\right)} \]
The first term is just the standard error of the difference between the two groups in the original study and the second term is the standard error of the hypothetical replication study assuming the same standard deviation of the original but a different \(n\).
In this way we estimate an interval where, combining sampling variance from both studies and assuming that they comes from the same population, the replication should fall.
set.seed(2025)
o1 <- rnorm(50, 0.5, 1) # group 1
o2 <- rnorm(50, 0, 1) # group 2
od <- mean(o1) - mean(o2) # effect size
se_o <- sqrt(var(o1)/50 + var(o2)/50) # standard error of the difference
n_r <- 100 # sample size replication
se_o_r <- sqrt(se_o^2 + (var(o1)/100 + var(o2)/100))
od + qnorm(c(0.025, 0.975)) * se_o_r[1] 0.325520 1.298378
Mathur & VanderWeele (2020) proposed a new method based on the prediction interval to calculate a p value \(p_{orig}\) representing the probability that \(\theta_{orig}\) is consistent with the replications. This method is suited for many-to-one replication designs. Formally:
\[ P_{orig} = 2 \left[ 1 - \Phi \left( \frac{|\hat \theta_{orig} - \hat \mu_{\theta_{rep}}|}{\sqrt{\hat \tau^2 + \sigma^2_{orig} + \hat{SE}^2_{\hat \mu_{\theta_{rep}}}}} \right) \right] \]
It is interpreted as the probability that \(\theta_{orig}\) is equal or more extreme that what observed. A very low \(p_{orig}\) suggest that the original study is inconsistent with replications.
The code is implemented in the Replicate and MetaUtility R packages:
tau2 <- 0.05
theta_rep <- 0.2
theta_orig <- 0.7
n_orig <- 30
n_rep <- 100
k <- 20
replications <- sim_studies(k, theta_rep, tau2, n_rep, n_rep)
original <- sim_studies(1, theta_orig, 0, n_orig, n_orig)
fit_rep <- metafor::rma(yi, vi, data = replications) # random-effects meta-analysis
Replicate::p_orig(original$yi, original$vi, fit_rep$b[[1]], fit_rep$tau2, fit_rep$se^2)[1] 0.5563241
# standard errors assuming same n and variance 1
se_orig <- sqrt(4/(n_orig * 2))
se_rep <- sqrt(4/(n_rep * 2))
se_theta_rep <- sqrt(1/((1/(se_rep^2 + tau2)) * k)) # standard error of the random-effects estimate
sep <- sqrt(tau2 + se_orig^2 + se_theta_rep^2) # z of p-orig denominator
curve(dnorm(x, theta_rep, sep), theta_rep - 4*sep, theta_rep + 4*sep, ylab = "Density", xlab = latex2exp::TeX("\\theta"))
points(theta_orig, 0.02, pch = 19, cex = 2)
abline(v = qnorm(c(0.025, 0.975), theta_rep, sep), lty = "dashed", col = "firebrick")Another related metric is the \(\hat P_{> 0}\), representing the proportion of replications following the same direction as the original effect. Before simply computing the proportions we need to adjust the estimated \(\theta_{rep_i}\) with a shrinkage factor:
\[ \tilde{\theta}_{rep_i} = (\theta_{rep_i} - \mu_{\theta_{rep_i}}) \sqrt{\frac{\hat \tau^2}{\hat \tau^2 + v_{rep_i}}} \]
This method is somehow similar to the vote counting but we are adjusting the effects taking into account \(\tau^2\).
The authors suggest a bootstrapping approach for making inference on \(\hat P_{> 0}\)
nboot <- 1e4
theta_boot <- matrix(0, nrow = nboot, ncol = k)
for(i in 1:nboot){
idx <- sample(1:nrow(replications), nrow(replications), replace = TRUE)
replications_boot <- replications[idx, ]
theta_cal <- MetaUtility::calib_ests(replications_boot$yi,
replications_boot$sei,
method = "REML")
theta_boot[i, ] <- theta_cal
}
# calculate
p_greater_boot <- apply(theta_boot, 1, function(x) mean(x > 0))Instead of using 0 as threshold, we can use meaningful effect size to be considered as low but different from 0. \(\hat P_{\gtrless q*}\) is the proportion of (calibrated) replications greater or lower than the \(q*\) value. This framework is similar to equivalence and minimum effect size testing (Lakens et al., 2018).
q <- 0.2 # minimum non zero effect
fit <- metafor::rma(yi, vi, data = replications)
# see ?MetaUtility::prop_stronger
MetaUtility::prop_stronger(q = q,
M = fit$b[[1]],
t2 = fit$tau2,
tail = "above",
estimate.method = "calibrated",
ci.method = "calibrated",
dat = replications,
yi.name = "yi",
vi.name = "vi") est se lo hi bt.mn shapiro.pval
1 0.35 0.1137901 0.1 0.55 0.3573 0.5224829
Another approach is to combine the original and replication results (both one-to-one and many-to-one) using a meta-analysis model. Then we can test if the pooled estimate is different from 0 or another meaningful value.
Fixed-Effects Model (k = 20)
logLik deviance AIC BIC AICc
20.7415 -0.0000 -39.4829 -38.4872 -39.2607
I^2 (total heterogeneity / total variability): 0.00%
H^2 (total variability / sampling variability): 0.00
Test for Heterogeneity:
Q(df = 19) = 0.0000, p-val = 1.0000
Model Results:
estimate se zval pval ci.lb ci.ub
0.2000 0.0316 6.3246 <.0001 0.1380 0.2620 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Random-Effects Model (k = 20; tau^2 estimator: REML)
logLik deviance AIC BIC AICc
19.7044 -39.4088 -35.4088 -33.5199 -34.6588
tau^2 (estimated amount of total heterogeneity): 0 (SE = 0.0065)
tau (square root of estimated tau^2 value): 0
I^2 (total heterogeneity / total variability): 0.00%
H^2 (total variability / sampling variability): 1.00
Test for Heterogeneity:
Q(df = 19) = 0.0000, p-val = 1.0000
Model Results:
estimate se zval pval ci.lb ci.ub
0.2000 0.0316 6.3246 <.0001 0.1380 0.2620 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The previous approach can be also implemented combining replications into a single effect and then compare the original with the combined replication study.
This is similar to using the CI or PI approaches but the replication effect will probably by very precise due to pooling multiple studies.
An interesting proposal is using the Q statistics (Hedges & Schauer, 2019a, 2019b, 2019c, 2021; Schauer, 2022; Schauer & Hedges, 2020; Schauer & Hedges, 2021), commonly used in meta-analysis to assess the presence of heterogeneity. Formally:
\[ Q = \sum_{i = 1}^{k} \frac{(\theta_i - \bar \theta_w)^2}{\sigma^2_i} \]
Where \(\bar \theta_w\) is the inverse-variance weighted average (i.g., fixed-effect model). The Q statistics is essentially a weighted sum of squares. Under the null hypothesis where all studies are equal \(\theta_1 = \theta_2, ... = \theta_i\) the Q statistics has a \(\chi^2\) distribution with \(k - 1\) degrees of freedom. Under the alternative hypothesis the distribution is a non-central \(\chi^2\) with non centrality parameter \(\lambda\). The expected value of the \(Q\) is \(E(Q) = v + \lambda\), where \(v\) are the degrees of freedom.
Hedges & Schauer proposed to use the Q statistics to evaluate the consistency of a series of replications:
This approach is testing the consistency (i.e., homogeneity) of replications. A successful replication should minimize the heterogeneity and the presence of a significant Q statistics should bring evidence for not replicating the effect1.
The method has been expanded and formalized in several papers with different objectives:
In the case of evaluating an exact replication we can use the Qrep() function that simply calculate the p-value based on the Q sampling distribution.
Qrep <- function(yi, vi, lambda0 = 0, alpha = 0.05){
fit <- metafor::rma(yi, vi)
k <- fit$k
Q <- fit$QE
df <- k - 1
Qp <- pchisq(Q, df = df, ncp = lambda0, lower.tail = FALSE)
pval <- ifelse(Qp < 0.001, "p < 0.001", sprintf("p = %.3f", Qp))
lambda <- Q - df
res <- list(Q = Q, lambda = lambda, pval = Qp, df = df, k = k, alpha = alpha, lambda0 = lambda0)
H0 <- ifelse(lambda0 != 0, paste("H0: lambda <", lambda0), "H0: lambda = 0")
title <- ifelse(lambda0 != 0, "Q test for Approximate Replication", "Q test for Exact Replication")
cli::cli_rule()
cat(cli::col_blue(cli::style_bold(title)), "\n\n")
cat(sprintf("Q = %.3f (df = %s), lambda = %.3f, %s", res$Q, res$df, lambda, pval), "\n")
cat(H0, "\n")
cli::cli_rule()
class(res) <- "Qrep"
invisible(res)
}Q test for Exact Replication
Q = 367.321 (df = 99), lambda = 268.321, p < 0.001
H0: lambda = 0
In case of approximate replication we need to set \(\lambda_0\) to a meaningful value but the overall test is the same. The critical \(Q\) is no longer evaluated with a central \(\chi^2\) but a non-central \(\chi^2\) with \(\lambda_0\) as non-centrality parameter.
Hedges & Schauer (2019c) provide different strategies to choose \(\lambda_0\). They found that under some assumptions, \(\lambda = (k - 1) \frac{\tau^2}{\tilde{v}}\)
Given that we introduced the \(I^2\) statistics we can derive a \(\lambda_0\) based in \(I^2\). F. L. Schmidt & Hunter (2014) proposed that when \(\tilde{v}\) is at least 75% of total variance \(\tilde{v} + \tau^2\) thus \(\tau^2\) could be considered neglegible. This corresponds to a \(I^2 = 25%\) and a ratio \(\frac{\tau^2}{\tilde{v}} = 1/3\) thus \(\lambda_0 = \frac{(k - 1)}{3}\) can be considered a neglegible heterogeneity
Simonsohn (2015) introduced 3 main questions when evaluating replicability:
meta-analysis
meta-analysis and standard tests, but problematic in terms of statistical power
small telescopes
The idea is simple but quite powerful and insightful. Let’s assume that an original study found an effect of \(y_{orig} = 0.7\) on a two-sample design with \(n = 20\) per group.
If the \(y_{rep}\) is lower (i.e., the upper bound of the confidence interval) than the small effect (\(\theta_{small} = 0.5\)) we conclude that the effect is probably so low that could not have been detected by the original study. Thus there is no evidence for a replication.
We can use the custom small_telescope() function on simulated data:
small_telescope <- function(or_d,
or_se,
rep_d,
rep_se,
small,
ci = 0.95){
# quantile for the ci
qs <- c((1 - ci)/2, 1 - (1 - ci)/2)
# original confidence interval
or_ci <- or_d + qnorm(qs) * or_se
# replication confidence interval
rep_ci <- rep_d + qnorm(qs) * rep_se
# small power
is_replicated <- rep_ci[2] > small
msg_original <- sprintf("Original Study: d = %.3f %s CI = [%.3f, %.3f]",
or_d, ci, or_ci[1], or_ci[2])
msg_replicated <- sprintf("Replication Study: d = %.3f %s CI = [%.3f, %.3f]",
rep_d, ci, rep_ci[1], rep_ci[2])
if(is_replicated){
msg_res <- sprintf("The replicated effect is not smaller than the small effect (%.3f), (probably) replication!", small)
msg_res <- cli::col_green(msg_res)
}else{
msg_res <- sprintf("The replicated effect is smaller than the small effect (%.3f), no replication!", small)
msg_res <- cli::col_red(msg_res)
}
out <- data.frame(id = c("original", "replication"),
d = c(or_d, rep_d),
lower = c(or_ci[1], rep_ci[1]),
upper = c(or_ci[2], rep_ci[2]),
small = small
)
# nice message
cat(
msg_original,
msg_replicated,
cli::rule(),
msg_res,
sep = "\n"
)
invisible(out)
}set.seed(2025)
d <- 0.2 # real effect
# original study
or_n <- 20
or_d <- 0.7
or_se <- sqrt(1/20 + 1/20)
d_small <- pwr::pwr.t.test(or_n, power = 0.33)$d
# replication
rep_n <- 100 # sample size of replication study
g0 <- rnorm(rep_n, 0, 1)
g1 <- rnorm(rep_n, d, 1)
rep_d <- mean(g1) - mean(g0)
rep_se <- sqrt(var(g1)/rep_n + var(g0)/rep_n)Here we are using the pwr::pwr.t.test() to compute the effect size \(\theta_{small}\) (in code d) associated with 33% power.
Original Study: d = 0.700 0.95 CI = [0.080, 1.320]
Replication Study: d = 0.214 0.95 CI = [-0.061, 0.490]
────────────────────────────────────────────────────────────────────────────────
The replicated effect is smaller than the small effect (0.493), no replication!
And a (quite over-killed) plot:
Verhagen & Wagenmakers (2014) proposed a method to estimate the evidence of a replication study. The core topics to understand the method are:
Bayesian inference is the statistical procedure where prior beliefs about a phenomenon are combined, using the Bayes theorem, with evidence from data to obtain the posterior beliefs.
The interesting part is that the researcher express the prior beliefs in probabilistic terms. Then after collecting data, evidence from the experiment is combined increasing or decreasing the plausibility of prior beliefs.
Let’s make an (not a very innovative 😄) example. We need to evaluate the fairness of a coin. The crucial parameter is \(\theta\) that is the probability of success (e.g., head). We have our prior belief about the coin (e.g., fair but with some uncertainty). We toss the coin \(k\) times and we observe \(x\) heads. What are my conclusions?
\[ p(\theta|D) = \frac{p(D|\theta) \; p(\theta)}{p(D)} \] Where \(\theta\) is our parameter and \(D\) the data. \(p(\theta|D)\) is the posterior distribution that is the product between the likelihood \(p(D|\theta)\) and the prior \(p(\theta)\). \(p(D)\) is the probability of the data (aka marginal likelihood) and is necessary only for the posterior to be a proper probability distribution.
We can “read” the formula as: The probability of the parameter given the data is the product between the likelihood of the data given the parameter and the prior probability of the parameter.
Let’s express our prior belief in probabilistic terms:
Now we collect data and we observe \(x = 40\) tails out of \(k = 50\) trials thus \(\hat{\theta} = 0.8\) and compute the likelihood:
Finally we combine, using the Bayes rule, prior and likelihood to obtain the posterior distribution:
The idea of the Bayes Factor is computing the evidence of the data under two competing hypotheses, \(H_0\) and \(H_1\) (~ \(\theta\) in our previous example):
\[ \frac{p(H_0|D)}{p(H_1|D)} = \frac{f(D|H_0)}{f(D|H_1)} \times \frac{p(H_0)}{p(H_1)} \]
Where \(f\) is the likelihood function, \(y\) are the data. The \(\frac{p(H_0)}{p(H_1)}\) is the prior odds of the two hypothesis. The Bayes Factor is the ratio between the likelihood of the data under the two hypotheses.
Calculating the BF can be problematic in some condition. The SDR is a convenient shortcut to calculate the Bayes Factor (Wagenmakers et al., 2010). The idea is that the ratio between the prior and posterior density distribution for the \(H_1\) is an estimate of the Bayes factor calculated in the standard way.
\[ BF_{01} = \frac{p(D|H_0)}{p(D|H_1)} = \frac{p(\theta = x|D, H_1)}{p(\theta = x, H_1)} \]
Where \(\theta\) is the parameter of interest and \(x\) is the null value under \(H_0\) e.g., 0. and \(D\) are the data.
Following the previous example \(H_0: \theta = 0.5\). Under \(H_1\) we use a completely uninformative prior by setting \(\theta \sim Beta(1, 1)\).
We flip again the coin 20 times and we found that \(\hat \theta = 0.75\).
The ratio between the two black dots is the Bayes Factor.
The idea is using the posterior distribution of the original study as prior for a Bayesian hypothesis testing where:
If \(H_0\) is more likely after seeing the data, there is evidence against the replication (i.e., \(BF_{r0} > 1\)) otherwise there is evidence for a successful replication (\(BF_{r1} > 1\)).
Warning
Disclaimer: The actual implementation of Verhagen & Wagenmakers (2014) is different (they use the \(t\) statistics). The proposed implementation for the current workshop use a standard linear model.
Let’s assume that the original study (\(n = 30\)) estimate a \(y_{orig} = 0.4\) and a standard error of \(\sigma^2/n\).
Note
The assumption of Verhagen & Wagenmakers (2014) is that the original study performed a Bayesian analysis with a completely flat prior. Thus the confidence interval is the same as the Bayesian credible interval.
For this reason, the posterior distribution of the original study can be approximated as:
Let’s imagine that a new study tried to replicate the original one. They collected \(n = 100\) participants with the same protocol and found and effect of \(y_{rep} = 0.1\).
We can analyze these data with an intercept-only regression model setting as prior the posterior distribution of the original study:
# setting the prior on the intercept parameter
prior <- rstanarm::normal(location = yorig,
scale = se)
# fitting the bayesian linear regression
fit <- stan_glm(y ~ 1,
data = dat,
prior_intercept = prior,
refresh = FALSE)
summary(fit)
Model Info:
function: stan_glm
family: gaussian [identity]
formula: y ~ 1
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 100
predictors: 1
Estimates:
mean sd 10% 50% 90%
(Intercept) 0.1 0.1 0.0 0.1 0.3
sigma 1.0 0.1 0.9 1.0 1.1
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 0.1 0.1 0.0 0.1 0.3
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 2639
sigma 0.0 1.0 2631
mean_PPD 0.0 1.0 3447
log-posterior 0.0 1.0 1615
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
We can use the bayestestR::bayesfactor_pointnull() to calculate the BF using the Savage-Dickey density ratio.
You can also use the bf_replication() function:
bf_replication <- function(mu_original,
se_original,
replication){
# prior based on the original study
prior <- rstanarm::normal(location = mu_original, scale = se_original)
# to dataframe
replication <- data.frame(y = replication)
fit <- rstanarm::stan_glm(y ~ 1,
data = replication,
prior_intercept = prior,
refresh = 0) # avoid printing
bf <- bayestestR::bayesfactor_pointnull(fit, null = 0, verbose = FALSE)
title <- "Bayes Factor Replication Rate"
posterior <- "Posterior Distribution ~ Mean: %.3f, SE: %.3f"
replication <- "Evidence for replication: %3f (log %.3f)"
non_replication <- "Evidence for non replication: %3f (log %.3f)"
if(bf$log_BF > 0){
replication <- cli::col_green(sprintf(replication, exp(bf$log_BF), bf$log_BF))
non_replication <- sprintf(non_replication, 1/exp(bf$log_BF), -bf$log_BF)
}else{
replication <- sprintf(replication, exp(bf$log_BF), bf$log_BF)
non_replication <- cli::col_red(sprintf(non_replication, 1/exp(bf$log_BF), -bf$log_BF))
}
outlist <- list(
fit = fit,
bf = bf
)
cat(
cli::col_blue(title),
cli::rule(),
sprintf(posterior, fit$coefficients, fit$ses),
"\n",
replication,
non_replication,
sep = "\n"
)
invisible(outlist)
}Bayes Factor Replication Rate
────────────────────────────────────────────────────────────────────────────────
Posterior Distribution ~ Mean: 0.137, SE: 0.097
Evidence for replication: 0.324652 (log -1.125)
Evidence for non replication: 3.080222 (log 1.125)
A better custom plot:
bfplot <- data.frame(
prior = rnorm(1e5, yorig, se),
posterior = rnorm(1e5, fit$coefficients, fit$ses)
)
plt <- ggplot() +
stat_function(geom = "line",
aes(color = "Original Study (Prior)"),
linewidth = 1,
alpha = 0.3,
fun = dnorm, args = list(mean = yorig, sd = se)) +
stat_function(geom = "line",
linewidth = 1,
aes(color = "Replication Study (Posterior)"),
fun = dnorm, args = list(mean = fit$coefficients, sd = fit$ses)) +
xlim(c(-0.5, 1.2)) +
geom_point(aes(x = c(0, 0), y = c(dnorm(0, yorig, sd = se),
dnorm(0, fit$coefficients, sd = fit$ses))),
size = 3) +
xlab(latex2exp::TeX("\\delta")) +
ylab("Density") +
theme(legend.position = "bottom",
legend.title = element_blank())A better custom plot:
Self-control has been regarded as an individual’s capacity to actively override or inhibit impulses; suppress urges; resist temptations; and break ingrained, well-learned behaviors or habits (Hagger et al., 2016).
Baumeister et al. (2007) proposed a limited-resource model of self-control
The ego-depletion effect is a behavioral effect where being engaged in a task requiring self-control reduce the strenght of self-control on a subsequent task (Baumeister et al., 2007).
Hagger et al. (2010) published a meta-analysis finding a standardized effect size of \(d = 0.62\) (\(SE = 0.02\), \(95\% CI = [0.57, 0.67]\)). They analyzed \(k = 198\) studies with a total of \(n = 10782\) participants.
However, Hagger et al. (2016) suggest that the effect could be inflated by publication bias and questioned the presence of the effect
Hagger et al. (2016) conducted a multi-lab replication study of the effect. See the OSF project https://osf.io/4zy8k/
The dataset can be found in 04-replication-methods/objects/hagger2016.rds:
id m_exp sd_exp m_ctrl sd_ctrl n_exp n_ctrl yi vi
1 0 NA NA NA NA NA NA 0.620000000 0.00040000
12 lab1 0.855 0.147 0.862 0.189 49 51 -0.040924886 0.04002438
19 lab2 0.904 0.153 0.896 0.141 62 58 0.053955545 0.03338254
110 lab3 0.888 0.207 0.849 0.239 50 49 0.173214156 0.04055969
13 lab4 0.857 0.177 0.864 0.198 75 71 -0.037138032 0.02742256
2 lab5 0.934 0.224 0.848 0.247 64 66 0.362328081 0.03128144
3 lab6 0.941 0.221 0.906 0.215 87 91 0.159900032 0.02255508
14 lab7 0.966 0.177 0.945 0.169 101 102 0.120915277 0.01974092
4 lab8 0.832 0.216 0.845 0.238 63 64 -0.056835612 0.03151073
15 lab9 1.000 0.192 0.956 0.180 50 52 0.234809077 0.03950104
5 lab10 0.855 0.271 0.922 0.244 79 80 -0.258679911 0.02536865
6 lab11 0.789 0.215 0.789 0.207 54 55 0.000000000 0.03670034
7 lab12 0.954 0.162 0.917 0.169 54 63 0.221687125 0.03460156
16 lab13 0.878 0.153 0.875 0.188 60 60 0.017391826 0.03333459
8 lab14 0.923 0.201 0.873 0.198 65 65 0.249147654 0.03100798
9 lab15 0.815 0.237 0.766 0.221 71 68 0.212507870 0.02895283
17 lab16 1.009 0.188 1.068 0.173 54 55 -0.324418361 0.03718312
18 lab17 0.881 0.173 0.919 0.182 48 50 -0.212226656 0.04106313
10 lab18 0.948 0.158 0.949 0.207 43 43 -0.005382076 0.04651180
11 lab19 0.943 0.223 0.885 0.219 47 49 0.260382433 0.04203788
20 lab20 0.948 0.185 0.898 0.188 53 52 0.266150985 0.03843601
21 lab21 0.908 0.204 0.916 0.200 59 56 -0.039328158 0.03481302
22 lab22 0.916 0.185 0.875 0.122 57 57 0.259891849 0.03538396
23 lab23 0.901 0.198 0.917 0.158 53 55 -0.088878058 0.03708631
24 lab24 0.896 0.174 0.860 0.184 99 99 0.200268475 0.02030330
type
1 original
12 replication
19 replication
110 replication
13 replication
2 replication
3 replication
14 replication
4 replication
15 replication
5 replication
6 replication
7 replication
16 replication
8 replication
9 replication
17 replication
18 replication
10 replication
11 replication
20 replication
21 replication
22 replication
23 replication
24 replication
id m_exp sd_exp m_ctrl sd_ctrl n_exp n_ctrl yi vi
1 0 NA NA NA NA NA NA 0.62000000 0.00040000
12 lab1 0.855 0.147 0.862 0.189 49 51 -0.04092489 0.04002438
19 lab2 0.904 0.153 0.896 0.141 62 58 0.05395555 0.03338254
110 lab3 0.888 0.207 0.849 0.239 50 49 0.17321416 0.04055969
13 lab4 0.857 0.177 0.864 0.198 75 71 -0.03713803 0.02742256
2 lab5 0.934 0.224 0.848 0.247 64 66 0.36232808 0.03128144
type
1 original
12 replication
19 replication
110 replication
13 replication
2 replication
id is an identifier for the study. Note that 0 is the original meta-analysis while others are the multi-lab replication studies*_exp and *_ctrl are the m mean, sd standard deviation and n sample size of the studies.yi and vi are the effect size and the sampling variance of the studiesThis is an interesting situation because we have a highly precise but probably biased original study and a probably unbiased multi-lab replications.
The dataset 04-replication-methods/objects/ghelfi2020.rds contains the original study and the replications:
id m_exp s_exp n_exp m_ctrl s_ctrl n_ctrl type yi
1 0 78.34000 10.83000 15 61.58000 16.880000 21 original 1.1152
2 1 75.28500 10.92395 20 65.36667 17.056364 19 replication 0.6822
3 2 67.08341 11.11722 22 67.51758 12.487700 22 replication -0.0361
4 3 62.25564 17.09394 19 64.25439 14.590814 19 replication -0.1231
5 4 68.99107 11.93297 40 70.70188 15.700467 24 replication -0.1256
6 5 76.52801 13.97302 17 76.64502 11.059971 22 replication -0.0092
7 6 78.22348 11.95674 9 64.04607 20.920061 9 replication 0.7925
8 7 67.02917 14.27363 24 65.93908 12.962434 29 replication 0.0791
9 8 70.20217 11.27462 23 72.54924 12.626843 22 replication -0.1929
10 9 70.19935 12.35952 51 71.70000 14.364429 55 replication -0.1109
11 10 71.79252 14.64755 147 68.30610 13.783661 142 replication 0.2444
12 11 70.20106 17.25785 9 64.76190 8.879799 8 replication 0.3691
vi
1 0.1316
2 0.1045
3 0.0877
4 0.1011
5 0.0652
6 0.1001
7 0.2190
8 0.0740
9 0.0863
10 0.0373
11 0.0139
12 0.2167