Trial PaperPsilocybinPlacebo

A Bayesian Reanalysis of a Trial of Psilocybin versus Escitalopram for Depression

This preprint (2022) reanalyses the data of a clinical trial in which the effects of psilocybin were compared to that of the SSRI escitalopram for major depressive disorder. Bayesian secondary found indeterminate evidence that psilocybin is superior that escitalopram using the QIDS SR-16 while strong evidence favoured psilocybin when using the BDI-1D and MADRS and extremely strong evidence when using the HAMD-17. The results support the idea that psilocybin outperformed escitalopram but was not clinically meaningful and, psilocybin is almost certainly non-inferior to escitalopram.

Authors

Bari, B. A.
Carhart-Harris, R. L.
Erritzoe, D.

Published

July 1, 2022

Psyarxiv

individual Study

Links

Read Paper DOI Google Scholar

Abstract

Objectives: To perform a Bayesian reanalysis of a recent trial of psilocybin (COMP360) versus escitalopram for Major Depressive Disorder (MDD) in order to provide a more informative interpretation of the indeterminate outcome of a previous frequentist analysis.Design: Reanalysis of a two-arm double-blind placebo-controlled trial.Participants: Fifty-nine patients with MDD. Interventions: Two doses of psilocybin 25mg and daily oral placebo versus daily escitalopram and 2 doses of psilocybin 1mg, with psychological support for both groups. Outcome measures: Quick Inventory of Depressive Symptomatology-Self-Report (QIDS SR-16), and three other depression scales as secondary outcomes: HAMD-17, MADRS, and BDI-1A.Results: Using Bayes factors and ‘sceptical priors’ which bias estimates towards zero, for the hypothesis that psilocybin is superior by any margin, we found indeterminate evidence for QIDS SR-16, strong evidence for BDI-1A and MADRS, and extremely strong evidence for HAMD-17. For the stronger hypothesis that psilocybin is superior by a ‘clinically meaningful amount’ (using literature-defined values of the minimally clinically important difference), we found moderate evidence against it for QIDS SR-16, indeterminate evidence for BDI-1A and MADRS, and moderate evidence supporting it for HAMD-17. Furthermore, across the board, we found extremely strong evidence for psilocybin’s non-inferiority versus escitalopram. These findings were robust to prior sensitivity analysis.Conclusions: This Bayesian reanalysis supports the following inferences: 1) that psilocybin did indeed outperform escitalopram in this trial, but not to an extent that was clinically meaningful--and 2) that psilocybin is almost certainly non-inferior to escitalopram. The present results provide a more precise and nuanced interpretation to previously reported results from this trial, and support the need for further research into the relative efficacy of psilocybin therapy for depression with respect to current leading treatments.

Unlocked with Blossom Pro

Research Summary of 'A Bayesian Reanalysis of a Trial of Psilocybin versus Escitalopram for Depression'

Introduction

A recent randomised trial comparing psilocybin therapy to escitalopram for major depressive disorder reported a non-significant difference on its pre-specified primary outcome, the QIDS SR-16 score from baseline to 6 weeks, yet found significant between-group differences favouring psilocybin on three secondary depression scales. Because multiple comparisons were not pre-specified for correction, the original frequentist interpretation treated the primary outcome as indeterminate and the secondary outcomes as uninterpretable. This pattern motivated a secondary analytic approach that could provide more informative statements about the strength of evidence across all outcomes. Nayak and colleagues therefore performed a Bayesian reanalysis of the original two-arm, double-blind trial to estimate the probability that psilocybin was superior to escitalopram, to test whether any observed differences reached literature-defined minimally clinically important differences (MCIDs), and to assess non-inferiority. The authors argued that Bayesian methods can overcome several limitations of the original frequentist analysis by providing direct probabilistic statements about hypotheses (any effect, clinically meaningful effect, and non-inferiority), by being less vulnerable to multiple comparisons when using appropriate priors, and by distinguishing indeterminate from genuinely null findings.

Methods

The reanalysis used Bayesian linear regression models for each depression outcome collected in the trial: QIDS SR-16, HAMD-17, MADRS, and BDI-1A. Each model predicted the 6-week follow-up score using baseline score and treatment condition (psilocybin versus escitalopram) as predictors, so that the primary quantity of interest was the posterior distribution of the group difference in follow-up scores after adjusting for baseline. Two classes of priors were compared. Flat priors approximated non-informative analyses and are broadly comparable to frequentist estimates. Skeptical priors were deliberately conservative, shrinking estimates toward zero; they were tuned so the 95% highest density interval of the prior predictive distribution for the group difference spanned benchmark values for "very much improved" derived from the literature. Full prior specifications and additional tuning details were reported in supplemental material (not reproduced here). The authors report that model fitting used Hamiltonian Markov Chain Monte Carlo (Stan via the brms and rethinking packages in R), with posterior predictive checks and trace plots inspected for adequacy. The extracted text does not clearly report the trial sample size within the Methods section. Using the posterior group-difference distributions, the investigators computed three posterior-based probabilities for each scale: the probability that psilocybin was superior by any amount (proportion of posterior > 0), the probability that psilocybin exceeded the MCID (proportion of posterior > MCID), and the probability of non-inferiority (proportion of posterior > non-inferiority margin). Scale-specific MCID values used were: QIDS 28.5% group difference, HAMD-17 4 points, MADRS 4.5 points, and BDI-1A 29.64% group difference. Non-inferiority margins were QIDS −0.3 standardized difference, MADRS −2.5 points, HAMD-17 −2.5 points, and a conservative −1 point for BDI-1A. The authors also computed Bayes factors (BF10) to quantify evidence for H1 versus H0 for both the "any superiority" and "clinically meaningful superiority" hypotheses, and performed prior sensitivity analyses varying the prior predictive 95% interval to span 50% and 150% of the MCID. Analyses were performed independently by two authors to ensure reproducibility of the code and results. The reanalysis therefore combines model-based posterior estimation with Bayes factor comparisons and sensitivity checks to assess robustness to prior choices.

Results

Under skeptical priors, posterior estimates and Bayes factor results differed across scales. For QIDS SR-16 the median group difference favoured psilocybin by 2.0 points (95% credible interval −0.8 to 5.0). The posterior probability of any positive effect was 92.0%, while the probability of exceeding the MCID was 5.4%. The Bayes factor for any positive effect was 1.2, which the authors interpret as indeterminate evidence; the Bayes factor for a clinically meaningful difference was 0.14, interpreted as moderate evidence for the null of no clinically meaningful effect. For HAMD-17 the median group difference under the skeptical prior was 5.3 points (95% credible interval 2.6 to 8.0) in favour of psilocybin. The posterior probability of any positive effect was effectively 100% and the probability of exceeding the MCID was 81.7%. The Bayes factor for any positive effect was 363 (classified as extremely strong evidence for superiority) and the Bayes factor for a clinically meaningful difference was 6.1 (moderate evidence for a clinically meaningful effect). MADRS produced a median group difference reported as 7.0 points in favour of psilocybin (the credible interval is not clearly reported in the extracted text). The posterior probability of any positive effect was 99.7% and the probability of exceeding the MCID was 36.5%. The Bayes factor for any positive effect was 25 (strong evidence), while the Bayes factor for a clinically meaningful difference was 1.3, which the authors label indeterminate. For BDI-1A the median group difference was 7.0 points (95% credible interval 1.6 to 12.2) favouring psilocybin. The posterior probability of any positive effect was 99.4% and the probability of exceeding the MCID was 28.7%. The Bayes factor for any positive effect was 12.6 (strong evidence), whereas the Bayes factor for a clinically meaningful difference was 1.0 (indeterminate). Across all scales the authors report strong evidence for non-inferiority of psilocybin relative to escitalopram: QIDS non-inferiority probability 99.67% (BF10 ≈197), HAMD-17 100% (reported as infinite BF in the extracted text), MADRS 99.98% (BF10 ≈2831), and BDI-1A 99.78% (BF10 ≈398). Sensitivity analyses varying prior width to 50% and 150% of the MCID did not substantially change these conclusions according to the authors; full tables and supplemental results are referenced but not reproduced in the extracted text.

Discussion

Nayak and colleagues interpret the Bayesian reanalysis as providing greater clarity than the original frequentist report. They conclude that, within this trial, psilocybin likely outperformed escitalopram on several clinician- and self-rated depression scales, but that the superiority did not consistently reach literature-defined thresholds for clinical meaningfulness across all measures. Specifically, the HAMD-17 results provided the strongest evidence both for any superiority and for a clinically meaningful superiority, while the QIDS SR-16 (the original primary outcome) yielded indeterminate evidence for any superiority and moderate evidence against a clinically meaningful effect. The MADRS and BDI-1A showed strong evidence for any superiority but indeterminate evidence regarding clinically meaningful superiority. The authors argue that Bayesian methods allow these distinctions to be made explicitly and yield intuitive probabilistic statements that can distinguish indeterminate from likely-null results, quantify non-inferiority, and avoid some problems of multiple comparisons when using conservative priors. They also advocate for Bayesian approaches in trial design, noting that sequential Bayesian designs can be more flexible and efficient than fixed-sample frequentist designs by allowing prespecified evidence thresholds for stopping for benefit or futility. Limitations acknowledged by the authors mirror those of the original trial: potential unblinding and expectancy effects that could inflate group differences. The authors note that the use of skeptical priors mitigates but does not eliminate these design-related biases. They further recognise that differences across rating scales merit additional work to understand scale properties and their implications for trial interpretation. The discussion reiterates that the Bayesian reanalysis does not conflict with the original manuscript's data but offers a more nuanced, probabilistic framing of what the trial data support and where further research is required.

View full paper sections

INTRODUCTION

A recent trial investigating psilocybin's efficacy, relative to escitalopram, for major depressive disorder reported no significant benefit relative to the standard of care. Specifically, psilocybin did not show a significantly difference with respect to the Quick Inventory of Depressive Symptomatology-Self-Report (QIDS SR-16) scores from 7-10 days preintervention to a 6-week endpoint, which was the primary outcome of this trial. However, a closer look at the results reveals that psilocybin significantly outperformed escitalopram on all secondary outcomes, including three clinically-validated depression scales. Because there was no pre-specified plan for multiple comparisons corrections, the formally allowable frequentist interpretation was that the primary outcome was indeterminate and that the secondary outcomes were uninterpretable. A Bayesian approach has the potential to extract more interpretable information from the results of this trial, overcoming some key limitations of the previous frequentist analysis.

FREQUENTIST AND BAYESIAN APPROACHES IN CLINICAL TRIALS

The results ofhighlight several drawbacks of frequentist methods. First, frequentist methods suffer from several problems arising from multiple comparisons. Because p-values are uniformly distributed when the null hypothesis is true, 5% of tests will be positive by chance alone, when α = .05. This necessitates special procedures to correct for multiple comparisons when multiple outcome measures are administered-a number of which can be arbitrary (see. Second, frequentist methods do not convey the probability of any particular hypothesis, dealing instead with the probability of the data (or more extreme data) assuming the null hypothesis is true. Because of this, p-values cannot be interpreted as measures of confidence on the findings. Third, these methods rigidly separate hypothesis testing from effect size estimation, and results are often reported that are statistically significant but clinically meaningless. Fourth, fixed sample sizes are chosen on the basis of a priori assumptions about the true effect size. If the actual effect size is smaller than anticipated, the trial is underpowered and may miss a real effect; hence, a null result provides no insight into whether this is due to a lack of power or due to a genuine absence of effect. On the other hand, if the actual effect size is much greater, then the trial collects superfluous participants. An alternative approach is to employ methods of Bayesian inference. Although these methods are still less often used, they address many of the limitations of frequentist methods. Firstly, with appropriately chosen priors, Bayesian inference can bypass the multiple comparisons problem. Fewer false positive claims are made with confidence, which allows for more flexible use of multiple comparisons. Second, the Bayesian posterior distribution naturally allows for effect size estimation and hypothesis testing to be conducted simultaneously. Third, and importantly for the specific case of clinical trials, Bayesian inference is flexible, modular, and allows for intuitive and meaningful clinical interpretations, rather than simple black/white dichotomization imposed by frequentist methods. In effect, the probability that a new intervention has any effect and the probability that it has a clinically meaningful effect (i.e., above an established criteria) can be determined naturally from the same posterior distribution. Additionally, frequentist analyses can often be interpreted as special cases of Bayesian inference (i.e., when using uniform or 'flat' priors), suggesting the two approaches are not entirely divorced from one another. Another important benefit of Bayesian analysis is that it allows us to quantify evidence for a hypothesis, rather than just evidence against a null, an advantage which we leverage here. Unlike p-values, which are simply positive or null, Bayes factors are tripartite, allowing us to distinguish positive, indeterminate, and null results. Under a frequentist paradigm, null results may be truly null or may represent an underpowered study, and differentiating the two can be highly non-trivial. Because of this, no conclusions can be made in general from a null results from a frequentist trial. In contrast, Bayes factors naturally allow us to calculate the probability that a finding is truly negative vs indeterminate (requiring more data). This information can prove critical in determining whether to continue trials on a particular intervention (with a larger sample size) or to cease trials of said intervention all together. For these reasons, Bayesian analyses are becoming increasingly common in clinical medicine. One useful example comes from the COVID STEROID 2 trial, which tested two different doses of dexamethasone in treating severe COVID-19 pneumonia. The study reported a null primary outcome, which was interpreted as null. A Bayesian reanalysis concluded that the probability of any benefit of the higher dose was 95%, of clinically important benefit was 62%, and of clinically important harm was 0.2%. While not conflicting with the original frequentist study, this reanalysis offers a more complete clinically informative picture of the data. Other examples include the ANDROMEDA-SHOCK trial) and a trial of Extra-Corporeal Membrane Oxygenation vs conventional ventilation, each of which initially reported inconclusive primary outcomes with frequentist analyses, yet Bayesian reanalysis demonstrated high probability of benefit in each. Each of these examples illustrate the usefulness of Bayesian reanalyses in better understanding clinical trial results that appeared ambiguous from the frequentist perspective. Notably, it is not the case that Bayesian reanalyses simply convert null findings from frequentist trials into positive effects. On the contrary, a systematic review of Bayesian reanalyses of 82 studies in high-impact critical care journals found that discordance between frequentist and Bayesian results is uncommon. In effect, in 78 of the 82 trials that were negative or indeterminate under frequentist criteria, Bayesian reanalysis found that clinically meaningful effects were probable in only 7 (9%). In 4 of the 82 trials with statistical significance for the intervention group, Bayesian reanalyses found positive results improbable in 2 (50%). As these findings demonstrate, Bayesian reanalyses are often more informative than the initial frequentist analysis--but Bayesian reanalyses do not represent a less conservative test of the purported benefit of a given intervention.

THE PRESENT STUDY

Given the success of Bayesian reanalyses, we suggest that the findings of thetrial can be better understood by subjecting them to a Bayesian reanalysis. Here, we perform a Bayesian reanalysis ofto quantify the efficacy of psilocybin versus escitalopram in treating major depressive disorder. We test the hypothesis that psilocybin is superior to escitalopram using all four clinically-validated depression inventories administered in the study, under both flat priors (largely equivalent to frequentist analyses) and skeptical priors (which bias effects towards zero and represent a more conservative approach). Our results show that psilocybin indeed outperforms escitalopram, but not to an extent that is 'clinically meaningful'-defined using literature defined, scale-specific values of the minimally clinically important difference (MCID, see Methods). Importantly, this reanalysis also provides additional insight into the seemingly incongruous "null" result on the QIDS, by distinguishing where evidence is truly indeterminate, and when it is in favor of the null. These results enrich and add context to, and support the need for further research into the relative efficacy of psilocybin therapy for depression, versus standard of care or any other viable active comparator with an evidence base.

BAYESIAN LINEAR REGRESSION

Bayesian linear models (McElreath 2020) were performed with each of the depression scales that were used as outcome measures in the trial: the 16-item Quick Inventory of Depressive Symptomatology-Self-Report (QIDS SR-16), the 17-item Hamilton Depression Rating Scale (HAMD-17), the Montgomery and Asberg Depression Rating Scale (MADRS), and the Beck Depression Inventory 1A (BDI-1A). All models took the following form, similar to the original analysis: where SCALEBL and SCALEFU are the values of a given scale at baseline and final follow-up, 𝛽 ! and 𝛽 "# are the coefficients of a linear relationship between SCALEBL and condition (psilocybin or escitalopram group) as predictors of SCALEFU and 𝑣 is the residual of the regression. Put simply, the outcome variable was the follow-up score for each scale at 6 weeks, while condition and baseline depression scale score were used as independent variables. Bayesian regression models need to specify prior distributions for their coefficients-in our case, for 𝛽 ! and 𝛽 "# . For each outcome measure, two variants of the model were assessed that differed in the definition of their priors: a flat prior variant (which approximates frequentist methods) and a skeptical prior variant (which shrinks estimates closer to 0). Flat priors posit that any effect size is possible, and simply allow each parameter to take any value with uniform prior probability. Flat priors often produce results equivalent to frequentist approaches. Skeptical priors instead posit that large effect sizes are unlikely. The skeptical priors were tuned such that the 95% highest density interval of the prior predictive distribution for group difference spans the magnitude of benchmark values for "very much improved". In other words, this prior constrains effect sizes to be within a range that is considered clinically possible, and penalizes effects that are large. This skeptical prior signifies a belief that there is likely no group difference. Skeptical priors hence shrink estimates toward zero and are more conservative than flat priors and typical frequentist methods. Full details of these priors are available in the supplementary materials. For constructing the skeptical priors, the following benchmark values for "very much improved" were used. These criteria are based on values previously identified in the literature: QIDS 75% change from baseline; HAMD-17 78% change from baseline, after averaging values from several citations; MADRS 82% change from baseline. Finally, for BDI-1A a 75% change from baseline was considered "very much improved", following the benchmarks used for the other measures, since benchmark values of "very much improved" were not readily available in the literature for this scale. Posterior distributions of depression scale scores were calculated for both psilocybin (COMPASS Pathways proprietary synthetic psilocybin, COMP360") and escitalopram at the final follow-up (6-week timepoint), and the posterior distribution of their difference was calculated by subtracting one distribution from the other-yielding the "posterior group difference". This posterior distribution can be summarized by its median value and by the upper and lower limits of the credible interval, which contains a given percentage (often 95%) of the posterior density. Note that frequentist confidence intervals are often misinterpreted as denoting the probability that the interval contains the true value of a parameter of interest, or as capturing the number of times the true value would lie within the given interval if the study were run multiple times. In contrast, the Bayesian credible interval can be interpreted more simply: given the data and the model, there is a e.g. 95% probability that the true value lies within the interval. Using the posterior group differences, the probabilities that psilocybin had 1) any superiority, 2) clinically meaningful superiority, and 3) non-inferiority to escitalopram were calculated by taking the percent of the posterior distribution 1) greater than 0, 2) the minimally clinically important difference (MCID), and 3) the non-inferiority margin, respectively. The MCID and non-inferiority margins were taken from the literature. The following values were used for MCID: QIDS 28.5% group difference; HAMD-17 4 points; MADRS 4.5 points; BDI-1A 29.64% group difference. The following non-inferiority margins were used: QIDS -0.3 standardized difference from control; MADRS -2.5 points); HAMD-17 -2.5 points. As non-inferiority margins were not readily available in the literature for BDI-1A, a conservative margin of -1 point was chosen. All analyses were performed in R (R Core Team 2020) independently by two authors (SMN and BAB) to ensure similar results. Model parameters were estimated using Hamiltonian Markov Chain Monte Carlo simulations using both brmsand rethinking (McElreath 2020) packages, which are wrappers for the probabilistic programming language Stan. Visual inspection of posterior predictive checks (demonstrating that simulated data adequately approximate real data) and trace plots (showing adequate chain mixing) suggested reasonable model specification. Analysis scripts are available at.

BAYES FACTORS

We computed Bayes factors for two sets of hypotheses: that psilocybin outperforms escitalopram 1) by any amount and 2) by at least the MCID. Bayes factors comparing a specific H1 ("experimental" hypothesis) to H0 ("null" hypothesis) quantify the degree of evidence for H1 versus H0. For a given prior and posterior distribution, this Bayes factor (henceforth BF10) can distinguish between null results and underpowered results-a useful property that is not possible with p-values. For the hypothesis that psilocybin outperforms escitalopram by any amount, the experimental hypothesis is that the group difference is greater than zero, while the null is that the group difference is zero. Mathematically: diff = SCALE FU $%&'()(%&*+,$()-.%/0-1 -SCALE FU $%&'()(%&*/,(.%$23(& , 𝐻 4 :diff 0, 𝐻 5 :diff = 0. To calculate BF10, we take advantage of the following relationship: where the first term is the posterior odds, second term is the Bayes factor, and third term is the prior odds. We calculate the Bayes factor by dividing the posterior odds by the prior odds. The prior odds can be interpreted as "the odds of H1 prior to seeing the data", and the posterior odds can be interpreted as "the odds of H1 after seeing the data". Greater values of the prior and posterior odds reflect greater plausibility of H1 under those distributions. BF10 is the ratio of these odds, where numbers greater than 1 indicate more plausibility for H1 after seeing the data, and numbers between 0 and 1 indicating more plausibility for H0. For example, a BF10 of 5 means the data are 5 times more likely under H1 than H0. Using common convention, values of BF10 in the range 3-10 indicate moderate evidence, values in the range of 10-30 indicate strong evidence, 30-100 very strong, and greater than 100 extremely strong evidence for H1. These values can be inverted and interpreted similarly as evidence for H0: a BF10 of 1/3-1/10 can be interpreted as strong evidence for H0, with strength of evidence increasing as numbers approach 0. BF10 from 0.5-2 are usually considered to be indeterminate, requiring more evidence. For the hypothesis that psilocybin is greater than escitalopram by a clinically meaningful amount (MCID), the following experimental and null hypotheses were used: Bayes factors were also computed for non-inferiority, using the following experimental and null hypotheses relative to the non-inferiority margin (NI):

PRIOR SENSITIVITY ANALYSIS

To ensure that results were not excessively impacted by the choice of priors, sensitivity analyses were performed using two additional sets of priors, in which the 95% highest density interval of the prior predictive distribution for group difference spanned 50% and 150% of the MCID. Further details about this procedure can be found in the supplemental material.

QIDS SR-16

The [95% CI] for QIDS SR-16 group difference under a skeptical prior was 2.0 [-0.8, 5.0] in favor of psilocybin, with a 92.0% probability for any positive effect and a 5.4% probability for a clinically meaningful difference. The Bayes factor for any positive effect was 1.2, indicating indeterminate evidence, which implies that the data are insufficient with respect to this question. The Bayes factor for a clinically meaningful difference was 0.14, indicating moderate evidence for the null of no clinically meaningful difference.

HAMD-17

The median [95% CI] for HAMD-17 group difference under a skeptical prior was 5.3 [2.6, 8.0] in favor of psilocybin, with a 100% probability for any positive effect and a 81.7% probability for a clinically meaningful difference. The Bayes factor for any positive effect was 363, indicating extremely strong evidence. The Bayes factor for a clinically meaningful difference was 6.1, indicating moderate evidence for a clinically meaningful difference.

MADRS

The median [95% CI] for MADRS group difference under a skeptical prior was 7.0] in favor of psilocybin, with a 99.7% probability for any positive effect and a 36.5% probability for a clinically meaningful difference. The Bayes factor for any positive effect was 25, indicating strong evidence. The Bayes factor for a clinically meaningful difference was 1.3, indicating indeterminate evidence.

BDI-1A

The median [95% CI] for BDI-1A group difference under a skeptical prior was 7.0 [1.6, 12.2] in favor of psilocybin, with a 99.4% probability for any positive effect and a 28.7% probability for a clinically meaningful difference. The Bayes factor for any positive effect was 12.6, indicating strong evidence, while the Bayes factor for a clinically meaningful difference was 1.0 indicating indeterminate evidence. Estimates for all four depression scales under skeptical and flat (not shown in text) priors is available in Table. The probabilities (Bayes factor) for non-inferiority were QIDS: 99.67% (197), HAMD-17: 100% (infinite), MADRS: 99.98% (2831), BDI-1A: 99.78% (398). Sensitivity analyses using different priors did not substantially alter these results. Details of these analyses can be found in the supplementary material.

DISCUSSION

This study presents a Bayesian reanalysis of data from a recently published study comparing psilocybin to escitalopram for the treatment of depression. Of the four depression scales included in this study, one failed to find a significant between-condition difference (QIDS SR-16) under the original frequentist analysis, while the remaining three found a significant difference in favor of psilocybin (BDI-1A, MADRS, HAMD-17). As the QIDS SR-16 was the pre-determined primary outcome, the trial was considered indeterminate overall. The Bayesian reanalysis presented here provides further insight into this trial's data, enabling clearer inferences to be made on them, and suggestions for future studies. Specifically, the results of the presented reanalysis suggests that psilocybin did indeed outperform escitalopram in this trial, but not to an extent that was clinically meaningful-while clarifying that more data is needed before these conclusions can be adopted with high confidence. In addition, results also support that psilocybin is almost certainly non-inferior to escitalopram, as administered in this study. Null hypothesis significance testing in the standard Neymann-Pearson methodology asks how probable the data are under the assumption that H0 is true, and is blind to the experimental hypothesis, H1. Such a method can therefore not directly estimate the probability of H0, or any other hypothesis. Alternatively, Bayesian methods can quantify the evidence for specific alternative and null hypotheses in intuitive, probabilistic terms. This allows more direct answers to questions relevant to clinicians (e.g. "what is psilocybin's effect on depression, how likely is that effect, and how certain can we be about it?") rather than offering a mere dichotomous answer. Harnessing this capacity, the current analysis investigated three hypotheses. For the hypothesis of any amount of superiority of psilocybin, there is indeterminate evidence (QIDS SR-16), strong evidence for H1 (BDI-1A and MADRS), and extremely strong evidence for H1 (HAMD-17). For the hypothesis that psilocybin is superior by a clinically meaningful amount, there is moderate evidence for H0 (QIDS SR-16), indeterminate evidence (BDI-1A and MADRS), and moderate evidence for H1 (HAMD-17). Across the board there is extremely strong evidence for noninferiority of psilocybin with respect to escitalopram. Taken together, we can conclude that in this study population psilocybin is probably superior to escitalopram, but not clearly to a degree that is clinically meaningful, and that psilocybin is almost certainly non-inferior to escitalopram. While none of these conclusions conflicts with the results of the original manuscript, they are much more informative and nuanced than the conclusions of frequentist analysis. In, the primary outcome measure (QIDS SR-16) yielded a non-significant result, while psilocybin was superior in every contrast using secondary efficacy outcome measures (including HAMD-17, MADRS, and BDI-1A). Nevertheless, frequentist conventions required this be reported as a null trial (i.e. that "the primary outcome is indeterminate and the secondary outcomes uninterpretable"). As a thought experiment, imagine an alternative, plausible outcome: the primary outcome significantly favored psilocybin and yet every secondary outcome was null. Although such results could be reported as proof of psilocybin's superiority over escitalopram, we suspect many readers would be skeptical of this interpretation -suspecting it to be a false positive. Under a Bayesian analysis, the individual scales continue to offer contrasting evidence. For example, for the hypothesis of clinically meaningful superiority of psilocybin, there is moderate evidence against (i.e., H0) according to the QIDS SR-16, while there is moderate evidence for (H1) according to the HAMD. Future work could be done to address the relative strengths and weaknesses of the depressive symptom severity rating scales used in this trial, which may further aid our abilities to draw inferences on this trial's results and also may contribute to the design of future trials. However, a Bayesian re-analysis with skeptical priors allows us to analyze the findings from each of the scales in their totality. This provides a more informative picture of the results of the trial by considering all of the available data while remaining robust to problems resulting from multiple comparisons. Bayesian methods have been critiqued as unnecessarily subjective, given the need for a prior distribution. We view this argument as a red herring, as frequentist clinical trials typically use substantial prior information in the design of the trial, particularly in estimating the number of subjects that must be enrolled to avoid an underpowered result. In addition, some frequentist methods are equivalent to Bayesian inference with uniform priors, demonstrating that priors are implicitly a feature of frequentism. The implicit flat prior distributions that characterize frequentist analyses are often inappropriate statistically (causing problems with model convergence) and logically (rendering extreme effect sizes as probable as small ones) (Van Dongen 2006). Bayesian principles extend far beyond inference performed at the end of data collection, offering important advantages in the design of clinical trials. In powering a trial, frequentist methods typically establish a fixed sample size based on a prior assumption of effect size, which is often uncertain. If a null result is obtained, it can be unclear whether the result is truly null or underpowered, despite best attempts at collecting an appropriate number of subjects. Sequential designs are possible, and occasionally used, though this requires a rigid design with prespecified looks at the data. A more flexible and intuitive approach is a Bayesian sequential trial. A Bayesian sequential trial might, for example, target a specified strength of evidence (applicable to H1 or H0) using Bayes factors, and continue collecting participants until that strength of evidence is reached). This method can not only allow continued data collection if results are indeterminate, but also permits ending trials earlier with lower sample sizes when effects are larger than expectedtaken this approach, data collection would have been allowed to continue until the evidence for QIDS SR-16 was no longer intederminate. Equally, a trial can be terminated early if there is sufficient evidence of no benefit (i.e., in support of H0), which is often not possible with standard frequentist design. Bayesian sequential design also obviates problems related to findings that are statistically significant but not clinically significant, as the choice of H1 can be a clinically meaningful difference. Overall, this article illustrates several of the advantages of Bayesian methods for the design and analysis of clinical trials. Firstly, specific alternative and null hypotheses can be clearly specified as the subject of the analysis. The evidence for these hypotheses can be presented in intuitive, probabilistic terms, or via Bayes factors that provide a quantitative assessment about the strength of one hypothesis over another. When there is limited prior information to go on, as in the case of a psilocybin trial directed at a novel therapeutic indication, Bayesian sequential trials allows a more flexible trial design that may on average save resourceswhile remaining rigorous and principled. Given these advantages, we believe Bayesian methods deserve greater use in psychedelic clinical trials in particular and clinical trials in general.. Values of BF10 in the range of 0.33-0.1 can be interpreted as strong evidence for H0, with strength of evidence increasing as numbers approach 0. BF10 from 0.5-2 are usually considered to be indeterminate, requiring more evidence. Clinically meaningful superiority refers to a group difference greater than the Minimally Clinically Important Difference (MCID).

Full Text PDF

Open PDF in new tab

Study Details

Study Type
individual
Population
humans
Characteristics
randomizedre analysisdouble blindplacebo controlled
Journal
Psyarxiv
Compounds
Psilocybin Placebo

Related Clinical Trial

CompletedPhase II

Psilocybin vs Escitalopram for Major Depressive Disorder: Comparative Mechanisms

Randomised double-blind Phase II trial (n=59) comparing psilocybin versus escitalopram for major depressive disorder, assessing efficacy and mechanisms.

Started: July 1, 2019
Type: interventional
Blinding: double
Randomized: Yes
Registry ID: NCT03429075