This document might be revised in the future. Please note the document date above and check back later for an updated version. Re-load this document in your browser to be sure you’re not viewing an old version cached in your browser.

Getting oriented

This tutorial guides you through a Shiny app that puts frequentist and Bayesian analysis side by side.

This tutorial is best viewed in a wide window so the dynamic table of contents (TOC) appears on the left of the text. With the TOC visible, you can click in it to navigate to any section you like. In a narrow window, however, the TOC appears at the top of the screen and disappears when you scroll down.

Core structure of the app

The app is organized as a 2 \(\times\) 2 table: There is one column for frequentist analysis and a second column for Bayesian analysis; there is one row for estimation with uncertainty and a second row for null hypothesis tests. The cells of the 2 \(\times\) 2 table indicate the typical information provided by each type of analysis, as noted in the figure below:

The app's 2 x 2 table of analyses.

The app’s 2 x 2 table of analyses.

Why this structure?

The framework is explained in the article:
Kruschke, J. K. and Liddell, T. M. (2018). The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review, 25, 178-206. DOI: https://doi.org/10.3758/s13423-016-1221-4
This article will be referred to in this tutorial as The Bayesian New Statistics. Readers who are familiar with the layout of analyses in The Bayesian New Statistics will notice that the rows of the app are reversed relative to the table in that article. This reversal is intentional, to emphasize that hypothesis testing is not the default analysis and is not necessary to do at all.

Interactive sliders

The app has lots of interactive sliders with which you can specify aspects of the data and assumptions of the analyses, as suggested by the schematic figure below. The sliders are great for learning what happens to all of the analyses simultaneously. The sliders are also great for focusing what to think about when translating a real-world situation to analysis assumptions.

Schematic of sliders in the app.

Schematic of sliders in the app.

Learning objectives

The interactive sliders provide a natural framework for your learning objectives, and for how to assess your learning. If you have mastered the ideas, you should be able to predict the (qualitative) effect of every slider on every cell in the app, and be able to explain why the effect happens. Moreover, you should be able apply the analyses to the real world, which means you should be able to translate real-world information to appropriate settings of the sliders. These objectives are summarized in the figure below:

Learning objectives suggested by arrows to and from sliders.

Learning objectives suggested by arrows to and from sliders.

Layout of the app

The previous figures emphasized the structure of the app without displaying all of its details. The actual appearance of the app when first invoked is shown below. The app may have changed since this tutorial was written, but its basic layout should be similar.

Appearance of app when opened.

Appearance of app when opened.

Notice the left edge of the screen is where you specify the data. To the right of the data edge, the screen shows aspects of the analyses. The analysis part of the screen is arranged as a 2 \(\times\) 2 table, as described in the previous sections.

Organization of this tutorial

This tutorial steps through all the zones of the app. The figure below shows the app (with all its features displayed) with this tutorial’s sequence of topics and their location on the app screen.

Sequence of topics in this tutorial.

Sequence of topics in this tutorial.

The sequence of topics is
  1. Data
  2. Analysis model: A normal distribution with its parameters
  3. Frequentist estimation: The maximum likelihood estimate
  4. Bayesian estimation with uncertainty: Posterior distribution on the parameters
  5. Bayesian hypothesis testing: The Bayes factor
  6. Frequentist hypothesis testing: The p value
  7. Frequentist uncertainty: The confidence interval

This tutorial provides many interactive examples en route. The interactive sections are marked by “Try It!” headers.

Opening the Shiny App

You can open the active app in your web browser by clicking this link.

You will need a large window to view the full app. A smart-phone screen will not be comfortable. On a computer screen, you might need to maximize the size of the window in which the app appears. You might also need to reduce your font magnification to see it all at once; if so use <ctrl>+\(-\) (i.e., control key with minus key) to decrease the app size within the window.

Data

Prerequisites: This section assumes you already know the meanings of arithmetic mean, standard deviation, and histogram.

For this app, the data are a single group of metric values. That is to say, the data are numerical values from a scale like height, temperature, or time. To make the values concrete, we will suppose the values are intelligence quotient (IQ) scores from a group of people, and therefore the values are in the general vicinity of 100 with a standard deviation in the vicinity of 15.

The figure below highlights the data panel of the app. The app generates random data values with characteristics specified by you. Within the data panel are three sliders with which you specify the data mean, standard deviation, and sample size (i.e., the number of values in the sample). The sample of data is constructed so it has exactly the mean, standard deviation, and sample size specified by the slider values. A histogram of the data is shown below the sliders.

The data panel is highlighted.

The data panel is highlighted.

(1) Try It!

In the active version of the app, which you opened via the link at the beginning of this tutorial, try moving the sliders and watch what happens. The examples below guide you through some some options.

Delayed Response. When you move the sliders, you will notice a delay in the app’s response. A delay is intentionally built into the app so that quick consecutive adjustments by the user are not all responded to. Instead, the app waits for the slider position to be stationary before doing the computations. Moreover, the computations churning “under the hood” take a few seconds.

• (1.1) Try sliding the data mean. Find the slider for the data mean. Move it to a different value, such as 85. Watch what happens to the histogram below the sliders; it will change and will show data values near the mean you selected.

In terms of IQ scores, you can imagine that different types of groups would have different means. Perhaps one group is students from an advanced-placement high-school economics course, and so you might think their mean IQ score would be above the general population mean of 100. Or perhaps a group of people from the general population has been given a “smart drug” that is supposed to increase IQ scores (at least while under the influence of the drug). Or perhaps a group has been given a depressant (such as alcohol) that might decrease IQ scores.

In this app, the data are generated randomly but are designed always to have exactly the sample mean and standard deviation selected by the user. In other words, if you set the slider to a mean of 110, then the data will have a sample mean of 110. This feature makes it easy to understand the influence of data properties on downstream analysis results.

• (1.2) Try sliding the data standard deviation. Make the standard deviation bigger or smaller, and watch the width of the histogram change accordingly.

In terms of IQ scores, you can imagine that different types of groups would have different standard deviations. For example, a group of students from a particular advanced-placement high-school economics course might have IQ scores less spread out than scores in the general population. Or consider a group of people from the general population who are taking an IQ test while being watched by other people. Under this social stressor, some people might be facilitated but other people might “choke” and perform less well, and therefore the standard deviation of the group would increase.

• (1.3) Try sliding the data sample size. Try manipulating the slider for the sample size. The histogram will become smoother for larger sample sizes, chunkier for smaller sample sizes.

The default slider settings are a mean of 111.0, a standard deviation of 20.0, and a sample size of 18. You might want to re-set the sliders to their default values. An easy way to re-set the app is by clicking the re-load button in your browser.

Analysis Model

Data are described by mathematical models

The foundation of data analysis is describing data with mathematical models. You can think of a mathematical model as a machine that generates patterns of random data values. The machine has control knobs that determine details of the pattern of data. The control knobs are called parameters and a knob’s position is called the value of the parameter.

One simple example of such a machine is an ordinary bathroom shower head. Think of each water droplet’s location on the floor as a datum generated by the machine. One “control knob” on the machine is the direction of the shower head. If you point the shower head leftward, the data (i.e., droplets) land on the left. If you set the shower head rightward, the data (i.e., droplets) land on the right. The angle of the shower head is one parameter that controls the location of droplets on the floor. Many shower heads also have a knob that controls the spread of the spray. When set narrow, the droplets land close together on the floor, but when set wide the droplets cover a wide area of the floor. These ideas are illustrated in the figure below. The location knob is labeled μ (mu) and the spread knob is labeled σ (sigma).

A shower head as machine that generates random data.

A shower head as machine that generates random data.

The bathroom shower head is just one device for generating a pattern of random water droplets. There are others, such as different types of lawn sprinklers. Each type of sprinkler generates a different pattern of droplets, and each type of sprinkler has different control knobs.

Different mathematical models of data are like different shower heads or sprinklers, each with their own control knobs. Each mathematical model generates a particular type of data, and each mathematical model has particular knobs – called parameters – that control the specific details of the pattern of data.

The essential idea of data analysis is that we try to imitate the pattern of data generated by the world with a pattern of data generated by a model. If the model imitates the world’s data well, then we can summarize and describe the data in terms of the model and its parameters. Moreover, we can understand the data in terms of the model and its parameters.

For example, suppose we come across a wet area of the lawn. There are different models for what may have created the wet zone. There’s the oscillating sprinkler, the cicular spray sprinkler, and the rotary sprinkler. Each of those models has parameter values including direction and spread. When we infer that the wet zone was made by an oscilating sprinkler set to one side, we are summarizing the data in terms of a particular sprinkler type (the model) and its settings (the parameter values) and we understand the data in those terms.

Using a different metaphor, suppose the data we’re trying to understand are paw prints in the mud. There are different models for what may have created the paw prints. There’s the rabbit model, the cat model, the dog model, and the deer model. And each of those models has parameter values including speed (e.g., walking, running, jumping, galloping) and direction. When we infer that the prints were made by a dog walking north, we are summarizing and describing the data in terms of a dog (the model) and it’s speed and direction (the parameters) and we understand the data in those terms.

In data analysis, the data are not water marks or paw prints, but are measurements of some sort. For example, IQ scores are numerical values. Such data are modeled by mathematical distributions, not by shower heads or roaming animals.

The model: A normal distribution with parameters μ and σ

The app uses a normal distribution to describe the data. A normal distribution specifies the probability of each possible data value. When plotted, a normal distribution is a symmetric bell-shaped curve, indicating that the data value under the peak of the curve is the most probable, and data values less than or greater than that central value are less probable. The figures below show examples of a normal distribution with different settings of its parameters. Notice that the horizontal axis is data values, and the vertical axis is the probability of those data values. The settings of μ (mu) and σ (sigma) are specified in the title of each plot:

As marked in the plots above, the μ parameter controls the location of the central tendency of the normal distribution, and the σ parameter controls the width of the normal distribution.

Technical digression: Probability density. The vertical axis of the plots above is labelled “probability density.” You can just think of it intuitively as probability, because it’s not crucial for us here to carefully define the mathematical notion of “density.” But if you’d like a very simple explanation of probability density, see Chapter 4 of Doing Bayesian Data Analysis, 2nd Edition.

The figure below highlights the region of the app screen that displays the model. The app shows a normal distribution superimposed on the histogram of data:

The analysis model is highlighted: a normal distribution.

The analysis model is highlighted: a normal distribution.

The key idea is that we’re using the normal distribution to describe the data. The data are a heap of numbers, like water droplets on the ground. The normal distribution is a model of the data, like a sprinkler that might have made a similar pattern of water droplets. We would like to infer what settings of μ and σ are plausible descriptions of the data. There may be lots of settings of μ and σ that plausibly mimick the data. Statistical methods quantify the plausbility across all possible values of the parameters, so we know not only the most plausible values but also the range of other reasonably plausible values. As we’ll see, the more data we have, the tighter is our estimate of the parameter values.

Pedagogical digression: Why did I use a normal model in the app? In designing the app, one key goal was to illustrate frequentist corrections for multiple tests. This requires a model with at least two parameters (to conduct a test of each parameter). The normal distribution is the most commonly known model with two parameters.

Frequentist estimation

There are lots of settings of μ and σ that may make the normal distribution mimick the data reasonably well, but what values of μ and σ are the best? The classic answer to this question in the frequentist framework is the values of μ and σ that maximize the probability of the data. Another technical term for probability is “likelihood,” and so the values that maximize the probability of the data are called the maximum likelihood estimate or MLE. It turns out that for the normal distribution, the MLE value for μ is just the arithmetic mean of the data, and the MLE value for σ is just the standard deviation of the data.

Technical digression: Formula for σMLE. The MLE for σ uses the formula for standard deviation that divides by \(N\). However, most software packages compute standard deviation using a formula that divides by \(N-1\). Such software packages presume the user wants the “unbiased” estimate of SD, which is a frequentist concept that assumes a fixed-\(N\) sampling distribution for SD. The MLE makes no assumptions about sampling distributions.

Philosophical digression: Is the MLE frequentist? I have chosen to put the MLE in the frequentist column of the app because the MLE is clearly not Bayesian: The MLE does not take into account any prior probabilities of parameter values and the MLE does not refer to the posterior probabilities of the parameter values. On the other hand, the MLE might not be claimed by frequentists either because the MLE does not take into account sampling intentions (as mentioned in the previous paragraph). In particular, the MLE is not always an unbiased estimator of a parameter, where bias is determined by considering a sampling distribution of the estimator. In other words, the notion of bias is inherently frequentist, and the MLE ignores any consideration of bias. I put the MLE in the frequentist column because, if desired, frequentists can derive sampling distributions of MLE values and determine their bias.

To illustrate the MLE of parameter values in the normal distribution, consider a sample of data that has mean of 111.0 and standard deviation of 20.0, shown in the histogram in the figures below. Superimposed on the histogram are normal distributions with different candidate values of μ and σ. Visual inspection reveals that the best mimickry of the data is achieved when μ equals the sample mean and σ equals the sample standard deviation.

The figure below highlights where the app displays the values of the MLE:

Frequentist estimation: The maximum likelihood estimate.

Frequentist estimation: The maximum likelihood estimate.

(2) Try It!

In the active version of the app, which you opened via the link at the beginning of this tutorial, do the following:

• (2.1) Try moving the slider for the data mean and watch the MLE’s. The data histogram will change, of course. Notice that the MLE of μ changes too! In fact, the MLE of μ always matches the data mean because, as was mentioned above, it turns out that the MLE of μ is the mean of the data. Also watch what happens to the MLE of σ: nothing. Nothing happens to the MLE of σ when you change the data mean because the specification of the data standard deviation has not changed.

• (2.2) Try moving the slider for the data standard deviation and watch the MLE’s. Notice that the MLE of σ changes too! In fact, the MLE of σ always matches the data standard deviation because, as was mentioned above, it turns out that the MLE of σ is the standard deviation of the data. Also watch what happens to the MLE of μ: nothing. Nothing happens to the MLE of μ when you change the data standard deviation because the specification of the data mean has not changed.

• (2.3) Try moving the slider for the data sample size and watch the MLE’s. The sample size does not affect the MLE values of μ and σ because the specification of the data mean and standard deviation have not changed.

The initial slider settings for the data are a mean of 111.0, a standard deviation of 20.0, and a sample size of 18, in case you want to reset to the initial values. An easy way to reset the app is to click the re-load button in your web brower.

Why are no confidence intervals shown? The default display of the app does not show confidence intervals around the MLE; that is, the values are shown as “NA” (not available). Why? Because a confidence interval is tantamount to a hypothesis test: Any value outside the confidence interval is rejected and any value inside the confidence interval is not rejected. In fact, that’s the most coherent and general way to define a confidence inteveral: a confidence interval is the range of parameter values not rejected by a hypotheses test. Thus, you can’t really specify a confidence interval without testing hypotheses, and the default mode of the app does not test hypotheses. Later in this tutorial we will address hypothesis testing and confidence intervals.

Bayesian estimation and uncertainty

Bayesian reasoning is just re-allocation of credibility across possibilities. In Bayesian data analysis, the possibilities are values of parameters.

Sherlock Holmes was using Bayesian reasoning when he said, “How often have I told you, when you have eliminated the impossible, whatever remains, no matter how improbable, must be the truth?” Sherlock started with a set of possible explanations for a crime, with each explanation having some initial credibility. He gathered data that made all but one explanation extremely unlikely. He therefore re-allocated credibility to the remaining possible explanation, even though the initial credibility of that explanation may have been small.

We do the same sort of reasoning in Bayesian data analysis. We start with a set of possible parameter values in a model, with initial credibilities of those parameter values. We gather data that make some parameter values more or less credible. We re-allocate credibility to the parameter values that are more consistent with the data, and re-allocate credibility away from parameter values that are less consistent with the data.

In our present application using a normal model, there are two parameters, μ and σ. Each combination of μ and σ values represents a possible description of our set of data. Recall the figures, above, that superimposed different normal distributions onto a histogram of data. Each normal distribution was a specific combination of μ and σ, and some combinations were more consistent with the data than other combinations. The figures showed only three candidate combinations; there are in fact an infinity of other combinations of μ and σ. The set of all combinations of μ and σ is a two-dimensional “parameter space”.

Just like Sherlock starts an investigation with prior credibilities of the possible explanations, in Bayesian analysis we start with prior credibilities of all combinations of the parameters. The prior credibilities are defined mathematically as a probability distribution over the parameter space. The figure below shows where the prior probability distribution is specified in the app:

Bayesian prior probability distribution.

Bayesian prior probability distribution.

The app shows two small graphs of probability distributions over the parameters μ and σ. Notice that these are probabilities of parameter values; they are not probabilities of data values. The horizontal axis of each plot is labeled with its parameter. The upper plot has μ on its horizontal axis and the title “Prior p(μ)” to indicate that it displays the prior probability of μ values. The lower plot has σ on its horizontal axis and the title “Prior p(σ)” to indicate that it displays the prior probability of σ values.

The prior probability distributions default to being very broad and noncommittal, meaning that a very wide range of values for μ and σ are roughly equally plausible at the beginning. The plots only show a limited range of the distribution which extends far beyond both sides of the plot window. The breadth implies that the prior distribution has little impact on the subsequent re-allocation of probabilities, as you’ll see below.

Philosophical digression: For Bayesians, the prior is part of the model. In the app layout, the upper left corner shows the normal distribution superimposed on the data histogram. And this tutorial has, so far, described “the model” as being (merely) the normal distribution with its parameters. Bayesians, however, think of the prior distribution as also being part of “the model.” This perspective becomes important later when doing model comparison. When comparing the ability of different models to describe data, the analyst must specify the priors assumed by each model, so the prior becomes an inherent part of the model.

When the data are taken into account, the probabilities of parameter values are re-allocated toward values that are consistent with the data. The re-allocated probabilities are called the posterior distribution. The figure below indicates where in the app the posterior distribution is displayed:

Bayesian posterior distribution.

Bayesian posterior distribution.

Within the cell that displays the posterior distribution, the upper plot has μ on its horizontal axis and the title “Posterior p(μ|D)” to indicate that it displays the posterior probability of μ values. The notation “p(μ|D)” means the probability of μ given the data D. The lower plot has σ on its horizontal axis and the title “Posterior p(σ|D)” to indicate that it displays the posterior probability of σ values given the data.

Mode of the posterior distribution: Take a close look at the posterior distributions of μ and σ. You’ll see that they are annotated with their modal value, that is, the value of the parameter that is most probable. The mode of the posterior distribution is one way of defining the “best” estimate of the parameter.

Uncertainty of the posterior distribution: The highest density interval (HDI). One attraction of Bayesian methods is that the posterior distribution inherently reveals the uncertainty of the estimated parameter value. When the posterior distribution is wide, the estimate is uncertain. When the posterior distribution is narrower, the posterior estimate is more certain. One way of quantifying the width of a distribution is with the 95% highest density interval (HDI). Any point inside the HDI has higher probability density than any point outside the HDI, and the 95% HDI covers 95% of the distribution. Thus we can say the 95% HDI includes the 95% most probable or most credible values.

(3) Try It!

For these exercises, leave the prior distributions broad (as they are by default when the app is invoked).

• (3.1) Slide the data mean to a different value and watch the posterior mode of μ. The posterior mode of μ should be close to the data mean. This makes sense because credible values of μ should be near the data mean when there is not strong prior information to suggest otherwise.

• (3.2) Slide the data standard deviation to a different value and watch the posterior mode of σ. The posterior mode of σ should be close to the data standard deviation. This makes sense because credible values of σ should be near the data SD when there is not strong prior information to suggest otherwise.

• (3.3) Slide the data standard deviation to a different value and watch the posterior HDI of μ. When the data SD gets smaller, the HDI on μ gets smaller. Does that make sense? Answer: Yes, because when there is less “noise” in the data, our estimate of μ becomes more stable and more precise.

• (3.4) Slide the data sample size to different values and watch the width of the posterior HDIs. When the sample size is smaller, the HDI’s are wider. When the sample size is larger, the HDI’s are narrower. This makes intuitive sense: With more data, we are more certain about the estimated values of the parameters.

The displayed numerical values of the mode and HDI limits are a bit wobbly. This wobble happens because “under the hood” the posterior distribution is represented by an algorithm called Markov chain Monte Carlo (MCMC). MCMC represents a probability distribution over a parameter with an enormous random sample of values from that distribution. MCMC merely makes a picture of the probability distribution, with higher resolution the longer it runs. The app runs MCMC just long enough to get reasonably stable numerical values. For research applications, the MCMC can be run as long as desired to get a more stable and highly accurate representation of the posterior distribution. For our purposes all you need to know about MCMC is that it creates a picture of the posterior distribution, and its numerical values have some wobble. You can learn more about MCMC in Chapter 7 of Doing Bayesian Data Analysis, 2nd Edition.

(4) Try It!

• (4.1) Notice MCMC wobble in the results. As mentioned above, the algorithm running under the hood uses MCMC methods to represent the posterior distribution, thereby creating random wobble in the numerical details of the posterior distribution. To get a feeling for how much MCMC wobble there is, try the following. First note the current numerical details of the mode and HDI for the posterior distribution of μ. Then quickly slide the Data Sample Size to any other value and back to its starting value. The app will re-run its analysis because it has detected slider movement, even though there is actually no change in the data mean, standard deviation, and sample size. Notice in the posterior distribution of μ there are slightly different numerical values for the mode and HDI. The underlying posterior distribution is unchanged; only its MCMC representation has wobble.

There are some applications in which you may have strong prior knowledge about the likely values of the parameters. For instance, suppose you are measuring IQ scores and you know you are sampling from the general population which, by definition of IQ scores, should have a μ near 100 and a σ near 15. You might want to take into account the prior knowledge when estimating μ and σ for your new sample. Therefore you would set the prior distribution on μ to be centered on 100.0 and very narrow to reflect your prior certainty, and you would set the prior distribution on σ to be centered on 15.0 and very narrow to reflect your prior certainty. What do you think the posterior distribution will look like? Will the posterior distribution depend on the amount of data?

(5) Try It!

To make this set of experiments most dramatic, start with these settings:
  • Set the data mean to 120 (which is much different than 100).
  • Set the data standard deviation to 25 (which is much different than 15).
  • Set the data sample size to 5 (which is small).

With the prior distribution broad, notice the form of the posterior distribution. The modal estimates of μ and σ are near the data mean and SD, but the posterior distributions have wide HDI’s because of the sample size is small.

• (5.1) Find the pair of sliders for the μ Prior Mode & SD. Leave the mode slider at 100.0, but move the SD slider down to 3. This change will make the prior distribution of μ be narrow around 100. What is the consequence for the posterior distribution of μ? Answer: The posterior mode is now much closer to the prior mode of 100 than it was when the prior was broad, and the posterior distribution of μ is also much narrower. In other words, the strong prior knowledge dominates the small amount of new information from the data.

• (5.2) Find the pair of sliders for the σ Prior Mode & SD. Leave the mode slider at 15.0, but move the SD slider down to 1. This change will make the prior distribution of σ be narrow around 15. What is the consequence for the posterior distribution of σ? Answer: The posterior mode is now closer to the prior mode of 15 than it was when the prior was broad, and the posterior distribution is also much narrower. In other words, the strong prior knowledge dominates the small amount of new information from the data.

• (5.3) Change the data sample size to its largest possible value. For this app, the slider for sample size only goes up to a modest value, so the largest possible value is not really very big. Nevertheless, you can see that with a larger sample, the posterior distributions for μ and σ move away from the prior modes closer to the data mean and SD. In other words, even strong priors can be overcome by sufficient amounts of data.

While manipulating the prior distribution you may have noticed there was no change of the MLE in the frequentist column. This is because the maximum likelihood estimate does not take into account prior knowledge.

(6) Try It!

• (6.1) Try implementing some strong prior knowledge. Suppose you’re investigating the short-term effect of alcohol consumption on IQ test performance. In many studies you’ve had volunteers drawn at random from the general population consume three alcoholic drinks in a hour and then take an IQ test. Suppose sample after sample has shown a mean near 90 (a fictitious number). Now you’re collecting data from a new alcohol-dosed sample and you want to use your prior knowledge to help estimate the mean of the new sample. Use the sliders to set the prior on μ so it’s centered on 90 and is fairly narrow to reflect your prior certainty. Play around with different settings of the data to see the effect of the strong prior knowledge on the posterior estimate of μ. Notice, however, there is no effect of your strong prior knowledge on the MLE.

• (6.2) Change the prior SD’s of μ and σ back to large values such as 90. You’ll see that the posterior modes are now very close to the data mean and SD. Play around with values for the SD’s of the priors. You’ll see that as long as the prior is fairly broad relative to the scale of the data, the posterior is virtually unaffected by exactly how broad the prior is.

After playing around with the sliders, if you want to re-set the default values, simply re-load the page in your browser window.

The posterior distribution inherently reveals the uncertainty of the parameter estimates. The 95% HDI is one way of quantifying the uncertainty. In the frequentist column, however, again notice that there is no confidence interval (CI) displayed. The absence of CI is the result of not (yet) doing any hypothesis tests. The CI is the range of parameter values not rejected by a hypothesis test, and without hypothesis testing there is no CI. (CI’s defined by coverage are tantamount to, and less general than, CI’s defined by tests.)

For further explanation and discussion, see Bayesian Data Analysis for Newcomers.

Decision using ROPE and HDI

Sometimes researchers are interested in assessing particular candidate parameter values, to decide if the candidate parameter value can be rejected as implausible or even accepted as a good-enough representation of the data. For example, consider investigating a new “smart drug” and wanting to know if the mean of the test sample is clearly above the general population mean of 100. You should check if all the most credible values of μ are sufficiently far above 100 for you to be sure that there is a clinically important effect. That is, checking all the most credible values is about being sure, and checking that they are sufficiently far above 100 is about clinical importance.

One way to specify clinical importance is by defining a region of practical equivalence (ROPE) around the target value of the parameter. The ROPE specifies values that are only negligibly different from the target value. For example, with IQ scores, we might say that IQ scores within 2 points of 100 are negligibly different from 100. Therefore, for us to declare that a sample mean is non-negligibly different from 100, we need to know that all the most credible values of μ are at least 2 points away from 100.

This reasoning leads to the following decision rule: If the 95% HDI falls outside the ROPE around the target value, then reject the target value — it’s not a credible description of the data because the 95% most credible values are all non-negligibly different from the target value. On the other hand, the target value can be accepted for practical purposes if the 95% HDI falls completely inside the ROPE, because all the 95% most credible values are only negligibly different from the target value. If the HDI falls partially inside the ROPE and partially outside the ROPE, then no decision about the target value can be made.

The app does not display ROPEs and does not implement the HDI+ROPE decision rule. But it is easy for you to do visually; just specify your ROPE and check where the HDI falls relative to your ROPE. (The ROPE is a decision threshold, and the app avoids options for setting decision thresholds because the app layout would get even more cluttered. The only decision threshold explicitly used by the app is the conventional threshold for frequentist p values, which we will encounter later.)

The HDI+ROPE decision rule is not called “hypothesis testing”, which is terminology reserved for different techniques discussed in subsequent sections. To learn more about the HDI+ROPE decision rule, see the article:
Kruschke, J. K. (2018). Rejecting or Accepting Parameter Values in Bayesian Estimation, Advances in Methods and Practices in Psychological Science, 1, 270-280. DOI: https://doi.org/10.1177/2515245918771304
with important Supplementary Material here: https://osf.io/fchdr/

Hypothesis Testing

It’s not necessary to test hypotheses, and the app defaults to not displaying hypothesis tests. There’s a classic book titled, “What if there were no significance tests?” (Harlow, L. L., Mulaik, S. A., & Steiger, J. H. 2016 Routledge; 1997 Psychology Press.). The app’s default view shows what a world without significance tests would look like!

But if you want to test hypotheses about the parameters, you can specify which parameter values you want to test. The bottom row of the app holds the controls and results of hypothesis testing, as shown in the figure below:

Controls for selecting hypotheses to test.

Controls for selecting hypotheses to test.

(7) Try It!

• (7.1) Click the buttons under “Null Hyp. Test” to select which parameters you want to test. You’ll see corresponding results panels appear under the frequentist and Bayesian columns. You’ll also see values appear for the confidence intervals (CIs) for frequentist uncertainty.

All of the details will be explained in subsequent sections of this tutorial. In the next section, we’ll examine Bayesian hypothesis testing. After that we’ll consider frequentist hypothesis testing, and then frequentist confidence intervals.

Bayesian hypothesis testing

Like Bayesian parameter estimation, Bayesian hypothesis testing involves re-allocation of credibility across possibilities. But for hypothesis testing the possibilities are different hypotheses about the parameters. One hypothesis restricts a parameter to its “null” value, and the hypothesis is called a “null hypothesis”. Another hypothesis allows all parameters to range more widely, and this hypothesis is called the “alternative” hypothesis. Bayesian hypothesis testing re-allocates crdibility across the null hypothesis and the alternative hypothesis.

To make the ideas concrete, suppose we administer a “smart drug” to a group of people and then have them take an IQ test. If there were no effect of the smart drug on the central tendency of IQ scores, then the mean IQ of the group should be the same as the general population mean of 100.0. In other words, the null value of μ is 100 and the null hypothesis claims that μ is set at 100. But the null hypothesis about μ doesn’t care about the value of σ and puts a broad prior on σ. The null hypothesis for μ therefore puts a prior distribution on the <μ,σ> parameter space that looks like a sharp ridge as shown in the left panel below:

Priors for Bayesian hypothesis test of &mu;.

Priors for Bayesian hypothesis test of μ.

The ridge in the left panel above indicates that only the null value of μ has non-zero probability in the prior distribution, while a broad range of σ values has non-zero prior probability.

The right panel above shows the prior distribution for the alternative hypothesis. That prior is broadly spread out over both parameters, indicating that lots of values of μ have non-zero prior probability.

(The app does not show plots like the figure above because there is not enough screen space. The app does not show the null-hypothesis prior distribution at all. The app does show aspects the alternative-hypothesis prior, but not in perspective like the figure above. The app shows the marginal distributions of μ and σ in the alternative hypothesis.)

In Bayesian hypothesis testing, the two priors in the figure above are conceived as alternative models. Each model has a prior probability. For example, if we think that the null hypothesis is unlikely, that is, if we think that the smart drug really will have an effect, then we might say that the null-hypothesis model has only a 20% prior probability and the alternative-hypothesis model has a 80% prior probability. A display of the entire prior probability distribution, across model indices and parameters within models, is shown below:

Priors for Bayesian hypothesis test of &mu; with prior probabilities of the hypotheses.

Priors for Bayesian hypothesis test of μ with prior probabilities of the hypotheses.

In the figure above, the upper panel displayes the probabilities of the model indices (in this case labeled “Null” and “Alternative”). The model-index variable is itself a parameter an overarching model of the data. The model-index variable has values that correspond to each model, and there is a probability distribution over those values, as displayed in the upper panel of the figure above.

When data are taken into account, Bayesian reasoning re-allocates probabilities across all the parameters in the overarching model simultaneously. Probabilities are shifted across μ and σ parameters within the alternative-hypothesis model, and across parameters within the null-hypothesis model, and across the model-index parameter.

In Bayesian hypothesis testing, the focus is on the model-index parameter: How much do the probabilities on the model indices shift? If the data are not consistent with the null-hypothesis prior, then probability shifts toward the alternative-hypothesis index. Interestingly, if the data are consistent with the null-hypothesis prior, then probability shifts toward the null-hypothesis index.

The degree of shift across hypotheses, from prior probabilities to posterior probabilities, is called the Bayes factor. Usually the Bayes factor is framed as a shift toward the null hypothesis away from the alternative hypothesis, and is denoted BFnull. When BFnull is greater than 1.0 there has been a shift of probability toward the null hypothesis away from the alternative hypothesis, and when BFnull is less than 1.0 there has been a shift of probability away from the null hypothesis toward the alternative hypothesis. (The Bayes factor is always greater than zero.)

(8) Try It!

• (8.1) In the app, under the header Null Hyp. Test specify your Test Intention by selecting the button for “mu only”. After clicking the button, new information appears in the bottom row of the app.

Bayesian hypothesis testing with 'mu only' selected.

Bayesian hypothesis testing with ‘mu only’ selected.

  • Under the header “Null Hyp. Test” there appears a slider for specifying the null value of μ, which is denoted μ0. It defaults to μ0=100.0, but you can set it to other values.

  • In the Frequentist column there appears information for frequentist null hypothesis significance tests. We’ll cover this information later.

  • In the Bayesian column there appears information about the Bayes factor. In particular, the panel displays μ: BFnull = …. As mentioned before, when BFnull>1.0 then credibility has shifted toward the null, but when BFnull<1.0 then credibility has shifted away from the null. The numerical value of the BFnull has MCMC wobble.

  • In the Bayesian column there also appears a slider for specifying the prior probability of the null hypothesis. The prior-probability distribution is called a null “model” to distinguish it from the value of μ0 in the null hypothesis. Beneath the information about the BFnull is shown the posterior probability of the null model. When BFnull>1.0 then the posterior probability of the null model is greater than its prior probability; when BFnull<1.0 then the posterior probability of the null model is less than its prior probability.

(9) Try It!

Manipulate the Data sliders and watch what happens to the BFnull’s and posterior probabilities of the models.

• (9.1) Manipulate the Data Mean; observe effect on BFnull for μ0. When the data mean is close to the value of μ0, then BFnull for μ0 gets large (greater than 1.0). When the data mean is far from the value of μ0, then BFnull for μ0 gets small (less than 1.0). In other words, when the data mean is near the null value then the BF suggests a shift of credibility toward the null hypothesis, but when the data mean is far from the null value then the BF suggests a shift of credibility toward the alternative hypothesis.

• (9.2) Manipulate the Data SD; observe effect on BFnull for σ0. Again, when the data SD is near the null value σ0 then the BF suggests a shift of credibility toward the null hypothesis, but when the data SD is far from the null value σ0 then the BF suggests a shift of credibility toward the alternative hypothesis.

• (9.3) Manipulate the Data Mean; observe the effect on BFnull for μ0 when the data SD is large vs when the data SD is small. When the data SD is small, that is when there is less noise in the data, there is a bigger change in the BF when the data mean is changed.

(10) Try It!

Manipulate the Null Hyp Values and watch what happens to the BFnull’s and posterior probabilities of the models.

• (10.1) Manipulate the Null Hyp μ0 Value; observe effect on BFnull for μ0. When the value of μ0 is close to the data mean, then BFnull for μ0 gets large (greater than 1.0). When the value of μ0 is far from the data mean, then BFnull for μ0 gets small (less than 1.0).

• (10.2) Manipulate the Null Hyp σ0 Value; observe effect on BFnull for σ0. The analogous statement is true for the relation of the data SD and the BFnull for σ0.

(11) Try It!

• (11.1) Move the slider for μ Null Model Prior Prob. and notice what happens to the posterior probability of the null model: If the prior probability of the null model is increased, its posterior probability goes up.

• (11.2) Notice also that changing the prior probability of the null model does not change the Bayes factor. The Bayes factor determines how much the model probabilities shift from prior to posterior.

• (11.3) Notice what happens if you set the prior probability to 0.0 or to 1.0: If the prior probability is 0.0 the posterior probability is 0.0, and if the prior probability is 1.0 then the posterior probability is 1.0.

This paragraph provides a little more mathematical detail about the Bayes factor. You can skip this paragraph without loss of dignity. Denote the prior for the null hypothesis as \(H_{null}\). In other words, \(H_{null}\) is the label for the ridge-shaped prior distribution, \(p_{null}(\mu,\sigma)\), shown in the left side of the figure above. Denote the alternative hypothesis as \(H_{alt}\). In other words, \(H_{alt}\) is the label for the broad prior distribution, \(p_{alt}(\mu,\sigma)\), shown in the right side of the figure above. The prior probability of the null hypothesis is denoted \(p(H_{null})\). Consequently the prior probability of the alternative hypothesis is \(p(H_{alt}) = 1 - p(H_{null})\). With all that notation established, we can define the Bayes factor exactly by the following relationship between prior odds and posterior odds: \[{ p(H_{null}|D) \over p(H_{alt}|D) } = BF_{null} \,\, { p(H_{null}) \over p(H_{alt}) } \] The BFnull indicates how much the prior odds shift to get to the posterior odds. Notice that if the prior odds are 50/50 then BFnull equals the posterior odds. But if the prior odds are not 50/50, then BFnull does not equal the posterior odds. For further explanation and discussion, see Bayesian Data Analysis for Newcomers.

Technical digression: Bayesian hypothesis testing is a special case of model comparison. As mentioned in a previous digression, the prior distribution is considered to be an inherent part of the model. In hypothesis testing, we’re comparing two different priors, and therefore this is a case of model comparison. We’re comparing a null-hypothesis model against an alternative-hypothesis model, and the hypotheses are expressed by the shapes of the prior distributions. For more info, see this blog post, or Bayesian Data Analysis for Newcomers.

The BF is sensitive to the breadth of alternative prior

The Bayes factor is very sensitive to the breadth of the alternative prior. Essentially, the wider the alternative prior, the less will be its posterior probability. When the alternative prior is made wider, the credibility of values in its midrange goes down; that is, the credibilities are getting spread out and diluted across a wider range of parameter values. Therefore all those values have effectively smaller prior probability, and the alternative prior is penalized for being spread out.

(12) Try It!

In the upper right of the app, where the prior on μ is specified, locate the slider for the SD of μ. It’s the second slider under “μ: Prior Mode & SD”.

  • (12.1) Set the prior SD at 30. Note the resulting BF, also note the HDI on μ. If all the other values are set at their defaults, you’ll find that the BF is approximately 0.8 (subject to MCMC wobble), slightly favoring the alternative hypothesis. And the 95% HDI on μ is approximately 100 to 122.
  • (12.2) Set the prior SD at 300. Note the resulting BF, also note the HDI on μ. If all the other values are set at their defaults, you’ll find that the BF is now approximately 6.0 (subject to MCMC wobble), favoring the null hypothesis. But the 95% HDI on μ is virtually unchanged, still approximately 100 to 122.

Because the BF is so sensitive to the breadth of the alternative prior, if you use a BF you must be very careful to use an alternative prior that is (i) theoretically meaningful, (ii) well justified, and (iii) checked for sensitivity when changed. When used appropriately and carefully the BF approach can be meaningfully informative, but when used thoughtlessly with default priors the BF approach can be misleading. For further discussion of cautions when using the BF approach, see Bayesian Data Analysis for Newcomers.

Deciding to reject or accept the null hypothesis

In some applications, analysts might want to decide to “accept” or “reject” a hypothesis. In the Bayesian framework, accepting one hypothesis means rejecting the other hypothesis. The decision to accept or reject is always in the context of the specific alternative hypotheses. To accept or reject a null hypothesis is always relative to the specific alternative hypothesis under consideration. You can’t just reject or accept a null hypothesis, you can only reject or accept a null relative to a specific alternative.

What criterion should be used to accept or reject a hypothesis? The obvious sensible answer is to consider the posterior probability of the hypothesis. If the posterior probability of the hypothesis is sufficiently high (relative to the posterior probability of the other hypothesis) then the hypothesis should be accepted and the other hypothesis rejected. For example, we might set the decision threshold at 0.95, so the decision rule would be: If the posterior probability of a hypothesis exceeds 0.95, accept it and reject the other hypothesis.

But some pracitioners use the BF to make a decision instead of the posterior probability. According to this decision rule, if BFnull is greater than some critical threshold C (such as 3 or 10), then accept the null hypothesis and reject the alternative hypothesis, and if BFnull is less than 1/C, then reject the null and accept the alternative.

The problem with using the BF to make a decision is that the BF ignores the prior and posterior probabilities of the hypotheses (which makes the Bayes factor essentially non-Bayesian and misleadingly named). As you noted in a previous Try It! exercise, when the prior probability of the null hypothesis is smaller then its posterior probability is smaller. Ignoring the prior and posterior probabilities of the hypotheses can be a serious blunder.

(13) Try It!

Suppose we test the IQ scores of a small group of people, drawn from the general population, when they are exposed to a very distracting environment (e.g., loud TV commercials) and cognitive load (e.g., having to remember different 7-digit random numbers throughout solving every question on the test). Presumably it would be very difficult to perform up to one’s potential in such circumstances. Therefore, even though the people were drawn from the general population, there is small prior probability that their mean score is the general population mean of 100. In other words, we should set the prior probability of the null hypothesis, μ=100, to a small value, such as 0.05. Let’s set up this scenario in the app.

In the Data panel: set the Data Mean to 90.0 (to reflect a small drop in IQ scores from 100 due to distraction), keep the Data SD at 20.0, and set the Data Sample Size to 5 (i.e., a small sample).

Set the Null Hyp. μ Value at its default value of 100.0.

Set the alternative hypothesis priors at their default broad values; that is, μ prior mode of 100 and SD of 90, σ prior mode of 15 and SD of 90.

Importantly, set the μ Null Model Prior Prob to 0.05, to reflect the small prior probability of the null hypothesis.

(13.1) What is the resulting BFnull for μ?

Answer: Approximately 5 (subject to MCMC wobble). That means there is a moderate shift of credibility away from the alternative hypothesis toward the null hypothesis that μ=100. That is, the BF suggests we should accept the null hypothesis.

(13.2) But what is the posterior probability of the null hypothesis?

Answer: Approximately 0.2 (subject to MCMC wobble). That is, the posterior probability of the null hypothesis is still much less than the posterior probability of the alternative hypothesis, suggesting we should not accept the null hypothesis.

In summary, when doing Bayesian hypothesis testing: Make sure the alternative prior distribution is theoretically meaningful and well justified, and check that the results are robust against changes in the alternative prior. Moreover, don’t just use the BF to make a decision, consider the prior probabilities of the hypotheses and use the posterior probabilities of the hypotheses to make a decision. When all of these steps are done thoughtfully, Bayesian hypothesis testing can be very informative. But mindlessly using default settings for BF’s can be perilous.

For further explanation and discussion, see Bayesian Data Analysis for Newcomers.

Bayesian hypothesis test of σ

In many applications, the primary interest might be the value of σ. Consider IQ scores of people put under some sort of stress, such as being carefully watched by spectators while taking the IQ test. Some test takers might be socially facilitated by such attention while other test takers might be bothered and distracted. In other words, the standard deviation of the scores might increase relative to the general population SD of 15. We might want to test if the estimated σ in the data is different than the null hypothesis value of 15.0.

All the logic previously explained for testing μ applies analogously to testing σ. Instead of a “ridge” null hypothesis on μ, we consider a “ridge” null hypothesis on σ, as shown in the diagram below:

Priors for Bayesian hypothesis test of &sigma;.

Priors for Bayesian hypothesis test of σ.

The ridge-shaped prior distribution for σ means that only a specific value of σ has non-zero prior probability while a broad range of μ values is possible.

(14) Try It!

Under Null Hyp Test, for Test Intention click “sigma only”. In the left column you’ll see a slider for specifying the value of σ0, and in the right column you’ll see information about the BF for the null hypothesis on σ along with its prior and posterior probabilities.

In the Data panel, set Data Mean to 100.0, set Data SD to 21.0 (to reflect the larger SD when socially stressed), and set Data Sample Size to 30. Leave everything else at their default values (including the Null Hyp σ0 value at 15.0).

What is the result? For σ, the BFnull is approximately 0.2 (subject to MCMC wobble), and the posterior probability of the null hypothesis will be lower than its prior probability. Notice also that the posterior estimate of σ (displayed in the Estimate-with-Uncertainty row of the app) has a 95% HDI that clearly excludes the null value of σ.

You can also test null hypotheses of both μ and σ. Under Test Intention click the button for “both mu and sigma”. This merely displays the information for both tests, and nothing in the Bayesian computations changes when you test hypotheses on both parameters. (In principle there could be a null hypothesis prior shaped like a “spike” at <μ00> with zero prior probability at all other combinations of μ and σ, but the app does not implement this possibility.) As will be explained later, in the frequentist framework computations do change when you test more than one hypothesis.

Comparing Bayesian estimation and Bayesian hypothesis testing

Bayesian parameter estimation and Bayesian hypothesis testing are both mathematically correct procedures for re-allocating credibility across possibilities. They differ only in what possibilities they focus on. Bayesian hypothesis testing focuses on two prior distributions over the parameter space and asks which prior distribution better accommodates the data. Bayesian parameter estimation focuses on a non-null prior distribution and asks what the posterior distribution looks like when the data are taken into account. The two procedures focus on different information, from which different conclusions might arise.

Importantly, the two procedures are differently sensitive to different details of the prior distribution. Bayesian parameter estimation tends to be very robust under changes in the breadth of the prior distribution: As long as the prior is broad relative to the posterior, the exact breadth of the prior is immaterial. Bayesian hypothesis testing, on the other hand, can be very sensitive to the breadth of the alternative-hypothesis prior. Moreover, Bayesian hypothesis testing should use the prior (and posterior) probabilities of the models themselves, which are not used by Bayesian estimation.

When it comes to making a decision about null values, a decision from using the HDI+ROPE with the parameter estimate, and a decision from using the BF in hypothesis testing, will often agree with each other. Sometimes, in borderline cases, they will disagree. Such disagreements are not a logical paradox, they are merely a consequence of using different decision procedures on different information.

(15) Try It!

Set the sliders to the default values of the app: just click the re-load button on the browser window of the app, and after it reloads click the “mu and sigma” button under Null Hyp Test.

Consider the cell for Bayesian estimation, which shows the posterior distribution on the parameters. The mode of the posterior distribution on μ is near the sample mean (i.e., near 111), and the mode of the posterior distribution on σ is near the sample SD (i.e., near 20). But because the sample size is modest (only 18), the 95% HDI’s are fairly wide. The posterior distribution on the parameters in richly informative because it provides the complete distribution of credibilities of all parameter values.

Suppose we wanted to make a decision about null values by using the HDI+ROPE procedure. The lower bound of the 95% HDI on μ is near μ0=100, and the lower bound of the 95% HDI on σ is near σ0=15. Therefore if we put a modest ROPE around the null values, say +/- 1.5 IQ points, we would not decide to reject the null values because the 95% HDI does not fall outside the ROPE.

Consider the cell for Bayesian hypothesis testing. It shows that the BFnull for μ is around 1.8 (subject to MCMC wobble), and the BFnull for σ is close to 1.5 (subject to MCMC wobble), which means that credibility shifts a little toward the null-hypothesis prior away from the alternative-hypothesis prior. This shift toward the null hypothesis occurs despite the fact that the most credible value of μ is about 111 (not 100) and the most credible value of σ is about 20 (not 15).

• (15.1) Now make the prior distributions broader and watch what happens to the parameter estimate and to the BF’s. Set the SD of μ Prior to 300 and set the SD of σ Prior to 300. You’ll see that the posterior distribution mode and HDI hardly change, but the BF’s for both μ and σ have increased to about 5, meaning that now the null hypotheses are more strongly preferred.

• (15.2) Now make the prior distributions a bit narrower but still reasonably broad and watch what happens to the estimates and to the BF’s. Set the SD of μ Prior to 30 and set the SD of σ Prior to 30. You’ll see that the posterior distribution mode and HDI hardly change, but the BF’s for both μ and σ have decreased to about 0.7, meaning that now the null hypotheses are not preferred.

The previous two examples merely point out that if you’re going to do Bayesian hypothesis testing, you must specify the prior distribution carefully and meaningfully.

• (15.3) An application with an informed prior. Suppose a manufacturer of a “smart drug” claims that its product raises IQ scores by a certain amount. We will collect new data of our own to assess the manufacturer’s claim.

The manufacturer has conducted big in-house studies and has a fairly precise estimate of μ, namely μ=110 with a SD of 5. We will set the prior distribution to represent this information: in the upper right cell of the app, set sliders for μ Prior Mode & SD to 110 and 5.

Suppose we are skeptical that smart drugs can really increase IQ, perhaps because a lot of previous claims by other producers have been debunked. Therefore we will set the prior probability of the null hypothesis (μ0=100) to 0.95, relecting our prior belief that there’s only a 1 in 20 chance that a smart drug could increase IQ that much. In the lower-right cell, set the slider for μ Null Model Prior Prob to 0.95.

Suppose we administer the smart drug to our own sample of people from the general population, with sample size of 50. Suppose our results show a data mean of 107 and a data SD of 17. Set the data sliders appropriately!

The posterior distribution on μ shows a mode a little greater than the sample mean of 107; this is the influence of the informed prior on the estimate. But the modal estimate of μ is also less than the manufacturer’s prior belief. This posterior distribution shows what the manufacturer should belief about the μ value in light of the new data. In other words, this is the result of starting with the manufacturer’s prior beliefs and re-allocating credibility based on our new data.

In the lower right cell, the BFnull for μ is approximately 0.1, which suggests a notable shift away from the null hypothesis, toward the manufacturer’s hypothesis. But because of our strong prior belief in the null hypothesis, the posterior probability of the null hypothesis remains middling. In other words, even though the BF shifts credibility toward the manufacturer’s hypothesis, the posterior probability of the manufacturer’s hypothesis is not very high.

Frequentist hypothesis testing

We now focus on the column of the app headed Frequentist. The frequentist approach asks a very different question than the Bayesian approach. Whereas the Bayesian approach asks about the allocation of credibility across possible parameter values given the observed data, the frequentist approach asks about the probability of getting different possible data values given a hypothesized parameter value.

Because the frequentist approach depends on hypothesizing particular parameter values, the app does not show much information in the frequentist column until particular hypothesis tests are specified along with their hypothesized parameter values.

Suppose you randomly sample some people from the general population and give them a “smart drug” and then measure their IQ. Suppose it turns out that their mean score is 111, apparently well above the general population mean of 100. Is this batch of high IQ scores definitely attributable to the smart drug, or could this sample of high IQ’s have happened merely by luck of the draw in a random sample, with no effect of the smart drug at all? If there is a high probability that a random sample would show high IQ scores by chance alone, then we would not want to attribute an effect to the drug. But if there is a very low probablity that a random sample would show IQ scores as extreme as what we observed, then we might reject the idea that the smart-drug sample had unchanged scores.

The term “frequentist” derives, loosely speaking, from the emphasis on how frequently there would be data as or more extreme than what was observed if a particular hypothesis were true.

Sampling distribution of a summary statistic

To address the frequentist questions, we must consider how often different types of random samples would occur if drawn from a particular hypothetical world. The idea is that we simulate data from a hypothetical world, sampled the same way we intended to collect data from the real world.

Suppose we draw a random sample of people from the general population and measure their IQ. Their scores will tend to be around 100, because we’ve hypothesized that they come from the general population, which by definition of IQ has a mean of 100 and a standard deviation of 15. If we repeatedly sample from the population, sometimes the sample will, by chance, have a bunch of high scores, and other times the sample will, by chance, have a bunch of low scores. For every sample we draw from the population, we compute a summary statistic such as the arithmetic mean, and then we consider the distribution of that summary statistic across all the different random samples. The result is the distribution of the summary statistic of the sample, or just “sampling distribution” for short.

What summary statistic should we use? We want the summary statistic to indicate how far the sample deviates from what we’d expect to get from the hypothesis. If we’re considering the central tendency of the sample (as opposed to its variance) then a natural measure is the difference between the sample mean and the hypothesized mean, denoted \(m_x - \mu_0\), where \(m_x\) is the mean of sample \(x\) and \(\mu_0\) is the mean of the null hypothesis. However, an infelicity of that measure is its magnitude depends on the scale of the data. For example if we are measuring temperatures and we switch from Fahrenheit to Celsius, the numerical value of \(m_x - \mu_0\) changes even though the actual temperatures haven’t changed. Therefore it’s convenient to consider the number of standard deviations between the sample mean and the hypothesized mean, because that number doesn’t change when the measurement scale changes. This value is called the effect size (or “Cohen’s d”), and is denoted \[d_x = (m_x - \mu_0) / s_x\] where \(s_x\) is the standard deviation of sample \(x\).

Technical digression: Effect size and the t statistic. In the definition of effect size, \(s_x\) is computed using \(N-1\). The effect size \(d\) is very similar to the traditional t statistic: \(t = d \cdot \sqrt{N}\). The app uses effect size instead of t for technical reasons when later encountering non-fixed sample sizes. For fixed sample sizes, p values from \(d\) are the same as p values from \(t\).

To create a sampling distribution of effect size, we start with a specific hypothesis about the scores. Suppose the scores come from the general popluulation and are normally distributed with \(\mu_0 = 100.0\) and \(\sigma_0 = 15.0\). Then we randomly draw some scores and compute \(d_x = (m_x - \mu_0) / s_x\) for the sample, and we record the sample’s \(d_x\). Then draw another random sample and compute and record its \(d_x\). Then do it again, and again, and again. Ultimately we make a histogram of all the samples’ effect sizes, and the result is the sampling distribution of the effect size.

The figure below illustrates the process of creating a sampling distribution for the effect size, \(d\). At the left of the figure is the hypothesis: A normal distribution of scores with mean of μ and standard deviation of σ. From that hypothetical distribution we draw random samples, and for each sample compute the value of the effect size, \(d_{hyp}\). The symbol for the effect size is given a subscript “hyp” to indicate that these values come from the hypothetical world, not from the real world. The process of drawing an infinite number of random samples is indicated in the middle of the figure below. Finally, at the right of the figure below, we make a smoothed histogram of the infinite set of \(d_{hyp}\) values from the samples. This is called the sampling distribution of \(d_{hyp}\). Notice its horizontal axis is labelled with “\(d_{hyp}\)”. The sampling distribution of \(d_{hyp}\) is centered at zero, because sometimes the random samples have scores have a mean above μ and other times the random samples have a mean below μ.

Process for creating a sampling distribution of d<sub>hyp</sub>.

Process for creating a sampling distribution of dhyp.

Definition of p value

Let’s return now to the example that motivated this venture into sampling distributions. In the example, a sample of people from the general population was given a “smart drug” and then they took IQ tests. It turned out that their mean score is 111, apparently well above the general population mean of 100. Could this sample of high IQ’s have happened merely by luck of the draw in a random sample, with no effect of the smart drug at all? If there is a high probability that a random sample would show high IQ scores by chance alone, then we would not want to attribute an effect to the drug. But if there is a very low probablity that a random sample would show IQ scores as extreme as what we observed, then we might reject the idea that the smart-drug sample had unchanged scores.

We now formalize this reasoning in terms of effect size. Suppose that the sample standard deviation is \(s_{obs} = 20.0\). The effect size of the observed data, relative to the null hypothesis of \(\mu_0 = 100.0\), is \(d_{obs} = (m_{obs} - \mu_0 ) / s_{obs} = (111-100)/20 = 0.55\). The question is, how often would random data from the null hypothesis show an effect size as or more extreme than the observed effect size? The question is answered by the sampling distribution of the effect size: We simply see how much of the sampling distribution is more extreme than \(d_{obs}\). If the probability of being more extreme than \(d_{obs}\) is very small, we reject the notion that the observed data came from the hypothesized distribution.

The probability that the effect size from the null hypothesis is more extreme than the observed effects size is called the p value of the effect size, and is denoted in the app as p(dhyp≽dobs|μ0,intent), which can be read aloud as “the probability that a random effect size from the hypothesis is as or more extreme than the observed effect size, given the hypothesized value of μ and the intended sampling and testing procedure.” The role of the intended sampling and testing procedure will be explained later.

The p value defined here is the “two-tailedp value, because it counts departures of mobs from μ0 in either direction (greater than or less than) as evidence against μ0. In other words, μ0 can be rejected by mobs being too far above or too far below, so we count both tails of the sampling distribution. The p value displayed by the app is the two-tailed p value.

For further discussion of sampling distributions and p values, see The Bayesian New Statistics.

Decisions with p values

Traditionally, the p value is used to make a decision about the hypothesis. When the p value is small enough, the hypothesis is rejected because it probably could not have generated the observed summary statistic. But how small of a p value is small enough to decide to reject? In the social sciences, the conventional criterion is p < .05. When p < .05 the hypothesis is rejected and the p value is sometimes marked with an asterisk to indicate that it marks an effect that is “significantly different from the hypothesized value” or just “significant” for short. When p is not less than .05, it is marked “n.s.” for “not significant.”

On the other hand, when p  .05 the hypothesis is not rejected and, importantly, it is also not accepted. The hypothesized value is not accepted because there are lots of other values of μ that could also have generated the data. This realization, that there are lots of μ values that are not rejected, leads to the notion of confidence interval, which will be explained later.

The .05 criterion is merely conventional. The physical sciences, for example, routinely use a stricter criterion called “five sigma,” which requires p < .0000003 to decide an effect is significant.

The choice of decision threshold is usually linked to one’s tolerance for a false alarm, that is, deciding to reject a null value even when it’s true. (The technical term for this is a Type I error.) Notice that the decision threshold for p marks the false-alarm rate a person is willing to tolerate. For example, if the decision threshold is .05, then 5% of the time we’ll reject the null value even when the data were generated from the null hypothesis. This allowed false-alarm rate is denoted α (alpha), and is conventionally set at 5% in the social sciences. The app displays the alpha level used for deciding significance.

The figure below highlights where the app displays p and α values:
The panel that displays p and &alpha; values is highlighted.

The panel that displays p and α values is highlighted.

(16) Try It!

In the app, under Null Hyp Test, click the button for “mu only”. In the Frequentist column, you’ll see displayed the \(p\) value for the effect size.

• (16.1) Manipulate the Data Mean and watch \(p\). Move the slider for the Data Mean to the same value as the Null Hyp μ0 Value. For example, if the Null Hyp μ0 Value is set to 100, then slide the Data Mean to 100. Notice that the p value becomes 1. Does that make sense? Answer: Yes, because when the data mean equals the hypothesized mean, the observed effect size is zero, and there is 100% probability that an effect size from the hypothesis will be have magnitude at least zero. Now move the Data Mean slider higher or lower than the Null Hyp μ0 Value. You’ll notice that the p value gets smaller. At some point, p will drop below .05 and the p value will be marked with an asterisk instead of n.s.

• (16.2) Manipulate the Data Standard Deviation and watch \(p\) (for the mean). Set the Null Hyp μ0 Value to 100. Set the Data Mean to 110, and the Data Sample Size to 18. Leave those values fixed, and slide the Data SD to different values. For example, when Data SD is set to 15, then the p value for μ is less than .05, but when Data SD is set to 25, then the p value for μ is greater than .05. Does that makes sense? Answer: Yes, because when there is more “noise” in the data values, it’s easier for the sample mean to land far above or below μ0 by chance alone.

• (16.3) Manipulate \(\mu_0\) and watch \(p\). Set the data values to something reasonable and leave them fixed there for now; for instance set Data Mean to 111, set Data SD to 20, and set Data Sample Size to 18. Now slide Null Hyp μ0 Value to various values and monitor the p value for μ0. You’ll see that when μ0 is near the data mean, then p is large and μ0 cannot be rejected. But when μ0 is sufficiently far from the data mean, then p is less than .05 and μ0 can be rejected. The range of parameter values that cannot be rejected is called the “confidence interval,” as will be discussed later.

Test of standard deviation

There are times when the focus of research is on the variance of the scores, not their central tendency. As has been mentioned previously, consider IQ scores of people put under some sort of stress, such as being carefully watched by spectators while taking the IQ test. Some test takers might be socially facilitated by such attention while other test takers might be bothered and distracted. In other words, the standard deviation of the scores might increase relative to the general population SD of 15. We might want to test if we can reject the null hypothesis value of σ0=15.0.

For testing the magnitude of the sample standard deviation, we consider its magnitude relative to the hypothesized standard deviation. For reasons of mathematical convenience, the values are squared and we consider the variance ratio \(vr_x = s_x^2 / \sigma_{0}^2\).

Technical digression regarding the \(\chi^2\) statistic. The variance ratio is very similar to the traditional \(\chi^2\) statistic: \(\chi^2 = vr \cdot (N-1)\). The app uses variance ratio instead of \(\chi^2\) for technical reasons when later encountering non-fixed sample sizes. For fixed sample sizes, p values from \(vr\) are the same as p values from \(\chi^2\).

The figure below shows the process for generating a sampling distribution of the variance ratio. It starts on the left side with a hypothesis for where the data came from. Notice the hypothesis includes both μ and σ values. Then random samples are generated repeatedly from the hyothesis. For each simulated sample of data, the variance ratio \(vr\) is computed. This is the only change from the procedure illustrated previously for generating a sampling distribution of the effect size; instead of computing \(d_{hyp}\) for each simulated sample we computer \(vr_{hyp}\). Finally, at the right of the figure below, we see the smoothed histogram of the \(vr_{hyp}\) values. This is the sampling distribution of \(vr_{hyp}\).

Process for creating a sampling distribution of vr<sub>hyp</sub>.

Process for creating a sampling distribution of vrhyp.

To compute a p value for a hypothesized σ0, the app determines how much of the sampling distribution of \(vr_{hyp}\) is more extreme than the observed \(vr_{obs}\). (The app computes the two-tail p value, i.e., the extreme tail value multiplied by 2.) The app displays the p value as p(vrhyp≽vrobs|σ0,intent).

(17) Try It!

In the app, under Null Hyp Test, click the button for “sigma only”. In the Frequentist column, you’ll see displayed the \(p\) value for the variance ratio.

• (17.1) Manipulate the Data Standard Deviation and watch \(p\). Slide the Data SD to different values, leaving Null Hyp σ0 Value at 15. You’ll see that when the data SD is sufficiently far from σ0 then the p is small and σ0 can be rejected. You might notice that when the data SD equals σ0, the p value is large but not 1. This is because the sampling distribution for the variance ratio is skewed, and the variance ratio must be slightly less than 1.0 to land at exactly half way up the sampling distribution, with 50% above and 50% below.

• (17.2) Manipulate the Data Mean and watch \(p\) for σ. Leave the Data SD fixed at some value such as 20, and leave Null Hyp σ0 Value fixed at some value such as 15. Now slide around the value of the Data Mean and watch what happens to \(p\) for σ. You’ll see that nothing happens. That’s because the variance ratio does not depend on the sample mean, so the \(p\) value for the variance ratio does not depend on the sample mean.

• (17.3) Manipulate \(\sigma_0\) and watch \(p\). Set the data values to something reasonable and leave them fixed there for now; for instance set Data Mean to 111, set Data SD to 20, and set Data Sample Size to 18. Now slide Null Hyp σ0 Value to various values and monitor the p value for σ0. You’ll see that when σ0 is near the data SD, then p is large and σ0 cannot be rejected. But when σ0 is sufficiently far from the data SD, then p is less than .05 and σ0 can be rejected. The range of parameter values that cannot be rejected is called the “confidence interval,” as will be discussed later.

Testing intention affects decision threshold for p value

There are many research situations in which interesting questions can be asked of multiple parameters. For example, consider the effect of a “smart drug” on IQ scores. It would be interesting to find out how much the central tendency changes, and it would also be interesting to find out how much the variance changes. An earnest researcher would want to test two hypotheses, one about the mean and another about the variance.

With each test there is an opportunity for committing a false alarm. That is, when testing μ0 there is an opportunity for falsely rejecting the null hypothesis when it is true, and when testing σ0 there is also an opportunity for falsely rejecting the null hypothesis when it is true. When testing both hypotheses, the overall opportunity for committing a false alarm has increased.

Recall from the earlier section about decision thresholds for p values, the decision threshold (i.e., α level) is set at one’s tolerance for false alarms. When making multiple tests, researchers usually want to control the overall false alarm rate. Therefore the α level for each test is “corrected” to take into account the full set of tests intended to be done. The exact form of the correction depends on the structural relationships of the specific tests. (In the present situation the two tests are independent, so the app uses the Sidak correction.)

Notice that the overall false-alarm rate depends only on the set of tests the researcher intends to conduct. The false-alarm rate is only about the hypotheses and their implied sampling distributions; the false-alarm rate has nothing to do with the actual data. Thus, the overall false-alarm rate does not depend on the data in any way; in fact the data don’t even have to be collected yet. Therefore you cannot look at the data to decide which tests you intend to conduct; the intention is supposed to be formed a priori from theoretical interests.

(18) Try It!

Set the sliders to the default values of the app: just click the re-load button on the browser window of the app.

Be sure to click the “mu and sigma” button under Null Hyp Test. In the Frequentist column, you’ll see displayed the \(p\) values for both the effect size and the variance ratio.

• (18.1) Change from “both mu and sigma” to “mu only” or “sigma only” and watch what happens to alpha (and * vs n.s.). You’ll see that when both tests are selected, the output says “corrected alpha = 0.025” instead of “alpha = 0.05”. You’ll also see that because the p values for both tests are between 0.025 and 0.05, they are significant for a single test but not significant when conducting both tests.

• (18.2) Change the intended tests and watch what happens to the Bayes factors. You’ll see that nothing happens to the Bayes factors, or anything else in the Bayesian column. This is because the Bayesian analysis does not use sampling distributions, and therefore doesn’t consider false alarm rates.

Stopping intention affects p value

When creating a sampling distribution, each sample of simulated data is supposed to mimick the way the real data were collected. That’s the whole point of the sampling distribution: To show what data would have looked like if they were collected just like the real data were collected but the null hypothesis were true.

Sometimes when data are collected, there is a pre-decided sample size. The researcher might decide in advance to collect data from exactly \(N=25\) people. Assuming all the resulting data can be kept, then the sampling distribution should be based on simulated samples all of which have exactly \(N=25\) scores.

But relatively little research actually has a predetermined sample size. In much research that starts with a fixed sample size, some random number of data points can’t be used because of procedural errors or other issues that disqualify particular data. In many cases of survey research, a large number of surveys are mailed out but only some random number of people actually responds. And in a variety of applications, researchers might recruit people by advertising available testing sessions at particular times, and some random number of people actually show up.

In all situations in which the sample size was not fixed in advance, the sampling distribution should be generated with simulated data that have random sample sizes that mimick the random sample sizes of the real data. Doing this properly requires having a way to mimick the random sample sizes, and because this can be challenging it is rarely done.

The app implements one simple situation with random sample sizes. Suppose we are conducting a survey at the exit of a busy grocery store and ask people to participate as they walk out of the store. We decide in advance that we will stand at the exit for exactly one hour. Therefore the number of participants we recruit depends on the rate of willing participants per hour. For example, suppose the underlying rate of willing participants is 23 per hour; the actual number who volunteer during any given hour is some random number near 23. We might get 18 pariticipants, or 28, or some other number. A typical mathematical distribution for describing random counts is a Poisson distribution. The app has an option for generating the sampling distribution with random \(N\) based on a Poisson distribution with specified rate.

(19) Try It!

Set the sliders to the default values of the app: just click the re-load button on the browser window of the app.

In the row for Null Hyp Test, be sure to click the Test Intention button for “mu and sigma”. At the top of the Frequentist column, there are two buttons under Stop Intention. The default button is “fixed N”.

• (19.1) Switch the Stop Intention from “fixed N” to “random N (Poisson)” and watch what happens to the p values. You’ll see that the p values change. Thus, p values depend on the intended reason for stopping collection of data.

• (19.2) With the Stop Intention set at “random N (Poisson)”, slide the Poisson Typical N to different values and watch what happens to the p values. When the typical N is set larger then the p values are smaller. Why does that happen? The answer takes a few steps to explain. First, the specification of Typical N is the sample size you would expect to get when conducting the research over and over, so it’s specifying the typical N for simulated samples in the sampling distribution. Second –and this concept is key– when N is larger, the sampling distribution becomes narrower. When the sample size is large, random noise tends to cancel, and therefore \(m_x\) tends to more closely reflect \(\mu_0\) and \(s_x\) tends to more closely reflect \(\sigma_0\). Therefore the effect size \((m_x-\mu_0)/s_x\) tends to be close to zero and the variance ratio \(s_x^2/\sigma_0^2\) tends to be close to one. Third, when the sampling distribution is narrower, the observed effect size and variance ratio are sitting farther out in the tails of their sampling distributions, hence their p values are smaller.

• (19.3) Suppose a manufacturer of a “smart drug” collected volunteers for an on-the-spot test at the exit of a busy grocery store. They happened to get 18 volunteers during the hour, with data as in the default settings of the app. Suppose they analyze the data as if the sample size were fixed at N=18. You can find the corresponding p values with the app. Now suppose it turns out that their sample size was extremely lucky, and the typical N is 8. Analyze the data in the app, setting the “Poisson Typical N” to 8. Are the p values significant? Now instead suppose that their sample size was terribly unlikely, and the typical N is 28. Re-analyze with the “Poisson Typical N” at 28. Are the p values significant?

• (19.4) With the Stop Intention set at “random N (Poisson)”, slide the Poisson Typical N to equal the sample size of the data, then switch back and forth between “fixed N” and “random N (Poisson)”. When the “Poisson Typical N” equals the observed sample size, the p value is slightly bigger than for fixed N of the same size. The reason is that the increase in p for smaller sample sizes is slightly bigger than the decrease in p for larger sample sizes. In other words, the benefits of increasing sample size are negatively accelerating (i.e., the benefits of increasing sample size decelerate).

For further discussion of stopping intentions in sampling distributions, see The Bayesian New Statistics.

Frequentist uncertainty: The confidence interval (CI)

In the Bayesian framework, uncertainty is inherently represented by the posterior distribution over the parameters. In the frequentist framework there is no such representation, and so other concepts must be created to represent uncertainty.

In the frequentist framework, uncertainty is quantified as the confidence interval which is the range of parameter values not rejected. This concept has been mentioned a few times in previous Try It! exercises. The idea is simply to conduct a hypothesis test on every value of the parameter, not only the “null” value of the parameter. All those parameter values that are rejected are outside the confidence interval. Any value not rejected is inside the confidence interval.

Because the decision to reject is based on the overall alpha level, we need to indicate what alpha level we’re using when we indicate a confidence interval. When the overall alpha level is 5%, we call the corresponding confidence interval the 95% confidence interval. (In general, if the overall false alarm rate is limited at α, the corresponding confidence interval is the 1-α confidence interval.)

Notice that because the CI is defined in terms of hypothesis tests, there is no CI for parameters that aren’t being tested. Importantly, expressing a CI for a parameter is tantamount to doing a test on the parameter because any value outside the 95% CI is automatically rejected at p<.05. This correspondence of CI and hypothesis test is why the app only shows CI’s for parameters that have tests specified for them.

Technical digression for readers who are used to CI’s being defined in terms of coverage: The most general definition of CI is as the parameter values not rejected. This definition always works, and the resulting \((1-\alpha)100\)% intervals have the property that they cover the generating parameter value \((1-\alpha)100\)% of the time. On the other hand, trying to define CI’s in terms of coverage leads to difficult problems of construction and uniqueness for a number of models.

Because CI’s are defined as the parameter values not rejected, and rejection is affected by testing and stopping intentions, CI’s are affected by testing and stopping intentions. In particular, when there are multiple tests and it gets “harder” to reject individual parameters, there are more parameter values not rejects and hence CI’s get wider. And when the stopping intention is a random N (not fixed N), the p values of candidate parameter values depend on the typical N, and the CI changes accordingly.

The figure below highlights where the app displays the CI’s:
The panel that displays confidence intervals is highlighted.

The panel that displays confidence intervals is highlighted.

Notice in the display of the CI there is no distribution shown. This is because there is no distribution. Parameter values inside the CI are not rejected, while parameter values outside the CI are rejected. Many people are tempted to “see” a distribution over the confidence interval that suggests the credibility of parameter values, but that is a Bayesian concept delivered by the Bayesian method one column to the right.

(20) Try It!

• (20.1) Set the value of Null Hyp μ0 Value to anything inside the CI limits. Notice p is greater than (or equal to) 0.05. Those values inside the CI are not rejected.

• (20.2) Set the value of Null Hyp μ0 Value to anything outside the CI limits. Notice p is less than 0.05. Those values outside the CI are rejected.

• (20.3) Slide the value of Null Hyp σ0 Value inside and outside its CI limits. Notice p ≥ .05 and is marked “n.s.” when σ0 is inside the CI, but p < .05 and is marked “*” when σ0 is outside the CI.

Did you actually try the exercises in the previous Try It! section? If not, you should, because it illustrates the definition of confidence interval.

(21) Try It!

• (21.1) Change the testing intentions, and watch the CI’s change. Switch from “mu only” to “mu and sigma”, watch the change in CI on μ. Switch from “sigma only” to “mu and sigma”, watch the change in CI on σ. Do the CI’s get wider or narrower when more tests are intended?

• (21.2) Change the stopping intention, and watch the CI’s change. Switch from “fixed N” to “random N (Poisson)”. Use different settings of Poisson Typical N. The CI’s will change accordingly. Do the CI’s get wider or narrower when the Typical N is increased?

• (21.3) Increase the Data Sample Size, and watch the CI’s get narrower. When the sample size increases, the sampling distributions become narrower, as was described in a previous section.

• (21.4) Slide the Data Mean higher, and watch the CI on μ increase with it. This makes sense: μ values near the data mean should not be rejected.

• (21.5) Slide the Data SD higher, and watch the CI on σ increase with it. This makes sense: σ values near the data SD should not be rejected.

• (21.6) Slide the Data SD higher, and watch the CI on μ get wider. This makes sense: noisier data implies less confident estimates of parameters.

Which analysis when?

The app juxtaposes Bayesian and frequentist analyses so you can (i) see the difference in information they deliver and (ii) interactively experience their different assumptions and dependencies. Putting the two approaches side by side helps to highlight exactly what each approach does by showing how the other approach addresses the questions differently.

Consider parameter estimation, in the middle row of the app. The information delivered by the frequentist approach is the MLE of the parameters and the two limits of the CI. The CI limits depend on the testing and stopping intentions of the analyst. On the other hand, the information delivered by the Bayesian approach is a complete posterior distribution over the parameter space, which can be summarized by the modal values of the parameters and the HDI. The posterior distribution is not affected by testing or stopping intentions. The posterior distribution is essentially unaffected by any broad prior, but the posterior distribution can be affected by the prior distribution if it is strongly informed (i.e., very narrow). The frequentist MLE and CI do not take into account any prior information.

When to use estimation, when to use hypothesis testing

There are cases in which the question of primary interest is a discrete decision about a parameter value or about a pair of hypotheses. In those cases, obviously, hypothesis testing can provide an answer. The danger of hypothesis testing is that it is far too easy for people to slip into the fallacies of “black and white thinking.” In the frequentist framework, people easily slip from not rejecting the null hypothesis to accepting the null hypothesis. In either frequentist or Bayesian framework, people easily remember the decision (to reject, not to reject, or to accept a null hypothesis) and easilty forget the magnitude and uncertainty of the result. For example, a trivially small magnitude effect can be “significant” when N is large (because there is great certainty about the small magnitude), and a large magnitude effect can be not significant when N is small (because there is wide uncertainty about the magnitude of the effect).

The results of hypothesis testing do not reveal anything about the magnitude or uncertainty of the parameter. Therefore, avoid hypothesis testing if it’s really not central to the research. If hypothesis testing is done, always also include results from estimation with uncertainty, and try not to commit fallacies of black and white thinking when presenting the results.

When to use Bayesian, when to use frequentist

As is evident from this app and the many examples presented in this tutorial, frequentist and Bayesian analyses provide different information. Which is more appropriate to use when? The answer is that it depends on what question you want to ask. Are you asking what parameter values and models are most credible given the data? To answer that question, look at the Bayesian analysis. Are you asking about error rates for imaginary data from hypothetical worlds? To answer that question, look at the frequentist analysis. However, if you are asking about error rates, you are inherently obliged to consider testing and stopping intentions, which leads down a rabbit hole that many researchers are loath to go. I think that most researchers, most of the time, are intuitively seeking to know about the credibility of parameter values in their models. The Bayesian framework is the natural way to address this yearning.

An extended discussion of these issues is provided in The Bayesian New Statistics.

Mastery of learning objectives

As described at the beginning of this tutorial, if you have mastered the ideas, you should be able to predict the (qualitative) effect of every slider on every cell in the app, and be able to explain why the effect happens. That also means knowing why some sliders do not affect some aspects of the results. Moreover, you should be able to apply the analyses to the real world, which means you should be able to translate real-world information to appropriate settings of the sliders. These objectives are summarized in the figure below (repeated from earlier in this tutorial). Have you achieved mastery?

Learning objectives suggested by arrows to and from sliders.

Learning objectives suggested by arrows to and from sliders.

Next steps

Pedagogical tool

The app is intended as a pedagogical tool, not as a general purpose tool for research use. The app’s underlying computations are (hopefully!) completely accurate, but the sliders provide only limited control and the output shows only limited precision.

Reproducible analysis

Real research analysis should be reproducible. That is, the analyst should be able to reproduce the analysis in the future, as should other analysts. Reproducibility comes from well documented scripts that include thorough explanatory notes and the exact computer code and data files for the analysis. The app does not provide a script for reproducible analysis; it does not record the settings of the sliders or why you set them that way.

Cogent and replicable findings

Like any analysis software, the app does not protect you from poorly collected data. Suppose your sample of data has a particular mean, standard deviation, and sample size and you set the sliders accordingly. If your data are not representative of the population, but are biased in some way, the app cannot know that. If your data points are not independently sampled (as is assumed by the model) the app cannot know that. The app only does computations based on the data you enter; the app does not know the quality of the data.

For more complex data structures and models, the possiblities for interpretive errors are myriad. There are design issues such as misinterpreting confounded factors or inferring causation from merely observed correlation. There are analysis issues such as biased selection of variables as predictors. There are other biases such as stopping collection of data when “statistical significance” is achieved, or stopping variations of experiment designs when “statistical significance” is achieved. There is bias in publication practice that favors only “statistically significant” results, leaving valid but “not significant” findings unpublished. All of these problems and others conspire to make many published findings unreplicable. Many of the problems must be addressed with procedural incentives, but a subset of the problems can be addressed by statistical analyses. For a short overview of Bayesian methods in replication analysis, see this video.

More complex applications

Because the goal of the app is to explain frequentist and Bayesian analyses, the app uses a very simple model: a two-parameter normal distribution. But all the ideas can be generalized to more complex models. The same principles apply, for example, to all the cases of the generalized linear model, including mulitiple linear regression, ANOVA-style models, logistic regression, multinomial regression, negative-binomial regression, etc. The same principles apply to other models found in item-response theory, or in survival analysis, etc. You can learn lots more from the book, Doing Bayesian Data Analysis, 2nd Edition.