All Science of B2B

Statistically Speaking E4: The Science of Confidence — Understanding Level of Significance and P-Values

About the Episode:

In this episode of Statistically Speaking, Kerry Cunningham and Sara Boostani take another step into inferential statistics by breaking down two concepts that sound technical but are fundamentally about one simple question: how much evidence do you need before you can trust what your data is telling you? Kerry and Sara explain significance levels, p-values, and why the term “statistically significant” can actually be more misleading than helpful — using a real example from their own B2B buying research to bring it all to life. 

Topics Covered

  • The core principle behind hypothesis testing 
    • We assume two groups behave the same way until the data gives us enough evidence to conclude otherwise — and we never decide in advance what we’re going to find. 
  • Significance levels: setting your evidence threshold in advance 
    • The significance level is a predetermined standard for how much evidence you need before concluding that a finding is not due to random chance.
    • The academic standard is 95% — meaning there’s only a 5% chance you’d see a difference this large if there were actually no real difference in the population.
    • That threshold can go higher (medicine, aviation) or lower (an A/B test on email subject lines) depending on what’s at stake.
  • P-values: the evidence your specific test actually produced 
    • Once you’ve set your threshold, the p-value tells you what your specific sample delivered
    • A p-value of .04 means there’s only a 4% chance you’d see a result this large if there were actually no real difference — clearing the 5% threshold set in advance
    • Because you can never prove something is definitively true, researchers state results in the negative — the burden of proof is always on demonstrating that the null hypothesis (no difference) can be rejected
  • A real research example: 
    • Kerry and Sara tested whether buyers of new business capabilities engage with sellers at a different point in the buying journey than buyers making renewals or replacements
    • The result — 67% vs. 64% of the way through the buying journey — came in at p < .001, meaning there’s less than a 0.1% chance that difference would appear in the data if there were no real difference in the population
    • But a 3-point difference? Probably not something you need to act on
  • Statistical reliability vs. meaningful difference 
    • A finding can clear the reliability threshold (the data is unlikely to be a fluke) while still being completely unimportant in practice
    • Kerry and Sara use the term “statistically reliable” instead of “statistically significant” to avoid confusion
    • The question of whether a finding matters is a separate question entirely — and that’s where effect size comes in (covered in the next episode)

Key Takeaways

  • Significance levels are set before you run your tests — they reflect how much confidence you need given what’s at stake
  • P-values tell you the probability of seeing your result if there were actually no real difference — the smaller the p-value, the stronger the evidence against the null hypothesis
  • The more data you have, the easier it is to find statistically reliable results that don’t actually matter
  • When a result doesn’t clear the 95% threshold but gets close, it’s still worth reporting — just be transparent about where it landed.

Related Resources

Default Author Image

Kerry Cunningham and Sara Boostani