ARCHE

Levels of Evidence - Not All Studies Are Created Equal

Published on October 27, 2025


"Not everything that counts can be counted, and not everything that can be counted counts." – William Bruce Cameron

Introduction

Imagine you're standing in front of a pharmacy shelf, looking at dozens of cold remedies. Each box makes bold claims: "Clinically proven!" "Doctor recommended!" "Backed by science!" But what does that actually mean? How do you know which claims are solid and which are just marketing fluff?

This is exactly the challenge doctors face every day when reading medical research. Not all evidence is equal. Some studies give us rock-solid answers we can trust, while others are more like educated guesses. Understanding the levels of evidence is like having a quality detector for medical research—it helps you figure out which studies deserve your attention and which ones to take with a grain of salt.(1)

The concept of levels of evidence isn't about dismissing certain types of research. Every study has its place and purpose. It's about recognizing that different study designs have different strengths and weaknesses, and some are better suited to answer specific questions than others. Think of it like tools in a toolbox: you wouldn't use a hammer to tighten a screw, even though both are perfectly good tools.(2)

In this chapter, we'll explore the hierarchy of evidence—often visualized as a pyramid—and understand why some studies sit at the top while others form the foundation. More importantly, we'll learn how to use this knowledge in real clinical situations to make better decisions for our patients.

The Evidence Pyramid: A Visual Guide

Picture a pyramid. At the broad base, you have lots of studies, but they're not the strongest. As you climb toward the peak, the amount of available research decreases, but the quality and reliability increase dramatically. This is the evidence pyramid, and it's one of the most useful mental models in evidence-based medicine.(3)

Figure 1. Levels of evidence hierarchy
The pyramid illustrates the hierarchy of evidence strength in biomedical research. At the apex are systematic reviews and meta-analyses, followed by randomized controlled trials, cohort studies, case-control studies, and at the base, expert opinion, case series and case reports. Higher levels indicate stronger evidence with reduced bias.

Let's start our climb from the bottom up.

Level 5: Expert Opinion and Case Reports

At the foundation of our pyramid sits expert opinion, case reports, and case series. These are the stories of medicine—individual patient experiences or the collective wisdom of experienced clinicians.(4)

A case report might describe a patient with an unusual presentation of a disease or an unexpected reaction to a treatment. These are valuable for generating hypotheses and alerting the medical community to rare events. But here's the catch: what happened to one patient might not happen to another. Without a comparison group, we can't know if the outcome was due to the treatment, the natural course of the disease, or pure chance.(3)

Expert opinion—what experienced doctors think based on their years of practice—certainly has value. These are people who've seen thousands of patients and have developed clinical wisdom. However, even experts can be influenced by personal biases, recent memorable cases, or outdated training. As the saying goes, "The plural of anecdote is not data".

Think of it this way: If your neighbour tells you about a miracle diet that worked for them, that's valuable information, but it's not proof the diet will work for everyone (or anyone else, for that matter).

Level 4: Case-Control Studies

Moving up the pyramid, we encounter case-control studies. These studies look backward in time—they start with people who already have a disease (cases) and compare them to similar people without the disease (controls), searching for differences in their past exposures.(5)

Case-control studies are brilliant for studying rare diseases or conditions with long time periods between exposure and outcome. For instance, if you wanted to understand what causes a rare cancer, you couldn't wait decades following thousands of people hoping some develop the disease. Instead, you'd find people with the cancer, match them with similar people without cancer, and look back at their histories.(6)

The limitation? Memory bias is a real problem. People with a disease often remember their past exposures differently than healthy people. Also, without following people forward in time, it's harder to establish that the exposure truly came before the disease.(7)

Real-world example: Researchers used case-control studies to establish the link between smoking and lung cancer in the 1950s. They compared people with lung cancer to those without and found that a much higher proportion of cancer patients had been smokers.

Level 3: Cohort Studies

Cohort studies are like watching a movie instead of looking at photographs. Researchers identify a group of people (a cohort) who share certain characteristics and follow them forward in time to see what happens.(5)

For example, you might identify 1,000 people who have hypertension and 1,000 similar people who don't, then follow both groups for five years to see who develops complications. Because you're watching events unfold in real time, you can be more confident about the timing: the exposure definitely came before the outcome.

Cohort studies are useful for studying prognosis (what's likely to happen over time) and can examine multiple outcomes from a single exposure. They're also better than case-control studies for calculating actual risk—not just relative risk.(7)

The downside? They're expensive, time-consuming, and people drop out over time (which can introduce bias). They also need large numbers of participants, especially if the outcome you're studying is rare.

Level 2: Randomized Controlled Trials (RCTs)

Now we're reaching the apex. Randomized controlled trials are often called the "gold standard" for testing whether a treatment works.(8)

Here's what makes RCTs special: randomization. When participants are randomly assigned to receive either the new treatment or a comparison treatment (or placebo), it creates groups that are balanced not just for characteristics we know about (age, sex, disease severity), but also for things we haven't even thought to measure. This random allocation is like nature's way of creating a fair comparison.

Good RCTs also use blinding—neither the patients nor the doctors know who's getting which treatment. This prevents expectations from influencing the results. If you know you're getting the new wonder drug, you might feel better just from the psychological boost, regardless of whether the drug actually works.

RCTs give us the strongest evidence about whether a treatment causes a particular outcome. They minimize bias and allow us to isolate the effect of the treatment from other factors.(8)

However, RCTs aren't perfect for everything. They're expensive, often have strict inclusion criteria (which means participants might not represent typical patients), and usually only follow people for a limited time. Some questions can't be answered with RCTs for ethical reasons—you can't randomize people to smoke cigarettes to prove smoking causes cancer.(9)

Level 1: Systematic Reviews and Meta-Analyses

At the peak of our evidence pyramid sit systematic reviews and meta-analyses—the crème de la crème of evidence.(8)

A systematic review is like a research project about research projects. Researchers systematically search for all high-quality studies on a specific question, critically appraise each one, and synthesize the findings. They use explicit, reproducible methods to minimize bias in selecting and interpreting studies.(10)

The word "systematic" is crucial. This isn't just someone's opinion about what studies say. It's a rigorous, transparent process following a protocol established before the review begins. Every decision about which studies to include or exclude is documented and justified.​

A meta-analysis takes this a step further by statistically combining results from multiple studies to calculate an overall effect. Imagine you have ten small studies, each suggesting a treatment might work but none quite reaching statistical significance on its own. By combining them mathematically, you increase the statistical power—like turning up the volume so you can hear the signal over the noise.

Why are systematic reviews at the top? Because they synthesize evidence from multiple studies, they're less likely to be misled by the quirks or biases of any single study. They can identify patterns, resolve controversies, and provide more precise estimates of effects. They also highlight gaps in the research, showing where more studies are needed.

The caveat? A systematic review is only as good as the studies it includes. If you systematically review poor-quality studies, you get a systematic summary of poor evidence. As researchers say, "garbage in, garbage out".

Understanding the Limitations

Before we get too excited about our neat pyramid, we need to acknowledge some important nuances.

Not All Studies of the Same Type Are Equal

A poorly designed RCT can be worse than a well-designed cohort study. The hierarchy tells us about the potential for providing strong evidence based on study design, but execution matters enormously.(11)

For example, an RCT with:

  • High dropout rates

  • Poor allocation concealment

  • No blinding

  • Outcome measures that don't matter to patients

...might provide weaker evidence than a carefully conducted cohort study with:

  • Complete follow-up

  • Valid outcome measures

  • Appropriate adjustment for confounders

  • Large sample size

Different Questions Need Different Designs

The evidence pyramid works best for questions about treatment effectiveness. But not every clinical question is about treatment.

For questions about:

  • Diagnosis: You need studies comparing a new test to a gold standard, with blinding to prevent bias

  • Prognosis: Cohort studies following patients over time are ideal

  • Harm: Sometimes cohort or case-control studies are more appropriate than RCTs (for ethical reasons)

  • Patient experiences: Qualitative studies provide depth that numbers cannot capture

Using the wrong study design for your question is like bringing a ladder to cross a river—it's not that the ladder is bad, it's just the wrong tool for the job.

Context Matters

A study conducted in a tertiary referral hospital in Sweden might not directly apply to a rural clinic in India. Patient populations differ, healthcare systems vary, and resources aren't equal everywhere. Even the highest-level evidence needs to be applied thoughtfully, considering local context.

The GRADE System: Adding Nuance

Because the simple pyramid has limitations, experts developed more sophisticated systems for rating evidence quality. The most widely used is GRADE (Grading of Recommendations Assessment, Development and Evaluation).(11)

GRADE starts with the study design (RCTs start high, observational studies start low) but then adjusts the quality rating up or down based on several factors:(12)

Factors that lower evidence quality:

  • Study limitations (risk of bias)

  • Inconsistency between studies (conflicting results)

  • Indirectness (studies don't quite match your question)

  • Imprecision (small sample sizes, wide confidence intervals)

  • Publication bias (negative studies hidden in file drawers)

Factors that can raise evidence quality:

  • Large magnitude of effect (the treatment works really well)

  • Dose-response relationship (more exposure = more effect)

  • All plausible confounders would reduce the effect (if anything, we're underestimating the benefit)

Under GRADE, evidence is rated as:

  • High quality: Very confident the true effect is close to the estimate

  • Moderate quality: Moderately confident, but the true effect might be substantially different

  • Low quality: Limited confidence; true effect may be substantially different

  • Very low quality: Very uncertain about the estimate

This system recognizes that evidence quality exists on a spectrum, not just in discrete categories.

Practical Applications

So how do you use this knowledge in real clinical practice?

1. Start at the Top

When you have a clinical question, begin your search looking for systematic reviews or meta-analyses. These synthesize existing evidence and save you from having to track down and evaluate dozens of individual studies.

Resources like the Cochrane Library specialize in high-quality systematic reviews. If you find a recent, well-conducted systematic review that answers your question, you've hit the jackpot.

2. Work Your Way Down if Needed

If no systematic review exists, look for high-quality RCTs. If those aren't available, well-designed observational studies can still provide valuable evidence.

The key is being honest about the strength of the evidence you're using. It's perfectly acceptable to base decisions on lower-level evidence when that's all that exists—just acknowledge the uncertainty.

3. Consider the Question Type

Match your evidence level to your question:

  • Treatment effectiveness → RCTs or systematic reviews

  • Rare adverse effects → Case-control studies or cohort studies

  • Long-term outcomes → Cohort studies

  • Diagnostic accuracy → Cross-sectional studies with appropriate reference standards

  • Patient experiences → Qualitative research

4. Assess More Than Just Study Design

Use the levels of evidence as a starting point, but then dig deeper:

  • Was the study well-conducted?

  • Were the results clinically meaningful (not just statistically significant)?

  • Do the participants resemble your patient?

  • Are the outcomes important to patients?

5. Integrate Evidence with Expertise and Patient Values

Evidence-based medicine has three components: best available evidence, clinical expertise, and patient preferences and values. The evidence pyramid tells you about the first component, but you still need the other two.

A treatment supported by Level 1 evidence might not be right for your patient if it conflicts with their values, isn't feasible in your setting, or your clinical judgment suggests they're an exception to the rule.

1. Services EL. Levels of evidence in research | Elsevier Author Services [Internet]. Elsevier Author Services - Articles. 2021 [cited 2025 Oct 26]. Available from: https://scientific-publishing.webshop.elsevier.com/research-process/levels-of-evidence-in-research/

2. Burns PB, Rohrich RJ, Chung KC. The Levels of Evidence and their role in Evidence-Based Medicine. Plast Reconstr Surg [Internet]. 2011 July [cited 2025 Oct 26];128(1):305–10. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC3124652/

3. Pruka A. Research Hub: Evidence Based Practice Toolkit: Levels of Evidence [Internet]. [cited 2025 Oct 26]. Available from: https://libguides.winona.edu/ebptoolkit/Levels-Evidence

4. Understanding the Levels of Evidence in Medical Research | Journal of Orthopaedic Case Reports [Internet]. [cited 2025 Oct 26]. Available from: https://jocr.co.in/wp/2025/05/understanding-the-levels-of-evidence-in-medical-research/

5. Gamble JM. An Introduction to the Fundamentals of Cohort and Case–Control Studies. Can J Hosp Pharm [Internet]. 2014 [cited 2025 Oct 26];67(5):366–72. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC4214579/

6. Purnell M. Health Library: Evidence-Based Practice: Step 3: Critical Appraisal [Internet]. [cited 2025 Oct 26]. Available from: https://library.health.nt.gov.au/EBP/appraisal

7. Song JW, Chung KC. Observational Studies: Cohort and Case-Control Studies. Plast Reconstr Surg [Internet]. 2010 Dec [cited 2025 Oct 26];126(6):2234–42. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC2998589/

8. Murad MH, Asi N, Alsawas M, Alahdab F. New evidence pyramid. BMJ Evidence-Based Medicine [Internet]. 2016 Aug 1 [cited 2025 Oct 26];21(4):125–7. Available from: https://ebm.bmj.com/content/21/4/125

9. Study designs [Internet]. [cited 2025 Oct 26]. Available from: https://www.cebm.ox.ac.uk/resources/ebm-tools/study-designs

10. Abbott B. Research Guides: Systematic Reviews: Levels of Evidence [Internet]. [cited 2025 Oct 26]. Available from: https://guides.library.ucdavis.edu/systematic-reviews/levels-of-evidence

11. Guyatt GH, Oxman AD, Vist GE, Kunz R, Falck-Ytter Y, Alonso-Coello P, et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ [Internet]. 2008 Apr 26 [cited 2025 Oct 26];336(7650):924–6. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC2335261/

12. Baker A, Young K, Potter J, Madan I. A review of grading systems for evidence-based guidelines produced by medical specialties. Clin Med (Lond) [Internet]. 2010 Aug [cited 2025 Oct 26];10(4):358–63. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC4952165/

Summary

Key Points to Remember:

  • The evidence pyramid is a hierarchy that ranks studies from weakest (expert opinion, case reports) to strongest (systematic reviews, meta-analyses) based on their ability to provide reliable evidence about treatment effectiveness.

  • Study design determines potential strength: Systematic reviews and meta-analyses sit at the top, followed by RCTs, cohort studies, case-control studies, and finally case reports and expert opinion at the base.

  • Not all studies of the same type are equal: A poorly conducted RCT can provide weaker evidence than a well-designed observational study. Execution and quality matter as much as design.

  • Different questions need different designs: The pyramid works best for treatment questions. Diagnostic, prognostic, and harm questions may require different study designs.

  • Systematic reviews synthesize multiple studies using explicit, reproducible methods, providing the most comprehensive and reliable evidence when available.

  • RCTs minimize bias through randomization, creating balanced comparison groups and allowing us to isolate the effect of treatments.

  • Observational studies (cohort and case-control) have important roles, especially for rare outcomes, long-term effects, and situations where RCTs are unethical or impractical.

  • Case reports and expert opinion generate hypotheses but cannot prove cause and effect. They're valuable for identifying rare events and unusual presentations.

  • The GRADE system adds nuance by starting with study design but adjusting quality ratings based on factors like bias, consistency, and effect size.

  • Context and applicability matter: Even high-quality evidence must be applied thoughtfully, considering your patient population, setting, and available resources.

  • Start your search at the pyramid's top: Look for systematic reviews first, then work down to individual studies if needed.

  • Evidence-based medicine integrates research with clinical expertise and patient values: The evidence pyramid tells you about study quality, but you still need judgment and shared decision-making.

Multiple Choice Questions

Question 1: A 45-year-old woman with newly diagnosed hypertension asks you about starting medication. You want to find the best available evidence about first-line antihypertensive therapy. Which type of study should you look for first?

A. A case report describing successful treatment with a new drug
B. An expert opinion article by a renowned cardiologist
C. A randomized controlled trial comparing two antihypertensive medications
D. A systematic review and meta-analysis of RCTs on first-line antihypertensive therapy
E. A cohort study following patients on various antihypertensive medications

Correct Answer: D

Explanation: When looking for evidence about treatment effectiveness, systematic reviews and meta-analyses of RCTs provide the highest level of evidence. They synthesize findings from multiple high-quality studies, minimize bias, and provide more precise estimates than individual studies. While a single RCT (option C) would be valuable, a systematic review that synthesizes multiple RCTs is superior. Expert opinions (B) and case reports (A) sit at the bottom of the evidence hierarchy. Cohort studies (E), while useful for some questions, provide weaker evidence for treatment effectiveness than RCTs.

Question 2: You read about a new cancer treatment in a case series of 15 patients, all of whom showed improvement. Your colleague is excited and wants to start using it immediately. What is the most important limitation of this evidence?

A. The sample size is too large to be practical
B. Case series cannot establish causation because there's no comparison group
C. Case series are always unreliable and should be ignored
D. The treatment definitely doesn't work because case series are low-level evidence
E. Case series are only useful for studying common conditions

Correct Answer: B

Explanation: The fundamental limitation of case series is the absence of a comparison group. Without knowing what would have happened to similar patients who didn't receive the treatment, we cannot determine if the improvement was due to the treatment, natural disease progression, other interventions, or regression to the mean. The sample size (A) is actually small, not large. Option C is too extreme—case series do have value for generating hypotheses and identifying rare adverse events. Option D incorrectly equates low-level evidence with proof of ineffectiveness. Option E is incorrect; case series are actually most useful for rare, not common, conditions.

Question 3: Which of the following factors would raise your confidence in the quality of evidence from an observational study, according to the GRADE system?

A. Wide confidence intervals indicating imprecision
B. Inconsistent results across different studies
C. A very large magnitude of effect (e.g., relative risk of 10)
D. High risk of selection bias
E. Indirectness of evidence

Correct Answer: C

Explanation: Under the GRADE system, three factors can upgrade the quality of evidence from observational studies: large magnitude of effect, dose-response relationship, and situations where all plausible confounders would reduce the observed effect. A very large effect size (like a relative risk of 10) suggests that even if there were some residual confounding, the treatment likely has a real effect. Options A, B, D, and E all describe factors that would decrease confidence in the evidence quality, not increase it.

Question 4: You're designing a study to investigate whether a rare birth defect is associated with a maternal medication exposure during pregnancy. Which study design would be most appropriate and efficient?

A. Randomized controlled trial
B. Case-control study
C. Systematic review (when no studies exist yet)
D. Expert opinion survey
E. Cohort study with 10-year follow-up

Correct Answer: B

Explanation: For rare outcomes, case-control studies are the most efficient design. You would identify children with the birth defect (cases), match them with children without the defect (controls), and look back at maternal medication exposure during pregnancy. An RCT (A) would be unethical (you can't randomize pregnant women to potentially harmful exposures) and impractical for rare outcomes. A systematic review (C) cannot be done without existing studies. Expert opinion (D) provides the weakest evidence. While a cohort study (E) could work, it would require following thousands of pregnant women exposed and unexposed to the medication for years, making it much more expensive and time-consuming than a case-control study.

Question 5: A well-conducted randomized controlled trial shows that a new diabetes medication reduces HbA1c by 0.3% compared to placebo, with a p-value of 0.001. However, guidelines suggest that a reduction of at least 0.5% is clinically meaningful. What does this tell you about the evidence?

A. The result is not statistically significant and should be dismissed
B. The result is both statistically and clinically significant
C. The result is statistically significant but may not be clinically meaningful
D. The study must have been poorly designed
E. The p-value proves the treatment is effective for all patients

Correct Answer: C

Explanation: This question highlights the crucial distinction between statistical significance and clinical significance. The small p-value (0.001) indicates statistical significance—the result is unlikely to be due to chance. However, the magnitude of effect (0.3% reduction) is below what clinical guidelines consider meaningful (0.5%). A treatment can be statistically significant without being clinically important, especially in large studies where even tiny effects can achieve statistical significance. Option A is incorrect because the result IS statistically significant. Option B is wrong because clinical significance is questionable. Option D is unjustified—the study may be well-designed. Option E misunderstands p-values, which don't prove anything about individual patient responses.

← Back to Blog