Bayes Theorem - fnl.es

# Bayes’ Theorem ## What is Bayes’ Theorem? Bayes’ Theorem is a rule for updating what you believe when you get new information. - You start with an initial belief about how likely something is. - You then observe new evidence. - Bayes’ Theorem tells you how to revise your belief in light of that evidence. An example: Imagine a medical test for a disease. - The disease is rare. - The test is pretty accurate, but not perfect. - You test positive. Your immediate reaction might be: “I tested positive, so I probably have the disease.” Bayes’ Theorem helps answer the real question: “Given that I tested positive, how likely is it that I actually have the disease?” It forces you to consider: - How common the disease is - How accurate the test is - How often false positives occur ## Concepts and Terminology Bayes’ Theorem combines four concepts: 1. **Prior probability** What you believed before seeing the evidence. 2. **Likelihood** How probable the evidence is if your hypothesis were true. 3. **Evidence** (normalization) How common the evidence is overall. 4. **Posterior probability** What you believe after seeing the evidence. Let’s define two events for our example: - H = a hypothesis (e.g., “the patient has the disease”) - E = observed evidence (e.g., “the test is positive”) ## Conditional Probability We want to know: What is the probability of H, given that E occurred? This is written as: $P(H \mid E)$: “probability of H given E” Out of all situations where E is observed, for what fraction H is also true? - $P(H \cap E)$: joint probability that both H is true and E is found - $P(E)$: probability that E can be observed at all, even if H were untrue This takes us into the field of conditional probability. By definition: $P(H \mid E) = \frac{P(H \cap E)}{P(E)}$ Similarly: $P(E \mid H) = \frac{P(H \cap E)}{P(H)}$ “If my belief were true, how likely is it that I would see this evidence?” $P(E \mid H)$ is known as the **likelihood**. Both expressions involve the same joint probability $P(H \cap E)$. ## Deriving Bayes’ Theorem From the two definitions above: $P(H \cap E) = P(H \mid E)P(E)$ $P(H \cap E) = P(E \mid H)P(H)$ Set them equal: $P(H \mid E)P(E) = P(E \mid H)P(H)$ Now solve for $P(H \mid E)$ to derive the mathematical definition of Bayes’ Theorem: $\boxed{ P(H \mid E) = \frac{P(E \mid H)\,P(H)}{P(E)} }$ Where: - $P(H)$ = prior probability - $P(E \mid H)$ = likelihood - $P(E)$ = evidence - $P(H \mid E)$ = posterior probability Bayes’ Theorem provides a formal, rational way to update prior beliefs $P(H)$ when new data $P(E)$ becomes available. ## What is the likelihood? Likelihood answers this question: “If my assumption were true, how plausible is the data I am seeing?” It does not ask whether the assumption is true. It asks how well the assumption explains the observed evidence. For example: - Assumption (hypothesis): “This coin is fair” - Observation (data): “I flipped it 10 times and got 8 heads” This is a critical distinction: - Probability: “How likely is the hypothesis?” - Likelihood: “How compatible is the data with the hypothesis?” This number is **measured or assumed, not derived from Bayes’ Theorem**. ## Counting the likelihood If you can directly observe cases where the hypothesis is true: $P(E \mid H) = \frac{\text{Number of times E occurs when H is true}}{\text{Total number of times H is true}}$ Example: You observe 950 positives out of 1,000 diseased patients. $P(E \mid H) = \frac{950}{1000} = 0.95$ ## Modeling the likelihood When direct counting is impossible, you use a model. Example: coin flips. - Hypothesis: coin has probability p of heads - Data: 8 heads out of 10 flips Likelihood (via the binomial model of Bernoulli trials): $P(E \mid H=p) = \binom{10}{8} p^8 (1-p)^2$ For a fair coin $(p = 0.5)$: $P(E \mid H=0.5) = \frac{10!}{8!\times 2!} 0.5^{10}$ $P(E \mid H=0.5) = 45 \times 0.000976$ $P(E \mid H=0.5) \approx 0.044$ So the likelihood of getting 8 heads out of 10 flips with a fair coin is below 5%. (5 heads out of 10 has a 24.6% likelihood with a fair coin.) $P(E)$ and $P(E \mid H)$ **are the same when you only have a single hypothesis** like “fair coin,” with probability 1; in real Bayesian inference, $P(E)$ aggregates likelihoods across all competing hypotheses. ## Hypothesis testing ### Setup: define hypotheses and data Observed data B: Exactly 8 heads out of 10 flips Competing hypotheses: - $H_1$: Coin is fair and heads $p = 0.5$ - $H_2$: Coin is biased toward heads with $p = 0.8$ ### Step 1: Assign prior beliefs Suppose before seeing any data you believed: - $P(H_1) = 0.7$ (fair coin) - $P(H_2) = 0.3$ (biased coin) ### Step 2: Compute likelihoods $P(E \mid H_i)$ Use the binomial distribution: $P(E \mid p) = \binom{10}{8} p^8 (1-p)^2$ Fair coin (p = 0.5): $P(E \mid H_1) = 45 \cdot (0.5)^{10} = \frac{45}{1024} \approx 0.04395$ Biased coin (p = 0.8): $P(E \mid H_2) = 45 \cdot (0.8)^8 (0.2)^2 \approx 0.302$ ### Step 3: Compute evidence $P(E)$ The total evidence across all hypotheses is: $P(E) = \sum_i P(E \mid H_i) P(H_i)$ $P(E) = P(E \mid H_1)P(H_1) + P(E \mid H_2)P(H_2)$ $P(E) = (0.04395 \times 0.7) + (0.302 \times 0.3)$ $P(E) = 0.0308 + 0.0906 = 0.1214$ ### Step 4: Update beliefs (posterior probabilities) Apply Bayes’ Theorem. Posterior for fair coin $P(H_1 \mid E) = \frac{0.04395 \times 0.7}{0.1214} \approx 0.254$ Posterior for biased coin $P(H_2 \mid E) = \frac{0.302 \times 0.3}{0.1214} \approx 0.746$ ### Step 5: Interpret the update - You started believing the coin was probably fair (70%) - Seeing 8 heads is much more compatible with a biased coin - After updating your belief is: - Fair coin: 25.4% - Biased coin: 74.6% The data shifted belief toward the hypothesis that better explains it. This is uniquely Bayesian. Frequentists do not assign probabilities to competing hypothesis. When multiple hypotheses exist, P(E) becomes a weighted average of likelihoods, and Bayes’ Theorem redistributes belief toward the hypotheses that best explain the observed data. When you repeat experiments to gather more data, your posteriors become the next priors. Sequential Bayesian updates naturally converge toward hypotheses that explain all observed data, not just the most recent result.