Statistics Cheat Sheet

1. Descriptive Statistics

1.1 Measures of Central Tendency

Measures of central tendency include the mean, median, and mode, which summarize the center of a data set:

Mean: The average of the data points, \( \bar{x} = \frac{\sum x_i}{n} \).
Median: The middle value when the data is ordered.
Mode: The most frequently occurring value in the data set.

Example:

Find the mean, median, and mode of the data set: 2, 3, 5, 5, 7, 8, 9.

Mean: \( \bar{x} = \frac{2 + 3 + 5 + 5 + 7 + 8 + 9}{7} = 5.57 \).
Median: Middle value is 5.
Mode: Most frequent value is 5.
Result: Mean = 5.57, Median = 5, Mode = 5.

1.2 Measures of Dispersion

Measures of dispersion include the range, variance, and standard deviation, which describe the spread of the data:

Range: Difference between the maximum and minimum values.
Variance: Average of the squared differences from the mean, \( \sigma^2 = \frac{\sum (x_i - \bar{x})^2}{n} \).
Standard Deviation: Square root of the variance, \( \sigma = \sqrt{\sigma^2} \).

Example:

Calculate the range, variance, and standard deviation of the data set: 2, 4, 4, 4, 5, 5, 7, 9.

Range: \( 9 - 2 = 7 \).
Variance: \( \sigma^2 = \frac{(2-5)^2 + (4-5)^2 + \cdots + (9-5)^2}{8} = 4 \).
Standard Deviation: \( \sigma = \sqrt{4} = 2 \).
Result: Range = 7, Variance = 4, Standard Deviation = 2.

2. Probability Theory

2.1 Probability Basics

Probability measures the likelihood of an event occurring, expressed as:

\[ P(A) = \frac{\text{Number of favorable outcomes}}{\text{Total number of outcomes}} \]

Example:

What is the probability of rolling a sum of 7 with two six-sided dice?

Favorable outcomes: (1,6), (2,5), (3,4), (4,3), (5,2), (6,1).
Total outcomes: 36.
Probability: \( P(\text{Sum of 7}) = \frac{6}{36} = \frac{1}{6} \).
Result: \( P = \frac{1}{6} \).

2.2 Conditional Probability

Conditional probability is the probability of an event occurring given that another event has already occurred, denoted as \( P(A \mid B) \):

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \]

Example:

If the probability of event A is 0.4 and the probability of event B is 0.5, with \( P(A \cap B) = 0.2 \), what is \( P(A \mid B) \)?

Conditional Probability: \( P(A \mid B) = \frac{0.2}{0.5} = 0.4 \).
Result: \( P(A \mid B) = 0.4 \).

3. Probability Distributions

3.1 Binomial Distribution

The binomial distribution models the number of successes in a fixed number of independent trials, with a success probability \( p \) in each trial. The probability of \( k \) successes in \( n \) trials is:

\[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} \]

Example:

What is the probability of getting exactly 3 heads in 5 flips of a fair coin?

Set \( n = 5 \), \( k = 3 \), and \( p = 0.5 \).
Calculate the probability: \( P(X = 3) = \binom{5}{3} (0.5)^3 (0.5)^{2} = 10 \times 0.125 \times 0.25 = 0.3125 \).
Result: \( P = 0.3125 \).

3.2 Normal Distribution

The normal distribution is a continuous probability distribution that is symmetric around the mean. The probability density function is:

\[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

Example:

If a dataset has a mean \( \mu = 100 \) and standard deviation \( \sigma = 15 \), what is the probability that a value is between 85 and 115?

Standardize the values: \( z = \frac{X - \mu}{\sigma} \).
Calculate the z-scores for 85 and 115: \( z_1 = \frac{85-100}{15} = -1 \), \( z_2 = \frac{115-100}{15} = 1 \).
Use the standard normal distribution table to find the probability: \( P(-1 \leq z \leq 1) = 0.6826 \).
Result: \( P = 0.6826 \).

3.3 Poisson Distribution

The Poisson distribution models the number of events occurring within a fixed interval of time or space, with a given mean rate \( \lambda \). The probability of observing \( k \) events is:

\[ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} \]

Example:

If a website receives an average of 4 hits per minute, what is the probability of receiving exactly 6 hits in a minute?

Set \( \lambda = 4 \) and \( k = 6 \).
Calculate the probability: \( P(X = 6) = \frac{4^6 e^{-4}}{6!} = 0.1042 \).
Result: \( P = 0.1042 \).

4. Confidence Intervals

4.1 Confidence Interval for the Mean

A confidence interval for the mean \( \mu \) is an interval estimate, which provides a range of values within which the true mean is likely to fall. For a sample mean \( \bar{x} \) and standard deviation \( s \), the 95% confidence interval is:

\[ \bar{x} \pm t_{\alpha/2, n-1} \frac{s}{\sqrt{n}} \]

Example:

Compute a 95% confidence interval for the mean if \( \bar{x} = 100 \), \( s = 15 \), and \( n = 25 \).

Find the critical value \( t_{\alpha/2, n-1} = 2.064 \) for 24 degrees of freedom.
Calculate the margin of error: \( ME = 2.064 \times \frac{15}{\sqrt{25}} = 6.192 \).
Compute the confidence interval: \( 100 \pm 6.192 \).
Result: \( [93.808, 106.192] \).

5. Hypothesis Testing

5.1 Null and Alternative Hypotheses

The null hypothesis \( H_0 \) is a statement of no effect or no difference, while the alternative hypothesis \( H_1 \) is a statement indicating the presence of an effect or difference. Hypothesis testing determines whether to reject \( H_0 \) in favor of \( H_1 \).

Example:

Test whether a new drug is more effective than the standard treatment. \( H_0: \mu_{\text{new}} \leq \mu_{\text{standard}} \), \( H_1: \mu_{\text{new}} > \mu_{\text{standard}} \).

Perform the hypothesis test using appropriate test statistics.
Determine if the null hypothesis can be rejected.
Result: Based on the test, decide whether the new drug is more effective.

5.2 p-Value and Significance Level

The p-value measures the strength of evidence against the null hypothesis. If the p-value is less than the significance level \( \alpha \), reject \( H_0 \). Common significance levels are 0.05 or 0.01.

Example:

If the p-value is 0.03 and the significance level is 0.05, should you reject the null hypothesis?

Compare the p-value to \( \alpha = 0.05 \).
Since \( 0.03 < 0.05 \), reject \( H_0 \).
Result: The null hypothesis is rejected.

6. Correlation and Regression

6.1 Correlation Coefficient

The correlation coefficient \( r \) measures the strength and direction of the linear relationship between two variables. It ranges from -1 to 1:

\( r = 1 \): Perfect positive correlation.
\( r = -1 \): Perfect negative correlation.
\( r = 0 \): No correlation.

Example:

Compute the correlation coefficient for the data pairs (1, 2), (2, 3), (3, 4), (4, 5), (5, 6).

Calculate the mean and covariance of the data.
Compute the standard deviations of the variables.
Result: The correlation coefficient \( r = 1 \), indicating a perfect positive correlation.

6.2 Simple Linear Regression

Simple linear regression models the relationship between a dependent variable \( y \) and an independent variable \( x \) using the equation:

\[ y = \beta_0 + \beta_1 x + \epsilon \]

where \( \beta_0 \) is the y-intercept, \( \beta_1 \) is the slope, and \( \epsilon \) is the error term.

Example:

Find the linear regression equation for the data: (1, 2), (2, 3), (3, 5), (4, 4), (5, 6).

Calculate the slope \( \beta_1 \) and y-intercept \( \beta_0 \).
Result: The regression equation is \( y = 1.2x + 0.6 \).

7. Analysis of Variance (ANOVA)

7.1 One-Way ANOVA

One-way ANOVA tests whether there are statistically significant differences between the means of three or more independent groups. The null hypothesis states that all group means are equal.

Example:

Test whether three different teaching methods lead to different student performance scores.

Calculate the between-group and within-group variances.
Compute the F-statistic and compare it to the critical value.
Result: Determine whether to reject the null hypothesis based on the F-test.

8. Chi-Square Tests

8.1 Chi-Square Test for Independence

The Chi-square test for independence assesses whether two categorical variables are independent of each other. The test statistic is:

\[ \chi^2 = \sum \frac{(O - E)^2}{E} \]

where \( O \) is the observed frequency and \( E \) is the expected frequency.

Example:

Test whether there is an association between gender (male, female) and preference for a product (like, dislike).

Set up the contingency table and calculate expected frequencies.
Compute the Chi-square statistic and compare it to the critical value.
Result: Determine whether to reject the null hypothesis of independence.

9. Non-Parametric Tests

9.1 Wilcoxon Signed-Rank Test

The Wilcoxon Signed-Rank Test is a non-parametric test used to compare two related samples or repeated measurements on a single sample. It assesses whether their population mean ranks differ.

Example:

Compare the median scores of two treatments applied to the same subjects.

Rank the differences between the paired observations.
Calculate the test statistic and compare it to the critical value.
Result: Determine whether to reject the null hypothesis based on the test statistic.

10. Time Series Analysis

10.1 Moving Averages

Moving averages smooth out short-term fluctuations and highlight longer-term trends in time series data. The simple moving average (SMA) for period \( t \) is:

\[ SMA_t = \frac{1}{n} \sum_{i=0}^{n-1} X_{t-i} \]

Example:

Calculate the 3-period moving average for the following time series: 10, 20, 30, 40, 50.

Compute the SMA for each period: \( SMA_3 = \frac{10 + 20 + 30}{3} = 20 \), \( SMA_4 = \frac{20 + 30 + 40}{3} = 30 \), etc.
Result: Moving averages are 20, 30, and 40.

11. Bayesian Statistics

11.1 Bayes' Theorem

Bayes' Theorem relates the conditional probability of event \( A \) given \( B \) to the conditional probability of \( B \) given \( A \), and the individual probabilities of \( A \) and \( B \):

\[ P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)} \]

Example:

If the probability of having a disease is 1%, the probability of testing positive given the disease is 90%, and the probability of testing positive without the disease is 5%, what is the probability of having the disease given a positive test result?

Use Bayes' Theorem to calculate: \( P(\text{disease} \mid \text{positive}) = \frac{0.9 \times 0.01}{0.9 \times 0.01 + 0.05 \times 0.99} = 0.154 \).
Result: \( P = 0.154 \) or 15.4%.

12. Sampling Methods

12.1 Simple Random Sampling

Simple random sampling is a method where each individual in the population has an equal chance of being selected. It is the most basic sampling technique.

Example:

Select 5 students randomly from a class of 30.

Use a random number generator to select 5 unique numbers between 1 and 30.
Select the corresponding students based on the generated numbers.
Result: 5 randomly selected students.

13. Central Limit Theorem

13.1 Central Limit Theorem

The Central Limit Theorem states that the distribution of the sample mean will approach a normal distribution as the sample size increases, regardless of the population's distribution, provided the sample size is sufficiently large.

Example:

Explain how the Central Limit Theorem applies to the average heights of a large sample of students.

Even if the population of heights is not normally distributed, the distribution of the sample mean will be approximately normal for a large sample size.
Result: The sample mean can be treated as normally distributed due to the Central Limit Theorem.

14. Experimental Design

14.1 Randomized Controlled Trials (RCT)

Randomized Controlled Trials (RCT) are experiments where participants are randomly assigned to different treatment groups to test the effect of an intervention.

Example:

Design an RCT to test the effectiveness of a new educational method on student performance.

Randomly assign students to a control group and a treatment group.
Apply the new educational method to the treatment group.
Compare the performance of the two groups using statistical analysis.
Result: Determine whether the new method significantly improves performance.

15. Quality Control

15.1 Control Charts

Control charts are used in quality control to monitor whether a process is in statistical control. Common types include X-bar charts (for the mean) and R-charts (for range).

Example:

Create an X-bar chart to monitor the diameter of manufactured bolts.

Calculate the mean diameter for each sample.
Plot the means on the X-bar chart with control limits.
Interpret the chart to determine if the process is in control.
Result: Detect any signals indicating that the process is out of control.