Skip to content

NajiElKotob/Awesome-Statistics-For-Data-Science

Repository files navigation

Awesome Statistics for Data Science

{Awesome Works in Progress}

Let's Talk Data


Data Bias

  • Selection bias Bias that occurs when the sample is not representative of the population due to non-random selection.
  • Berkson’s Bias - statology.org | Berkson’s bias is a type of bias that occurs in research when two variables appear to be negatively correlated in sample data, but are actually positively correlated in the overall population.
  • Historical Bias Bias that results from data reflecting outdated or skewed social, cultural, or historical norms.
  • Outlier Bias Bias introduced by extreme values that disproportionately influence statistical analysis or visual representations.
  • Visualization Bias Bias caused by the way data is visually presented, potentially misleading the viewer's interpretation.
  • Simpson's Paradox A phenomenon where a trend appears in several different groups of data but disappears or reverses when these groups are combined.

Numbers and Statistics

Statistical Analysis

Samples and Populations

  • Samples & Populations - stat.psu.edu
  • Sampling Methods
    • Probability sampling; Non-probability sampling
  • Sample Size
  • Law Of Large Numbers (LLN) - investopedia.com | The law of large numbers, in probability and statistics, states that as a sample size grows, its mean gets closer to the average of the whole population.
  • Central Limit Theorem (CLT) - investopedia.com | In probability theory, the central limit theorem (CLT) states that the distribution of a sample variable approximates a normal distribution (i.e., a “bell curve”) as the sample size becomes larger, assuming that all samples are identical in size, and regardless of the population's actual distribution shape.
  • Confidence Intervals
    • Confidence intervals explained - scribbr.com
    • Understanding Confidence Intervals 📺 - Dr Nic's Maths and Stats
    • The correct interpretation of a 95% confidence interval is that "we are 95% confident that the population parameter is between X and X." Learn more
    • Confidence Intervals estimate how close a sample mean is to the actual mean.
    • Critical Values (Z) 99%=2.575; 95%=1.960; 90%=1.645; 85%=1.440; 80%=1.282

Descriptive Statistics

Describe, show or summarize data in a meaningful way

  • Measures of Central Tendency - are single values that attempt to describe the central position of a set of data.
    • Mean (Average) - Most meaningful with normally distributed data
      • Arithmetic Mean Σx/n, Geometric Mean n√∏x, Harmonic Mean n/Σ(1/x)
      • The Greek letter μ (mu) is used in statistics to represent the population mean of a distribution.
    • Median (The "middle" of a sorted list of numbers) - Diminish the effect of outliers (aka Med, M, x̃ 'x-tilde')
    • Mode (Most Often) - bi-modal distribution; categorical data
    • Numerical Summarization - stat.psu.edu
    • Sensitivity to skewness
  • Measures of Variability (Dispersion)
    • Range
    • Interquartile Range (IQR)
    • Variance σ2
    • Standard Deviation σ
      • Standard deviation (S) = square root of the variance
    • Standard Error | A mathematical tool used in statistics to measure variability
  • Shapes of Distribution
    • Normal Distribution
      • The empirical rule, also referred to as the three-sigma rule or 68-95-99.7 rule, is a statistical rule which states that for a normal (aka Gaussian or Gauss or Laplace–Gauss) distribution, almost all observed data will fall within three standard deviations (denoted by σ) of the mean or average (denoted by µ).
      • The quincunx (or Galton Board) mathsisfun.com - | Simulator
    • Non-normal Distribution (Flat, Bi-modal, Parabolic)
    • Skewness (-ve Skewness/Skewed Left, +ve Skewness/Skewed Right)
      • When data are skewed right, the mean is larger than the median.
      • When data are skewed left, the mean is smaller than the median.
    • Kurtosis (Leptokurtic, Mesokurtic, Platykurtic)
  • Percentiles, Quartiles, Quintile and Decile
    • Percentiles
      • The 30th percentile is the value from the data set greater than 30% of observations, and therefore less than 70% of observations.
      • Median = 50th percentile
      • 1st Quartile = 25th percentile
      • 3rd Quartile = 75th percentile
      • IQR = The difference between Q3 and Q1. IQR contains the middle 50% of data
    • Quartiles are the values that divide a list of numbers into quarters - mathsisfun.com
      • Interquartile range = 3rd quartile - 1st quartile
      • Exclusive method vs inclusive method - The exclusive method excludes the median when identifying Q1 and Q3, while the inclusive method includes the median in identifying the quartiles.
      • Quartile Deviation - byjus.com | Quartile Deviation for Ungrouped/Grouped Data
    • Quintile - "A quintile is one of five values that divide a range of data into five equal parts, each being 1/5th (20 percent) of the range." Investopedia
    • Decile - Devide a range of data into ten equal parts, each being 1/10th (10 percent) of the range. 1st and 9th deciles equal to 10th and 90th percentiles.
    • Outliers
  • Percentage
  • Binning
  • Measures of Association
  • Normalization vs Standardization

Probabilities

Probabilities refer to the measure of the likelihood that a particular event will occur. It quantifies uncertainty and is a fundamental concept in statistics and mathematics. Probabilities are expressed as numbers between 0 and 1, where 0 indicates that an event will not occur, and 1 indicates that an event will certainly occur.

Probability Essentials

Probability Basics

Percentiles

Percentiles are measures that indicate the relative position of a value within a data set. A percentile represents the percentage of values in the data set that fall below a given value. For example, the 50th percentile (median) is the value below which 50% of the data points lie.

Permutations

Permutations refer to the different ways in which a set of items can be arranged in a specific order. The order of arrangement is important in permutations. For example, the permutations of the set {A, B, C} are ABC, ACB, BAC, BCA, CAB, and CBA.

Combinations

Combinations are the different ways of selecting items from a larger set where the order of selection does not matter. For example, the combinations of choosing 2 items from the set {A, B, C} are AB, AC, and BC.

  • To calculate the probability of winning a lottery where you need to guess 6 correct numbers out of 42, we can use combinations (also known as "combinatorics").
    • The total number of possible combinations of 6 numbers out of 42: 42! / 6! (42-6)! = 5,245,786
    • The probability: The probability is approximately 0.0000001907, or 1 in 5,245,786

Multiple Events Probabilities

Multiple Events Probabilities refer to the likelihood of various combinations of events occurring in a given scenario. It involves calculating the probability of two or more events happening together or in sequence.

Probabilities of two events

Conditional Probabilities

Law of Total Probability

Multiplication rule

Probability trees

Bayes Theorem

Discrite and Continous Probabilities

Discrete and Continuous Probabilities refer to the types of random variables and their associated probability distributions. Discrete probabilities deal with discrete random variables, which have a countable number of possible values. Examples include the number of heads in coin tosses or the number of students in a class. Continuous probabilities deal with continuous random variables, which have an infinite number of possible values within a given range. Examples include heights, weights, and time.

Discrite vs Continous

Binomials

Bell-shaped curve

Z-Score


Inferential Statistic

Core concepts
  • T-distribution vs. z-distribution
Hypothesis
  • Hypothesis tests attempt to provide an answer to questions such as "How likely is an observation just random change?"
  • Null Hypothesis
    • The null hypothesis, H0 is the commonly accepted fact; it is the opposite of the alternate hypothesis. Researchers work to reject, nullify or disprove the null hypothesis. Researchers come up with an alternate hypothesis, one that they think explains a phenomenon, and then work to reject the null hypothesis. learn more
    • The null statement must always contain some form of equality (=, ≤ or ≥) Always write the alternative hypothesis, typically denoted with Ha or H1, using less than, greater than, or not equals symbols, i.e., (≠, >, or <) learn more
  • Confidence Intervals
  • Confidence Level
  • Alpha value (aka significance level)
  • Type I and Type II errors - scribbr.com
    • Which is more dangerous for a smoke detector? A type I (false positive) or type II error (false negative)?
  • t-test - compares the means of two groups
  • Chi-Square test - determines whether categorical variables are associated
  • z-test
    • Z-Scores
      • Simply put, a z-score (also called a standard score) gives you an idea of how far from the mean a data point is. But more technically it’s a measure of how many standard deviations below or above the population mean a raw score is. learn more
      • Z-Table - z-table.net
  • Probability
  • P-value ⬆⬇
    • Meet P. Value (aka p-value) - Alteryx Community Team
      • P-Values, clearly explained (Video) - StatQuest
      • In general, P values larger than 0.01 should be reported to two decimal places, those between 0.01 and 0.001 to three decimal places; P values smaller than 0.001 should be reported as P<0.001. learn more
      • A p-value less than 0.05 (typically ≤ 0.05) is statistically significant. It indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct (and the results are random). Therefore, we reject the null hypothesis, and accept the alternative hypothesis. learn more
    • Cheatsheet
  • Univariate, bivariate, multivariate and multivariate multiple analysis (MMR)
  • Upsampling and Downsampling
Error Types
  • Accept H0 AND H0 is True = Correct
  • Reject H0 AND H0 is False = Correct
  • Reject H0 AND H0 is True = Type I Error
  • Accept H0 AND H0 is False = Type II Error
Statisticsl Tests and Analysis
Error Estimatinos
Effect Size

Other Topics


Tools


Write-Up


Surveys

Books


Mathematics for Machine Learning

Linear Algebra: Fundamental concepts such as matrices, vectors, dot products, and matrix multiplication are crucial. Linear algebra provides the language and framework for describing and manipulating data in ML. Calculus: Understanding of derivatives and gradients is important for optimization problems in ML, including gradient descent. Probability and Statistics: Basic understanding of probabilities, probability distributions, means, variances, and expectation values is essential for understanding models, making predictions, and evaluating model performance.

Pre-Algebra

Linear Algebra

YouTube 📺
Tools
  • Graphing Calculator ⭐ - desmos.com | Explore math with our beautiful, free online graphing calculator.

Calculus

Precalculus

Matrices

Gradient

Optimization Methods

Extra Knowledge

Books

Python

  • manim - Mathematical Animation Engine

YouTube 📺


Learning


Special Videos 📺

Special Channels 📺


Articles


Related Topics

About

Awesome Statistics For Data Science

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published