{Awesome Works in Progress}
- What is data?
- What is Data - Data represent facts or something that has actually taken place, observed and measured.
- Data > (beats) Opinion
- What is Data - Data represent facts or something that has actually taken place, observed and measured.
- Data literacy
Data literacy is the ability to read, work with, analyze and communicate with data.
- How to build data literacy in your company - mit.edu
- Boost Your Team’s Data Literacy - hbr.org
- Data Analytics vs Data Analysis - bmc.com
- Data literacy training - statcan.gc.ca
- Data Literacy Preview: Study Hall: ASU + Crash Course 📺 ⭐ - Arizona State University
- A Data Culture is the collective behaviors and beliefs of people who value, practice, and encourage the use of data to improve decision-making.
- Data Types
- Qualitative
- Nominal, Ordinal, Binary
- Quantitative
- Discrete, Continous
- Learn more
- Types of Data & Measurement Scales: Nominal, Ordinal, Interval and Ratio - mymarketresearchmethods.com
- Types of Variable - Laerd.com
- Interval scale Vs Ratio scale: Interval scales hold no true zero and can represent values below zero e.g., you can measure temperature below 0 degrees Celsius, such as -10 degrees.; Ratio variables never fall below zero. Height and weight measure from 0 and above, but never fall below it.
- Understanding Ratios & Proportions - libguides.com
- When a Variable’s Level of Measurement Isn’t Obvious - theanalysisfactor.com
- Foot Size vs Shoe Size - Shoe size (discrete), but the underlying measure is foot length which is measurement (continuous) data. learn more
- Discrete vs. Continuous Variables: Meaning and Differences - outlier.org
- Is Age Discrete or Continuous?
- The numbers on the basketball players' t-shirts?
- Qualitative
- Selection bias
Bias that occurs when the sample is not representative of the population due to non-random selection.
- Berkson’s Bias - statology.org | Berkson’s bias is a type of bias that occurs in research when two variables appear to be negatively correlated in sample data, but are actually positively correlated in the overall population.
- Historical Bias
Bias that results from data reflecting outdated or skewed social, cultural, or historical norms.
- Outlier Bias
Bias introduced by extreme values that disproportionately influence statistical analysis or visual representations.
- Visualization Bias
Bias caused by the way data is visually presented, potentially misleading the viewer's interpretation.
- Simpson's Paradox
A phenomenon where a trend appears in several different groups of data but disappears or reverses when these groups are combined.
- Samples & Populations - stat.psu.edu
- Sampling Methods
- Probability sampling; Non-probability sampling
- Sample Size
- Sample Size Calculator - calculator.net
- Calculate your sample size - surveymonkey.com
- A general rule of thumb for the Large Enough Sample Condition is that n≥30, where n is your sample size. Learn more - statisticshowto.com
- Margin of Error & Sample size Calculator - aytm.com
- Number of Samples (You can Afford) = Budget / Cost per Sample
- Power Analysis
- Law Of Large Numbers (LLN) - investopedia.com | The law of large numbers, in probability and statistics, states that as a sample size grows, its mean gets closer to the average of the whole population.
- Central Limit Theorem (CLT) - investopedia.com | In probability theory, the central limit theorem (CLT) states that the distribution of a sample variable approximates a normal distribution (i.e., a “bell curve”) as the sample size becomes larger, assuming that all samples are identical in size, and regardless of the population's actual distribution shape.
- Confidence Intervals
- Confidence intervals explained - scribbr.com
- Understanding Confidence Intervals 📺 - Dr Nic's Maths and Stats
- The correct interpretation of a 95% confidence interval is that "we are 95% confident that the population parameter is between X and X." Learn more
- Confidence Intervals estimate how close a sample mean is to the actual mean.
- Critical Values (Z) 99%=2.575; 95%=1.960; 90%=1.645; 85%=1.440; 80%=1.282
Describe, show or summarize data in a meaningful way
- Measures of Central Tendency - are single values that attempt to describe the central position of a set of data.
- Mean (Average) - Most meaningful with normally distributed data
- Arithmetic Mean Σx/n, Geometric Mean n√∏x, Harmonic Mean n/Σ(1/x)
- The Greek letter μ (mu) is used in statistics to represent the population mean of a distribution.
- Median (The "middle" of a sorted list of numbers) - Diminish the effect of outliers (aka Med, M, x̃ 'x-tilde')
- Mode (Most Often) - bi-modal distribution; categorical data
- Numerical Summarization - stat.psu.edu
- Sensitivity to skewness
- Mean (Average) - Most meaningful with normally distributed data
- Measures of Variability (Dispersion)
- Range
- Interquartile Range (IQR)
- Variance σ2
- Standard Deviation σ
- Standard deviation (S) = square root of the variance
- Standard Error | A mathematical tool used in statistics to measure variability
- Shapes of Distribution
- Normal Distribution
- The empirical rule, also referred to as the three-sigma rule or 68-95-99.7 rule, is a statistical rule which states that for a normal (aka Gaussian or Gauss or Laplace–Gauss) distribution, almost all observed data will fall within three standard deviations (denoted by σ) of the mean or average (denoted by µ).
- The quincunx (or Galton Board) mathsisfun.com - | Simulator
- Non-normal Distribution (Flat, Bi-modal, Parabolic)
- Skewness (-ve Skewness/Skewed Left, +ve Skewness/Skewed Right)
- When data are skewed right, the mean is larger than the median.
- When data are skewed left, the mean is smaller than the median.
- Kurtosis (Leptokurtic, Mesokurtic, Platykurtic)
- Normal Distribution
- Percentiles, Quartiles, Quintile and Decile
- Percentiles
- The 30th percentile is the value from the data set greater than 30% of observations, and therefore less than 70% of observations.
- Median = 50th percentile
- 1st Quartile = 25th percentile
- 3rd Quartile = 75th percentile
- IQR = The difference between Q3 and Q1. IQR contains the middle 50% of data
- Quartiles are the values that divide a list of numbers into quarters - mathsisfun.com
- Interquartile range = 3rd quartile - 1st quartile
- Exclusive method vs inclusive method - The exclusive method excludes the median when identifying Q1 and Q3, while the inclusive method includes the median in identifying the quartiles.
- Quartile Deviation - byjus.com | Quartile Deviation for Ungrouped/Grouped Data
- Quintile - "A quintile is one of five values that divide a range of data into five equal parts, each being 1/5th (20 percent) of the range." Investopedia
- Decile - Devide a range of data into ten equal parts, each being 1/10th (10 percent) of the range. 1st and 9th deciles equal to 10th and 90th percentiles.
- Outliers
- Identifying outliers with the 1.5xIQR rule - khanacademy.org
- Percentiles
- Percentage
- Percentage Difference, Percentage Error, Percentage Change - mathsisfun.com
- Percentage Change and Percent Difference - sumn.org
- Change = ((New - Old) / |Old|) * 100
- Difference = |(First - Second)/((First + Second)/2)| * 100
- Binning
- Freedman–Diaconis rule
- Sturge’s rule - Optimal Bins = ⌈log2n + 1⌉
- Measures of Association
- Correlation and Causation
- Correlation vs Causation: Understand the Difference for Your Product - amplitude.com
- Correlation means there is a relationship or pattern between the values of two variables. A scatterplot displays data about two variables as a set of points in the xyxyx, y-plane and is a useful tool for determining if there is a correlation between the variables.
- Causation means that one event causes another event to occur. Causation can only be determined from an appropriately designed experiment. In such experiments, similar groups receive different treatments, and the outcomes of each group are studied. We can only conclude that a treatment causes an effect if the groups have noticeably different outcomes.
- Confounding Variables - scribbr.com
- Correlation Coefficients
- What Do Correlation Coefficients Positive, Negative, and Zero Mean? - investopedia.com
- Covariance
- What Is Covariance? - investopedia.com
- Covariance (CFI) - orporatefinanceinstitute.com | A measure of the relationship between random variables
- Correlation and Causation
- Normalization vs Standardization
Probabilities refer to the measure of the likelihood that a particular event will occur. It quantifies uncertainty and is a fundamental concept in statistics and mathematics. Probabilities are expressed as numbers between 0 and 1, where 0 indicates that an event will not occur, and 1 indicates that an event will certainly occur.
- Probability Basics - 365 Data Science
Percentiles are measures that indicate the relative position of a value within a data set. A percentile represents the percentage of values in the data set that fall below a given value. For example, the 50th percentile (median) is the value below which 50% of the data points lie.
Permutations refer to the different ways in which a set of items can be arranged in a specific order. The order of arrangement is important in permutations. For example, the permutations of the set {A, B, C} are ABC, ACB, BAC, BCA, CAB, and CBA.
Combinations are the different ways of selecting items from a larger set where the order of selection does not matter. For example, the combinations of choosing 2 items from the set {A, B, C} are AB, AC, and BC.
- To calculate the probability of winning a lottery where you need to guess 6 correct numbers out of 42, we can use combinations (also known as "combinatorics").
- The total number of possible combinations of 6 numbers out of 42: 42! / 6! (42-6)! = 5,245,786
- The probability: The probability is approximately 0.0000001907, or 1 in 5,245,786
Multiple Events Probabilities refer to the likelihood of various combinations of events occurring in a given scenario. It involves calculating the probability of two or more events happening together or in sequence.
Discrete and Continuous Probabilities refer to the types of random variables and their associated probability distributions. Discrete probabilities deal with discrete random variables, which have a countable number of possible values. Examples include the number of heads in coin tosses or the number of students in a class. Continuous probabilities deal with continuous random variables, which have an infinite number of possible values within a given range. Examples include heights, weights, and time.
- T-distribution vs. z-distribution
- Hypothesis tests attempt to provide an answer to questions such as "How likely is an observation just random change?"
- Null Hypothesis
- The null hypothesis, H0 is the commonly accepted fact; it is the opposite of the alternate hypothesis. Researchers work to reject, nullify or disprove the null hypothesis. Researchers come up with an alternate hypothesis, one that they think explains a phenomenon, and then work to reject the null hypothesis. learn more
- The null statement must always contain some form of equality (=, ≤ or ≥) Always write the alternative hypothesis, typically denoted with Ha or H1, using less than, greater than, or not equals symbols, i.e., (≠, >, or <) learn more
- Confidence Intervals
- Confidence Level
- Alpha value (aka significance level)
- Type I and Type II errors - scribbr.com
- Which is more dangerous for a smoke detector? A type I (false positive) or type II error (false negative)?
- t-test - compares the means of two groups
- Chi-Square test - determines whether categorical variables are associated
- z-test
- Z-Scores
- Simply put, a z-score (also called a standard score) gives you an idea of how far from the mean a data point is. But more technically it’s a measure of how many standard deviations below or above the population mean a raw score is. learn more
- Z-Table - z-table.net
- Z-Scores
- Probability
- Probability Line - dcp.edu.gov.on.ca
- Intro to Probability for Data Science (Free e-book) - probability4datascience.com
- Statistical significance ♟
- Statistical significance refers to the claim that a result from data generated by testing or experimentation is not likely to occur randomly or by chance but is instead likely to be attributable to a specific cause learn more
- A Refresher on Statistical Significance - Amy Gallo (Harvard Business Review)
- P-value ⬆⬇
- Meet P. Value (aka p-value) - Alteryx Community Team
- P-Values, clearly explained (Video) - StatQuest
- In general, P values larger than 0.01 should be reported to two decimal places, those between 0.01 and 0.001 to three decimal places; P values smaller than 0.001 should be reported as P<0.001. learn more
- A p-value less than 0.05 (typically ≤ 0.05) is statistically significant. It indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct (and the results are random). Therefore, we reject the null hypothesis, and accept the alternative hypothesis. learn more
- Cheatsheet
- Meet P. Value (aka p-value) - Alteryx Community Team
- Univariate, bivariate, multivariate and multivariate multiple analysis (MMR)
- Upsampling and Downsampling
- Accept H0 AND H0 is True = Correct
- Reject H0 AND H0 is False = Correct
- Reject H0 AND H0 is True = Type I Error
- Accept H0 AND H0 is False = Type II Error
- Statistical tests: which one should you use? - scribbr.com
- Choosing the correct statistical test in SAS, Stata, SPSS and R - stats.idre.ucla.edu
- How to choose the right statistical test? - Barun Nayak and Avijit Hazra1
- Frequency Distribution
- Cross Tabulation
- Correspondence analysis
- Multinomial Logistic Regression
- Cluster Analysis
- One-hot encoding
- Numerical encoding
- Ordinal encoding
- Cohen's d is designed for comparing two groups. It takes the difference between two means and expresses it in standard deviation units. It tells you how many standard deviations lie between the two means learn more
- Power Analysis, Clearly Explained! 📺 - StatQuest
- Equations and Functions
- The Distributive Property (Video) - Algebra Basics: The Distributive Property - Math Antics
- Aggregation
- The perils of calculating an Average of Averages - Darren Gosbell
- Calculating a Weighted Average (Average of Averages)- Analytics Edge
- Anscombe's Quartet
- The Datasaurus Dozen - Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing.
- Derivatives and Optimizations
- Vectors and Matrices
- Wide and Long data
- Pairwise vs. Listwise deletion
- Monte Carlo
- Monte Carlo Simulation 📺 - MIT OpenCourseWare
- Introduction to Monte Carlo simulation in Excel
- A/B Testing
- Baye’s Theorem
- Types of Distribution | Normal; Binomial; Uniform; Poisson; Beta; Gamma; Log
- Time series
- Forecasting: Principles and Practice ⭐ - Rob J Hyndman and George Athanasopoulos | Monash University, Australia
- Statistics Kingdom - statskingdom.com
- DrawMyData - a tool for teaching stats and data science by Robert Grant
- Desmos - Graphing Calculator - desmos.com
- Standard Normal Distribution Table - mathsisfun.com
- Probability Distribution Applets - divms.uiowa.edu (Matt Bognar, Ph.D.)
- Simulated Sampling Distributions ⭐ - albany.edu
- One Sample T-Test Calculator - statskingdom.com
- Comparing Two Independent Samples (Sample Size) ⭐ - stat.ubc.ca
- Poisson Distribution Calculator - statology.org | The probability that the restaurant receives more than 100 (Normal avg). e.g., 130 => P(X = 130): 0.00058
- Online Statistics Calculator ⭐ - datatab.net
- Dice Roller - random.org
- Analytical Report – What Is It and How to Write It? - whatagraph.com
- APA
- Numbers & Statistics - APA Formatting And Style Guide (7th Edition)
- Reporting a multiple linear regression in APA
- Survey testing - abs.gov.au
- Pattern Recognition and Machine Learning (Free)
- Analysis of Multiple Dependent Variables
- Introductory Statistics - saylordotorg.github.io
- Introduction to Statistics - courses.lumenlearning.com
- Mathspace ⭐ - mathspace.co | We bring all of your learning tools together in one place, from video lessons, textbooks, to adaptive practice. Encourage your students to become self-directed learners.
- Introduction to Probability for Data Science - probability4datascience.com
- Introduction to Statistical Learning 📺 ~12 hours
Linear Algebra: Fundamental concepts such as matrices, vectors, dot products, and matrix multiplication are crucial. Linear algebra provides the language and framework for describing and manipulating data in ML. Calculus: Understanding of derivatives and gradients is important for optimization problems in ML, including gradient descent. Probability and Statistics: Basic understanding of probabilities, probability distributions, means, variances, and expectation values is essential for understanding models, making predictions, and evaluating model performance.
- Factors and multiples
- Prime numbers
- Khan Academy - Linear algebra
- Linear Algebra for Machine Learning - Jon Krohn | (48 videos) This is a complete course on linear algebra for machine learning.
- Essence of linear algebra - 3Blue1Brown
- The Art of Linear Programming 📺 ~19min ⭐ - Tom S
- Graphing Calculator ⭐ - desmos.com | Explore math with our beautiful, free online graphing calculator.
- Calculus made EASY! 5 Concepts you MUST KNOW before taking calculus! 📺 ~24min - Dr Ji Tutoring
- Precalculus by Richard Wright - andrews.edu
- Precalculus & Trigonometry - math.libretexts.org
- Precalculus (Cuemath) - cuemath.com
- Essential Math for Data Science: Introduction to Matrices and the Matrix Product
- The Applications of Matrices | What I wish my teachers told me way earlier 📺 ~25min - Zach Star
- Introduction to Matrices ⭐ - math.libretexts.org
- Matrix Calculator - DESMOS - desmos.com How to
- Scalars, Vectors and Matrices
- A scalar is a number, like 3, -5, 0.368 | A vector is a list of numbers (can be in a row or column) | A matrix is an array of numbers (one or more rows, one or more columns).
- What Is a Gradient in Machine Learning? - machinelearningmastery.com
- Understanding Gradient Descent Algorithm and the Maths Behind It - analyticsvidhya.com
- Mathematics is the queen of Sciences (Video)
- What Is The Fibonacci Sequence? - The Fibonacci Sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55… (Xn = Xn-1 + Xn-2)
- manim - Mathematical Animation Engine
- MIT 18.650 Statistics for Applications, Fall 2016 - MIT OpenCourseWare
- Joshua Emmanuel
- zedstatistics ⭐
- Stephanie Glen - statisticshowto.com
- Khan Academy - The average, Descriptive statistics, Probability and Statistics
- Evidence-Based Practice - Rich Simpson
- How Imaginary Numbers Were Invented
- Statistics Tutorial - w3schools.com
- Praxis Core Math - khanacademy.org
- Data Science for Beginners (Microsoft)
- Introduction to Data Science - umich.edu
- The Data Journey - statcan.gc.ca
- Free Data Science Courses - Harvard University - harvard.edu
- Data analysis: hypothesis testing - open.edu
- The Trillion Dollar Equation - Veritasium
- Dr Nic's Maths and Stats
- statisticsfun
- Statistics (Khan Academy)
- Statistics (CrashCourse)
- Statistics - A Full Lecture to learn Data Science ~4 hours - DATAtab
- What This Graph of a Dinosaur Can Teach Us about Doing Better Science - scientificamerican.com
- Machine Learning - Awesome ML
- Numbers - Awesome Numbers
- KPIs - Awesome KPIs
- Python - Awesome Python