Skip to content

Commit

Permalink
Thiel-Sen estimators
Browse files Browse the repository at this point in the history
  • Loading branch information
howardjp committed Jan 3, 2025
1 parent 31dfd3d commit 33723aa
Show file tree
Hide file tree
Showing 3 changed files with 208 additions and 0 deletions.
4 changes: 4 additions & 0 deletions _includes/jsloads.html
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,10 @@
};
</script>

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/styles/panda-syntax-dark.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/highlight.min.js"></script>
<script>hljs.highlightAll();</script>

<script>
imagesLoaded( document.querySelector('#masonry-container'), function( instance ) {
var msnry = new Masonry('#masonry-container', {
Expand Down
204 changes: 204 additions & 0 deletions _posts/2025/2025-01-03-thiel-sen-estimator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
---
id: theil-sen-estimator
layout: post
title: "The Theil-Sen Estimator: A Robust Tool for Real-World Data"
date: 2025-01-03
author: k3jph
permalink: /2025/01/03/theil-sen-estimator
featured_image: /assets/img/2025/theil-sen-estimator.webp
categories:
- Blog
tags:
- statistics
- data analysis
- robust methods
- time-series analysis
---

Linear regression is one of the most widely used tools for analyzing
relationships between variables. Its simplicity and power make it an
essential part of every data scientist's toolkit, but its effectiveness
hinges on data adhering to strict assumptions. In the messy,
unpredictable real world, datasets are often riddled with outliers and
noise that can distort results and obscure meaningful trends.
Traditional methods like [ordinary least squares
(OLS)](https://www.xlstat.com/en/solutions/features/ordinary-least-squares-regression-ols)
regression, which is sensitive to extreme values, often falter in these
scenarios. The Theil-Sen Estimator, however, offers a robust
alternative, excelling in extracting reliable trends even from
challenging datasets.

## What Is the Theil-Sen Estimator?

The Theil-Sen Estimator is a [non-parametric
method](https://www.sciencedirect.com/topics/mathematics/nonparametric-method)
for estimating a linear trend. Unlike OLS regression, which minimizes
the sum of squared residuals, the Theil-Sen method focuses on the median
slope of all possible pairs of points in the dataset. This approach
makes it inherently robust to outliers and extreme values. In a world
where clean, well-behaved data are the exception rather than the rule,
this robustness is invaluable.

Mathematically, the estimator works by calculating the slopes between
every pair of data points:

$$
\text{slope}_{ij} = \frac{y_j - y_i}{x_j - x_i}, \quad \text{for } x_i \neq x_j.
$$

Once all slopes are calculated, the Theil-Sen Estimator takes the median
of these values as the final slope. This simple yet powerful technique
reduces the influence of extreme values, providing a more reliable
estimate of the true trend.

## Why Is It Useful?

OLS regression is optimal when data conform to its assumptions:
linearity,
[homoscedasticity](https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/homoscedasticity/),
and normally distributed residuals. But real-world data often, even
usually, violate these assumptions. A single extreme value can
dramatically affect the slope calculated by OLS, potentially obscuring
meaningful trends.

The Theil-Sen Estimator addresses this limitation directly. By focusing
on medians rather than means, it minimizes the impact of anomalies. This
makes it particularly effective in:

- **Environmental Science:** Long-term climate datasets often contain
extreme values due to [rare weather
events](/2015/10/11/1000-year-storms-in-south-carolina/) or
measurement errors. The Theil-Sen method helps identify genuine trends
amidst this noise.
- **Economics:** Economic data frequently exhibit outliers due to
recessions, booms, or one-time shocks. Robust methods ensure that
these events do not distort long-term analyses.
- **Finance:** Financial markets are inherently volatile. A robust trend
estimator like Theil-Sen can reveal underlying trends despite
short-term fluctuations.

## The Mechanics Behind Its Robustness

The secret to the Theil-Sen Estimator's robustness lies in its use of
medians. Medians are less sensitive to outliers because they consider
only the relative ranking of data points, not their magnitude. In
contrast, means--central to OLS regression--are directly influenced by
every value in the dataset. For example, if you introduce an extreme
value into a dataset, the mean may shift significantly, but the median
remains largely unaffected.

This property makes the Theil-Sen Estimator particularly appealing in
situations where the data include:

- Measurement errors or anomalies.
- Non-Gaussian distributions with heavy tails.
- Sparse datasets where a few points carry disproportionate weight.

## Revitalizing the Theil-Sen Estimator

While the Theil-Sen Estimator has been known since the mid-20th century,
it has gained renewed attention in the era of big data and advanced
analytics. Today, its applications extend beyond academic curiosity,
finding practical use in time-series analysis, machine learning
preprocessing, and even anomaly detection.

Consider time-series data, such as temperature records or stock prices.
Traditional methods often struggle to separate signal from noise in
these datasets, particularly when outliers are present. The Theil-Sen
Estimator's robustness ensures that trends are identified accurately,
making it an ideal choice for:

- **Climate Change Research:** Detecting long-term trends in global
temperatures despite short-term fluctuations caused by natural
variability or measurement errors.
- **Economic Forecasting:** Analyzing GDP growth trends while accounting
for occasional shocks like pandemics or financial crises.
- **Predictive Analytics:** Enhancing the stability of models by
preprocessing data with robust estimators to remove the influence of
anomalies.

## A Simple Example

Let us consider a toy dataset to illustrate the difference between OLS
regression and the Theil-Sen Estimator. Suppose we have five data points
representing the relationship between two variables, [latex]x[/latex]
and [latex]y[/latex]:

$$
\{(1, 2), (2, 4), (3, 6), (4, 8), (10, 50)\}
$$

Clearly, the first four points suggest a linear relationship, while the
fifth point is an outlier. Applying OLS regression to this dataset would
result in a slope heavily influenced by the outlier. However, the
Theil-Sen Estimator calculates the slopes between every pair of points,
then selects the median slope, effectively ignoring the outlier.

Here is how to implement the Theil-Sen Estimator in both R and Python:

### Example in R
```R
# Load necessary library
library(mblm)

# Example dataset
x <- c(1, 2, 3, 4, 10)
y <- c(2, 4, 6, 8, 50)

# Apply Theil-Sen Estimator
result <- mblm(y ~ x)

# Print results
summary(result)
```

### Example in Python
```python
# Import necessary libraries
from sklearn.linear_model import TheilSenRegressor
import numpy as np

# Example dataset
x = np.array([1, 2, 3, 4, 10]).reshape(-1, 1)
y = np.array([2, 4, 6, 8, 50])

# Apply Theil-Sen Estimator
model = TheilSenRegressor()
model.fit(x, y)

# Print results
print("Slope:", model.coef_[0])
print("Intercept:", model.intercept_)
```

## Practical Considerations

While the Theil-Sen Estimator is robust and versatile, it is not without
limitations:

1. **Computational Cost:** Calculating all pairwise slopes can be
computationally intensive for large datasets. However, modern
algorithms and computational tools have mitigated this issue.
2. **Applicability:** The method assumes a linear relationship between
variables. Nonlinear trends require other robust techniques.
3. **Interpretability:** While the median slope is robust, it may lack
the nuanced interpretability of OLS coefficients in complex models.

Despite these considerations, the Theil-Sen Estimator remains a vital
tool in any analyst's arsenal, particularly for preliminary data
exploration and robust trend analysis.

The Theil-Sen Estimator exemplifies how simplicity and robustness can
coexist in statistical methods. In an era where data are increasingly
messy and noisy, its ability to deliver reliable results makes it more
relevant than ever. Whether you are an environmental scientist tracking
climate trends, an economist analyzing volatile markets, or a data
scientist preprocessing data for machine learning, the Theil-Sen
Estimator offers a dependable way to cut through the noise and find the
signal.

By embracing robust methods like the Theil-Sen Estimator, we equip
ourselves to better understand the complex, unpredictable world
reflected in our data. For those seeking clarity amidst the chaos, this
technique is not just an alternative--it is an essential part of the
toolkit.
Binary file added assets/img/2025/theil-sen-estimator.webp
Binary file not shown.

0 comments on commit 33723aa

Please sign in to comment.