Ora

What is Skewness in Pandas?

Published in Descriptive Statistics 4 mins read

In Pandas, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean, indicating the extent to which a dataset's distribution deviates from a symmetrical shape. Pandas provides the skew() function to effortlessly calculate this statistical property for Series and DataFrames, offering crucial insights into the shape of your data's distribution.

Understanding Skewness

Skewness quantifies how much a distribution leans to one side or the other. A perfectly symmetrical distribution, like a normal distribution, has zero skewness. When a distribution is not symmetrical, it is considered skewed.

Types of Skewness

Skewness can be categorized into three main types:

  • Positive Skew (Right-Skewed):
    • The tail of the distribution extends further to the right.
    • The mean is typically greater than the median.
    • Indicates a concentration of data points on the lower side with some unusually high values.
    • Example: Income distribution, where most people earn a moderate amount, but a few earn very high amounts.
  • Negative Skew (Left-Skewed):
    • The tail of the distribution extends further to the left.
    • The mean is typically less than the median.
    • Indicates a concentration of data points on the higher side with some unusually low values.
    • Example: Exam scores, where most students perform well, but a few score very low.
  • Zero Skew (Symmetrical):
    • The distribution is perfectly symmetrical around its mean.
    • The mean, median, and mode are approximately equal.
    • Example: A perfectly balanced normal distribution.

How Pandas Calculates Skewness

The Pandas skew() function computes the unbiased skewness of a dataset. Specifically, it returns the unbiased skew over the requested axis, normalized by N-1, where N is the number of observations. This normalization helps ensure that the skewness estimate is more robust, especially with smaller sample sizes.

The formula generally used for sample skewness is:

$$g1 = \frac{n}{(n-1)(n-2)} \sum{i=1}^{n} \left( \frac{x_i - \bar{x}}{s} \right)^3$$

where:

  • $n$ is the number of data points.
  • $x_i$ is each individual data point.
  • $\bar{x}$ is the sample mean.
  • $s$ is the sample standard deviation.

Pandas implements this calculation efficiently, allowing data scientists and analysts to quickly assess the symmetry of their data.

Practical Applications of Skewness

Understanding skewness is vital for various data analysis tasks:

  • Data Preprocessing: Many statistical models assume a normal distribution (zero skew). Highly skewed data might require transformations (e.g., logarithmic transformation) to meet these assumptions, improve model performance, or ensure validity of statistical tests.
  • Outlier Detection: Extreme skewness can sometimes indicate the presence of outliers, which pull the mean away from the median towards the tail.
  • Descriptive Statistics: It provides a key descriptive statistic, alongside mean, median, and standard deviation, to fully characterize the shape of a data distribution.
  • Risk Management: In finance, positively skewed returns might indicate frequent small losses and a few large gains, while negatively skewed returns could suggest frequent small gains but potential for large losses.

Using the skew() Function in Pandas

Pandas makes calculating skewness straightforward. You can apply the skew() method directly to a Pandas Series or DataFrame.

Let's illustrate with examples:

import pandas as pd
import numpy as np

# Create a Pandas Series
data_series = pd.Series([1, 2, 2, 3, 3, 3, 4, 4, 5, 10])
print("Original Series:\n", data_series)
print("\nSkewness of the Series:", data_series.skew())

# Create a DataFrame with different distributions
data_df = pd.DataFrame({
    'Positive_Skew': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 50],
    'Negative_Skew': [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, -10],
    'Symmetrical': [1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10],
    'With_NaNs': [1, 2, np.nan, 4, 5, 6, np.nan, 8, 9, 10]
})
print("\nOriginal DataFrame:\n", data_df)

# Calculate skewness for the entire DataFrame (column-wise by default)
print("\nSkewness of DataFrame columns:\n", data_df.skew())

# Calculate skewness along rows (axis=1)
print("\nSkewness of DataFrame rows:\n", data_df.skew(axis=1))

Key Parameters of skew()

Parameter Description Default Value
axis Specifies whether to calculate skewness along rows (0 or 'index') or columns (1 or 'columns'). 0
skipna Boolean, excludes NA/null values when computing the result. If False, and there are NA values, the result for that axis will be NaN. True
level For MultiIndex, determines which level(s) to compute skewness for. None
numeric_only Boolean, if True, only includes float, int, and boolean data. If False, will attempt to use non-numeric data, which may result in an error. (Note: behavior may vary slightly across Pandas versions for numeric_only and its interaction with object dtypes). False

For more detailed information, refer to the official Pandas DataFrame.skew() documentation.

Conclusion

Skewness is a fundamental statistical measure that helps data professionals understand the shape and symmetry of their data distributions. Pandas simplifies its calculation through the skew() function, providing a quick and efficient way to uncover valuable insights, guide data preprocessing, and inform modeling decisions.