An Approach to Determining Asymmetric Confidence Interval
Updated: Sep 25, 2020
Confidence interval is a range of values computed from the statistics of the observed data which is likely to include a population value with a certain degree of confidence. It is widely used in data science to determine a sample taken from a set belongs to an expected population or not.
In this post, we will be discussing an approach to determining asymmetrical upper and lower bounds of a certain degree of confidence for time series signal to decide the outliers. Before we begin, let's emphasize that we need to apply upper and lower bounds of the confidence interval over the signal version cleared of fluctuations or over the prediction of model for Machine Learning cases.
This is a sample signal that we will work with
In order to proceed, first we need to smooth out the noise of shorter-term fluctuations in our signal. Classical and simple way of doing that is moving average. Hamming windowed moving average responds cyclical tendencies of data better than conventional moving average. For Python, Pandas contains it as a built-in function.
df["smooth"] = df["real"].rolling(11,win_type='hamming',center=True).mean()
As mentioned before, we will be using confidence interval on top of "smoothed" or predicted (in Machine Learning cases) value of the signal. So, our confidence interval should be computed from fluctuated part ("noise") of data and it will define upward fluctuation limit with upper bound and downward fluctuation limit with lower bound.
df["noise"] = df["real"] - df["smooth"]
The distance between the actual value of the signal and its smoothed state may differ from the up and down perspectives and is reflected in the noise. This indicates the upward and downward appetite of the signal. So we will examine the positive part of the noise for upper bound and negative part of the noise for lower bound separately. By this way, we will obtain a confidence interval with asymmetric upper and lower boundaries.
Statistical computation part
For statistical calculation, we need a probability distribution for positive values, on which we can fit our data points to be able to generate a probability density function.
Why don't we just use percentile instead of fitting to a probability distribution?
Because we want to define the sample space of our data points, and thus be capable of determining a boundary value, which may not exist in our sample set, according to the degree of confidence.
We will be using Exponential Distribution. You can use SciPy library from Python universe.
from scipy.stats import expon as dist upper_noise = df[df["noise"] > 0]["noise"] upper_params = dist.fit(upper_noise) upper_pdf_x = upper_noise.sort_values(ascending=True).copy().values upper_pdf_y = dist.pdf(upper_pdf_x , *upper_params) lower_noise = (df[df["noise"] < 0]["noise"]).abs() lower_params = dist.fit(lower_noise) lower_pdf_x = lower_noise.sort_values(ascending=True).copy().values lower_pdf_y = dist.pdf(lower_pdf_x, *lower_params)
What will the degree of confidence be?
What is our tolerance for noise?
The answer to these questions is a hyperparameter that should be determined according to the solution expectation we need.
Let's say 98%. This means that 98% of noise is acceptable for us. In other words, we can move up to the point where the population defined by our sample space reaches 98% probability. The point where the area under the distribution curve reaches the value 0,98.
SciPy has a wonderful built-in inverse cumulative distribution function (ppf: percent point function) that gives us the exact point where the cumulative probability reaches 0,98.
upper_bound_value = dist.ppf(98/100, *upper_params) upper_bound_value 45.108445989466084 lower_bound_value = dist.ppf(98/100, *lower_params) lower_bound_value 24.429620291197793
To be legible, vertical distribution plots at right in the last graph are not scaled to the same scale as the noise plot. Anyway, you got the idea. :)
Finally, let's apply our confidence interval over smoothed values and see how it will look like.
In some cases, you may want to extend your confidence interval a little further above the calculated one, for example when there are not enough examples to calculate the boundaries better. Then you can use the mean value, another parameter obtained from the data, with a certain ratio.
Special thanks to Ferit Buyukkececi.
Hope you like it. See you in another post.