allendowney,
@allendowney@fosstodon.org avatar

The latest installment in the Data Q&A series is about estimating percentiles, the limits of bootstrapping, and quantifying uncertainty due to missing data.

https://www.allendowney.com/blog/2024/04/26/small-percentiles-and-missing-data/

avehtari,
@avehtari@bayes.club avatar

@allendowney When computing the uncertainty, should you take into account that the measurements have high autocorrelation? Effective sample size for that quantile estimate seems to be about 3k, which is much less than the total sample size 53k

allendowney,
@allendowney@fosstodon.org avatar

@avehtari Good question -- I'm not sure. Some of the reduction in ESS is because we're estimating such a small percentile, I think. But yes, there's a ton of structure in the data that the bootstrap is ignoring. Hmm...

avehtari,
@avehtari@bayes.club avatar

@allendowney I get ESS about 3050 for 0.2% quantile and ESS about 3540 for 50% quantile, so the difference is not big compared to the total sample size

allendowney,
@allendowney@fosstodon.org avatar

@avehtari I was thinking about this on my morning run and I have a new theory -- the reduced ESS is a consequence of using KDE. Any values more than a few bandwidths away from the estimate contribute nothing.

Still not sure how much better we would do with a model that takes into account the autocorrelation. Might have to do the experiment.

avehtari,
@avehtari@bayes.club avatar

@allendowney I estimated ESS using posterior R package, which was made for analysing Markov chains, but actually "effective sample size" term was first used for climatological time series (Laurmann and Gates, 1977)!

Posterior package estimated 90% uncertainty interval for 0.2% quantile is (-1.5, 3.5).

I didn't use KDE or bootstrap, but I did smooth out the discreteness in tail with Pareto smoothing to make MCSE computation easier.

avehtari,
@avehtari@bayes.club avatar

@allendowney It seems your use of KDE is biasing the results to lower temperatures, as the bootstrap distribution is not centered on 1, which could be caused by kernels being symmetric and maybe also having Gaussian shape, which both would make KDE to have a thicker tail than the data. Pareto distribution can match the tail shape better.

allendowney,
@allendowney@fosstodon.org avatar

@avehtari Thanks for looking into this! There are a couple of things I'm finding confusing here. One is that the CI you got is substantially wider than the one I got. Why is that?

allendowney,
@allendowney@fosstodon.org avatar

@avehtari The other is what you said about the tails -- I expected the Gaussian tail of the KDE kernel to match the tail of the data pretty well -- and this figure suggests that it does:

allendowney,
@allendowney@fosstodon.org avatar

@avehtari A Pareto tail would be much thicker, wouldn't it?

avehtari,
@avehtari@bayes.club avatar

@allendowney Generalized Pareto distribution fitted to the tail of the data has the same tail thickness as the data. Theory states that many distributions (including most text book distributions) have Pareto tail when you go far enough in the tail. Pareto distribution is popular in extreme value analysis, as it seems the theory often works also in practice.

allendowney,
@allendowney@fosstodon.org avatar

@avehtari Ok, but doesn't the figure in my previous message indicate that the Gaussian tail of the KDE fits the data well over the range of the data? If the values below that range are a little smaller or a lot smaller, that would not affect 0.2 percentile.

avehtari,
@avehtari@bayes.club avatar

@allendowney It's wider because I take into account the autocorrelation, which reduces the effective sample size to about 3050. If I replace the effective sample size by the total sample size (that is, assuming independent observations), I get about the same width for CI (but it is centered on 1 unlike your KDE+bootstrap one)

allendowney,
@allendowney@fosstodon.org avatar

@avehtari Hmm. I think the number of things not making sense to me has exceeded the number of things that can be cleared up in this medium :(

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • DreamBathrooms
  • mdbf
  • InstantRegret
  • Durango
  • Youngstown
  • rosin
  • slotface
  • thenastyranch
  • osvaldo12
  • ngwrru68w68
  • kavyap
  • cisconetworking
  • khanakhh
  • magazineikmin
  • anitta
  • cubers
  • vwfavf
  • modclub
  • everett
  • ethstaker
  • normalnudes
  • tacticalgear
  • tester
  • provamag3
  • GTA5RPClips
  • Leos
  • megavids
  • JUstTest
  • All magazines