HydrePrever, French
@HydrePrever@mathstodon.xyz avatar

Mastodon, what is a good (freely available) dataset to give an example of principal component analysis?
Thanks for answering or sharing! #statistics #pca #rstats

ma_delsuc,
@ma_delsuc@fediscience.org avatar

@HydrePrever
I teach PCA (via scikit-learn) using the penguin dataset found here :
https://allisonhorst.github.io/palmerpenguins/
it's fun, and easy to use. Several variables of several types.

(I know I should publish the notebook that carry this teaching...)

Mehrad,
@Mehrad@fosstodon.org avatar

@HydrePrever
the {datasets} package that is shipped with R by default contains many toy datasets. The easiest thing for PCA is the iris data

krazykitty,
@krazykitty@mamot.fr avatar

@HydrePrever well everybody uses the decathlon data and I think it's a fun example. The Olivetti faces data set is a bit harder to interpret but fun as well.

rmflight,
@rmflight@mastodon.social avatar
krazykitty,
@krazykitty@mamot.fr avatar

@rmflight I love this dataset but I feel that it's a bit of a stretch to get something meaningful out of a PCA on it. It only has 4 features and you can learn a lot from looking at them without the PCA @HydrePrever

rmflight,
@rmflight@mastodon.social avatar

@krazykitty @HydrePrever Ok, sure. But Iris gets used all the time too, and it only has 4 features as well.

4 is enough to get the point across for showing classes. Otherwise, go to something much higher like -omics, in which case start poking around bioconductor datasets.

I personally think a set of points describing the 3D shape of a banana also helps to demonstrate the what that PCA is actually doing (still only 3 features). 😉

krazykitty,
@krazykitty@mamot.fr avatar

@rmflight @HydrePrever I find the decathlon data more fun to work with, personally, and it's the one with which I've have had more success with my students, but YMMV.

In any case, Palmer Penguins >> Iris!

CCochard,
@CCochard@mastodon.social avatar

@HydrePrever That's not the iris dataset that scikit learn use to demonstrate?

krazykitty,
@krazykitty@mamot.fr avatar

@CCochard @HydrePrever The iris dataset only has four features, PCA is barely necessary...

CCochard,
@CCochard@mastodon.social avatar

@krazykitty @HydrePrever Fair... However, it's almost easier if you are trying to explain how it works to be able to back rationalise what you have done, although the data set is ridiculously boring.

krazykitty,
@krazykitty@mamot.fr avatar

@CCochard @HydrePrever that's true too, but in that case I'd recommend the Palmer Penguins dataset, which has no association with Fisher and Annals of Eugenics.

HydrePrever,
@HydrePrever@mathstodon.xyz avatar

@krazykitty @CCochard you know that I have the whole collection of annals of eugenics in my office 😁

krazykitty,
@krazykitty@mamot.fr avatar

@HydrePrever @CCochard I know how you love citing nothing else^^

LeafyEricScott,
@LeafyEricScott@fosstodon.org avatar

@krazykitty @CCochard @HydrePrever and it was originally published in a eugenics journal, so... eww

CCochard,
@CCochard@mastodon.social avatar

@LeafyEricScott @krazykitty @HydrePrever yes Krazy Kitty mentioned it.
I have been thinking of whether I should remove my post or keep it as it has interesting information under (probably should at least edit...)

HydrePrever,
@HydrePrever@mathstodon.xyz avatar

@CCochard @LeafyEricScott @krazykitty no worries... my students thinks that I am kind of nut reminding them all the time that correlation has been defined by Galton who was... chi square by Pearson who was a bloody... and so on, but I don't refrain from using, e.g. Galton height data.

LeafyEricScott,
@LeafyEricScott@fosstodon.org avatar

@HydrePrever @CCochard @krazykitty I've had this paper in my "to-read" list for a while, which I assumes validates this approach based on the title: https://www.tandfonline.com/doi/full/10.1080/26939169.2023.2224407

  • All
  • Subscribed
  • Moderated
  • Favorites
  • statistics
  • tacticalgear
  • DreamBathrooms
  • ethstaker
  • InstantRegret
  • Youngstown
  • magazineikmin
  • osvaldo12
  • slotface
  • khanakhh
  • rosin
  • kavyap
  • everett
  • thenastyranch
  • ngwrru68w68
  • megavids
  • GTA5RPClips
  • cisconetworking
  • mdbf
  • cubers
  • Durango
  • anitta
  • modclub
  • normalnudes
  • tester
  • Leos
  • provamag3
  • JUstTest
  • lostlight
  • All magazines