gagejustins's HN analysis has inspired me to take a crack at typifying Hacker News front page stories by type.
Whilst he'd manually assessed each front-page story, I'm classifying by site, so that an NY Times article on, say, quantum computing would still be described as "general news".
I've classified 10,200 of 52,642 domains, the first 300 or so manually, much of the rest using regexes and imputation (e.g., ".edu", ".gov", and sites on Blogspot, Substack, Medium, etc.).
Results by story count:
1 13782 general news<br></br> 2 13398 software<br></br> 3 10473 tech news<br></br> 4 8677 blog<br></br> 5 7651 academic / science<br></br> 6 7294 n/a<br></br> 7 4750 ???<br></br> 8 4600 business news<br></br> 9 3546 corporate comm.<br></br> 10 1504 general magazine<br></br> 11 1291 general information<br></br> 12 1162 general interest<br></br> 13 1132 technology<br></br> 14 1099 videos<br></br> 15 1073 social media<br></br> 16 975 government<br></br> 17 568 corporate comm<br></br> 18 559 tech discussion<br></br> 19 505 tech law<br></br> 20 251 tech publications<br></br> 21 171 tech blog<br></br> 22 170 science news<br></br> 23 136 business education<br></br> 24 104 corporate comm. <br></br> 25 103 video<br></br> 26 99 corporate commm.<br></br> 27 96 general discussion<br></br> 28 80 misc<br></br> 29 71 technology / security<br></br> 30 61 law <br></br> 31 59 webcomic<br></br> 32 49 translation<br></br> 33 48 health news<br></br> 34 47 images<br></br> 35 46 podcast<br></br> 36 32 law<br></br> 37 7 legal news<br></br><br></br> Unclassified: 93213<br></br><br></br>"n/a" indicates no site, e.g., an Ask, Tell, or Show HN post.<br></br><br></br>'???' indicates I couldn't (quickly) assess a domain. Examples: 37signals.com, readwriteweb.com, thenextweb.com, archive.org, anandtech.com, avc.com, docs.google.com, righto.com, slideshare.net, infoq.com, hackaday.com, gamasutra.com, marco.org, smashingmagazine.com, highscalability.com, catonmat.net, centernetworks.com, jvns.ca, scribd.com, about.gitlab.com, cloud.google.com, alleyinsider.com, msn.com, firstround.com, axios.com, openculture.com, onstartups.com, ejohn.org, dadgum.com, shkspr.mobi, mixergy.com, geek.com, gmane.org, foundread.com.<br></br><br></br>"cproorate commm." is an obvious typo. This is very rough code & classification.<br></br><br></br>#HackerNewsAnalytics #MediaAnalysis #HackerNews<br></br>