dredmorbius, to random

Hacker News front-page analytics

A question about what states were most-frequently represented on the HN homepage had me do some quick querying via Hacker News's Algolia search ... which is NOT limited to the front page. Those results were ... surprising (Maine and Iowa outstrip the more probable results of California and, say, New York). Results are further confounded by other factors.

Thread: https://news.ycombinator.com/item?id=36076870

HN provides an interface to historical front-page stories (https://news.ycombinator.com/front), and that can be crawled by providing a list of corresponding date specifications, e.g.:

https://news.ycombinator.com/front?day=2023-05-25<br></br>

Easy enough.

So I'm crawling that and compiling a local archive. Rate-limiting and other factors mean that's only about halfway complete, and a full pull will take another day or so.

But I'll be able to look at story titles, sites, submitters, time-based patterns (day of week, day of month, month of year, yearly variations), and other patterns. There's also looking at mean points and comments by various dimensions.

Among surprises are that as of January 2015, among the highest consistently-voted sites is The Guardian. I'd thought HN leaned consistently less liberal.

The full archive will probably be < 1 GB (raw HTML), currently 123 MB on disk.

Contents are the 30 top-voted stories for each day since 20 February 2007.

If anyone has suggestions for other questions to ask of this, fire away.

And, as of early 2015, top state mentions are:

 1. new york:         150<br></br> 2. california:       101<br></br> 3. texas:             39<br></br> 4. washington:        38<br></br> 5. colorado:          15<br></br> 6. florida:           10<br></br> 7. georgia:           10<br></br> 8. kansas:            10<br></br> 9. north carolina:     9<br></br>10. oregon:             9<br></br>

NY is highly overrepresented (NY Times, NY Post, NY City), likewise Washington (Post, Times, DC). Adding in "Silicon Valley" and a few other toponyms boosts California's score markedly. I've also got some city-based analytics.

dredmorbius,

How Much Colorado Love? Or a 16-year Hacker News Front Page analytics

I've pulled 5,939 front pages from Hacker News, dating from 20 February 2007 to 25 May 2023, initially to answer the question "how often is Colorado mentioned on the front page?" (38 times, 5th most frequent US state). This also affords the opportunity to ask and answer other questions.

Preliminary report: https://news.ycombinator.com/item?id=36098749

dredmorbius,

OK, current stats are 63.5% of posts classified, with 29.8% of sites classified, a/k/a the old 65/30 rule. The mean posts per unclassified site is 1.765, so my returns for further classification will be ... small.

Full breakdown:

   4 20<br></br>  14 19<br></br>  13 18<br></br>  23 17<br></br>  32 16<br></br>  37 15<br></br>  48 14<br></br>  55 13<br></br>  96 12<br></br> 120 11<br></br> 122 10<br></br> 168 9<br></br> 247 8<br></br> 315 7<br></br> 396 6<br></br> 622 5<br></br>1052 4<br></br>2016 3<br></br>5103 2<br></br>26494 1<br></br>

A ... large number of sites w/ <= 20 posts are actually classified, mostly by regexp rules & patterns. Oh, hey, I can dump that breakdown as well:

  35 20<br></br>  27 19<br></br>  47 18<br></br>  31 17<br></br>  33 16<br></br>  41 15<br></br>  51 14<br></br>  45 13<br></br>  42 12<br></br>  29 11<br></br>  46 10<br></br>  46 9<br></br>  47 8<br></br>  91 7<br></br> 138 6<br></br> 178 5<br></br> 269 4<br></br> 524 3<br></br>1624 2<br></br>11472 1<br></br>

I could pick just under 4% more posts by classifying another 564 sites but ... that sounds a bit too much like work at the moment. Compromises and trade-offs.

Now to try to turn this into an analysis over time.

I've been working with a summary of activity by site, so running analysis has been pretty quick (52k records and gawk running over that).

To do full date analysis requires reading nearly 180k records, and ... hopefully not having to loop through 52k sites for each of those. Gawk's runtimes start to asplode when running tens of millions of loop iterations, especially if regexes are involved.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • JUstTest
  • ngwrru68w68
  • everett
  • InstantRegret
  • magazineikmin
  • thenastyranch
  • rosin
  • GTA5RPClips
  • Durango
  • Youngstown
  • slotface
  • khanakhh
  • kavyap
  • DreamBathrooms
  • provamag3
  • tacticalgear
  • osvaldo12
  • tester
  • cubers
  • cisconetworking
  • mdbf
  • ethstaker
  • modclub
  • Leos
  • anitta
  • normalnudes
  • megavids
  • lostlight
  • All magazines