mfi, to academicchatter German

Does anyone have experience with (AI) tools to assist with print or social media analyses? Any resources you could recommend?

@politicalscience @sociology @academicchatter

daieuxetdailleurs, to geneafr French
@daieuxetdailleurs@framapiaf.org avatar
news_sniffer, to Ukraine

Andre Damon writes here how the New York Times admits massive Ukraine casualties then later changes the article to cover them up:

https://www.wsws.org/en/articles/2023/07/26/kpcx-j26.html

News Sniffer caught those changes and a couple more too. See them here:

https://www.newssniffer.co.uk/articles/2513855/diff/0/6

news_sniffer, to journalism

News Sniffer is now monitoring 2.2 million news articles and has detected 4.4 million changes!

https://www.newssniffer.co.uk/

dredmorbius, to random

Hacker News front-page analytics

A question about what states were most-frequently represented on the HN homepage had me do some quick querying via Hacker News's Algolia search ... which is NOT limited to the front page. Those results were ... surprising (Maine and Iowa outstrip the more probable results of California and, say, New York). Results are further confounded by other factors.

Thread: https://news.ycombinator.com/item?id=36076870

HN provides an interface to historical front-page stories (https://news.ycombinator.com/front), and that can be crawled by providing a list of corresponding date specifications, e.g.:

https://news.ycombinator.com/front?day=2023-05-25<br></br>

Easy enough.

So I'm crawling that and compiling a local archive. Rate-limiting and other factors mean that's only about halfway complete, and a full pull will take another day or so.

But I'll be able to look at story titles, sites, submitters, time-based patterns (day of week, day of month, month of year, yearly variations), and other patterns. There's also looking at mean points and comments by various dimensions.

Among surprises are that as of January 2015, among the highest consistently-voted sites is The Guardian. I'd thought HN leaned consistently less liberal.

The full archive will probably be < 1 GB (raw HTML), currently 123 MB on disk.

Contents are the 30 top-voted stories for each day since 20 February 2007.

If anyone has suggestions for other questions to ask of this, fire away.

And, as of early 2015, top state mentions are:

 1. new york:         150<br></br> 2. california:       101<br></br> 3. texas:             39<br></br> 4. washington:        38<br></br> 5. colorado:          15<br></br> 6. florida:           10<br></br> 7. georgia:           10<br></br> 8. kansas:            10<br></br> 9. north carolina:     9<br></br>10. oregon:             9<br></br>

NY is highly overrepresented (NY Times, NY Post, NY City), likewise Washington (Post, Times, DC). Adding in "Silicon Valley" and a few other toponyms boosts California's score markedly. I've also got some city-based analytics.

dredmorbius,

HN Front Page / Global Cities Mentions

One question I've had about HN is how well or poorly it represents non-US (or even non-Silicon Valley) viewpoints and issues.

Pulling from the Globalization and World Cities Research Network list, the top 50 global cities names appearing in HN front-page titles:

  1   191  San Francisco<br></br>  2   164  London<br></br>  3   117  Boston<br></br>  4    86  Seattle<br></br>  5    60  Tokyo<br></br>  6    58  Paris<br></br>  7    56  Chicago<br></br>  8    56  Hong Kong<br></br>  9    55  New York City<br></br> 10    50  Berlin<br></br> 11    50  Phoenix<br></br> 12    45  Rome<br></br> 13    40  Detroit<br></br> 14    36  Singapore<br></br> 15    31  Vancouver<br></br> 16    30  Los Angeles<br></br> 17    27  Austin<br></br> 18    23  Beijing<br></br> 19    20  Dubai<br></br> 20    19  Shenzhen<br></br> 21    19  Toronto<br></br> 22    17  Amsterdam<br></br> 23    16  Copenhagen<br></br> 24    16  Houston<br></br> 25    16  Moscow<br></br> 26    15  Atlanta<br></br> 27    14  Barcelona<br></br> 28    14  Denver<br></br> 29    13  Baltimore<br></br> 30    13  San Jose<br></br> 31    13  Stockholm<br></br> 32    12  San Diego<br></br> 33    12  Sydney<br></br> 34    11  Cairo<br></br> 35    10  Munich<br></br> 36    10  Wuhan<br></br> 37     9  Helsinki<br></br> 38     9  Miami<br></br> 39     9  Mumbai<br></br> 40     9  Philadelphia<br></br> 41     9  Shanghai<br></br> 42     9  Vienna<br></br> 43     8  Montreal<br></br> 44     7  Beirut<br></br> 45     7  Dublin<br></br> 46     7  Istanbul<br></br> 47     6  Bangalore<br></br> 48     6  Dallas<br></br> 49     6  Kansas City<br></br> 50     6  Minneapolis<br></br>

(Best viewed in original on toot.cat.)

Note that some idiosyncrasies affect this, e.g., "New York City" appears rarely, whilst "New York" may refer to the city, state, any of several newspapers, universities, etc. "New York" appears 315 times in titles (mostly as "New York Times").

I've independently verified that, for example, "Ho Chi Minh City" doesn't appear, though "Ho Chi Minh" alone does:

https://news.ycombinator.com/item?id=15374051, on the 2017-9-30 front page: https://news.ycombinator.com/front?day=2017-09-30

So apply salt liberally.

Edits: tyops & speling.

dredmorbius,

Things about which Hacker News cares being down, and of which it has noticed:

Skype network is down, possibly under viral DoS attack. Lessons?<br></br>Is this why Twitter is down? Their Engineer Speaks<br></br>Amazon is down ... implications for AWS?<br></br>The Website Is Down (Hilarious 10 Minute Video)<br></br>Matthew Simmons: The only way is down<br></br>GitHub is down<br></br>KK on Unabomber: pounce on [technology] when it is down and kill it before it rises again<br></br>Yes, Rackspace Is Down And So Are Many Of Your Favorite Sites<br></br>Tell HN: Authorize.net is down<br></br>Dreamhost is down. All of it.<br></br>Most of Slicehost is Down<br></br>Ubisoft DRM authentification server is down, Assassin's Creed 2 unplayable<br></br>Dropbox is down<br></br>Heroku is down for the third time today<br></br>Tumblr is Down – Fans Angry<br></br>Great. Skype is down.<br></br>Reddit Is Down To One Developer<br></br>AWS is down, but here's why the sky is falling<br></br>Amazon EC2 EU-West is down<br></br>Reddit is down for 12 hours protest SOPA and PIPA.<br></br>Java.sun.com is down again - breaking bad apps across the land<br></br>Heroku is down<br></br>Tell HN: Heroku is Down (update: recovering as of 10PM PST)<br></br>AWS is down due to an electrical storm in the US<br></br>Heroku is down again<br></br>Google Talk is down<br></br>GoDaddy's DNS Service is Down<br></br>Github is down<br></br>Netflix is Down<br></br>Hacker News is down, so we made five issues free<br></br>This site is down because the owner stiffed the web designer<br></br>Dropbox is down<br></br>WhatsApp is down<br></br>DreamObjects is down<br></br>Facebook is down (09:08AM PDT Aug 1, 2014)<br></br>YTMND is down for temporary maintenance<br></br>Google Cloud Is Down<br></br>GitHub is down<br></br>DigitalOcean block storage is down<br></br>Firefox usage is down despite Mozilla's top exec pay going up<br></br>Slack is down<br></br>[dupe] Slack is down<br></br>Tell HN: GitHub is down again<br></br>Kiwi Farms is down across all domains as DDoS-Guard terminates service<br></br>Twitter's API is down?<br></br>

dredmorbius,

The Hacker News front page has noted that 282 people have died.

dredmorbius,

Hacker News "Leaders" front-page activity

So, more on that thing I said I wouldn't do but did anyway ...

Backstory: a dumb question lead me to crawl the HN front-page (FP) archive from 2007-present, just shy 6,000 pages, representing 178,162 stories, 52,400 distinct sites, and 43,491 distinct submitters. Each page has up to 30 stories, such that a fully-populated year has 10,950 or 10,980 (leap year) stories.

HN also provides a "leaders" page showing the top-100 members and "karma" (overall votes) --- latter being obscured for the top-10 members, though that can be found on their profile page. (https://news.ycombinator.com/leaders)

So ... I can get a summary of front-page activity for all leaders. It's ... interesting.

To assuage my guilt somewhat I'm only reporting overall / summary or anonymised stats. My goal isn't to out anyone specifically, but to give a sense of what HN front-page and "leader" member activity is like.

Seven leaders have no front-page posts at all, 17 have single-digit counts. The range is from 0 (obv!) to 1,183, mean 175.7, median 129, st.dev. 201.32, 10%ile: 3, 25%ile: 11; 75%ile: 253.5, 90%ile; 493.5.

Active years (years in which there is nonzero front-page activity) is ... all over the map -- there are members with results over 17 years, and with none at all.

What's ... peculiar ... is the points/karma% ratios. "Points' are votes on stories, "karma" is supposedly overall points (sum of story + comment moderation, less some for negative votes). The percentage of votes to overall karma ranges from 0 (no front-page activity) ... to 150.94%: more votes than cumulative karma. Points > overall karma (ratio > 100%) happens sixteen times, which is ... odd.

(Well, I mean, 16 is an even number, but the fact is odd-as-in-strange.)

One reason I've been doing this is to come up with some sense of overall quality metric. Engagements (votes and comments) are a highly-imperfect indicator, but looking at the arithmetic mean of votes and comments is interesting. I'm looking here at the average over all a member's front-page submissions:

Votes range from 0 to 634, mean 196.50, median 105,91, st.dev. 101.92, 25%ile: 150k 75%ile: 239.95.

Comments range from 0 to 323.75, mean 102.06, median: 96.38, 25%ile: 60.67, 75%ile: 123.16.

As might be expected, several members with lower-than-average submissions see high averages (there's more variance in small-n stats). One of the top-10 submitters (by average points and comments) has 514 FP stories, with an average of 236.37 points and 176.96 comments, and the most prolific submitter is very nearly median by votes and comments.

It's also possible to look at who's submitting a small or large range of sites by calculating a sites/stories% ratio. I'm finding, for example, one leader with 414 FP stories, from only 30 distinct sites, with the top site representing over half their submissions. (The site in question is legit and interesting, this does not appear to be spammy.) Several appear to favour their own personal sites / blogs, though again, not in a noxious way that I see. And 18 leaders have posted only a single item per site (each post is its own site), ranging from 1 to 20 FP items overall.

The ratio ranges from 0 (obv!) to 100 (obv!), mean 67.03%, median 71.83%, 25%ile: 51.82%, 75%ile: 89.72%.

dredmorbius,

HN Front Page: Foreign Policy Top 100 Global Thinkers (2014)

I pulled a copy of the "global thinkers" list I'd used as an indicator of website salience in a 2015 study.

The HN front page offers a limited opportunity for matches --- titles are 80 characters only, and HN's editorial policy is to not list authors of works, so what will show here is likely a subset of actual mentions.

That said: nearly a quarter of the list (23 entries) appear, from 1 to 11 times each. Paul Krugman (11), Lawrence Lessig (10), and Richard Dawkins (10) top the list.

     1  Paul Krugman:  11<br></br>     2  Lawrence Lessig:  10<br></br>     3  Richard Dawkins:  10<br></br>     4  Freeman Dyson:  9<br></br>     5  Daniel Kahneman:  8<br></br>     6  Noam Chomsky:  8<br></br>     7  Jaron Lanier:  6<br></br>     8  Steven Pinker:  5<br></br>     9  Daniel Dennett:  4<br></br>    10  Christopher Hitchens:  2<br></br>    11  Craig Venter:  2<br></br>    12  Edward O. Wilson:  2<br></br>    13  Jared Diamond:  2<br></br>    14  Richard Posner:  2<br></br>    15  Steven Weinberg:  2<br></br>    16  Thomas Friedman:  2<br></br>    17  Gary Becker:  1<br></br>    18  Hernando de Soto:  1<br></br>    19  James Lovelock:  1<br></br>    20  Larry Summers:  1<br></br>    21  Martha Nussbaum:  1<br></br>    22  Peter Singer:  1<br></br>    23  Salman Rushdie:  1<br></br>

Thje 2015 post, "Tracking the Conversation" is here: https://old.reddit.com/r/dredmorbius/comments/3hp41w/tracking_the_conversation_fp_global_100_thinkers/

dredmorbius,

Hacker News Analytics: ~3% of submissions reach front page, with half of comments on FP articles

This is a finding based on maths and a previous study by Whaly in 2022 based on HN 2021 activity, rather than my own crawl, though it's informed by the latter.

https://whaly.io/posts/hacker-news-2021-retrospective

The HN front page is a limited resource --- there are 365 * 30 == 10,950 front-page slots in a year, another 30, or 10,980, in a leap year, and regardless of site activity over a year, those slots are fixed. It's somewhat of a reminder that regardless of how much information we can access, our time to process that information is finite. Or as Herbert Simon observed: what information consumes is attention.

Whaly saw 386,663 total story submissions for 2021. I'm pretty sure that this is net of moderation (user flags, auto-kills, spam detection, voting-ring detection and the like). But it works out to a hair under 3% of stories not catching on any of those tripwires which then land on the HN front page.

Mind that that's actually a somewhat low estimate, as a story may appear for part of the day on the front page but not be represented on the end-of-day front-page archive.

I'm now thinking of doing some spot checks to see what kinds of success rates individual submitters have in landing on the front page. From what I've seen, even well-known and popular members have at best a modest chance of success.

Whaly also give a total number of comments: 3,769,520. That I can compare to my own front-page stats for 2021: 1,859,933, or 49.34% of all comments. That is, half of HN comments appear on the 3% of stories which reach the front page. That percentage is lower than what I'd have expected, though it's still a very strong bias toward the front page.

(Now I want to complete another analysis I'd thought of: mean votes and comments by story position (1--30), by year. Hrm...)

dredmorbius,

gagejustins's HN analysis has inspired me to take a crack at typifying Hacker News front page stories by type.

Whilst he'd manually assessed each front-page story, I'm classifying by site, so that an NY Times article on, say, quantum computing would still be described as "general news".

I've classified 10,200 of 52,642 domains, the first 300 or so manually, much of the rest using regexes and imputation (e.g., ".edu", ".gov", and sites on Blogspot, Substack, Medium, etc.).

Results by story count:

     1  13782  general news<br></br>     2  13398  software<br></br>     3  10473  tech news<br></br>     4   8677  blog<br></br>     5   7651  academic / science<br></br>     6   7294  n/a<br></br>     7   4750  ???<br></br>     8   4600  business news<br></br>     9   3546  corporate comm.<br></br>    10   1504  general magazine<br></br>    11   1291  general information<br></br>    12   1162  general interest<br></br>    13   1132  technology<br></br>    14   1099  videos<br></br>    15   1073  social media<br></br>    16    975  government<br></br>    17    568  corporate comm<br></br>    18    559  tech discussion<br></br>    19    505  tech law<br></br>    20    251  tech publications<br></br>    21    171  tech blog<br></br>    22    170  science news<br></br>    23    136  business education<br></br>    24    104  corporate comm. <br></br>    25    103  video<br></br>    26     99  corporate commm.<br></br>    27     96  general discussion<br></br>    28     80  misc<br></br>    29     71  technology / security<br></br>    30     61  law <br></br>    31     59  webcomic<br></br>    32     49  translation<br></br>    33     48  health news<br></br>    34     47  images<br></br>    35     46  podcast<br></br>    36     32  law<br></br>    37      7  legal news<br></br><br></br>  Unclassified: 93213<br></br><br></br>"n/a" indicates no site, e.g., an Ask, Tell, or Show HN post.<br></br><br></br>'???' indicates I couldn't (quickly) assess a domain.  Examples:  37signals.com, readwriteweb.com, thenextweb.com, archive.org, anandtech.com, avc.com, docs.google.com, righto.com, slideshare.net, infoq.com, hackaday.com, gamasutra.com, marco.org, smashingmagazine.com, highscalability.com, catonmat.net, centernetworks.com, jvns.ca, scribd.com, about.gitlab.com, cloud.google.com, alleyinsider.com, msn.com, firstround.com, axios.com, openculture.com, onstartups.com, ejohn.org, dadgum.com, shkspr.mobi, mixergy.com, geek.com, gmane.org, foundread.com.<br></br><br></br>"cproorate commm." is an obvious typo.  This is very rough code & classification.<br></br><br></br>#HackerNewsAnalytics #MediaAnalysis #HackerNews<br></br>
dredmorbius,

I'm continuing to play with this, and have classified a whole mess more sites (reminder to self: update that count) (response to self: 13,150 sites classified).

So that's about 25% of all sites that are classified. Looking by story count ... it's about 55% of all FP stories. (Power laws are your friend here...)

Looking at my current breakdowns (and again, this is all VERY ROUGH):

     1   15770  8.82%  blog<br></br>     2   15034  8.40%  general news<br></br>     3   13899  7.77%  software<br></br>     4   12889  7.21%  tech news<br></br>     5    7960  4.45%  academic / science<br></br>     6    7294  4.08%  n/a<br></br>     7    6025  3.37%  corporate comm.<br></br>     8    4859  2.72%  business news<br></br>     9    2120  1.19%  social media<br></br>    10    2031  1.14%  general interest<br></br>    11    1557  0.87%  general magazine<br></br>    12    1397  0.78%  general information<br></br>    13    1239  0.69%  technology<br></br>    14    1099  0.61%  videos<br></br>    15     975  0.55%  government<br></br>    16     607  0.34%  ???<br></br>    17     559  0.31%  tech discussion<br></br>    18     505  0.28%  tech law<br></br>    19     497  0.28%  misc documents<br></br>    20     420  0.23%  science news<br></br>    21     316  0.18%  mailing list<br></br>    22     251  0.14%  tech publications<br></br>    23     171  0.10%  tech blog<br></br>    24     149  0.08%  literature<br></br>    25     136  0.08%  business education<br></br>    26     133  0.07%  cryptocurrency<br></br>    27     126  0.07%  law<br></br>    28     118  0.07%  webcomic<br></br>    29     109  0.06%  entertainment news<br></br>    30     103  0.06%  health news<br></br>    31     103  0.06%  video<br></br>    32      96  0.05%  general discussion<br></br>    33      80  0.04%  misc<br></br>    34      71  0.04%  technology / security<br></br>    35      49  0.03%  translation<br></br>    36      47  0.03%  images<br></br>    37      46  0.03%  podcast<br></br>    38      42  0.02%  journalism<br></br>    39      30  0.02%  propaganda<br></br>    40      29  0.02%  healthcare / medicine<br></br>    41      18  0.01%  medicine<br></br>    42       7  0.00%  legal news<br></br><br></br>Classified:    98966<br></br>Unclassified:  79916<br></br>Total:        178882<br></br>Ratio:             0.553<br></br>

My classifications are rough and I may revisit these. "blog" covers a lot of sins, though most are tech blogs (which makes "technology blog" redundant).

What I'd really like to do is to look at how trends vary over the years. Perhaps also by day of week / month of year. Finally answer that age-old question of whether HN is turning into Reddit....

As noted above, this is based on classifying the site rather than interpreting the title or reading the source article, so it's all a bit wobbly.

(This post formats better on toot.cat or on sites that render Markdown.)

dredmorbius,

OK, current stats are 63.5% of posts classified, with 29.8% of sites classified, a/k/a the old 65/30 rule. The mean posts per unclassified site is 1.765, so my returns for further classification will be ... small.

Full breakdown:

   4 20<br></br>  14 19<br></br>  13 18<br></br>  23 17<br></br>  32 16<br></br>  37 15<br></br>  48 14<br></br>  55 13<br></br>  96 12<br></br> 120 11<br></br> 122 10<br></br> 168 9<br></br> 247 8<br></br> 315 7<br></br> 396 6<br></br> 622 5<br></br>1052 4<br></br>2016 3<br></br>5103 2<br></br>26494 1<br></br>

A ... large number of sites w/ <= 20 posts are actually classified, mostly by regexp rules & patterns. Oh, hey, I can dump that breakdown as well:

  35 20<br></br>  27 19<br></br>  47 18<br></br>  31 17<br></br>  33 16<br></br>  41 15<br></br>  51 14<br></br>  45 13<br></br>  42 12<br></br>  29 11<br></br>  46 10<br></br>  46 9<br></br>  47 8<br></br>  91 7<br></br> 138 6<br></br> 178 5<br></br> 269 4<br></br> 524 3<br></br>1624 2<br></br>11472 1<br></br>

I could pick just under 4% more posts by classifying another 564 sites but ... that sounds a bit too much like work at the moment. Compromises and trade-offs.

Now to try to turn this into an analysis over time.

I've been working with a summary of activity by site, so running analysis has been pretty quick (52k records and gawk running over that).

To do full date analysis requires reading nearly 180k records, and ... hopefully not having to loop through 52k sites for each of those. Gawk's runtimes start to asplode when running tens of millions of loop iterations, especially if regexes are involved.

dredmorbius,

So ... I'm starting to get the reporting by site classification across years down and ... it is interesting.

Preliminary and buggy code yet. Also this is highly dependent on how I've actually classified sites.

I've got a few classifications I'd wanted to keep an eye on:

  • Programming-specific sites. A lot of this is github and gitlab, basically, software projects with code. I'm distinguishing software (which is mostly about use) and programming which involves, or at least anticipates, actual development.
  • "Political commentary". I used this as a description for ... highly political sites (spot-checking to see what stories actually hit the front page, though I should be more robust in that). The list: reason.com, rt.com, bostonreview.net, alternet.org, cato.org, rootsofprogress.org, breitbart.com, dailykos.com, mises.org, dailycaller.com, jacobinmag.com, rawstory.com, tribunemag.co.uk, hoover.org, heritage.org, theroot.com, wsws.org, adamsmith.org, manhattan-institute.org, theblaze.com.

And there's "academic / science" which is mostly university and academic press / journal sites.

Anywho....

... at least from initial takes, the trending on these does not suggest a trending toward sensationalistic topics and/or sites, but the opposite. Much more programming FP stories in recent years, fewer political commentary, and more academic/science items.

Presuming this holds up as I code further.

This is one of the fun things about data analysis: stuff jumps out at you, sometimes confirming hunches, but often radically violating preconceptions.

I want to look more closely at what happens in the lead-up and follow-on to the 2016 US elections cycle in particular....

Hrm. What does spike is cryptocurrency-specific sites in 2014. Though that falls off again. (I suspect as that discussion enters more mainstream sources.)

And "general info" and "general interest" sites seem to rise in recent years.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • provamag3
  • kavyap
  • DreamBathrooms
  • InstantRegret
  • magazineikmin
  • thenastyranch
  • ngwrru68w68
  • Youngstown
  • everett
  • slotface
  • rosin
  • ethstaker
  • Durango
  • GTA5RPClips
  • megavids
  • cubers
  • modclub
  • mdbf
  • khanakhh
  • vwfavf
  • osvaldo12
  • cisconetworking
  • tester
  • Leos
  • tacticalgear
  • anitta
  • normalnudes
  • JUstTest
  • All magazines