Activity - So ... I'm playing with a report showing how often F500 companies are mentioned...

dredmorbius, 1 year ago
So ... I'm playing with a report showing how often F500 companies are mentioned in HN submission titles.

As I've noted, most of my scripting is in awk (gawk), and it's ... usually pretty good.

I'm toying with a couple of loops where I read all 178k titles, and all 500 company names, into arrays, then check to see if the one appears in the other.

The first iteration of that was based on the index() function, which is a simple string match. Problem is that there are substring matches, for example "Lear" (the company) will match on "Learn", "Learning", etc., and so is strongly overrepresented.

So I swapped in match(), which is a regular-expression match, and added W as word-boundaries.

The index-based search ran in about 20 seconds. That's a brief wait, but doable.

The match (regex) based search ... just finished as I'm writing this. 13 minutes 40 seconds.

Regexes are useful, but can be awfully slow.

Which means that my first go at this --- still using gawk but having it generate grep searches and printing the match count only ... is much faster whilst being accurate. That runs in just under a minute here. I'd looked for another solution as awk is "dumb" re the actually output: it doesn't read or capture the actual counts, so I'll either have to tweak that program or feed its output to an additional parser. Neither of which is a big deal, mind.

Oh, and Apple seems to be the most-mentioned company, though the F500 list omits Google (or YouTube, or Android), listing only Alphabet, which probably results in a severe undercount.

Top 10 using the F100 list:
 1 Apple: 2447 2 Microsoft: 1517 3 Amazon: 1457 4 Intel: 554 5 Tesla: 404 6 Netflix: 322 7 IBM: 309 8 Adobe: 180 9 Oracle: 167 10 AT&T: 143 
Add to those:
$ egrep -wc '(Google|Alphabet|You[Tt]ube|Android)' hn-titles 7163 egrep -wc '(Apple|iPhone|iPad|iPod|Mac[Bb]ook)' hn-titles 3656 egrep -wc '(Facebook|Instagram)' hn-titles 2512 
Note I didn't even try "Meta", though let's take a quick look ... yeah, that's a mess.

Up until 2021-10-28, "Meta" is a concept, with 33 entries. That was the day Facebook announced its name change. 82 total matches (so low overall compared to the earlier numbers above), 49 post-announcement, of which two are not related to Facebook a/k/a Meta. Several of the titles mention both FB & Meta ... looks like that's four of 'em.

So "Meta" boosts FB's count by 45.

There are another 296 mentions of Steve Jobs and Tim Cook which don't also include "Apple".

And "Alphabet" has 54 matches, six of which don't relate to the company.

Of the MFAANG companies:
Google: 5796 Apple: 2447 Facebook: 2371 Microsoft: 1517 Amazon: 1457 Netflix: 322 
(Based on grep.)

#DataAnalysis #awk #grep #bash #HackerNewsAnalytics
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

denspier 1 year ago