stevensanderson, to datascience
@stevensanderson@mstdn.social avatar

if you're ready to level up your data manipulation skills, give intersect() a spin and let your insights shine! 🌈 Embrace the world of R and keep growing as a data wizard! 🧙‍♂️ Happy coding! 🎉

Post: https://www.spsanderson.com/steveondata/posts/2023-07-28/

eric_ma, to datascience
@eric_ma@techhub.social avatar

Looking for a recommendation(website,Substack, any other material...) where I can improve my SQL knowledge. I am looking for something that I can read(theory) and practice(exercisea). I really enjoy learning python in Substack but until now I have not found something similar for SQL.

Any advise or recommendation?

stevensanderson, to datascience
@stevensanderson@mstdn.social avatar

📢 Master the Art of List Subsetting in R! 🚀 Or: Lists...again

📝 Lists in R are versatile data structures, capable of holding various elements like vectors, matrices, and even other lists. But what makes them truly magical is the ability to extract specific data efficiently through subsetting. 🎯

Blog Post: https://www.spsanderson.com/steveondata/posts/2023-07-19/

stevensanderson, to random
@stevensanderson@mstdn.social avatar

I encourage you to roll up your sleeves and give it a try yourself. 💪🔍

Read the full blog post and start your exploration. Let's dive in and level up your data analysis game! 🚀📊

https://www.spsanderson.com/steveondata/posts/2023-07-17/

stevensanderson, to statistics
@stevensanderson@mstdn.social avatar

Let's unlock the true potential of your data together! Read the blog post, try the cov() function, and let's embark on an exciting journey of discovery. 💡

Post: https://www.spsanderson.com/steveondata/posts/2023-07-14/

#r

stevensanderson, to opensource
@stevensanderson@mstdn.social avatar

file_path <- "data.csv"
if (file.exists(file_path)) {
print("The file exists!")
} else {
print("The file does not exist.")
}

In this example, we check if the file named "data.csv" exists. Depending on the outcome, it will print either "The file exists!" or "The file does not exist."

Post: https://www.spsanderson.com/steveondata/posts/2023-07-13/

#R

MarkRubin, to science
@MarkRubin@fediscience.org avatar

“Turning all the knobs!”

In Part 2 of a two-part series of articles, Michael Höfler and colleagues consider “how to explore data to modify existing claims and create new ones”

https://doi.org/10.15626/MP.2022.3270



BarbChamberlain, to random
@BarbChamberlain@toot.community avatar
purplepadma, to random

Morning, work today but that’s all good. Back to my ! I slept better but had some ker-AAAA-zee dreams. How did you sleep? What plans do you have?

stevensanderson, to statistics
@stevensanderson@mstdn.social avatar

🔬📊 Mastering Data Grouping with R's ave() Function 📊🔬

Are you tired of manually calculating statistics for different groups in your data analysis projects? Look no further! R's ave() function is here to revolutionize your data grouping experience. 🚀

Post: https://www.spsanderson.com/steveondata/posts/2023-06-27/

#r

stevensanderson, to opensource
@stevensanderson@mstdn.social avatar

📊🔬 Exciting news! Learn bootstrap resampling in R with lapply, rep, and sample functions. Estimate uncertainty, analyze data variability, and unlock insights. #R 🎉💻

Post: https://www.spsanderson.com/steveondata/posts/2023-06-23/

SimonMolinsky, to python
@SimonMolinsky@fosstodon.org avatar

Call for REVIEWERS!

I'm looking for reviewers specialized in:

  • network science 🕸
  • higher-order networks (hypergraph, simplicial complex) 😲
  • data analysis and graph algorithms 🔗
  • network visualization 🌐
  • and Python 🐍

Package XGI was submitted to @pyOpenSci and is awaiting review (link to the submission in the first comment)! If you don't have time but know someone that could be interested, please, share this post!

dredmorbius, to random

Hacker News front-page analytics

A question about what states were most-frequently represented on the HN homepage had me do some quick querying via Hacker News's Algolia search ... which is NOT limited to the front page. Those results were ... surprising (Maine and Iowa outstrip the more probable results of California and, say, New York). Results are further confounded by other factors.

Thread: https://news.ycombinator.com/item?id=36076870

HN provides an interface to historical front-page stories (https://news.ycombinator.com/front), and that can be crawled by providing a list of corresponding date specifications, e.g.:

https://news.ycombinator.com/front?day=2023-05-25<br></br>

Easy enough.

So I'm crawling that and compiling a local archive. Rate-limiting and other factors mean that's only about halfway complete, and a full pull will take another day or so.

But I'll be able to look at story titles, sites, submitters, time-based patterns (day of week, day of month, month of year, yearly variations), and other patterns. There's also looking at mean points and comments by various dimensions.

Among surprises are that as of January 2015, among the highest consistently-voted sites is The Guardian. I'd thought HN leaned consistently less liberal.

The full archive will probably be < 1 GB (raw HTML), currently 123 MB on disk.

Contents are the 30 top-voted stories for each day since 20 February 2007.

If anyone has suggestions for other questions to ask of this, fire away.

And, as of early 2015, top state mentions are:

 1. new york:         150<br></br> 2. california:       101<br></br> 3. texas:             39<br></br> 4. washington:        38<br></br> 5. colorado:          15<br></br> 6. florida:           10<br></br> 7. georgia:           10<br></br> 8. kansas:            10<br></br> 9. north carolina:     9<br></br>10. oregon:             9<br></br>

NY is highly overrepresented (NY Times, NY Post, NY City), likewise Washington (Post, Times, DC). Adding in "Silicon Valley" and a few other toponyms boosts California's score markedly. I've also got some city-based analytics.

dredmorbius,

How Much Colorado Love? Or a 16-year Hacker News Front Page analytics

I've pulled 5,939 front pages from Hacker News, dating from 20 February 2007 to 25 May 2023, initially to answer the question "how often is Colorado mentioned on the front page?" (38 times, 5th most frequent US state). This also affords the opportunity to ask and answer other questions.

Preliminary report: https://news.ycombinator.com/item?id=36098749

dredmorbius,

I'm wanting to test some reporting / queries / logic based on a sampling of data.

Since my file-naming convention follows ISO-8601 (YYYY-MM-DD), I can just lexically sort those.

And to grab a random year's worth (365 days) of reports from across the set:

ls rendered-crawl/* | sort -R | head -365 | sort<br></br>

(I've rendered the pages, using w3m's -dump feature, to speed processing).

The full dataset is large enough and my awk code sloppy enough (several large sequential lists used in pattern-matching) that a full parse takes about 10 minutes, so the sampling shown here speeds development better than 10x while still providing representative data across time.

dredmorbius,

HN Front Page / Global Cities Mentions

One question I've had about HN is how well or poorly it represents non-US (or even non-Silicon Valley) viewpoints and issues.

Pulling from the Globalization and World Cities Research Network list, the top 50 global cities names appearing in HN front-page titles:

  1   191  San Francisco<br></br>  2   164  London<br></br>  3   117  Boston<br></br>  4    86  Seattle<br></br>  5    60  Tokyo<br></br>  6    58  Paris<br></br>  7    56  Chicago<br></br>  8    56  Hong Kong<br></br>  9    55  New York City<br></br> 10    50  Berlin<br></br> 11    50  Phoenix<br></br> 12    45  Rome<br></br> 13    40  Detroit<br></br> 14    36  Singapore<br></br> 15    31  Vancouver<br></br> 16    30  Los Angeles<br></br> 17    27  Austin<br></br> 18    23  Beijing<br></br> 19    20  Dubai<br></br> 20    19  Shenzhen<br></br> 21    19  Toronto<br></br> 22    17  Amsterdam<br></br> 23    16  Copenhagen<br></br> 24    16  Houston<br></br> 25    16  Moscow<br></br> 26    15  Atlanta<br></br> 27    14  Barcelona<br></br> 28    14  Denver<br></br> 29    13  Baltimore<br></br> 30    13  San Jose<br></br> 31    13  Stockholm<br></br> 32    12  San Diego<br></br> 33    12  Sydney<br></br> 34    11  Cairo<br></br> 35    10  Munich<br></br> 36    10  Wuhan<br></br> 37     9  Helsinki<br></br> 38     9  Miami<br></br> 39     9  Mumbai<br></br> 40     9  Philadelphia<br></br> 41     9  Shanghai<br></br> 42     9  Vienna<br></br> 43     8  Montreal<br></br> 44     7  Beirut<br></br> 45     7  Dublin<br></br> 46     7  Istanbul<br></br> 47     6  Bangalore<br></br> 48     6  Dallas<br></br> 49     6  Kansas City<br></br> 50     6  Minneapolis<br></br>

(Best viewed in original on toot.cat.)

Note that some idiosyncrasies affect this, e.g., "New York City" appears rarely, whilst "New York" may refer to the city, state, any of several newspapers, universities, etc. "New York" appears 315 times in titles (mostly as "New York Times").

I've independently verified that, for example, "Ho Chi Minh City" doesn't appear, though "Ho Chi Minh" alone does:

https://news.ycombinator.com/item?id=15374051, on the 2017-9-30 front page: https://news.ycombinator.com/front?day=2017-09-30

So apply salt liberally.

Edits: tyops & speling.

dredmorbius,

So ... I'm playing with a report showing how often F500 companies are mentioned in HN submission titles.

As I've noted, most of my scripting is in awk (gawk), and it's ... usually pretty good.

I'm toying with a couple of loops where I read all 178k titles, and all 500 company names, into arrays, then check to see if the one appears in the other.

The first iteration of that was based on the index() function, which is a simple string match. Problem is that there are substring matches, for example "Lear" (the company) will match on "Learn", "Learning", etc., and so is strongly overrepresented.

So I swapped in match(), which is a regular-expression match, and added W as word-boundaries.

The index-based search ran in about 20 seconds. That's a brief wait, but doable.

The match (regex) based search ... just finished as I'm writing this. 13 minutes 40 seconds.

Regexes are useful, but can be awfully slow.

Which means that my first go at this --- still using gawk but having it generate grep searches and printing the match count only ... is much faster whilst being accurate. That runs in just under a minute here. I'd looked for another solution as awk is "dumb" re the actually output: it doesn't read or capture the actual counts, so I'll either have to tweak that program or feed its output to an additional parser. Neither of which is a big deal, mind.

Oh, and Apple seems to be the most-mentioned company, though the F500 list omits Google (or YouTube, or Android), listing only Alphabet, which probably results in a severe undercount.

Top 10 using the F100 list:

     1  Apple:  2447<br></br>     2  Microsoft:  1517<br></br>     3  Amazon:  1457<br></br>     4  Intel:  554<br></br>     5  Tesla:  404<br></br>     6  Netflix:  322<br></br>     7  IBM:  309<br></br>     8  Adobe:  180<br></br>     9  Oracle:  167<br></br>    10  AT&T:  143<br></br>

Add to those:

$ egrep -wc '(Google|Alphabet|You[Tt]ube|Android)' hn-titles<br></br>7163<br></br>egrep -wc '(Apple|iPhone|iPad|iPod|Mac[Bb]ook)' hn-titles<br></br>3656<br></br> egrep -wc '(Facebook|Instagram)' hn-titles<br></br>2512<br></br>

Note I didn't even try "Meta", though let's take a quick look ... yeah, that's a mess.

Up until 2021-10-28, "Meta" is a concept, with 33 entries. That was the day Facebook announced its name change. 82 total matches (so low overall compared to the earlier numbers above), 49 post-announcement, of which two are not related to Facebook a/k/a Meta. Several of the titles mention both FB & Meta ... looks like that's four of 'em.

So "Meta" boosts FB's count by 45.

There are another 296 mentions of Steve Jobs and Tim Cook which don't also include "Apple".

And "Alphabet" has 54 matches, six of which don't relate to the company.

Of the MFAANG companies:

Google: 5796<br></br>Apple: 2447<br></br>Facebook: 2371<br></br>Microsoft: 1517<br></br>Amazon: 1457<br></br>Netflix: 322<br></br>

(Based on grep.)

dredmorbius,

OK, current stats are 63.5% of posts classified, with 29.8% of sites classified, a/k/a the old 65/30 rule. The mean posts per unclassified site is 1.765, so my returns for further classification will be ... small.

Full breakdown:

   4 20<br></br>  14 19<br></br>  13 18<br></br>  23 17<br></br>  32 16<br></br>  37 15<br></br>  48 14<br></br>  55 13<br></br>  96 12<br></br> 120 11<br></br> 122 10<br></br> 168 9<br></br> 247 8<br></br> 315 7<br></br> 396 6<br></br> 622 5<br></br>1052 4<br></br>2016 3<br></br>5103 2<br></br>26494 1<br></br>

A ... large number of sites w/ <= 20 posts are actually classified, mostly by regexp rules & patterns. Oh, hey, I can dump that breakdown as well:

  35 20<br></br>  27 19<br></br>  47 18<br></br>  31 17<br></br>  33 16<br></br>  41 15<br></br>  51 14<br></br>  45 13<br></br>  42 12<br></br>  29 11<br></br>  46 10<br></br>  46 9<br></br>  47 8<br></br>  91 7<br></br> 138 6<br></br> 178 5<br></br> 269 4<br></br> 524 3<br></br>1624 2<br></br>11472 1<br></br>

I could pick just under 4% more posts by classifying another 564 sites but ... that sounds a bit too much like work at the moment. Compromises and trade-offs.

Now to try to turn this into an analysis over time.

I've been working with a summary of activity by site, so running analysis has been pretty quick (52k records and gawk running over that).

To do full date analysis requires reading nearly 180k records, and ... hopefully not having to loop through 52k sites for each of those. Gawk's runtimes start to asplode when running tens of millions of loop iterations, especially if regexes are involved.

TheMemeticist, to random

A small fun exercise for using : Most common times to see a / according to 22k data points. 🛰️ 🛸

charlesdebarros, to programming

This week, Week 9 of our Data Citizen Bootcamp at Cambridge Spark, was a gentle intro to Python for Data Analysis, including an intro to the Pan

I found a very nice article called "A Beginner’s Guide to Data Analysis in Python" written by Natassha Selvaraj. It is a very well-written article with plenty of examples to follow through. Most definitely worth a read.
Enjoy! 😉

https://towardsdatascience.com/a-beginners-guide-to-data-analysis-in-python-188706df5447

Sherbet_dibdab, to random

No-one in particular.

Mainly stuff about British comics (ok it's ), from the early to mid 80s UK microcomputer boom, popular science / rationalism, will share the UK popular cultural archive but not a poem remembering the Milk In A Bottle, always, and for work and sometimes for fun. Formerly @mstdn.social, migrated to @mastodonapp.uk 26/04/23 -- I find the Local timeline is a bit more UK-relatable

pascalschulthess, to cycling
@pascalschulthess@mstdn.science avatar

I switched instances in the hope to have a more relevant local feed at mstdn.science.

So, let’s do this thing again.

I’m Pascal, father to 4 kids, and enthusiast. I live in , the cycling capital of the world and work as a at the institute.

A couple of hashtags describing my work:

gerald_leppert, to statistics German

To everyone who is in , , :

There is a new group about on Mastodon: @rstats
Follow the group to get all group posts.

Share your post with the group by tagging the group name.

Tip: The "Lists" feature is handy to see group posts. Create a list for @rstats and "+pin" the list.

Feel free to boost this toot.

#R

  • All
  • Subscribed
  • Moderated
  • Favorites
  • megavids
  • rosin
  • ngwrru68w68
  • tacticalgear
  • DreamBathrooms
  • mdbf
  • magazineikmin
  • thenastyranch
  • Youngstown
  • khanakhh
  • everett
  • slotface
  • Durango
  • kavyap
  • JUstTest
  • osvaldo12
  • normalnudes
  • cubers
  • GTA5RPClips
  • InstantRegret
  • cisconetworking
  • anitta
  • ethstaker
  • Leos
  • modclub
  • tester
  • provamag3
  • lostlight
  • All magazines