#DataAnalysis - kbin.social

stevensanderson, 9 months ago to datascience

if you're ready to level up your data manipulation skills, give intersect() a spin and let your insights shine! 🌈 Embrace the world of R and keep growing as a data wizard! 🧙‍♂️ Happy coding! 🎉

#RProgramming #DataAnalysis #DataScience #DataManipulation #DataWizardry

Post: https://www.spsanderson.com/steveondata/posts/2023-07-28/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

eric_ma, 10 months ago to datascience

Looking for a recommendation(website,Substack, any other material...) where I can improve my SQL knowledge. I am looking for something that I can read(theory) and practice(exercisea). I really enjoy learning python in Substack but until now I have not found something similar for SQL.

Any advise or recommendation?

#SQL #Dataanalysis #Substack #datascience #databases

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

stevensanderson, 10 months ago to datascience

📢 Master the Art of List Subsetting in R! 🚀 Or: Lists...again

📝 Lists in R are versatile data structures, capable of holding various elements like vectors, matrices, and even other lists. But what makes them truly magical is the ability to extract specific data efficiently through subsetting. 🎯

#Rprogramming #DataAnalysis #DataScience #ListSubsetting #CodingSkills #RStats

Blog Post: https://www.spsanderson.com/steveondata/posts/2023-07-19/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

stevensanderson, 10 months ago to random

I encourage you to roll up your sleeves and give it a try yourself. 💪🔍

Read the full blog post and start your exploration. Let's dive in and level up your data analysis game! 🚀📊

https://www.spsanderson.com/steveondata/posts/2023-07-17/

#Rprogramming #DataAnalysis #DuplicatesDetection #dplyr #BaseR #DataManipulation

reply

expand (3)

collapse (3)

report

activity

copy /kbin url

copy original url

open original url

Loading...

stevensanderson, 10 months ago to statistics

Let's unlock the true potential of your data together! Read the blog post, try the cov() function, and let's embark on an exciting journey of discovery. 💡

Post: https://www.spsanderson.com/steveondata/posts/2023-07-14/

#DataAnalysis #Covariance #Rprogramming #UnlockYourData #r #rstats #statistics #opensource #opensourcecommunity #opensourcesoftware

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

stevensanderson, 10 months ago to opensource

file_path <- "data.csv"
if (file.exists(file_path)) {
print("The file exists!")
} else {
print("The file does not exist.")
}

In this example, we check if the file named "data.csv" exists. Depending on the outcome, it will print either "The file exists!" or "The file does not exist."

Post: https://www.spsanderson.com/steveondata/posts/2023-07-13/

#Rprogramming #DataAnalysis #ProductivityBoost #R #rstats #opensource #opensourcesoftware #opensourcecommunity #technology #innovation

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

MarkRubin, 10 months ago to science

“Turning all the knobs!”

In Part 2 of a two-part series of articles, Michael Höfler and colleagues consider “how to explore data to modify existing claims and create new ones”

https://doi.org/10.15626/MP.2022.3270

#Science
#DataAnalysis
#Psychology

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

BarbChamberlain, 10 months ago to random

Appreciate boosts to reach great candidates. Come work with @WSDOT Active Transportation Division! TWO openings: GIS/Data Systems Specialist – TPS4 https://www.governmentjobs.com/careers/washington/wsdot/jobs/4111128/active-transportation-gis-and-data-systems-specialist-tps4 (closes 7/26) and Senior Connecting Communities Planner – TPS5 https://www.governmentjobs.com/careers/washington/wsdot/jobs/4107117/senior-connecting-communities-planner-tps5 (closes 7/17) #WSDOTactive #MoveEquity #planning #GIS #transportation #ActiveTransport #EnvironmentalJustice #UrbanPlanning #RoadSafety #DataAnalysis #TransportJobs #CommunityEngagement #bilingual

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ enobacon

purplepadma, 10 months ago to random

Morning, work today but that’s all good. Back to my #DataAnalysis! I slept better but had some ker-AAAA-zee dreams. How did you sleep? What plans do you have?

reply

expand (4)

collapse (4)

report

activity

copy /kbin url

copy original url

open original url

Loading...

stevensanderson, 10 months ago to statistics

🔬📊 Mastering Data Grouping with R's ave() Function 📊🔬

Are you tired of manually calculating statistics for different groups in your data analysis projects? Look no further! R's ave() function is here to revolutionize your data grouping experience. 🚀

Post: https://www.spsanderson.com/steveondata/posts/2023-06-27/

#Rprogramming #DataAnalysis #DataGrouping #Statistics #Efficiency #r #rstats #grouped #data #factors #stats #groupedstats #technology #innovation #opensource #opensourcesoftware #opensourcecommunity

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

stevensanderson, 11 months ago to opensource

📊🔬 Exciting news! Learn bootstrap resampling in R with lapply, rep, and sample functions. Estimate uncertainty, analyze data variability, and unlock insights. #DataAnalysis #R #RStats #OpenSource #RProgramming 🎉💻

Post: https://www.spsanderson.com/steveondata/posts/2023-06-23/

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

SimonMolinsky, 11 months ago to python

Call for REVIEWERS!

#networkscience #graphs #dataanalysis #python

I'm looking for reviewers specialized in:

network science 🕸

higher-order networks (hypergraph, simplicial complex) 😲

data analysis and graph algorithms 🔗

network visualization 🌐

and Python 🐍

Package XGI was submitted to @pyOpenSci and is awaiting review (link to the submission in the first comment)! If you don't have time but know someone that could be interested, please, share this post!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ pyOpenSci

dredmorbius, 11 months ago to random
Hacker News front-page analytics

A question about what states were most-frequently represented on the HN homepage had me do some quick querying via Hacker News's Algolia search ... which is NOT limited to the front page. Those results were ... surprising (Maine and Iowa outstrip the more probable results of California and, say, New York). Results are further confounded by other factors.

Thread: https://news.ycombinator.com/item?id=36076870

HN provides an interface to historical front-page stories (https://news.ycombinator.com/front), and that can be crawled by providing a list of corresponding date specifications, e.g.:
https://news.ycombinator.com/front?day=2023-05-25 
Easy enough.

So I'm crawling that and compiling a local archive. Rate-limiting and other factors mean that's only about halfway complete, and a full pull will take another day or so.

But I'll be able to look at story titles, sites, submitters, time-based patterns (day of week, day of month, month of year, yearly variations), and other patterns. There's also looking at mean points and comments by various dimensions.

Among surprises are that as of January 2015, among the highest consistently-voted sites is The Guardian. I'd thought HN leaned consistently less liberal.

The full archive will probably be < 1 GB (raw HTML), currently 123 MB on disk.

Contents are the 30 top-voted stories for each day since 20 February 2007.

If anyone has suggestions for other questions to ask of this, fire away.

And, as of early 2015, top state mentions are:
 1. new york: 150 2. california: 101 3. texas: 39 4. washington: 38 5. colorado: 15 6. florida: 10 7. georgia: 10 8. kansas: 10 9. north carolina: 9 10. oregon: 9 
NY is highly overrepresented (NY Times, NY Post, NY City), likewise Washington (Post, Times, DC). Adding in "Silicon Valley" and a few other toponyms boosts California's score markedly. I've also got some city-based analytics.

#hn #hackernews #data #DataAnalysis #WebCrawling
reply

expand (39)

collapse (39)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

dredmorbius, 11 months ago

How Much Colorado Love? Or a 16-year Hacker News Front Page analytics

I've pulled 5,939 front pages from Hacker News, dating from 20 February 2007 to 25 May 2023, initially to answer the question "how often is Colorado mentioned on the front page?" (38 times, 5th most frequent US state). This also affords the opportunity to ask and answer other questions.

Preliminary report: https://news.ycombinator.com/item?id=36098749

#HackerNews #dataAnalysis #wget #awk #gawk #media #colorado

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

dredmorbius, 11 months ago
I'm wanting to test some reporting / queries / logic based on a sampling of data.

Since my file-naming convention follows ISO-8601 (YYYY-MM-DD), I can just lexically sort those.

And to grab a random year's worth (365 days) of reports from across the set:
ls rendered-crawl/* | sort -R | head -365 | sort 
(I've rendered the pages, using w3m's -dump feature, to speed processing).

The full dataset is large enough and my awk code sloppy enough (several large sequential lists used in pattern-matching) that a full parse takes about 10 minutes, so the sampling shown here speeds development better than 10x while still providing representative data across time.

#ShellScripting #StupidBashTricks #Linux #DataAnalysis
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dredmorbius, 11 months ago
HN Front Page / Global Cities Mentions

One question I've had about HN is how well or poorly it represents non-US (or even non-Silicon Valley) viewpoints and issues.

Pulling from the Globalization and World Cities Research Network list, the top 50 global cities names appearing in HN front-page titles:
 1 191 San Francisco 2 164 London 3 117 Boston 4 86 Seattle 5 60 Tokyo 6 58 Paris 7 56 Chicago 8 56 Hong Kong 9 55 New York City 10 50 Berlin 11 50 Phoenix 12 45 Rome 13 40 Detroit 14 36 Singapore 15 31 Vancouver 16 30 Los Angeles 17 27 Austin 18 23 Beijing 19 20 Dubai 20 19 Shenzhen 21 19 Toronto 22 17 Amsterdam 23 16 Copenhagen 24 16 Houston 25 16 Moscow 26 15 Atlanta 27 14 Barcelona 28 14 Denver 29 13 Baltimore 30 13 San Jose 31 13 Stockholm 32 12 San Diego 33 12 Sydney 34 11 Cairo 35 10 Munich 36 10 Wuhan 37 9 Helsinki 38 9 Miami 39 9 Mumbai 40 9 Philadelphia 41 9 Shanghai 42 9 Vienna 43 8 Montreal 44 7 Beirut 45 7 Dublin 46 7 Istanbul 47 6 Bangalore 48 6 Dallas 49 6 Kansas City 50 6 Minneapolis 
(Best viewed in original on toot.cat.)

Note that some idiosyncrasies affect this, e.g., "New York City" appears rarely, whilst "New York" may refer to the city, state, any of several newspapers, universities, etc. "New York" appears 315 times in titles (mostly as "New York Times").

I've independently verified that, for example, "Ho Chi Minh City" doesn't appear, though "Ho Chi Minh" alone does:

https://news.ycombinator.com/item?id=15374051, on the 2017-9-30 front page: https://news.ycombinator.com/front?day=2017-09-30

So apply salt liberally.

Edits: tyops & speling.

#HN #HackerNews #DataAnalysis #ShellScripting #GlobalCities #MediaAnalysis
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

dredmorbius, 11 months ago
So ... I'm playing with a report showing how often F500 companies are mentioned in HN submission titles.

As I've noted, most of my scripting is in awk (gawk), and it's ... usually pretty good.

I'm toying with a couple of loops where I read all 178k titles, and all 500 company names, into arrays, then check to see if the one appears in the other.

The first iteration of that was based on the index() function, which is a simple string match. Problem is that there are substring matches, for example "Lear" (the company) will match on "Learn", "Learning", etc., and so is strongly overrepresented.

So I swapped in match(), which is a regular-expression match, and added W as word-boundaries.

The index-based search ran in about 20 seconds. That's a brief wait, but doable.

The match (regex) based search ... just finished as I'm writing this. 13 minutes 40 seconds.

Regexes are useful, but can be awfully slow.

Which means that my first go at this --- still using gawk but having it generate grep searches and printing the match count only ... is much faster whilst being accurate. That runs in just under a minute here. I'd looked for another solution as awk is "dumb" re the actually output: it doesn't read or capture the actual counts, so I'll either have to tweak that program or feed its output to an additional parser. Neither of which is a big deal, mind.

Oh, and Apple seems to be the most-mentioned company, though the F500 list omits Google (or YouTube, or Android), listing only Alphabet, which probably results in a severe undercount.

Top 10 using the F100 list:
 1 Apple: 2447 2 Microsoft: 1517 3 Amazon: 1457 4 Intel: 554 5 Tesla: 404 6 Netflix: 322 7 IBM: 309 8 Adobe: 180 9 Oracle: 167 10 AT&T: 143 
Add to those:
$ egrep -wc '(Google|Alphabet|You[Tt]ube|Android)' hn-titles 7163 egrep -wc '(Apple|iPhone|iPad|iPod|Mac[Bb]ook)' hn-titles 3656 egrep -wc '(Facebook|Instagram)' hn-titles 2512 
Note I didn't even try "Meta", though let's take a quick look ... yeah, that's a mess.

Up until 2021-10-28, "Meta" is a concept, with 33 entries. That was the day Facebook announced its name change. 82 total matches (so low overall compared to the earlier numbers above), 49 post-announcement, of which two are not related to Facebook a/k/a Meta. Several of the titles mention both FB & Meta ... looks like that's four of 'em.

So "Meta" boosts FB's count by 45.

There are another 296 mentions of Steve Jobs and Tim Cook which don't also include "Apple".

And "Alphabet" has 54 matches, six of which don't relate to the company.

Of the MFAANG companies:
Google: 5796 Apple: 2447 Facebook: 2371 Microsoft: 1517 Amazon: 1457 Netflix: 322 
(Based on grep.)

#DataAnalysis #awk #grep #bash #HackerNewsAnalytics
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

dredmorbius, 10 months ago
OK, current stats are 63.5% of posts classified, with 29.8% of sites classified, a/k/a the old 65/30 rule. The mean posts per unclassified site is 1.765, so my returns for further classification will be ... small.

Full breakdown:
 4 20 14 19 13 18 23 17 32 16 37 15 48 14 55 13 96 12 120 11 122 10 168 9 247 8 315 7 396 6 622 5 1052 4 2016 3 5103 2 26494 1 
A ... large number of sites w/ <= 20 posts are actually classified, mostly by regexp rules & patterns. Oh, hey, I can dump that breakdown as well:
 35 20 27 19 47 18 31 17 33 16 41 15 51 14 45 13 42 12 29 11 46 10 46 9 47 8 91 7 138 6 178 5 269 4 524 3 1624 2 11472 1 
I could pick just under 4% more posts by classifying another 564 sites but ... that sounds a bit too much like work at the moment. Compromises and trade-offs.

Now to try to turn this into an analysis over time.

I've been working with a summary of activity by site, so running analysis has been pretty quick (52k records and gawk running over that).

To do full date analysis requires reading nearly 180k records, and ... hopefully not having to loop through 52k sites for each of those. Gawk's runtimes start to asplode when running tens of millions of loop iterations, especially if regexes are involved.

#HackerNewsAnalytics #HackerNews #gawk #awk #DataAnalysis #MediaAnalysis
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

TheMemeticist, 1 year ago to random

A small fun exercise for #DataAnalysis using #ArtificialIntelligence: Most common times to see a #UFO / #UAP according to 22k data points. 🛰️ 🛸 #Aliens

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ GhostOnTheHalfShell

charlesdebarros, 1 year ago to programming

This week, Week 9 of our Data Citizen Bootcamp at Cambridge Spark, was a gentle intro to Python for Data Analysis, including an intro to the Pan

I found a very nice article called "A Beginner’s Guide to Data Analysis in Python" written by Natassha Selvaraj. It is a very well-written article with plenty of examples to follow through. Most definitely worth a read.
Enjoy! 😉

#dataanalysis #python #programming #pandas #learning #data #bootcamp

https://towardsdatascience.com/a-beginners-guide-to-data-analysis-in-python-188706df5447

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Sherbet_dibdab, 1 year ago to random

No-one in particular.

Mainly stuff about British comics (ok it's #2000AD), #retrogaming from the early to mid 80s UK microcomputer boom, popular science / rationalism, will share the UK popular cultural archive but not a poem remembering the Milk In A Bottle, #DoctorWho always, #DataAnalysis and #DataScience for work and sometimes for fun. Formerly @mstdn.social, migrated to @mastodonapp.uk 26/04/23 -- I find the Local timeline is a bit more UK-relatable
#Introduction

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ angelus_04

pascalschulthess, 1 year ago to cycling

I switched instances in the hope to have a more relevant local feed at mstdn.science.

So, let’s do this #introduction thing again.

I’m Pascal, father to 4 kids, #German #scientist and #carfree #cycling enthusiast. I live in #Utrecht, the cycling capital of the world and work as a #PostDoc at the #Hubrecht institute.

A couple of hashtags describing my work:
#Cancer #TheoreticalBiology #SystemsBiology #SystemsPharmacology #Signalling #Dynamics #Modelling #Organoids #DataAnalysis #Development

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ mwfc, dgoldsmith, Brendanjones, Pagan_Animist +2 more

gerald_leppert, 1 year ago to statistics German

To everyone who is in #statistics, #datascience, #econometrics :

There is a new group about #rstats on Mastodon: @rstats
Follow the group to get all group posts.

Share your post with the group by tagging the group name.

Tip: The #Mastodon "Lists" feature is handy to see group posts. Create a list for @rstats and "+pin" the list.

Feel free to boost this toot.

#R #Rlang #CRAN #rstat #ggplot2 #ggplot #rstudio #tidyverse #satRdays #dataviz
#dataanalysis #machinelearning #stats #Fediverse

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...