if you're ready to level up your data manipulation skills, give intersect() a spin and let your insights shine! 🌈 Embrace the world of R and keep growing as a data wizard! 🧙♂️ Happy coding! 🎉
Looking for a recommendation(website,Substack, any other material...) where I can improve my SQL knowledge. I am looking for something that I can read(theory) and practice(exercisea). I really enjoy learning python in Substack but until now I have not found something similar for SQL.
📢 Master the Art of List Subsetting in R! 🚀 Or: Lists...again
📝 Lists in R are versatile data structures, capable of holding various elements like vectors, matrices, and even other lists. But what makes them truly magical is the ability to extract specific data efficiently through subsetting. 🎯
Let's unlock the true potential of your data together! Read the blog post, try the cov() function, and let's embark on an exciting journey of discovery. 💡
file_path <- "data.csv"
if (file.exists(file_path)) {
print("The file exists!")
} else {
print("The file does not exist.")
}
In this example, we check if the file named "data.csv" exists. Depending on the outcome, it will print either "The file exists!" or "The file does not exist."
Morning, work today but that’s all good. Back to my #DataAnalysis! I slept better but had some ker-AAAA-zee dreams. How did you sleep? What plans do you have?
🔬📊 Mastering Data Grouping with R's ave() Function 📊🔬
Are you tired of manually calculating statistics for different groups in your data analysis projects? Look no further! R's ave() function is here to revolutionize your data grouping experience. 🚀
📊🔬 Exciting news! Learn bootstrap resampling in R with lapply, rep, and sample functions. Estimate uncertainty, analyze data variability, and unlock insights. #DataAnalysis#R#RStats#OpenSource#RProgramming 🎉💻
Package XGI was submitted to @pyOpenSci and is awaiting review (link to the submission in the first comment)! If you don't have time but know someone that could be interested, please, share this post!
A question about what states were most-frequently represented on the HN homepage had me do some quick querying via Hacker News's Algolia search ... which is NOT limited to the front page. Those results were ... surprising (Maine and Iowa outstrip the more probable results of California and, say, New York). Results are further confounded by other factors.
HN provides an interface to historical front-page stories (https://news.ycombinator.com/front), and that can be crawled by providing a list of corresponding date specifications, e.g.:
So I'm crawling that and compiling a local archive. Rate-limiting and other factors mean that's only about halfway complete, and a full pull will take another day or so.
But I'll be able to look at story titles, sites, submitters, time-based patterns (day of week, day of month, month of year, yearly variations), and other patterns. There's also looking at mean points and comments by various dimensions.
Among surprises are that as of January 2015, among the highest consistently-voted sites is The Guardian. I'd thought HN leaned consistently less liberal.
The full archive will probably be < 1 GB (raw HTML), currently 123 MB on disk.
Contents are the 30 top-voted stories for each day since 20 February 2007.
If anyone has suggestions for other questions to ask of this, fire away.
NY is highly overrepresented (NY Times, NY Post, NY City), likewise Washington (Post, Times, DC). Adding in "Silicon Valley" and a few other toponyms boosts California's score markedly. I've also got some city-based analytics.
How Much Colorado Love? Or a 16-year Hacker News Front Page analytics
I've pulled 5,939 front pages from Hacker News, dating from 20 February 2007 to 25 May 2023, initially to answer the question "how often is Colorado mentioned on the front page?" (38 times, 5th most frequent US state). This also affords the opportunity to ask and answer other questions.
I'm wanting to test some reporting / queries / logic based on a sampling of data.
Since my file-naming convention follows ISO-8601 (YYYY-MM-DD), I can just lexically sort those.
And to grab a random year's worth (365 days) of reports from across the set:
ls rendered-crawl/* | sort -R | head -365 | sort<br></br>
(I've rendered the pages, using w3m's -dump feature, to speed processing).
The full dataset is large enough and my awk code sloppy enough (several large sequential lists used in pattern-matching) that a full parse takes about 10 minutes, so the sampling shown here speeds development better than 10x while still providing representative data across time.
Note that some idiosyncrasies affect this, e.g., "New York City" appears rarely, whilst "New York" may refer to the city, state, any of several newspapers, universities, etc. "New York" appears 315 times in titles (mostly as "New York Times").
I've independently verified that, for example, "Ho Chi Minh City" doesn't appear, though "Ho Chi Minh" alone does:
So ... I'm playing with a report showing how often F500 companies are mentioned in HN submission titles.
As I've noted, most of my scripting is in awk (gawk), and it's ... usually pretty good.
I'm toying with a couple of loops where I read all 178k titles, and all 500 company names, into arrays, then check to see if the one appears in the other.
The first iteration of that was based on the index() function, which is a simple string match. Problem is that there are substring matches, for example "Lear" (the company) will match on "Learn", "Learning", etc., and so is strongly overrepresented.
So I swapped in match(), which is a regular-expression match, and added W as word-boundaries.
The index-based search ran in about 20 seconds. That's a brief wait, but doable.
The match (regex) based search ... just finished as I'm writing this. 13 minutes 40 seconds.
Regexes are useful, but can be awfully slow.
Which means that my first go at this --- still using gawk but having it generate grep searches and printing the match count only ... is much faster whilst being accurate. That runs in just under a minute here. I'd looked for another solution as awk is "dumb" re the actually output: it doesn't read or capture the actual counts, so I'll either have to tweak that program or feed its output to an additional parser. Neither of which is a big deal, mind.
Oh, and Apple seems to be the most-mentioned company, though the F500 list omits Google (or YouTube, or Android), listing only Alphabet, which probably results in a severe undercount.
Note I didn't even try "Meta", though let's take a quick look ... yeah, that's a mess.
Up until 2021-10-28, "Meta" is a concept, with 33 entries. That was the day Facebook announced its name change. 82 total matches (so low overall compared to the earlier numbers above), 49 post-announcement, of which two are not related to Facebook a/k/a Meta. Several of the titles mention both FB & Meta ... looks like that's four of 'em.
So "Meta" boosts FB's count by 45.
There are another 296 mentions of Steve Jobs and Tim Cook which don't also include "Apple".
And "Alphabet" has 54 matches, six of which don't relate to the company.
OK, current stats are 63.5% of posts classified, with 29.8% of sites classified, a/k/a the old 65/30 rule. The mean posts per unclassified site is 1.765, so my returns for further classification will be ... small.
I could pick just under 4% more posts by classifying another 564 sites but ... that sounds a bit too much like work at the moment. Compromises and trade-offs.
Now to try to turn this into an analysis over time.
I've been working with a summary of activity by site, so running analysis has been pretty quick (52k records and gawk running over that).
To do full date analysis requires reading nearly 180k records, and ... hopefully not having to loop through 52k sites for each of those. Gawk's runtimes start to asplode when running tens of millions of loop iterations, especially if regexes are involved.
This week, Week 9 of our Data Citizen Bootcamp at Cambridge Spark, was a gentle intro to Python for Data Analysis, including an intro to the Pan
I found a very nice article called "A Beginner’s Guide to Data Analysis in Python" written by Natassha Selvaraj. It is a very well-written article with plenty of examples to follow through. Most definitely worth a read.
Enjoy! 😉
Mainly stuff about British comics (ok it's #2000AD), #retrogaming from the early to mid 80s UK microcomputer boom, popular science / rationalism, will share the UK popular cultural archive but not a poem remembering the Milk In A Bottle, #DoctorWho always, #DataAnalysis and #DataScience for work and sometimes for fun. Formerly @mstdn.social, migrated to @mastodonapp.uk 26/04/23 -- I find the Local timeline is a bit more UK-relatable #Introduction