A question about what states were most-frequently represented on the HN homepage had me do some quick querying via Hacker News's Algolia search ... which is NOT limited to the front page. Those results were ... surprising (Maine and Iowa outstrip the more probable results of California and, say, New York). Results are further confounded by other factors.
HN provides an interface to historical front-page stories (https://news.ycombinator.com/front), and that can be crawled by providing a list of corresponding date specifications, e.g.:
So I'm crawling that and compiling a local archive. Rate-limiting and other factors mean that's only about halfway complete, and a full pull will take another day or so.
But I'll be able to look at story titles, sites, submitters, time-based patterns (day of week, day of month, month of year, yearly variations), and other patterns. There's also looking at mean points and comments by various dimensions.
Among surprises are that as of January 2015, among the highest consistently-voted sites is The Guardian. I'd thought HN leaned consistently less liberal.
The full archive will probably be < 1 GB (raw HTML), currently 123 MB on disk.
Contents are the 30 top-voted stories for each day since 20 February 2007.
If anyone has suggestions for other questions to ask of this, fire away.
NY is highly overrepresented (NY Times, NY Post, NY City), likewise Washington (Post, Times, DC). Adding in "Silicon Valley" and a few other toponyms boosts California's score markedly. I've also got some city-based analytics.
@jessie is a lover of #languages and helps run #CommonVoice, @mozilla 's open #voice#data set, which now supports over 100 languages. She also teaches #WebDev and loves #hiking. She's awesome you should follow her 🇬🇧
That's all for now, please do share your own lists so we can create deeper connections, and a tightly-connected community here
I'm reminded here of @maryrobinette's short story - "Red Rockets" - "She built something better than fireworks. She built community."
Question: someone I know is doing a data science project for university, and needs to scrape some tabular data from a web site to perform analysis on as an assignment.
Is there anything open source or GNOME-related that is publicly listed as tabular data somewhere that could be interesting for them to analyze? Ideally something with at least 100 data points and multiple columns per data point, if that makes sense.
Two #Coding questions, from restarting #Python after doing mostly Matlab for a while.
I really liked Tables in Matlab - what’s the best (fastest, simplest) equivalent of it in Python nowadays? #Pandas?
with Matlab you can use ‘webread’ to one-line load the contents of a public google spreadsheet, as a table - very cool! What’s the simplest equivalent in Python?
Morning, work today but that’s all good. Back to my #DataAnalysis! I slept better but had some ker-AAAA-zee dreams. How did you sleep? What plans do you have?
Feeling stuck with Excel for data analysis? You're not alone! Excel is fantastic, but for truly powerful insights and visualizations, it can fall short.
Here's what you'll gain:
🧐 * Advanced data manipulation & cleaning
💻 * Powerful statistical analysis & modeling
📉 * Eye-catching data visualizations
🌟 * Seamless integration back to Excel
5 Latest Tools You Should Be Using With Python for Data Science.
🗂️ The article provides insightful details on tools like ConnectorX, DuckDB, Optimus, Polars, and Snakemake which could enhance data wrangling, querying, manipulation, and workflow automation capabilities.
Looking for a recommendation(website,Substack, any other material...) where I can improve my SQL knowledge. I am looking for something that I can read(theory) and practice(exercisea). I really enjoy learning python in Substack but until now I have not found something similar for SQL.
🔬📊 Mastering Data Grouping with R's ave() Function 📊🔬
Are you tired of manually calculating statistics for different groups in your data analysis projects? Look no further! R's ave() function is here to revolutionize your data grouping experience. 🚀
Morning! I’ve been into town to get fresh bread, had breakfast and I’m ready for work. All four of us are WFH today, it’s going to be hard to keep out of each other’s way. More #DataAnalysis for me, I think I’ll work on category of offence and how that correlates with participants’ scoring of satisfaction with different areas of their life #criminology. Oh wait, Miss Cinnamon has just arrived and says that we must have cuddle first :blobcatreach: Have a great day everyone!
We dive deep into simplifying outlier detection in R using #easystats to follow good practices and make your data analysis more robust and replicable. Check it out! #Rstats#DataAnalysis@rstats
📊🔬 Exciting news! Learn bootstrap resampling in R with lapply, rep, and sample functions. Estimate uncertainty, analyze data variability, and unlock insights. #DataAnalysis#R#RStats#OpenSource#RProgramming 🎉💻
file_path <- "data.csv"
if (file.exists(file_path)) {
print("The file exists!")
} else {
print("The file does not exist.")
}
In this example, we check if the file named "data.csv" exists. Depending on the outcome, it will print either "The file exists!" or "The file does not exist."
Learn how to set a data frame column as the index for faster data access and streamlined operations.
In R, utilize the setDT() function from #datatable or column_to_rownames() from #tibble to seamlessly set your desired column as the index. Try it out with your datasets and experience the boost in productivity!
Imagine you have a bunch of data points and you want to know how many belong to different categories. This is where grouped counting comes in. We've got three fantastic methods for you to explore, each with its own flair: aggregate(), dplyr, and data.table.
I'll give you a quick rundown on creating horizontal boxplots in R using both base R and ggplot2. We'll work with the "palmerpenguins" dataset to keep things interesting!