@dlakelan@mastodon.sdf.org
@dlakelan@mastodon.sdf.org avatar

dlakelan

@dlakelan@mastodon.sdf.org

Applied Mathematician, Julia programmer, father of two amazing boys, official coonhound mix mutt-walker.

PhD in Civil Engineering. Debian Linux user since ca. 1994.

Bayesian data analysis iconoclast

This profile is from a federated server and may be incomplete. Browse more on the original instance.

dlakelan, to random
@dlakelan@mastodon.sdf.org avatar

I give you civilian labor force with a disability (16yo or older)

https://fred.stlouisfed.org/series/LNU01074597

dlakelan, to random
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell @economics@a.gup.pe

Here's a post I made this morning at Gelman's blog explaining why much of what's published by Economists in the media is gaslighting because it fundamentally fails to address the question of interest by failing to form an appropriate dimensionless ratio to answer the question: https://statmodeling.stat.columbia.edu/2024/05/17/how-to-think-about-the-effect-of-the-economy-on-political-attitudes-and-behavior/#comment-2372708

ChrisMayLA6, to random
@ChrisMayLA6@zirk.us avatar

How do you know you've been gaslighted?

when a Bank of England director tells you its 'possible' interest rates will be reduced over the summer....

Of course its possible they'll be reduced, but my guess is they'll just want to keep them high a little longer... just to make sure those pesky workers & their demands for a return to pst standards of living have been firmly dampened down.

Perhaps, by some strange co-incidence they'll fall the month before an Autumn election?

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell @ChrisMayLA6

Roughly the most important issues were

  1. time and money
  2. difficulty of fitting 30,000 parameter models
  3. lack of technology for diagnosis of very large bayesian models
  4. ambitiousness of the project

data management was actually not too bad. Mariadb had a method for searching CSV files as if they were tables, so I was able to use that for slicing and dicing.

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell @ChrisMayLA6

If I remember correctly I was working at the level of "public use microdata areas". These are geographic regions of roughly 100k people. So there are 350M/100k = 3500 of them. each one involved estimating a nonlinear function of household composition (number of people and their age), so roughly 5-10 parameters for that function, and then tying them together via regional relationships... so that's where roughly 30,000 parameters comes from.

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell @ChrisMayLA6

Diagnosis means roughly figuring out why Markov Chain Monte Carlo either didn't converge, or if it did converge whether the results "made sense" so that the estimates of the parameters were reliable.

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell @ChrisMayLA6

Iterative development of the model involves fitting the model, figuring out whether there are important considerations it fails to address, re-defining the model, and re-fitting it.

Roughly, fitting it might take 4 hours of computing say. So you get maybe 2 iterations of this process a day. Sometimes there are simple software bugs, sometimes there are "logical bugs". There have been a number of important methodological advances since 2017 or so.

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell @ChrisMayLA6

Some of those advantages could give you "rough fits" in perhaps 10-20 mins or something, so you could iterate much more quickly today potentially.

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell

It's absolutely wonderful, I would love to do this full time. Roughly the ideal situation would be something like $500k a year to fund me, a PhD Econ friend, and 3 students with an undergrad degree, one in an engineering or physics or biophysics discipline, one in a CS discipline with experience in databases, and one in an econ discipline.

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell

Probably a lot we could do. The obvious way to parallelize it just gives you say 5x as many MCMC samples from 5 different chains in the same 4 hour run. Which is great for final results, but bad for iterative development.

But like I said, a lot has changed since then. One thing is I now work in Julia and it is truly a blessing for this kind of stuff.

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell

A number of reasons, one of which is ideally we'd do things like food budget as a function of age, also we'd do heating and cooling budget as a function of location and climate. You want to estimate minimal budgets. So you don't just want to look at what people actually spend, because some of that is "disposable income".

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell
The markov chain processes the entire dataset at each iteration. It's quite intensive. But there are reasons you might be able to do regions in parallel and sacrifice some minor amount of information (costs in say Fresno don't really inform costs in say Kansas City or in Atlanta that much).

So, yeah, it's a real honest to goodness research project that roughly something like NSF should fund for 4 years.

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell

One of my motivations for the whole project was to understand the differences between rural and urban regions and how stark those were likely to be in CA.

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell

It's a good question. Usually grants are not available to any organization that isn't a 501c3 nonprofit, so there's a chicken-and-egg issue. I could create a 501c3 and try to get funding for it, but the organizations funding things prefer existing track records rather than new orgs... the overhead of forming a 501c3 isn't insane, but it's not zero either. You and I think the topic is very compelling, but it's directly opposed to an established Econ power structure

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell

Probably the best way to go about it is to get some kind of "preliminary" grant to show that the project is kind of possible and compelling and that the organization can handle it. But even to realistically get started we're talking $120k. That's what a CS undergrad earns in the LA area.

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell

You might enjoy diving into some preliminary analysis at the level of just slicing and dicing the ACS and making plots.

If you're interested I could mentor on that. I'd recommend installing Julia on a Linux box with 16 or 32 Gigs of RAM and a terabyte SSD, and setting up a github project... I can't really do the multi-hour-per-day to do the projects right now, but I could absolutely coach.

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell
It's enough to get started to see whether you find this stuff compelling. I'd say rather than doing mariadb and SQL queries you'd be doing more by hand filtering in Julia but it's still enough to know whether you want to do this stuff or not.

Start by making a GitHub project on the web with just a README and then cloning the empty project to your Mac. Easiest way

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell
Your biggest issue is running MariaDB on the same machine and having it take up RAM. It's very reasonable for a 32GB machine, doable with 16 but maybe not a great idea with 8GB.

Your best bet is a CONNECT table https://mariadb.com/kb/en/connect-csv-and-fmt-table-types/

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell
The less traditional stats you know the better 😉.

Don't add the datasets to the git repos, too big! But a script to do the download and extraction would help document what we downloaded and help anyone who wants to replicate.

git add, git commit, git push, and git pull are the most important getting started commands.

Did you get VSCodium and the Julia extension? I'll write a quick getting started notebook tomorrow you can use as a template for sorting and plotting

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell

MacOS version of codium

https://github.com/VSCodium/vscodium/releases/download/1.88.1.24104/VSCodium-darwin-x64-1.88.1.24104.zip

You can get the Julia extension from the extension manager. I recommend codium because of the Julia extension, which nicely captures plots, reads and edits Jupyter notebooks, and has debugger and data inspector for Julia objects etc

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell
Stats is like Econ, it's full of bog standard stuff in all the textbooks that is wrong.

Not math wrong, just inappropriately applied with bad assumptions.

If it involves tests or p values feel free to avoid any of that. For now focus on making good plots of interesting facts.

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell
Here's an example of interesting question, what's a typical ratio of rent+utilities to mortgage+utilities in each county each year. To make that ratio you need to decide on households you can compare. Probably match county, year, household size, then randomize them and form the ratios and histogram that.

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell
I think the ACS has an estimate of taxes paid on income? So another one would be distribution of the ratio of annual income minus taxes to annual rent+utilities, and you could split out by household size.

Don't expect you to be able to just do these manipulations right straight off but your coding background is gonna make it pretty easy for you to read documentation and examples and work out how to do what you need I'm pretty sure.

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell
Richard McElreath's "Statistical Rethinking" is my recommendation.

I started putting together some example stuff for videos here

https://github.com/dlakelan/JuliaDataYouTube

But then opted out of the YouTube ecosystem. Thinking of doing them on my own PeerTube instance instead.

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell
I don't think you can track families in the ACS data. Maybe each family is interviewed twice? But it might just be once.

Yeah questions like the ones you raise are good. Sometimes we just start simple and add complexity. Sqft is not available I think, but house type might be (single family, townhouse, trailer, whatever). A good place to start is the data dictionary they give you in the ZIP file (I think it's included?). Understand what's been collected first.

dlakelan,
@dlakelan@mastodon.sdf.org avatar

@GhostOnTheHalfShell
I've got a gigabit connection and PeerTube distributes the videos via BitTorrent so it might be ok to run it out of my closet. But yeah, Archive.org is good stuff I love them.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • JUstTest
  • kavyap
  • DreamBathrooms
  • thenastyranch
  • magazineikmin
  • InstantRegret
  • GTA5RPClips
  • Youngstown
  • everett
  • slotface
  • rosin
  • osvaldo12
  • mdbf
  • ngwrru68w68
  • megavids
  • cubers
  • modclub
  • normalnudes
  • tester
  • khanakhh
  • Durango
  • ethstaker
  • tacticalgear
  • Leos
  • provamag3
  • anitta
  • cisconetworking
  • lostlight
  • All magazines