On multiple occasions I've listened to instance admins speak about high S3... - Random

devnull, 3 months ago (edited 3 months ago)

On multiple occasions I've listened to instance admins speak about high S3 costs. The sheer amount of data absolutely balloons the more activity your server sees, I get it.

What I don't get is whether there's some unknown fedi ethical reason everybody insists on setting up an S3 cache (followed immediately by complaining about it).

Y'all want to know what the rest of the web does? Hosts their own uploaded media, and links out to the rest...

#fediadmin #s3 #devops #sysadmin

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ devnull, hrefna

Image

Image alternative text

jenniferplusplus, 3 months ago

@devnull I think there's value in doing large scale caching of external media. But only caching, not hoarding. Keeping your own permanent copy of every jpg your server has ever heard about is completely bonkers. Caching remote media (and respecting or overriding their cache control signals) should be a trust decision, just like it is to accept, reject, hide, or show their text.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

devnull, 3 months ago

@jenniferplusplus 💯 completely agreed. There's merit in multiple layers of caching (despite having to deal with the the rather unfortunate problem of cache invalidation), so I am entirely on board with caching remote media locally so as to not inundate smaller servers with huge volumes of traffic.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dt, 3 months ago

@devnull media from other sites (images, audio, video, etc.) often comes with third-party cookies. that said, i agree that caching media takes too much space. it's a systemic problem (people got used to upload humongous files without thinking about it). on my (experimental) social network, i “solved” this by not allowing any images. it works in my context but it's not a universal solution obviously. still, i wish people thought about the footprint of their media a bit more.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrefna, 3 months ago

@devnull Short answer: My instinct is that you are correct in doing it for optimization reasons, though I think the optimization reasons extend slightly.

There's an assumption in the fediverse that we are single individuals acting individually. This is where the expression that there are no "servers" in ActivityPub comes from (I have opinions about this expression, but setting that aside).

So the assumption is that your system that served the media may not be always online. So there's a cache.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrefna, 3 months ago

@devnull The small local server may not always be online, it may not be able to support Fediverse Load™, it may be flaky, etc.

Hosting it locally removes those as concerns.

I'm personally of the view that we should act more like the rest of the internet here and just work with links, however.

(I consider the "not leaking IP addresses" mostly a BS justification and that anyone who needs that really needs Tor and to not use most fediverse software already, but it does exist as well).

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrefna, 3 months ago

@devnull There is some argument to be made for doing it for the purposes of locality—for example if Alice's server is in the US and Bob's is in Singapore—but this extends to the entire internet and a better solution here, IMO, are CDNs (and it should be optional if we're going to go that route, and servers should be able to not have their media stored like that, etc).

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

kevinriggle, 3 months ago

@hrefna @devnull the real reason to do it is that it aligns instances’ hosting costs with the number of users they have rather than how popular their users’ posts get. Nobody wants their CDN bill to spike to an unplayable level because a big user retweeted one of their users’ posts. This is why cross linking images died elseweb as well. Ppl would just block the image once it got too much traffic and now your thing 404’s

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrefna, 3 months ago

@kevinriggle

That's a variant of what I already said in the second post: you have a theoretically more reliable connection to your local instance than to the remote instance, the remote instance isn't set for traffic, etc

That said, to the degree it is specifically about that cost, I'd argue that it's optimizing the wrong part of the problem. Or perhaps more accurate to say: optimizing a part of the problem with a niche use case when there are other concerns that need to be weighed.

@devnull

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

devnull, 3 months ago

@hrefna @kevinriggle Putting a CDN in front and taking the brunt of your local instance's traffic ought to be an expectation if you have the capacity to generate large volumes of traffic.

The rationale falls apart a fair bit when the amount of traffic most single-user (and low-user) instances generate to other instances is closer to a rounding error.

So my intention was to learn more (mission accomplished) but also to object to the blanket assumption that S3 is always necessary in all cases.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

FenTiger, 3 months ago

@devnull @hrefna @kevinriggle It's also about how much traffic other instances can generate to you. Even a tiny single user instance can see a huge amount of inbound traffic if someone like Gargron boosts one of your posts.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

kevinriggle, 3 months ago

@FenTiger @devnull @hrefna @kevinriggle yeah this. Tiny single user instances shouldn’t bear the burden of someone else deciding to pay attention to them, the people with the attention (or their instances) should

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

kevinriggle, 3 months ago

@FenTiger @devnull @hrefna @kevinriggle one of many, many things we didn’t understand about the internet thirty years ago and had to learn repeatedly, face first

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

FenTiger, 3 months ago

@kevinriggle We're still trying to learn: https://mastodon.social/@jwz/109411593248255294

@devnull @hrefna

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

kevinriggle, 2 months ago

@FenTiger @kevinriggle @devnull @hrefna or at least JWZ is. I love him but man has very clearly made his choices

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

kevinriggle, 2 months ago

@FenTiger @kevinriggle @devnull @hrefna (if 1000 simultaneous hits takes down your web server what potato are you running it on and why are they even touching the database, c’mon man)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrefna, 2 months ago

@kevinriggle

The webserver (probably)

@FenTiger @devnull

portal 2 potato GIF

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

alex, 3 months ago

@devnull

Just checked today, and my 1 user instance has consumed 40G of its 50G storage allotment. This is a default setup.

Related: I use and recommend Hugo's hosting service https://masto.host

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

devnull, 3 months ago

@alex will respond at-large to other commenters, but I specifically wanted to call this one out.

Your single-user instance has consumed 40GB of storage. Yet you as a single-user have not consumed 40GB of media — at least, I certainly hope not!

Which means you're paying for storage of 99.9% unseen content, simply because it's been passed along to your instance, for the off chance you'll stumble upon it (statistical probability after 1 day: ~0%)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

devnull, 3 months ago

Am I wrong for thinking that this established expectation (especially for smaller bootstrapped instances) is perfectly cromulent from an ops perspective? Honestly asking because I come from a time before DevOps and Microservices were a thing, and we all hosted our crud on servers we had physical access to (though VPSes are great!)

Yes, I totally get the benefits of having a CDN. Especially with global access, but nobody's setting up a globally distributed CDN for their dinky Mastodon instance.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

devnull, 3 months ago (edited 3 months ago)

Yes, sure, it reduces the load on the origin server, if access to the media was distributed via other federated servers' CDNs, but one neat trick to reducing your transit costs is to... not carte blanche host every piece of media your instance stumbles onto.

If anything, the rationale seems rather contrived.

I just don't want to make a misstep here and come off as a selfish fediverse implementor.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

devnull, 3 months ago

Oh, yes, also "separation of duties". I get that farming out that work to a specialized host is fantastic and easy and just works, but the calculus doesn't work for me I guess?

I'll be the first to admit that I'm not an ops guy and can't argue my points all that eloquently, but maybe I'm comfortable with my servers falling over under extreme load. I'd rather wake up to a downed service over a $200,000 egress bill. My two cents.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

devnull, 3 months ago

Years and years ago, when we got our first big customer, NodeBB fell over at the drop of a hat.

Our approach to this wasn't to give up and throw more servers at it (microservices, YAY!), though we definitely did support that eventually... it was to buckle down and optimize the bejeezus out of our code so that it ran faster than anything else on the market.

You could load NodeBB on the crappiest VPS and it'll handle 100 concurrent users, easy.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

devnull, 3 months ago

Anyway, back to the main point... perhaps I'm not seeing the benefit of S3 over a cheap volume (NFS export if horiz. scaled) with nginx plopped in front to serve those static assets.

Like I said, maybe I'm missing something.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

boris, 3 months ago

@devnull the convention is to not load hotlinked images.

So when a server (I’ll use Mastodon as my example) has member accounts who follow a remote account, it takes on the cost of eg caching media for all of its local members.

Rather than putting 100% of the load on the remote server.

This means a small server can have many followers, and the bigger server can do the caching.

Let me tackle setup in my next post.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

boris, 3 months ago

@devnull Why S3 + CDN?

Hmm. Well vs what?

We go through 2TB of cached media per month on our server, on S3. Provisioning a 2TB physical (or probably 4TB) would be way more expensive.

We have about 120 members, whose local store of uploaded media is at ~150GB after a year

We have Cloudfront CDN in front of that, because running CF is way cheaper than egress charges.

VPS is on digital ocean.

We’re doing some exploring around bare metal.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

devnull, 3 months ago

@boris it's worth exploring to see just how often those 2TB of assets are called upon, how often, and how those values change as the assets stale.

Where exactly is the cutoff point before storing it in S3 becomes financially unsuitable?

My thinking is that that cutoff point is far sooner than you'd expect. If we saved all encountered media to disk and persisted it only 24h, that'd likely handle 99% of use cases.

Thank you for sharing!! I wonder how much of your cached media is < 24h old.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

olives, 3 months ago

@devnull I've heard of that happening to instances. They'll wake up one day and they'll have a nasty surprise $10k bill from something stupid.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

devnull, 2 months ago

@olives I don't know about you but I'd rather not be financially ruined because a scraper decided to blow up my instance overnight.

I don't know when exactly it became normalized to have an always-up site, and to absorb the associated costs of it.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment