(Done) Migrating Everything to new Hosting Company - 27th March - 00:00 UTC

Basically, I’m sick of these network problems, and I’m sure you are too. We’ll be migrating everything: pictrs, frontends & backends, database & webservers all to 1 single server in OVH.

First it was a cpu issue, so we work around that by ensuring pictrs is on another server, and have just enough CPU to keep us all okay. Everything was fine until the spammers attacked. Then we couldn’t process the activities fast enough, and now we can’t catch up.

We are having constant network drop outs/lag spikes where all the networking connections get “pooled” with a CPU steal of 15%. So we bought more vCPU and threw resources at the problem. Problem temporarily fixed, but we still had our “NVMe” VPS, which housed our database and lemmy applications showing an IOWait of 10-20% half the time. Unbeknown to me, that it was not IO related, but network related.

So we moved the database server off to another server, but unfortunately that caused another issue (the unintended side effects, of cheap hosting?). Now we have 1 main server accepting all network traffic, which then has to contact the NVMe DB server and pict-rs server as well. Then send all that information back to the users. This was part of the network problem.
Adding backend & frontend lemmy containers to the pict-rs server helped alleviate and is what you are seeing at the time of this post. Now a good 50% of the required database and web traffic is split across two servers which allows for our servers to not completely be saturated with request.

On top of the recent nonsense, it looks like we are limited to 100Mb/s, that’s roughly 12MB/s. So downloading a 20MB video via pictrs would require the current flow: (in this example)

User requests image via cloudflare
(its not already cached so we request it from our servers)
Cloudflare proxies the request to our server (app1).
Our app1 server connects to the pictrs server.
Our app1 server downloads the file from pictrs at a maximum of 100Mb/s,
At the same time, the app1 server is uploading the file via cloudflare to you at a maximum of 100Mb/s.
During this point in time our connection is completely saturated and no other network queries could be handled.

This is of course an example of the network issue I found out we had after moving to the multi-server system. This is of course not a problem when you have everything on one beefy server.

Those are the board strokes of the problems.

Thus we are completely ripping everything out and migrating to a HUGE OVH box. I say huge in capital letters because the OVH server is $108/m and has 8 vCPU, 32GB RAM, & 160GB of NVMe. This amount of RAM allows for the whole database to fit into memory. If this doesn’t help then I’d be at a loss at what will.
Currently (assuming we kept paying for the standalone postgres server) our monthly costs would have been around $90/m. ($60/m (main) + $9/m (pictrs) + $22/m (db))

Migration plan:

The biggest downtime will be the database migration as to ensure consistency we need to take it offline. Which is just simpler than

DB:

stop everything
start postgres
take a backup (20-25 mins)
send that backup to the new server (5-6 mins (Limited to 12MB/s)
restore (10-15 mins)

pictrs

syncing the file store across to the new server

app(s)

regular deployment

Which is the same process I recently did here so I have the steps already cemented in my brain. As you can see, taking a backup ends up taking longer than restoring. That’s because, after testing the restore process on our OVH box we were no where near any IO/CPU limits and was, to my amazement, seriously fast. Now we’ll have heaps of room to grow with a stable donation goal for the next 12 months.

See you on the other side.

Tiff

Image

Image alternative text

reddthat_209, 1 month ago

Thank you for all the hard work!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

LocustOfControl, 1 month ago

Legend.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ticoombs, 1 month ago

/gif legen-dary

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ticoombs, 1 month ago (edited 1 month ago)

I managed to streamline the exports and syncs so we performed them concurrently. Allowing us to finish just under 40 minutes! Enjoy the new hardware!

So it begins: (Federation “Queue”)
Federation queue showing a upwards trend, then down then slightly back up again

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Blaze, 1 month ago

Well done!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Stimmed, 1 month ago

OMG, posts load instantly now, used to take 3 to 15 seconds. I’m in US East Coast for reference.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ticoombs, 1 month ago

That’s what I love to hear! 🎉

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

PeepinGoodArgs, 1 month ago

I’m glad you mentioned this. It’s snappy!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ticoombs, 1 month ago

🐊 Snap Snap!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Blaze, 1 month ago

I just had a look at the graph, it looked good until now, but now it’s up again :(

…lem.rocks/…/federation-health-activities-behind?…

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ticoombs, 1 month ago (edited 1 month ago)

That’s when US timezones wakes up. We physically cannot accept more than 3 requests per second. Physically being the actual network physical limits ( of 3 x 287ms = 861ms, we used to be 930ms+. The server move got us 21ms closer!). LW generates more than 3 activities per second during US “awake” time zones. So we have a period of 8 hours where we need to catch up.

Like I’ve said in our forcing federation post. There isn’t anything to worry about because we are completely up-to-date on posts and comments because of our sync script.

It’s just the sequential nature of Lemmy. I’m going to test a new container in the next 12 hours which removes the blocking metadata generation from the accepting of activities. That way we can guarantee at least 3 activities a second.

Realistically, that is a minor fix but it won’t help with those graphs in the long term. We will need to have parallel sending, for it ever scale.

On a side note while we were on our old server and were using our forcing federation script, we had it set to 10 parallel requests. It didn’t even worry about it. I saw no increase in server load. Which is good news for the lemmyverse in general, as everyone will be able to accept the new parallel sending without needing to increase their hardware.

Tiff

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Blaze, 1 month ago

Thank you for the detailed answer!

There isn’t anything to worry about because we are completely up-to-date on posts and comments because of our sync script.

Sorry, it’s a bit late for me on this side, but if I understand correctly, posts and comments are indeed up-to-date, but upvotes are synchronized later, is this correct?

Thank you for the work as always!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ticoombs, 1 month ago (edited 1 month ago)

but upvotes are synchronized later

Correct. All votes are syncronised eventually.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Blaze, 1 month ago

Good luck for the migration!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ticoombs, 1 month ago

It starts… Soon. 😎

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

MonsiuerPatEBrown, 1 month ago

Did I evrr let you know that I pissed off a CIA asset that launders Russian oligarch monies using the FBI, Filipino organized crime, the Albanians, and other US based law enforcement via FedEx, UPS, USPS, and Joann’s Fabrics ?

Should I contribute more monthly to cover their probable sabotaging reddthat ?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ticoombs, 1 month ago (edited 1 month ago)

Between you and me, you personally probably don’t need to donate more in the short term, but I’m not going to stop you! 😛

We need about A$40-50/month extra to cover everything now. We have A$77.22 setup in recurring donations on OpenCollective, and just our server bills are A$115 (converted from US$74.80). + Domain Renewal (1.5/m Euro) + Wasabi Storage (~$8/m USD) This will be updated Funding post. With the money on Ko-Fi, OpenCollective and the recurring donations on OpenCollective, we have at least 12 months of runway before we run out of money. So it isn’t critical at the moment.

Thanks! 🤎

Edit: Actual prices

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

doctortofu, 1 month ago

Fingers crossed that everything goes well!

In the meantime, here’s a counter until the event that should work for any timezone

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Blaze, 1 month ago

Thanks!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Facebones, 1 month ago

Idk crap about lemmy backend stuff, I’m just here for Legends of the Hidden Temple.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ticoombs, 1 month ago

Glad the link worked! It’s always risky posting mp4 links. I’ll be glad once the new front end patches come through so that by default, shows an image of the video (iirc).

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Facebones, 1 month ago

FWIW I didn’t know it was a video until you said something haha. The video did work though also when I clicked on it.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

the_rogue, 1 month ago

until the fire nation spammers attacked.

Hehe

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ticoombs, 1 month ago

😁

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

AFC1886VCC, 1 month ago

I’ve noticed some issues since moving to reddthat. Glad to see a fix is being worked on, keep up the good work :)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ticoombs, 1 month ago

How are your issues now? 🧐

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

AFC1886VCC, 1 month ago

Things seem okay now, no weird behaviour like random logouts and communities not loading 😁

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Blaze, 1 month ago

If you moved recently, you are quite unlucky with the timeframe ha ha

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Blaze, 1 month ago

Great news! Thank you so much for this!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ticoombs, 1 month ago

PS. Everyone enjoying this new wide layout?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Blaze, 1 month ago

Did anything change? If so, I didn’t notice it ha ha

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ticoombs, 1 month ago

I changed the default theme to be the “Compact” version. Which makes it wide screen, but if you’ve set your own then it doesn’t change it. If you open up reddthat.com in a private browser you should see it.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Migration plan:

DB:

pictrs

app(s)

Add comment