Post-Mortem: The massive lemmy.world -> lemmy.dbzer0.com federation delays.

Image

Image alternative text

nyakojiru, 2 months ago

Yep I can confirm is massively faster now to comment and post. Even faster than other instances and other corporation products.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

count0, 2 months ago

Great writeup, thank you so much for sharing!

Nothing more frustrating than googling an issue and (only) finding forum threads ending in “nvm it works now” 😬

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dessalines, 2 months ago (edited 2 months ago)

Glad you were able to figure this one out, I never know whether to be mad at myself or proud of my persistence when I spend like a day trying to fix something that turned out to be really simple and almost always unrelated to what I thought the problem was 😂

Edit: also if you found any performance-related config improvements, either to the postgres.conf, nginx.conf, or lemmy.hjson, please contribute them to lemmy-ansible so that all instances can benefit from what you’ve learned.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

db0, 2 months ago

Already sent a big pr for lemmy-doc 😊

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

nutomic, 2 months ago (edited 2 months ago)

As someone hosting a service like this, especially when it has 12K people in it, this is very scary! While 2 lemmy core developers were in the chat, the help they provided was very limited overall and this session mostly relied on my own skills to troubleshoot.

This reinforced in my mind that as much as I like the idea of lemmy (or any of the other threadiverse SW), this is only something experts should try hosting. Sadly, this will lead to more centralization of the lemmy community to few big servers instead of many small ones, but given the nature of problems one can encounter and the lack of support to fix them if they’re not experts, I don’t see an option.

I disagree with this conclusion. If you had installed Lemmy according to the official instructions, you would have the database, backend and everything else on the same server and would never have run into this particular issue. And any problems youd have would likely be noticed (and debugged) by many other instances too. Your setup is heavily customized so it is only natural that there are few people who can help with it.

Anyway its an interesting journey, thanks for writing down your experience and for improving the documenation!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

KairuByte, 2 months ago

I’m curious how you think “everything on the same box” scales? You can’t load balance, you can’t ensure resources are being used efficiently, you can’t even reboot a machine without the entire thing going dark.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

nutomic, 2 months ago

Lemmy.ml runs on a single server and is much bigger than db0. Sure you can’t get 100% availability this way but no one expects that.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

KairuByte, 2 months ago

Do you have a link to something describing their infrastructure?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

taaz, 2 months ago (edited 2 months ago)

Edit: this comment is not written well, and is not describing the issue I wanted to actually comment on, I am tired and sorry

I will hop on to this to also point out that there actually were people willing to actively help (me included, see the original post on this community) but if I say it bluntly we were not “invited in on the show”, let me expand that.

The problem is, as @nutomic points out here, we don’t have the slightest idea how exactly your infrastructure looks, without that there is only the most general stuff we can help with.

From my point of view, joining the matrix chat later in the process, I watched you do/post stuff that I have no idea where it comes from, I don’t have the full context of what has been already tried and crossed out and what’s the current plan.
You @db0 would have to stop chopping and start networking with the people - that is definitely not easy to do effectively, especially if more people join later (and too have to be updated with the sate) but we could have fast tracked the docker/compilation stuff ruling lemmy out sooner.

In retrospect, if we had full picture of how the infrastructure looks the chance someone would go “oh you have split backend and database servers, check the latency” would definitely be a lot higher, but we didn’t know (hell I actually assumed your deployment is same or close to the lemmy ansible one). I am aware this is easy to say after the solution has been found but hopefully you get the networking/communication idea.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

db0, 2 months ago (edited 2 months ago)

Wait, hold on, how was help not accepted? I talked with everyone who replied to me me and followed every suggestion. If someone had asked for infra information I gave it.

You know It’s really frustrating to open myself and write about my experiences honestly and then people try to stay that it’s actually my fault I didn’t ask for help “the right way” . What kind of effect to do you think this might have to other potential lemmy hosters?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

taaz, 2 months ago (edited 2 months ago)

I didn’t want to devalue your communication, I think I have worded my previous comment very badly in that sake, I am sorry about that. (I also really need to go to sleep so I will be blunt here.)

There is a nuance to the internet communication when it comes to asking OSS community for support, at least speaking from my own experience as someone working in tech.
Getting one or two people actively bouncing ideas of off is a already big success - quality of OSS support is often very spotty across projects and it’s understandable because people do it in their free time which is limited (also if the project is complex, there is often less people experienced with it, less total sum of free time for support, I think this currently applies to Lemmy a lot).
With that in mind, when I come asking for support I am mostly prepared to not get any, I am prepared to have to dive into the codebase, debug, deconstruct, debug, swear, swear some more. Maybe this is just me and I had really bad luck mostly, but I don’t know.
Should the devs/owners of any OSS project be ready to provide (some) support for their product if they want it to survive, probably yes, and how much is good depends on the project, you, anyone.

So

What kind of effect to do you think this might have to other potential lemmy hosters?

My opinion is that currently, lemmy is simply not ready for non-tech people. (And I can’t really imagine it will ever be, unless there is a lot of people active in the development and are willing to help others. At least currently there is just too much moving parts that require at least some amount of technical experience. Also lemmy is not something like… GUI application - some application to be used by non-tech people, in the sense that if you want to deploy your own lemmy instance you the admin is the target user of that software, not talking about UX/UI)

Also as someone else has commented here, hosting something for myself is easy, hosting for friends is just a slightly bit harder, but hosting something for the public, getting hundreds-thousands of people makes it by a magnitude a lot more difficult (now you need active monitoring, durable backups, …).

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

db0, 2 months ago

You surely noticed that I was more than prepared to get my hands dirty during this incident. 😉

When I speak about support, I don’t mean having people doing it for me.

But overall you don’t seem to disagree with me that hosting you lemmy is not for the non-technical. Which is what nutomic took issue with.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

taaz, 2 months ago (edited 2 months ago)

But overall you don’t seem to disagree with me that hosting you lemmy is not for the non-technical. Which is what nutomic took issue with.

I read it as them taking isssue with you having different infra then recommend/expected, more then (not) being non-tech friendly. (I am going to sleep right now, I will check in tommorrow, well today later).

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

kbotc, 2 months ago

Tossing stuff on the same server is not great as I don’t want to pay for fast storage for my image store, but I want fast for my DB. My web server should have extra CPU and network but is otherwise ephemeral. This is the same stuff people have been running for years and is microservices 101.

The correct thing to do here is build in tracing and profiling hooks, as an example OpenTracing so something like Jaeger can consume and show problems and would have lit this up like a Christmas tree, Pyroscope can show changes over time in where CPU goes, and logs get shuffled off into graylog or some other centralized service for correlation.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

nutomic, 2 months ago

Images can be stored in S3 so that’s not an issue. And Lemmy has some tracing logs as well as Prometheus stats, not sure if db0 tried looking into those.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

db0, 2 months ago

I don’t think if seen mention of these anywhere or how to use them

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Simon, 2 months ago

This is my job, so I’ll counter that this isn’t realistic, and in a professional situation it would probably be hosted in kubernetes which spans multiple servers and sometimes multiple regions - I don’t think the devs have a readme for that… (or maybe they do). The point being that the official docs are geared for a hobbyist to set up a node and not having separate VMs makes sense in that scenario. However I would say that it’s plain that mister db0 has a much larger instance than could be considered hobbyist at this point.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

db0, 2 months ago (edited 2 months ago)

The official instructions do not scale nor do they work for all situations. But besides that, the problem is not that my bad setup caused a problem. Shit happens and I didn’t blame anyone but myself. The problems is that when a problem occurs, one has to get lucky to get support. I don’t have to even prove this. I know for sure a fact that there’s lemmy instances that decommissioned because they followed the default setup, run into issues, got no support and gave up.

Edit: Also, man, from one Foss developer to another: You really have to learn to stop the instinct to say ‘it broke because you did it wrong’. I know it feels unfair, but trust me, this is not the way.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

nutomic, 2 months ago

I’m not saying you did it wrong, it’s open source so of course you can use it in any way you like. But some ways have a higher risk of breaking than others.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

suppenloeffel, 2 months ago

Very interesting read, thank you!

I (self)host a lot of stuff as well as developing and deploying some of my software via docker containers and dabbled in Full-Stack territory quite a few times.

Exposing stuff to the internet still scares the shit out of me. Debugging sucks. There’s so much that can go wrong, every layer multiplicates the possibilities of stuff that can wrong or behave in a way not expected. Your journey describes the pain of debugging perfectly. Yeah, in hindsight, it’s often something that probably should have been checked first. But that’s hindsight for you.

And that’s not even accounting for staying ahead of the game while securing your 24/7 publicly accessible service, running on ever-changing software, with infrastructural requirements you basically have no control over. In your spare time.

Hosting something for yourself can be a lot of fun, hosting something for other, potentially many thousand, people makes you kind of responsible. That can be rewarding and fun at times as well, but is also a prime source for headaches.

Deploying stuff is the easy part, knowing what to do when stuff inevitably breaks is where it is at. Therefore, IMHO, it’s probably a good thing that most Lemmy admins at least know where to ask/start when shit hits the fan. This unfortunately leads to more centralization, but for good reasons: teams of volunteers taking care of fewer instances will almost always lead to a better experience than a lot of lone wolfs curating a lot of small instances. Improving scalability, monitoring and documentation is always nice, but will never replace a capable admin such as yourself.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Flatworm7591, 2 months ago

Well that was an entertaining read! Thanks for all your efforts to keep our instance running smoothly. I have noticed it seems a bit snappier since you fixed the problem.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

cephus, 2 months ago

Very fascinating and informative. Thank you for sharing.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Blaze, 2 months ago

Thanks for the write up!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

WeirdGoesPro, 2 months ago

Thank you for your hard work.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

henfredemars, 2 months ago

This reinforced in my mind that as much as I like the idea of lemmy (or any of the other threadiverse SW), this is only something experts should try hosting. Sadly, this will lead to more centralization of the lemmy community to few big servers instead of many small ones, but given the nature of problems one can encounter and the lack of support to fix them if they’re not experts, I don’t see an option.

This also gave me an insight about how the federation of lemmy will eventually break when a single server (say, lemmy.world) grows big enough to start overwhelming even servers who are not badly setup like mine was.

Lemmy has many scalability problems to solve, and not all of these problems are slow database queries. I believe your experience is going to become increasingly common as the community grows because that increased centralization will compound the scalability problems and continue to drive up the technical know-how required to host a successful instance. The software eventually needs to do more to detect and present operational problems to administrators in a friendly way. I2P is an example of a distributed network that’s quite good at reporting issues with the node.

With that said, not everything is doom and gloom. The community has proven itself highly resilient and smart people like yourself are finding solutions. It’s going to be tough road ahead.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

JoMiran, 2 months ago

That was like reading Homer’s “The Iliad”.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

sunaurus, 2 months ago (edited 2 months ago)

Nice post, I enjoyed the storytelling. Glad it’s all sorted now 😁

Btw, regarding this point:

All in all, this has been a fairly frustrating experience and I can’t imagine anyone who’s not doing IT Infrastructure as their day job being able to solve this. As helpful as the other lemmy admins were, they were relying a lot on me knowing my shit around Linux, networking, docker and postgresql at the same time. I had to do extended DB analysis, fork repositories, compile docker containers from scratch and deploy them ad-hoc etc. Someone who just wants to host a lemmy server would give up way earlier than this.

I think you’re totally right, but at the same time, I think the collaborative troubleshooting that happened on Matrix (and has happened many times in the past for other issues) is pretty healthy, and not something that is always possible for other open source software.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Vilian, 2 months ago

people interested in hosting their own instance is probably already interested in linux, or already using it, i don’t think it’s that bad

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment