If you were measuring performance of #ActivityPub servers (let’s say, delivery... - Fediverse

steve, 2 months ago

If you were measuring performance of #ActivityPub servers (let’s say, delivery latency under load and active user scalability, to start) how would you propose doing it? What else would you measure? Are there any existing benchmarks?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ smallcircles

Image

Image alternative text

VolatileDream, 2 months ago

@steve For delivery latency, i'd measure from the server finishing to process the request to create the item, to the last follower server receiving all the request bytes. Looking specifically for implementations that don't scale up properly with Shared Inbox Delivery, eg: with 10,000 followers on the same instance. And implementations that don't parallelize their deliveries, eg: serial deliver where one server can timeout & delay other deliveries.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

nikclayton, 2 months ago

@VolatileDream @steve "Time to last follower server receiving the bytes" puts you at the mercy of their availability.

time-to-first-delivery-attempt might be better, coupled with time-between-retry-attempts.

Better still, start at the other end of the problem and write user-centric SLAs focused on the experience you want users to have, and derive metrics that can capture that.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

VolatileDream, 2 months ago

@nikclayton @steve Framing it as user-centric SLA/Os ++

I interpreted the request as a protocol level test, where an implementation can be scored on how it handles another implementations unavailability or latency. With the goal being to handle it in a reliable and performant manner. I oversimplified in communicating the idea the first time. oops.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

steve, 2 months ago

@VolatileDream @nikclayton Yes, I think both incoming delivery latency (request to post being available to view) and outgoing delivery performance could be interesting. Outgoing delivery failure handling isn't exactly performance per se, but could be important. I think some Mastodon attacks have exploited the server's inability to handle (intentionally) misbehaving peers.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

nikclayton, 2 months ago

@steve @VolatileDream that's why time-between-retry-attempts is important.

Say you set the goal to be "a failed delivery is retried at least every 60 minutes". That's the metric you alert on. Not e.g. size-of-outbound-queue - the outbound queue might be large, but if you're still processing it frequently enough to meet the every-60-minutes goal that doesn't matter (from a user performance perspective).

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ devnull

nikclayton, 2 months ago

@steve @VolatileDream to put it another way, "internal" metrics like size of queue, database iops, memory use, etc, can help troubleshooting when you know there's an active problem that is impacting the quality of service users are experiencing.

But they don't tell you whether there's a user facing problem in the first place. For that you need metrics that capture the aspects of the user experience that you care about.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment