I actually have some post-mortem on this now. The problem was/is
essentially cycle-starvation.
(SQUID) <-> (REDIRECTORS) <-> (DBCACHE)
<-> (AUTHENTICATORS) <-> (DBCACHE)
The redirectors and authenticators have internal caching, and spend the
majority of their lives cpu-bound when things get busy. The dbcache and
squid aren't always CPU-bound, presumably waiting sometimes on network
data and filesystem IO.
The redirectors and authenticators get their process priorities pushed
down, because they're hoggy. Squid stays right up there able to accept
more inbound requests. The DBcache initially stays high, until looping
around waiting for non-blocking sockets back to the redirectors and
authenticators causes it to start to get prejudice from the scheduler
(it's front end does not use a select()/poll() model, alas).
If this was a web-server, we'd be seeing a sort of sinusoidal behaviour
in capacity handling, but since everything's double-ended for
data-sources, it looks like we get asymptotic [Err..I think.
Insufficient caffeine in my blood at this point. Don't shoot me if I
misapplied the term].
My thoughts earlier about sched_yield() would seem to be good ones, in
this scenario. I'm going to try experimental versions to see. In this
setup, if there's a CPU famine, everything should be starving more or
less equally (actually, the redirectors and authenticators should get a
little more in the way of cycles, since they are synchronous).
Ah, it's all a wonderful big learning experience. I've jotted some huge
polynomials out in an effort to describe the behaviour, but it's far FAR
beyond my meagre skills in that area.
D
Clifton Royston wrote:
>
> On Wed, Oct 20, 1999 at 02:38:19AM +0000, Dancer wrote:
> ...
> > Yesterday, we _did_ manage to jam one of them up, but good...the
> > incoming request rate was on the order of 90-100/second, and the
> > service-times shot up, and things began to bog down. Still doing
> > post-mortem analysis on the logs, but it looks like what happened was
> > that (with the incoming request rate exceeding the handling capacity)
> > the number of simultaneous connections climbed, causing further
> > overhead, and shortchanged the filtering and authentication processes
> > (who are large CPU consumers).
>
> My experience is that this kind of phenomenon can happen with pretty
> much any connection-oriented IP service - or more generally, any
> transaction-processing type application on nearly any OS - when the OS
> and hardware are simply pushed to the limits of what they can handle.
> You see a phenomenon where the response time slows fairly linearly and
> continuously until it hits some critical level and then falls right off
> a cliff.
>
> Right now we're going through similar issues on one of our main
> servers which both handles lots of mail and many web servers. Similar
> story - once it gets past a certain point, mail starts stacking up,
> more and more HTTP connects start piling up, and it totally thrashes
> until the connection rate drops enough.
>
> It's a testimony to BSD UNIX that I've seen it go through loads of
> around 400 and come back to normal without crashing. Most OSes I've
> worked with over the years simply crash and burn at that point.
>
> > I'm not yet quite sure how to avoid this, but I have some ideas about
> > having the helper apps call sched_yield() about halfway through certain
> > CPU-bound routines, to help share the cycles...I might be talking crazy,
> > though. Some experiments are in order.
>
> The only real solution I know of once you hit this kind of point is
> to start spreading out the load more, or, in some cases, total
> application redesign. You can usually tweak a bit more performance out
> of even well-designed apps, but it'll only get you so far and it's not
> repeatable.
>
> In our case, we're trying to split the workload off onto several
> other servers, hopefully without breaking anything in the process.
>
> -- Clifton
>
> --
> Clifton Royston -- LavaNet Systems Architect -- cliftonr@lava.net
> "An absolute monarch would be absolutely wise and good.
> But no man is strong enough to have no interest.
> Therefore the best king would be Pure Chance.
> It is Pure Chance that rules the Universe;
> therefore, and only therefore, life is good." - AC
Received on Mon Oct 25 1999 - 21:26:39 MDT
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:49:04 MST