RE: Cacheoff results published. from Andres Kroonmaa on 2000-10-12 (squid-dev)

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Fri, 13 Oct 2000 01:23:47 +0200

On 12 Oct 2000, at 13:52, Chemolli Francesco (USI) <ChemolliF@GruppoCredit.it> wrote:

> > Also worth noting is that the simulated environment includes
> > delays that
> > cause file descriptors to run at well over 1000...I think our box was
> > topping 2000 in the peak phases.
>
> That could be an important factor. I heard that poll is one of the biggest
> CPU hogs in squid. The bigger the FDset, the more it hogs.
> This is why I explicitly specified my FD usage info.

I think this is a misconception. poll() itself is not the biggest cpu hog in
squid. Although for quite some time squid has had a bug that caused it poll
DNS incoming socket after each and any operation on normal sockets, causing
quite alot of useless cpu burn. It seems that 2.4Devel4 has this issue.

Squid is sitting alot of time in poll() for many reasons, but mostly waiting
for io or OS doing io for squid. Lots of work is done by system during the
cpu time of squid even if second cpu is idle, but this is an issue with SMP
scaling, not poll overhead. Usually what is meant by poll cpu hogging is
pure handling and parsing of large FDset, and this is wrong. Pure overhead
of poll is reaching some 10% of system cpu (depending on poll frequency,
obviously, I here assume 100 times/sec) only after reaching some 1000-2000
open files (depends on cpu speed).
Of course, time spent in poll is lost for squid, and this can be solved with
some sort of async notification or separate thread, moving system work done
for squid by OS "into background", and leaving more cpu time available for
squid itself, but this is a matter of SMP scaling again.

All this is quite a non-issue under high loads, because poll is called only
when there are no servicable FDs left. And as load increases, poll frequency
is reducing, thus overhead of poll is deminishing. poll overhead is worst
when there are thousands of FDs open and only one at a time becomes ready.
Then poll rate goes skyrocket. Under realistic and high loads most time is
spent servicing ready sockets and poll is called with quite a low rate.
In fact, dropping poll in favour of async notification could result in more
overhead under such high loads than poll is currently adding.

For relatively idle Squid one of biggest cpu hogs is preparation of FDs for
poll, especially Defer checking. With high-res timing I see that it takes
consistenly at least 5-6 times more cpu time to prepare FD array for poll
than it takes poll to update FDset states and return if it has any FD ready.
This shows that poll's own overhead is much less than preparing for poll.
This preparation hooks upto 15% of total system cpu time (under my load
patterns) on average. And this overhead increases with increasing number
of open files, faster than overhead of poll. But as it is called as often
as poll, it has the same deminishing total overhead as poll has under
high loads.

Highest cpu time goes to handling reads from network which most probably
has to do with parsing headers, and also time spent in system. This could
also be somewhat offloaded, but lots of effort should go into optimising
this part of squid.
Under stress-tests with some 3000 concurrent sessions I see that handling
of network reads takes upto 50% of system CPU, handling writes upto 20%,
poll overhead goes up, but total cpu time spent in poll is reduced because
of lowered poll rate, as also happens with FDset preparation overhead.
Currently, squid needs to poll between read and write for the same client.
When this will be redesigned, poll rate will drop even more, leaving more
cpu for handling network reads.

Next biggest cpu hog seems ACL matching. With some 30 ACL's in total on my
box acl matching seems to hook upto 10% of total cpu on system. Not being
very complex ACLs this seems quite excessive. Under stress-tests ACL checks
showed upto 20% of total system CPU.
While looking into it I noticed that for some reasons same ACLs are evaluated
multiple times over again. It seems that it happens if squid must resolv DNS
for request to proceed. Weird is that urlpath_regex type ACLs are reevaluated
many times (I've seen upto 9-11 times per regex acl per request).

Quite amazing is the amount of memory allocs/frees per request. Currently
at about 120 allocs/request and 120 frees/req, alloc/free rate can very
easily go to high thousands per second. (how about 50000 mallocs/sec?)
Surprisingly, it doesn't have very high impact on CPU, I've seen upto 5%
of total system CPU with 50K mallocs/sec. Yet this is definitely burning
alot of cpu besides mallocs themselves and can also become a limiting
factor under very high loads. So optimisations to reduce memory allocation
and fast release should be undertaken.

thats what I observe on squid 2.3 with async-io.

------------------------------------
Andres Kroonmaa <andre@online.ee>
Delfi Online
Tel: 6501 731, Fax: 6501 708
Pärnu mnt. 158, Tallinn,
11317 Estonia
Received on Thu Oct 12 2000 - 17:27:19 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:42 MST