Re: Deferred reads from Andres Kroonmaa on 2000-06-02 (squid-dev)

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Sat, 3 Jun 2000 00:39:08 +0200

On 31 May 2000, at 11:13, Henrik Nordstrom <hno@hem.passagen.se> wrote:

> > Since this is part of the commloops development, I'm going to
> > start removing these and replacing them with suitable interfaces.
> > Can anyone think of a reason to keep deferred reads?
>
> Not if you have properly sheduled read/writes like we have discussed
> before a couple of times (See for example
> http://www.squid-cache.org/mail-archive/squid-dev/199911/0049.html).

these posts made me think of why and how is using of polling useful.

...sorry for a long rant, I sort of tend to think aloud... and more
than to say I'd like to learn, so if I'm spitting crap, beat me. ;)

What poll() as an OS function really does? I understand that all it
does is to find file structure related to FD and tests if the buffers
are either filled or empty and return that fact. All of it is a fast
memory traversing task. What does a read() do? alot redundant, ie.
it has to find the file structure related to FD, test if there is any
data ready or buffers free and either return with EAGAIN or block,
or copy data to/from the buffers and update the pointers (sure it
does alot more, but it also does the tests first)
So, in my understanding OS read() is actually poll()+_read() anyway.
In this case, if we are finding that using poll() on a single FD
and successive read()/write() makes Squid run faster, than polling
all open FD's together, then we could ask why do we need the poll()
at all? While thinking of it I came to few thoughts.

> A typical call pattern for a data forwarding operation today is
> basically:
>
> 0. poll the receiving socket for read
> 1. Read from the receiving socket
> 2. Mark the sending socket for poll
> 3. poll the sending socket for write
> 4. write to the sending socket
> 5. Mark the receiving socket for poll
> 6. poll the receiving socket for read
> [now back to 1]

Probably we should loop in commSelect and optimistically launch
reader handlers, followed by related writer handlers? We skip poll,
we just try to read from the origin server, and if that succeeds,
we try to write to a client. If that succeeds also, we have a very
fast forward of data, if it does not succeed, we just skip to the
next receiving FD and return to that one in the next loop run.

Well, this could result in hell of a fast proxy, but obviously with
100% cpu usage at all times. If that is a dedicated box, then this
could be even ok. if it is not dedicated, we could use "nice" for
squid process.
Other than "not elegant" are there any more important reasons not to
do so?

Obviously, we burn CPU uselessly for most of the time. But does such
waste have any additional bad sideeffects that would influence the
performance?

For one thing, context switches skyrocket. How much overhead does
this take in reality? Is high context-switch rate eroding usable CPU
time measureably (with modern CPUs)?

Then, we probably can't hope that disk io returns with EAGAIN and
no delay. In most cases the process is blocked for the duration of
real disk io. This means that network io is handled between disk io
operations only, and total squid throughput is directly related to
amount of disk io (even for sessions not needing disk io). this
could be a needless limiting factor. If we move disk io to async-io
with either threads or helper processes, then squid will have more
time to service sessions with network-only traffic.

Still, if all consumers are slower than squid server-side, then we'll
move fast all available data from server-side buffers into client-side
buffers and we'll end up with situation where all server-side handlers
can't read more data from the network but need to wait for the slower
client-side to catch up. So, we'd like to add some sort of blocking on
write (to buffers) internally in squid (defer) until slower parts are
catching up. Also, if server-side is slow, there seems no point in
trying to read from the socket for some 1000 times with EAGAIN. So,
we'd like to move such slow sockets to some sort of waiting state.

If we have no sockets ready for io, we don't want to burn CPU uselessly,
we'd rather relinquish cpu to other processes, yet we want to be
notified as soon as we have work to do.
To poll() is quite obvious choice.

We could build commSelect loop not around poll, but around optimistic
io with a fallback to poll. Suppose we start servicing sockets with
optimistic io, and if detect EAGAIN for few times in a row, then we add
this FD to a polling list. At the start of each loop, we poll this list
with zero timeout to see if there are any FD's coming out of waiting
state. Those that are ready for io, we can take out of this polling
list and start servicing with optimistic io. If we have no FDs left for
optimistic io, we can increase poll timeout, thus reducing impact on
CPU and leaving more time to "gather work to do".

Basically, we should classify sockets into busy and idle sockets, omit
polling for known busy sockets and poll idle sockets together with busy
(lately) sockets.

We'll end up with pretty much the same implementation as we have now,
but somewhat more complex. Is there any point in doing so? Would we
gain anything compared to the current design?

I think that by omitting poll of busy sockets we can reduce load on
kernel that has to check and return status of each and every socket
passed with poll(). Especially if there are few busy sockets and lots
of idle sockets. I believe that poll of single FD or a subset of FD's
is faster because kernel has to find and traverse much less file
structures and critical sections to determine if the socket is ready.
Therefore, if we simply split pollable sockets into groups and poll
them in sequence, the end result performance shouldn't differ too much
from what its now.

1. if we know socket is busy and probably ready for io, we want to
avoid polling of all others idle sockets just to make sure.
2. we should resort to poll only if we are idle or we think we have
rechecked other idle sockets too long ago.

in other words, we should try to avoid polls that results in
immediate return with very few ready sockets. We would like to have
poll return with many sockets ready.

Perhaps there is even point in adding artificial few-mSec sleep
before polling of all sockets in the current design. The reasoning
would be to allow OS gather incoming traffic from network and allow
it to flush outgoing traffic from buffers, thus allowing for more
sockets be ready at next poll and move data in larger chunks.
Somewhat similar effect should be probably seen as with disks -
you get better throughput if using less ops with larger chunks
of data.
If I understand right "Server-side network read() size histograms'
in cachemgr then over 60% of reads from network are under 2Kb.
Seems that there is no need to have much larger socket buffers
in squid. At the same time we know that to get decent performance,
tcp stack should be able to accept windows of 64K or more (for SAT
links much more), and most probably real squid cache would be
tuned to that. so, we can assume quite safely, that tcp stack
would be able to buffer at least 32K without any trouble.
Similar on the client side, squid would be tuned to be ready to
buffer upto 64K of data in tcp stack. So we really don't need
to pump data few bytes at a time, we could reduce rate and
increase amount of data pumped at a time.
Simple look at tcpdump output suggests that most sessions are
occuring few packets at a time with only 0.5-5 mSecs apart
before tcp-ack is awaited. This means that we can decide to either
handle every packet as soon as it arrives, or "wait" alittle
until a bunch gets here and then handle them together. Current
code seems to do it asap, meaning that we jump back&forth for
every slightest activity on sockets. If we'd take it calm, we
possibly could increase forward latency somewhat, but whether
this is critical is questionable.

What kind of bad impact this could have if we poll all sockets no
more frequently than say every 2-5 mSec? Given that we'd flush
upto 4-32K in a go, bandwidth for large objects isn't a problem.
For small objects, we could increase latency from 1 Msec to 6,
is that detectable by a human browser?

btw, if we poll and service sockets at constant rate, we could
implement real-time traffic-shaping and rate limiting. Also,
defering of actual io would have much less impact as we'd need
to reevaluate deferral state after constant interval in the
future.

Say we define that we poll all sockets every 5 mSec. We startup
loop, note subsec time and poll all sockets with zero timeout,
service all thats ready, note subsec time, find amount of time
left before next poll should be done, and sleep for that time,
then close the loop on next interation. The timeout of the loop
itself we measure separately.

------------------------------------
Andres Kroonmaa <andre@online.ee>
Network Development Manager
Delfi Online
Tel: 6501 731, Fax: 6501 708
Pärnu mnt. 158, Tallinn,
11317 Estonia
Received on Fri Jun 02 2000 - 16:41:10 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:28 MST