Re: Deferred reads from Andres Kroonmaa on 2000-06-03 (squid-dev)

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Sat, 3 Jun 2000 17:58:01 +0200

On 3 Jun 2000, at 12:18, Henrik Nordstrom <hno@hem.passagen.se> wrote:

> > In this case, if we are finding that using poll() on a single FD
> > and successive read()/write() makes Squid run faster, than polling
> > all open FD's together, then we could ask why do we need the poll()
> > at all? While thinking of it I came to few thoughts.
>
> The benefit of poll is that it batches together the poll part for
> multiple filedescriptors far more efficiently than calling read() on
> them individually. However, with lots of filedescriptors even poll()
> becomes a bottleneck. However that is a CPU bottleneck, and doing
> similar amount of work in user-space won't do you any good.

I'm just wondering if there might be some difference whether the cpu
bottleneck happens in userland vs kernel. If kernel if written so
that there are few shared locks, then pushing too much work on it
may block some other tasks from proceeding. In this sense it has
less impact on the whole system if bottleneck happens in userland.

> Didn't you se the message about comm_read/write?
guess not.

> My idea is that poll() should only be called for if it is known that the
> filedescriptor is "blocking".
>
> write(client) (no more data to send)
> read(server)
> write(client) (again full buffer)
> poll(client)
> write(client)
> read(server) (partial reply only..)
> write(client) (not full buffer this time)

we'd need to address a problem with that: what if neither socket ever
blocks? Will this cause a spike of forwarding for a single session
leaving all others to wait? True, its supposed to be very rare,
but if a large file is pumped from fast server to fast client, this
can happen. All it needs is some load on system that causes writes
and reads to take more time than for the network traffic to full/empty
the buffers. Then by the time we are ready with write, read socket
is ready for read. So we need to break the read/write/read/write to
allow others to speak also. Currently it is broken after every operation
naturally by polling. I think we should settle somewhere in between.
One way to do it is to limit amount of work to few read+writes per
socket in a pass. loop around all sockets, and if there is noone left
for io, go to poll() them all together.

> One simple way to make this happen is to have comm_read/write schedule
> for poll() when required, instead of having comm_poll() call
> comm_read/write.

yes, this is what I mean by optimistic io with fallback to poll.

>> Still, if all consumers are slower than squid server-side, then we'll
>> move fast all available data from server-side buffers into client-side
>> buffers and we'll end up with situation where all server-side handlers
>> can't read more data from the network but need to wait for the slower
>> client-side to catch up.

> In real life this is rarely the case. Why:
> a) Most requests are for small objects which fits fully in the TCP/IP
> window or transmit buffers.
> b) Internet connectivity to many servers is poor.

I disagree. If you have 1000 dialup users at about 33,6K each, you'd
need 10-20M of internatinal link and pretty good backbone. This moves
per-session bottleneck to client-side most of the time.
persistent connections are only increasing, meaning potentially lots
of traffic via single client socket. actual servers can be pretty
many, so objects being small doesn't mean too much any more.

> > list and start servicing with optimistic io. If we have no FDs left for
> > optimistic io, we can increase poll timeout, thus reducing impact on
> > CPU and leaving more time to "gather work to do".
>
> That is the whole idea this discussion circles around.
>
> How to efficiently detect if poll is required: The previous operation
> returned partial data or EAGAIN/EWOULDBLOCK.

Sure. I just try to look further. sure poll is required when socket
gets EWOULDBLOCK. My guess is that we waste more CPU and gain little
performance if we poll immediately all blocking sockets. at least one
would probably be ready in under 1mSec, and we ask kernel to check for
all of them. CPU is wasted to check all of them, although only small
fraction of them gets ready. We service the ready one and poll again.
In the end we are polling every 1mSec with say 1-3 sockets ready.
But we could poll every 5 mSecs and get 10-15 sockets ready instead.
for that we can insert some sleep time before actual poll()
IMHO we can this way reduce CPU usage without actually impacting
per-session performance.

> > Somewhat similar effect should be probably seen as with disks -
> > you get better throughput if using less ops with larger chunks
> > of data.
>
> True. However here it is quite likely more important to optimize the
> sizes of read() operations to keep a nice saw-tooth pattern in the TCP
> window sizes when congested.

not sure what you mean. as I understand it, tcp is most efficient
when receiving buffers are empty and transmit buffers are full.
I'd make read size as large as possible. I think socket buffers
are not taking any considerable memory, so I'd increase them if
that helps.

> > every slightest activity on sockets. If we'd take it calm, we
> > possibly could increase forward latency somewhat, but whether
> > this is critical is questionable.
>
> Partly agreed. Latency is an important issue. However, sending lots of
> small packets to a congested link won't help latency, nor will delaying
> transmission of small packets on a unused link.

I agree that large latency is an issue. but small one? look at this
as gathering: if we don't get additional traffic in few msec, we
give up and send whats gathered so far. If we get more traffic in
this time, we send more in one shot and feel efficient ;)

> > btw, if we poll and service sockets at constant rate, we could
> > implement real-time traffic-shaping and rate limiting.
>
> Not sure I quite follow you there. In what way cannot this be done
> without constant rate poll()?

not sure either ;) perhaps its because I don't understand how its
done right now. I'd be thankful if someone describes it abit.
With constant rate poll its just easier? If we limit a session
to say 32K, we have 4KB per second, no more. I guess its no good
for tcpip if we let through say 3 fullsized packets and then defer
the transmission for 1+secs. Its also no good if we send 1,5KB
in one pass, then after 0.5mSec find that we can send 2 more bytes,
then after 30mSecs find that we can send 120 bytes, then after
500mSecs that 2000 bytes.
See, I understand so that without constant rate poll (service)
we end up with either random sized gaps between packets or
random sized data packets, which I think are both bad for tcpip
efficiency. We want to have constant sized packet at constant
rate. Of course, we can arrange for that without constant rate
poll too, but it would be less straightforward, imho.

> > Also, defering of actual io would have much less impact as
> > we'd need to reevaluate deferral state after constant
> > interval in the future.
>
> Deferal based on buffer overflows does not need to be reevaluated. It
> will be "immediately" known when the situation has ended.

yes, but deferral based on rate limiting is tied to time. And this
will need to be evaluated.

> > Say we define that we poll all sockets every 5 mSec. We startup
> > loop, note subsec time and poll all sockets with zero timeout,
> > service all thats ready, note subsec time, find amount of time
> > left before next poll should be done, and sleep for that time,
> > then close the loop on next interation. The timeout of the loop
> > itself we measure separately.
>
> As the amount of I/O builds up the processing time will soon approach
> the poll interval.

this means that delay-poll time reaches zero and squid is saturated.
I'd omit accepting more request in that case, for eg.

------------------------------------
Andres Kroonmaa <andre@online.ee>
Network Development Manager
Delfi Online
Tel: 6501 731, Fax: 6501 708
Pärnu mnt. 158, Tallinn,
11317 Estonia
Received on Sat Jun 03 2000 - 09:59:55 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:28 MST