On 25/01/2013 3:06 p.m., Alex Rousskov wrote:
> Hello,
>
> The attached patch fixes several ConnOpener problems by relying on
> AsyncJob protections while maintaining a tighter grip on various I/O and
> sleep states. It is in PREVIEW state because I would like to do more
> testing, but it did pass basic tests, and I am not currently aware of
> serious problems with the patch code.
>
> I started with Rainer Weikusat's timeout polishing patch posted
> yesterday, but all bugs are mine.
>
>
> Here are some of the addressed problems:
>
> * Connection descriptor was not closed when attempting to reconnect
> after failures. We now properly close on failures, sleep with descriptor
> closed, and then reopen.
>
> * Timeout handler was not cleaned up properly in some cases, causing
> memory leaks (for the handler Pointer) and possibly timeouts that were
> fired (for then-active handler), after the connection was passed to the
> initiator.
>
> * Comm close handler was not cleaned up properly.
>
> * Connection timeout was enforced for each connection attempt instead of
> all attempts together.
>
> and possibly other problems. The full extent of all side-effects of
> mishandled race conditions and state conflicts is probably unknown.
>
>
> TODO: Needs more testing, especially around corner cases.
> Does somebody need more specific callback cancellation reasons?
> Consider calling comm_close instead of direct write_data cleanup.
> Make connect_timeout documentation in squid.conf less ambiguous.
> Move prevalent conn_ debugging to the status() method?
> Polish Comm timeout handling to always reset .timeout on callback?
> Consider revising eventDelete() to delete between-I/O sleep
> timeout.
>
> Feedback welcomed.
NP: This is way beyond the simple fixes Ranier was working on. The
changes here relys on code behaviour which will limit the patch to trunk
or 3.3. I was a bit borderline on the earlier on the size of Raniers
patches, but this is going over the change amount I'm comfortable
porting to the stable branch with a beta cycle coming to an end.
Auditing anyway:
* You are still making comments about what "Comm" should do (XXX: Comm
should!). ConnOpener *is* "Comm" at this point in the transaction. If
"Comm" needs to do anything then it is *this* objects responsibility
scope to see that it happens. If thre is a *simple* helper function
elsewhere in comm_*() or Comm:: or fd_*() which can help so be it, but
this object *is* Comm and needs to peform the "Comm should do X"
operations as related to state opening an FD.
* It was probably done beforehand, but it much clearer that it happens
now that the sleep()/DelayedRetry mechanism leaks Pointer() as well as
the InProgress mechanism.
+++ IMHO: leave it leaking, the use-case is a rarity and we can update
the event API separately and faster than we can fix all the callers to
workaround it.
* Looking at your comment in there about comm_close() I see why we are
at such loggerheads about it.
+++ comm_close() does *not* replace the write_data cleanup.
- write_data cleanup is explicitly and solely to remove the leaked
Pointer()'s *nothing else*. The extra two lines of code are to ensure
that hack does not corrupt anything. Until those fd_table pointers stop
being dynamic this code is non-optional.
- even if you called comm_close() you would need to perform the
write_data cleanup before calling it to prevent the leak.
* apart from the timeout unset the relevant parts of the comm_close()
sequence are all in comm_close_complete(). Perhapse you should schedule
one of those calls instead of clearing the handlers and calling
fd_table() synchronously..
++ but notice how that (and comm_close()) adds a second async delay on
the FD becoming available for re-use. This is a performance
optimization. You will need to setup a synchronous comm_close_complete()
to achieve the same speed.
++ the other operations in com_close() function itself are all NOP. No
other component has been given the FD or conn_ to set any state themselves.
* the naming of "ConnOpener::open()" is ambiguous. Since it is not the
active open() operation in this Job. It is the getSocket() operation and
should be named as such.
Amos
Received on Fri Jan 25 2013 - 04:36:04 MST
This archive was generated by hypermail 2.2.0 : Fri Jan 25 2013 - 12:00:09 MST