Nick Lewycky wrote:
> Hi. I've been working to add prefetching to squid3. It works by
> analyzing HTML and looking for various tags that a graphical browser an
> be expected to request.
>
> So far, it seems to just-barely work. What works is checking the
> content-type of the document, avoiding encoded (gzip'ed) documents,
> analyzing the HTML using libxml2 in "tag soup" mode, resolving the full
> URL from relative references, and fetching the files into the cache. (I
> would, of course, appreciate code reviews of the branch before I diverge
> too far!)
>
> However, I've run into a few problems.
>
> To prefetch a page, we call clientBeginRequest. I've already had to
> extend the richness of this interface a little. The main problem is that
> it will open up a new socket for each call. On a page with 100
> prefetchables, it will open 100 TCP connections to the remote server.
> That's not nice. I need a way to re-use a connection for multiple
> requests. How should I do this? I'd like clientBeginRequest to be smart
> enough to handle this behind the scenes.
>
> Occasionally I see duplicate prefetches. I think what's going on here is
> that the object is uncacheable. The only way I can think of solving this
> is by adding an "uncacheable" entry type to the store -- but that just
> seems wrong, conceptually. On a related note, maybe we could terminate a
> prefetch as soon as we receive the headers and notice that it's
> uncacheable. Currently, we download the whole thing and just discard it
> (after analyzing it for more prefetchables if it's HTML).
>
> Finally, does anyone have suggestions for how to test for performance
> improvement due to prefetching?
A good way to test how your algorithms are working is to get a nice, long
actual Squid workload -eg, URLs fetched, and compare how long it takes
to execute the whole thing with and without prefetching.
Note that you generally have to prefetch a LOT of stuff to get much
improvement,
because web cache fetch popularity follows zipf's law and decays slowly.
Good luck with your work.
Jon
Received on Sat May 14 2005 - 12:40:30 MDT
This archive was generated by hypermail pre-2.1.9 : Tue May 31 2005 - 12:00:03 MDT