Squid currently has no facilities for processing content beyond the HTTP
headers, in plug-in form or otherwise. There have been a few hacks
along the way that do some specialized form of filtering (like stripping
the anim bit from GIFs, or stripping out javascript), but those projects
never really went anywhere and have long been unsupported.
Robert has done some promising work on generic content processing in
Squid, but ran into some roadblocks that he didn't have time to address.
You may want to start from there and tackle the issues he ran into,
if you have the time and inclination.
ICAP provides support for similar things in limited circumstances (it is
targetted at content providers who want to customize or aggregate
content or provide additional services nearer to the client). Geetha
(and Ralf? I think) has been doing lots of cool stuff in that area, but
I don't think it will address your needs at all in its existing form.
Dan's Guardian does content processing, and so might provide a good
starting point (note the request attached to its GPL license, however,
before embarking on any commercial work with it). It is a standalone
proxy these days, obviously much simpler in implementation than
Squid...I do not know how compliant it is with regard to the HTTP
protocols, but I haven't heard any particularly alarming things about it
and Dan seems to be a skilled programmer, so I'd suspect it is a good
choice for your project.
andrew cooke wrote:
> Hi,
>
> Is there a simple way to process files that are requested through Squid?
>
> I'd like to try constructing a database containing links, word counts etc,
> for pages that I view. The simplest way I can think of to do this is to
> point my browser at a proxy and process data there. Squid seems the obvious
> choice for a proxy (but see last point below).
>
> Looking for similar functionality in other code working with Squid, I found
> the Viralator which checks downloads for viruses
> (http://viralator.loddington.com/). It intercepts requests using Squirm,
> pulls the file using wget, and the resupplies it (after scanning) via Apache.
> This seems very complicated (and may only work correctly for downloads rather
> than page views - I'm not clear about the details yet) (although I could drop
> Apache when working on the machine hosting Squid).
>
> Instead, I was wondering if Squid had support for plugin modules (that might
> be intended to support filters, for example), but I haven't been able to find
> anything.
>
> Another approach might be to scan the files cached by Squid (ie as files on
> the local disk, not streamed data). But this presumably won't work with
> dynamic pages and it might be difficult to associate URLs with files (also,
> it forces caching when, for single person use, proxy-only might be
> sufficient). And how would this be triggered for new files?
>
> Does anyone have any suggestions on the best way forwards? Perhaps there's a
> simpler proxy that I could use instead? There are certainly a lot of simple
> http proxies out there, but I'm not sure how closely they follow the spec.
>
> Any help appreciated,
> Thanks,
> Andrew
>
>
-- Joe Cooper <joe@swelltech.com> Web caching appliances and support. http://www.swelltech.comReceived on Sat Jul 27 2002 - 20:16:43 MDT
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 17:09:22 MST