Re: How long is a domain or url can be? from Amos Jeffries on 2014-04-30 (squid-dev)

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Thu, 01 May 2014 06:51:31 +1200

On 1/05/2014 6:12 a.m., Eliezer Croitoru wrote:
> On 04/30/2014 11:52 AM, Henrik Nordström wrote:
>> Unless it has been fixed the UFS based stores also have an implicit
>> limit on cached entries somewhat less than 4KB (whole meta header need
>> to fit in first 4KB). Entries failing this gets cached but can never get
>> hit.
> Then StoreID helps a bit with that..
> Now it's understood why some urls with the "?" in them do not cache well
> sometimes :P
>
>>> >DNS defines X.Y.Z segments as being no longer than 255 bytes*each*.
>> For Internet host names the limits are 63 octets per label, and 255
>> octects in total including dot delimiters.
>
> This is indeed what I have been reading in the RFC and it makes the
> regex for domain simpler to define.
> From what I have seen 2-3KB of request size was the high limit of the
> size that have been used.
> I assume that this is what is happening now in the current data sizes
> over the network.
> Every once in a while the data size goes up and the url should also
> since they will be used by bigger sizes hash algorithms.
> It was started in smaller and then crc16 crc32 mdX md5 sha1 sha512...etc..
>
> So for now a url blacklist should be at-least 4KB with size but I think
> when jumping\doubling 4KB it's not such a big jump to 8KB.
> The main issue I was thinking was between using one field of the DB with
> X size or other one which has indexes.
>
> For now I have used mysql TEXT which doesn't have indexes but only the
> first query takes more then 0.00 ms.
>
> I have tried couple key-value DB's and other DB's but it seems like all
> of them are having some kind of a step which is the slowest and then it
> run's fast.

At a guess that would probably loading the table data into memory or
constructing some form of cache for the results. If you imagine it the
same way Squid operates: the first request is a MISS and has to actually
get to the origin and do all its processing, second and later requests
can be fast HITs.

>
> I have mysql Compared to key-vaule and the main differences are the
> on-disk size of the DB which is important if there is a plan to filter
> many many specific urls and not based only on patterns.
>
> Amos:(or anyone else) since you patched squidguard, maybe you do have
> basic understanding with it's lookup algorithm?
> I started reading the code of SquidGuard but then all of a sudden I lost
> my way in it and things got a bit complicated (for me) to understand how
> they do it.(hints..)

I only looked at it hard enough to find the reply line syntax and make
it return the URL alone with tah patching. Will give it a look over and
see what I can find later today.

BTW: If you go for optimizing MySQL database be wary of SOUNDEX(). It is
great for textual comparisons and indexing, but only if you are storing
American English words. For any other input it is a brilliant way to
screw up without noticing.

Amos
Received on Wed Apr 30 2014 - 18:51:41 MDT

This archive was generated by hypermail 2.2.0 : Thu May 01 2014 - 12:00:15 MDT