I've been wondering sometimes, what if we drop LRU and other
replacement policies, and implement something that is used in
CPU-caching: like direct-mapped and set-associative caches.
For squid, MD5 hash has been tested to give very small collision
frequency (I can't find it anywhere, but if I recall right, then
Duane did a test with few millions of URLs), and at already quite
a low bitcounts.
We have fixed amount of files possible for given disk space, and
filemap bitmap is usually a power of 2.
Now, what if we mask MD5 bitmap to drop all bits higher than what
we can handle within our file bitmap, and use resulting number as
a direct pointer to cachedir swapfile, or even better, disk block?
Lets assume we can handle 2M of objects, for a moment.
First, we can use simple array[0..2M] of "something" (pointers?)
related to corresponding StoreEntrys. Each array item can be used
to determine existence of object in cache (pointer to actual
StoreEntry, for eg, or refcount, or expires time, or obj size)
StoreEntry database could also be a simple linear array, without
need to have (double)linked lists. As we implement direct mapping
of MD5 to disks, theres no need to manipulate this db in ram.
Or we can even omit StoreEntry db altogether and keep in ram only
pending requests.
Cache hit/miss detection is very straightforward and fast.
We need not keep separate MD5->filename translation.
We even need not keep full MD5 keys in ram, as we accept
occasional collisions of MD5 hash found in ondisk objects.
We need not rebuild swapstate, we simply know how many files
can fit into cachedir, and assume that all files are in use.
(or sequentially stat all files to be sure) On false hit we
invalidate array item, or most probably simply overwrite it with
new just requested object.
If we accept partial storedb inram, as have been talked about
in relation to reiserfs, then we can even skip loading timestamp
data into ram, and as soon as we have verfied existence of disk
file, we can assume a hit.
We can bypass all FS altogether and map MD5 directly onto raw
disk block or fragment.
I rely here on MD5's property to evenly distribute bitmap usage,
(which is what basically makes collisions rare), thus I hope that
disk space would be utilised quite well eventually.
I also assume that collisions will very rarely happen in fast
successions, ie. before a collision happens, an object would have
some age it has stayed in cache, and thus perhaps served its purpose.
If we assume 10% of collisions, and 100% of disk space is wiped
through cache in 24h, then it seems reasonable to assume that any
object will stay in cache for few hours before a collision occurs
on it, on average.
In fact, we rely on collisions and need them, as this is our only
replacement policy. With 25% of collisions we expect full cache
to be replaced in 4 days. So collisions are our friend.
Of course alot of restrictions are imposed. It is assumed that
disk storage is divided into fixedsized blocks, one per object.
no replacement policy can be implemented other than either
overwrite old object or drop the new one.
Lots of disk space may stay unused. Some set of URL's may
constantly compete for the same disk store due to collisions.
Very difficult to maintain cache digests. For ICP hit/miss we
could return hit based on our bitmasked search result, but
this may lead to quite high false hit rate.
But we could maintain enourmous amount of objects with very
little ram usage. To solve for variable sized web objects, we
can implement cachedir selection based on object size. Or we
can increase complexity of array by allowing consecutive disk
blocks be allocated to the same object.
There are awful lot of web objects under 512byte size. We can
fit 70M of these on a disk of 36G. Theres no way we can handle
this amount of objects with current Squid mem usage, unless
we have pretty expensive hardware.
What seems attractive to me in this approach, is that we don't
need to bother about replacement overhead, lookup overhead,
FS overhead, memory overhead, startup overhead, (even dirty
startups, as most disk ops are atomic). We can consider disks
just a bunch of spindles, cheap and massive compared to anything
else in a system.
If all this sounds disgusting, beat me ;)
PS.
To quickly check what disk utilisation and collision rate I'd
have I took access.log from one of recent days, which had
4070655 urls, of which 2012485 where found unique, and made
few tests with different array sizes and URL sets.
Probably these are not very good samples, as MD5 is known to
get less collisions as bitmask increases, and I have too few
URLs to fill all arrays. But it seems that it might work ok
starting from 2M objects in array.
MD5 test suite. Array size: 4194304
Done Reading URLS: 1000000
0 - 3304705 78.79% of array unused
hits - count urls%
1 - 787675 78.77%
2 - 93937 9.39%
3 - 7521 0.75%
4 - 445 0.04%
5 - 18 0.00%
6 - 3 0.00%
MD5 test suite. Array size: 2097152
Done Reading URLS: 1000000
0 - 1301847 62.08% of array unused
hits - count urls%
1 - 620829 62.08%
2 - 147766 14.78%
3 - 23526 2.35%
4 - 2887 0.29%
5 - 271 0.03%
6 - 24 0.00%
7 - 2 0.00%
MD5 test suite. Array size: 1048576
Done Reading URLS: 1000000
0 - 403796 38.51% of array unused
hits - count urls%
1 - 385997 38.60%
2 - 183258 18.33%
3 - 58314 5.83%
4 - 14029 1.40%
5 - 2715 0.27%
6 - 419 0.04%
7 - 44 0.00%
8 - 4 0.00%
MD5 test suite. Array size: 4194304
Done Reading URLS: 2012129
0 - 2596049 61.89% of array unused
hits - count urls%
1 - 1245622 61.91%
2 - 298400 14.83%
3 - 47851 2.38%
4 - 5801 0.29%
5 - 540 0.03%
6 - 39 0.00%
7 - 1 0.00%
9 - 1 0.00%
MD5 test suite. Array size: 2097152
Done Reading URLS: 2012129
0 - 802939 38.29% of array unused
hits - count urls%
1 - 771855 38.36%
2 - 369207 18.35%
3 - 118271 5.88%
4 - 28500 1.42%
5 - 5379 0.27%
6 - 873 0.04%
7 - 113 0.01%
8 - 12 0.00%
9 - 3 0.00%
MD5 test suite. Array size: 1048576
Done Reading URLS: 2012129
0 - 153429 14.63% of array unused
hits - count urls%
1 - 295569 14.69%
2 - 283591 14.09%
3 - 181341 9.01%
4 - 86995 4.32%
5 - 33383 1.66%
6 - 10475 0.52%
7 - 2931 0.15%
8 - 700 0.03%
9 - 136 0.01%
10 - 19 0.00%
11 - 5 0.00%
12 - 2 0.00%
------------------------------------
Andres Kroonmaa <andre@online.ee>
Delfi Online
Tel: 6501 731, Fax: 6501 708
Pärnu mnt. 158, Tallinn,
11317 Estonia
Received on Mon Nov 06 2000 - 14:12:14 MST
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:56 MST