chunked_mempools - feedback needed from Andres Kroonmaa on 2001-04-27 (squid-dev)

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Fri, 27 Apr 2001 16:51:09 +0200

I think I've reached the point with chunked pools that I'm not quite
able to improve it further. I think its stable, and useful. But it is
in some sense quite different from current mempools that needs to be
discussed.

I'd like you to try it, comment it, suggest any changes and
improvements, and decide whether it is ok for HEAD.

------
Recall what its all about, and describe what I did.

Initial idea for chunking came from fact that malloc adds some memory
overhead for every allocation, typically in the range of 16 bytes. It
has to, to both satisfy one-fits-all alignment and be efficient with
variable sized allocations. Original memPool added its own overhead by
keeping list of pointers to idle or released objects, which is overhead
of 1 more pointer.

Chunks are essentially arrays of objects given memPool handles,
allocated in relatively large sizes and then maintained by the pool
itself. This omits malloc() overhead. To omit memPool's own overhead,
idle or free items are not kept on a separate list, but free objects
themselves become nodes in a linkedlist of free items. Thus free
housekeeping is stuffed into objects themselves. Only 1 pointer is
minimum requirement per object, which limits the minimum allocatable
size for a memPool.

The very purpose of chunking is to reduce useless overhead and reduce
memory footprint of squid. With former chunked pools are quite good at,
the latter is somewhat more questionable.

Original memPool was able to release idle item at any time if needed,
typically used with memory_pools_limit config tag, to limit amount of
ram kept in idle pools. With chunking this is not possible, as memory
can be returned only in chunks. If any chunk has only single item in
use, the whole chunk cannot be freed even if it is never used again.

For chunked pools, idle memory limit cannot be strict, it can only be
a wish, which must be allowed to be violated. To be memory efficient,
chunks must be as full as possible, to increase probability of idle
chunks to become empty and freeable.

Efficiency of chunked pools is dependant on memory usage pattern. I can
imagine "chunk-busting" allocation sequence, in which alot of chunks
are allocated but then left empty for a long time. eg. allocate alot of
shortlived objects that are interleaved with few longlived, in such a
way that each longlived object gets into a separate chunk. After most
shortlived objects are released, all chunks are almost empty, but not
releasable, giving pretty high overhead. This is unavoidable, but in
practice very rarely happens.

But point is that this *can* happen. With original memPool it can't.

Important thing with chunked pools becomes handling of freeable chunk
fragmentation. Algoritm of selecting chunks for next allocation and
limiting chunk sizes is needed to handle that well.
Efficiency of chunked pools is statistical - in most cases it has less
ram overhead than original mempools, but in unfortunate scenarios it can
add more overhead than it saves.
Fortunately, chunks can be picked large enough to force malloc lib to
allocate them from mmapped area. This allows idle parts of chunks to be
paged out of the way by OS VM.

During a free with chunked mempools free item has to be stuffed into a
freelist chain close to other items of that chunk. It is needed to search
and find the chunk that given released object belongs to.
Performance for chunking was initially relatively slow, because for each
released object upto all chunks in a pool were searched. This performance
overhead was most for longliving objects and nearly unnoticable for very
shortliving objects, because lastused chunk was moved ontop of the search
chain.

To further improve performance several changes have been made.
Allocated chunks are stuffed into a splay tree, and compare function is
used to compare released object to a chunk, returning a fact if given
object belongs to a chunk. Chunk search efficiency increased.

Then a cache of free objects was added.
Normally, when a new chunk is created ondemand, its local freelist is
populated with all objects as a linkedlist. When obejcts are allocated
from a chunk, object is taken from llist head, when object is released,
it is put ontop of that list. To omit chunk search for every released
object, additional perpool freelist cache was added.

Now when object is released, it is not put back into its owning chunk,
but instead just stuffed ontop of the pool's freelist cache. Next
allocation is made from that cache, or if this cache is empty, suitable
chunk is searched or created.

So, free objects can be in 1 of 2 different states:
1) back home in the chunk they belong to (chunk's local freelist)
2) in air, somewhere in the perpool freelist cache.
both kinds of freelists use the objects themselves as llist nodes, so
one object can be only in one of the lists.

Without periodic cleanup most or all objects of the pool would end up in
the freelist cache. This makes it impossible to know or select chunks
that are empty and releasable.
Therefore this global freelist cache is periodically destroyed by placing
every object into its home chunk. For each object in free cache splay
tree search is made and object stuffed into chunk's freelist and removed
from the cache.
After that all chunks are sorted by an algoritm that chooses in what
order further allocations would be using chunks. Empty chunks are
considered for release - if over idle pool limits then immediately, if
within pool idle limits, then only if unreferenced for quite some time.

Although seemingly very timeconsuming task, it is run from squid
event only every 15secs and overall impact is nearly undetectable.

As said above, order in which chunks are selected for next alloc can
influence how well the chunks are utilised. I've tried few algoritms.
Initially I preferred always chunks that reside lower in ram (compare
pointers). This provided general tendency for squid memory usage to get
compacted into lower heap, reducing probability of holes in heap and
thus helped to keep process size smaller. Then I tried to prefer chunks
least utilised, to exploit locality (new chunk gets most hits), but this
eventually caused all chunks to get equal utilisation, and didn't allow
any chunk to become empty. Currently it prefers chunks that are most
filled. This is to maximise chunk utilisation.

I haven't tested chunked pools under high real life loads, only with
Cidera feeding one proxy (about 5-20 reqs/sec, http only). With this
load, chunked pools behave very well, idle memory is within 2-4M with
cache having about 2.3M objects and process size of about 192M.

More mods.
While chasing for maximum performance it occured to me that we needlessly
updated memMeters upon every single alloc/free. Quite many Meters were
updated, adding to cpu load. I changed that so that only single perpool
counter is increased on alloc and one on free. Then, when memMeters are
needed, and every thousand requests are gathered, counters are flushed
into memMeters. This avoids quite alot of needless calculations.
The only drawback is that precision of "high" usage is reduced from exact
to "average". Average comes from lazy updating of Meters. I count frees
and allocs, over long time, and difference is added to Meters. So sudden
and short spikes of "high" usage can be missed.
If that is acceptable, then this saves quite alot of cpu.

Also, memPools main module is now under lib/. Under src we have only
cachemgr and event handlers. So we can use memPools in modules under
lib/ now also.

------------------------------------
Andres Kroonmaa <andre@online.ee>
CTO, Delfi Online
Tel: 6501 731, Fax: 6501 708
Pärnu mnt. 158, Tallinn,
11317 Estonia
Received on Fri Apr 27 2001 - 08:55:40 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:13:50 MST