A hash table by any other name

By Daroc Alden
July 15, 2024

On June 25, Matthew Wilcox posted a second version of a patch set introducing a new data structure called rosebush, which "is a resizing, scalable, cache-aware, RCU optimised hash table." The kernel already has generic hash tables, though, including rhashtable. Wilcox believes that the design of rhashtable is not the best choice for performance, and has written rosebush as an alternative for use in the directory-entry cache (dcache) — the filesystem cache used to speed up file-name lookup.

Rosebush is intended to present roughly the same API as rhashtable, with the main difference for users being the performance of the two hash tables — something which is critically important to the dcache, since it is referenced during almost all filesystem operations. All hash tables have the same basic structure, but the details can be quite important for performance. The key of the value being looked up (a file name, for example) is hashed, which is used to select a bucket from a top-level array of buckets. Multiple keys can have the same hash, however, so the hash table must either store multiple keys per bucket (the approach taken by rosebush and rhashtable), or use a more complicated bucket-selection algorithm (such as linear probing or cuckoo hashing). Rosebush uses arrays as fixed-size buckets, whereas rhashtable instead uses linked lists. While there are other differences between the two data structures, it is the avoidance of linked lists that Wilcox thinks will make the biggest difference.

To look up a key in a rosebush, the key is hashed. In the normal case, the rosebush has a two-level structure; the hash is first used to select a bucket from an array of buckets, and then looked up within that bucket using a linear scan. The array has a number of elements that is a power of two, so the last N bits of the hash can be used to efficiently select an entry. In order to adjust to the number of items stored in the rosebush, the top-level array is resizeable. This does mean that the occasional insertion into the table will need to allocate a new top-level array and rehash the keys into new buckets, but the average insertion only has to do a constant amount of work. If few enough items have been added to the rosebush (the exact number depends on the architecture), the rosebush will just skip the top-level array and fall back to using a single bucket. This saves some memory for rosebushes that are not actually being used. Buckets are stored with the hashes first, followed by the associated values. This ensures fewer cache lines need to be fetched to look up an element in a bucket. Finally, the value associated with the key is returned.

In order to efficiently support concurrent access, rosebush uses read-copy-update (RCU) to protect each bucket. There is a whole-hash-table spinlock that is taken during resizing operations, but the vast majority of accesses should only need to use RCU on the specific bucket being read from or updated. Since reading under RCU protection is approximately free, this means that the hash table has excellent concurrent read performance — a necessity for the dcache.

The lists rhashtable uses are intrusive, singly-linked lists. The kernel makes extensive use of embedded linked list structures. Intrusive pointers do have downsides, however. For one thing, a structure with an embedded linked list entry in it cannot be stored in multiple lists. It also uses extra memory when the structure is not being stored in a list. The fact that rosebush does not use intrusive pointers allows an item to be stored in multiple rosebushes, or the same rosebush under different keys; the benefit is a side effect of the performance-motivated design decision to avoid linked lists.

The main problem with linked lists in this context is pointer-chasing: linked lists require the CPU to access several potentially unrelated parts of memory in order to iterate through the bucket. Rosebush avoids this by using an array to hold bucket entries, reducing the number of cache misses incurred by iterating over a bucket. Wilcox explained the motivation in his cover letter:

Where I expect rosebush to shine is on dependent cache misses. I've assumed an average chain length of 10 for rhashtable in the above memory calculations. That means on average a lookup would take five cache misses that can't be speculated. Rosebush does a linear walk of 4-byte hashes looking for matches, so the CPU can usefully speculate the entire array of hash values (indeed, we tell it exactly how many we're going to look at) and then take a single cache miss fetching the correct pointer. Add that to the cache miss to fetch the bucket and that's just two cache misses rather than five.

In response to the first version of the patch set (from February), Herbert Xu disagreed, pointing out that an rhashtable is resized whenever the table reaches 75% capacity, so the average bucket contains only one item. Therefore uses of rhashtable will incur many fewer cache misses than Wilcox had predicted.

David Laight thought that was considering the wrong metric, however. Since the kernel never has any reason to look up an empty bucket, it is more interesting to consider the average length of non-empty buckets. Xu acknowledged the point, but didn't think it changed the ultimate analysis, since the average chain length should still be nowhere near 10 items.

Reviewers were, as ever, skeptical in the face of assertions about performance without corresponding measurements. Peng Zhang asked "how much performance improvement is expected if it is applied to dcache?" Wilcox didn't reply to Zhang's question. The other reviewers did seem fairly happy with the second version of the patchset, mainly suggesting small changes to the documentation and test suite.

Bucket expansion

Posted Jul 15, 2024 18:22 UTC (Mon) by Paf (subscriber, #91811) [Link] (35 responses)

How is bucket expansion handled? When more many things hash to the same bucket than you allocated slots in the bucket? That’s not a problem for a linked list but it is for this sort of structure, right?

Bucket expansion

Posted Jul 15, 2024 18:37 UTC (Mon) by daroc (editor, #160859) [Link] (34 responses)

The first version of the patch set had expandable buckets, handled in the normal way for growable arrays: allocate a new, larger array and copy things over. But the second version of the patch set uses fixed-size buckets, and just enlarges the top-level array that holds the buckets if too many things are hashing to the same bucket.

Bucket expansion

Posted Jul 15, 2024 20:07 UTC (Mon) by Paf (subscriber, #91811) [Link] (32 responses)

OK, thanks for the clarification! An interesting approach - it sort of ignores the theoretical worst case of a bad hash function causing ongoing pileup. But in practice I think that should work quite well.

Bucket expansion

Posted Jul 15, 2024 20:58 UTC (Mon) by Wol (subscriber, #4433) [Link] (31 responses)

https://en.wikipedia.org/wiki/Linear_hashing

This technique is over 40 years old. Rosebush doesn't seem to be linear hashing, but it's a very similar technique. Linear hashing is behind Pick/MultiValue's expanding tables.

The technique is pretty easy to explain. You have hash-buckets like classical hashing. But the hash algorithm is modified slightly. Instead of converting your hash function to a bucket using a normal modulo, you drop the high order bits until you have a valid bucket.

So if you have 2^n+y buckets, you try mod 2^(n+1), and if that doesn't work you try mod 2^n. So if you want to increase the buckets, you know that everything that is going to hash into the new 2^n+y+1 bucket is currently in bucket 2^(n-1)+y+1 (if I've got my maths correct). So the cost of adding one bucket to the hash table is the cost of rehashing just one bucket.

Cheers,
Wol

Bucket expansion

Posted Jul 17, 2024 12:01 UTC (Wed) by Wol (subscriber, #4433) [Link]

Hmmm ...

Just read up on Cuckoo Hashing ...

I've got a paper (same sort of date as Litwin's linear hash paper - 1980) that approximately describes cuckoo hashing!

It's actually a linear hashing paper, that has one item per bucket, and 100% fill. So every time you add a new item, it works out the linear hash location, and cuckoo's an item out of the way. Something along the lines of if the item in the bucket doesn't belong there, it pushes it out into the new end position and takes its place, while if the other item does belong there, it itself takes the end place and creates a chain.

Cheers,
Wol

Bucket expansion

Posted Jul 18, 2024 1:09 UTC (Thu) by george_v_reilly (subscriber, #38242) [Link] (2 responses)

In LKRhash, https://nwcpp.org/june-2012.html, we used linear hashing, cache-friendly data structures, and careful locking to build a scalable hash table. The slides at https://speakerdeck.com/georgevreilly/lkrhash-the-design-... do a reasonable job of explaining our approach.

Bucket expansion

Posted Jul 18, 2024 11:20 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

Thanks. Downloaded. I think that will be very interesting ...

Cheers,
Wol

Bucket expansion

Posted Jul 18, 2024 13:12 UTC (Thu) by Wol (subscriber, #4433) [Link]

Just looked at it - fascinating!

Just a little historical point - I first met dynamically hashed files in 1985 - Pr1me INFORMATION.

And the original Pick (back then it was GIRLS) was a pure virtual memory machine. I don't get how it worked, but it just divided core and disk into frames, and everything was hashed. I guess to shut the machine down, the core frames would have been backed by disk, so it would just flush the core and stop.

Well interesting you had to modify Linear Hashing for a memory environment, when one of the earliest users of the technique was based on a "pure memory" model!

Cheers,
Wol

Bucket expansion

Posted Jul 18, 2024 14:31 UTC (Thu) by paulj (subscriber, #341) [Link] (26 responses)

Hmmm... this is pretty much tree rebalancing.

A lot of (all?) techniques that try make lists and hash-tables scale are essentially tree operations bent and wedged around hash tables. Like skiplists, linear hashing, etc. But these methods tend to lack the depth of theoretical analysis that trees have.

Wouldn't it be better to just use a tree - a block based one for best locality?

Bucket expansion

Posted Jul 18, 2024 15:38 UTC (Thu) by Wol (subscriber, #4433) [Link] (24 responses)

> But these methods tend to lack the depth of theoretical analysis that trees have.

Do they? Or is it just that trees are sexy?

> Wouldn't it be better to just use a tree - a block based one for best locality?

How many decision points are there in scanning a tree? One per level? And how many levels does your typical tree have?

For retrieval, the 95% use case for linear hashing is "calculate hash function, calculate modulo (which is AND, possibly twice), retrieve data. That's simple, uncontrovertible theoretical analysis. Is "lack of analysis" a snide way of saying "the analysis is so simple it was all done and finished years ago"? (That's not a dig at you - it's the University "Publish or Perish" mentality!)

I don't really know trees, but if you have 1000 entries, if each node splits into two branches, isn't that ten comparisons to retrieve any item? MUCH more expensive. And if each comparison requires a disk access (certainly I'm thinking databases, so that's the norm), then costs go through the roof. Making the tree wider rather than deeper simply moves the costs. With linear hashing, given the key, I can be statistically confident I'm going to get what I'm looking for at the first disk access (or possibly a bit more if my record is too big to fit in just one bucket).

The problem with theoretical analysis is people forget the real world. Theoretical analysis rarely takes wall-clock into consideration.

Cheers,
Wol

Bucket expansion

Posted Jul 18, 2024 16:05 UTC (Thu) by daroc (editor, #160859) [Link] (23 responses)

Without really addressing your main point, I can provide some numbers about the branching factor used in modern trees, based on some research I've done on B-tree implementations.

Modern B-trees often have approximately 512 entries per level, since that fits well into a 4096-byte page. (Although good B-tree implementations often use pointer- or key- compression, so the exact number of values that will fit can vary) So in the case of having 1000 items, many of those can be stored inline in the top-level node, and the others should be at most one level down — although of course the exact details depend on your implementation. Generally, B-trees can hold hundreds of thousands of items with two levels, hundreds of millions with three levels, etc.

But yes, tree operations are generally O(log n), and hash table operations are generally O(1), so the difference can be significant with large enough datasets.

Bucket expansion

Posted Jul 18, 2024 16:23 UTC (Thu) by paulj (subscriber, #341) [Link] (21 responses)

Yes, if you know the input domain is relatively fixed (i.e., far more reads than adds/deletes) the O(1) lookup of a hash-table is better than the O(logn) of a tree.

The question is on mixed work-loads really. I'd be curious how they compare. I guess best way would be implement and measure over different workloads.

Hash maps versus tree maps and cache effects

Posted Jul 18, 2024 16:43 UTC (Thu) by farnz (subscriber, #17727) [Link]

The other thing is to consider exactly which items you're reading; a tree will tend to cluster similar keys together (e.g. if the keys are strings, "linus" will be near "linux" in the tree structure), while hashes will tend to spread them out. This can, IME, result in a B-tree map being better for reads than a hash map, because the B-tree map ends up cache-friendly as a result of the workload, where the hash map does not.

As an aside, it's interesting to note that despite both radix trees and hash tables having been used in the past for page tables, both AArch64 and x86-64 use radix trees, not hash tables, as the core data structure for virtual address to physical address mapping.

Bucket expansion

Posted Jul 19, 2024 11:23 UTC (Fri) by Wol (subscriber, #4433) [Link] (19 responses)

> The question is on mixed work-loads really. I'd be curious how they compare. I guess best way would be implement and measure over different workloads.

Don't forget - with linear (and others) hashing insert and delete are also O(1).

Cheers,
Wol

Bucket expansion

Posted Jul 19, 2024 12:29 UTC (Fri) by paulj (subscriber, #341) [Link] (18 responses)

That's not the case. Insert for a normal hash table is O(n). Remember, O means the upper bound - the worst case. If you mean the non-split case is O(1) well then the equivalent non-rebalance case for a tree is also O(1). Indeed, the best cycle cost for a series of B-tree inserts could be much better than for a hash table, depending on memory details, as farnz pointed out.

Not sure about the split case, depends on the details of rebucketing I guess. I wouldn't be surprised if that ends up O(logn) as for trees. In which case it's O(logn). Maybe it's worse - analyses seem to be hard to be find. And these analyses /do/ matter, if you want to be have a chance at making your code resilient to DoS attacks. You need this kind of info to choose the right data-structures.

Sometimes the fastest common-case data-structure is a /bad/ choice, if it means attackers can bring your machine to its knees. Sometimes the slower average-case data-structure is better, if it has that reliable speed over /all/ conditions!

Bucket expansion

Posted Jul 19, 2024 14:08 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)

I believe it's more normal to say hash table insert has O(1) amortized cost. Specifically that means a sequence of n inserts takes O(n) time (in the strict upper-bound sense of 'O'), which is counted as an average of O(n)/n = O(1) per insert. That's still the worst case performance (in contrast to e.g. a probabilistic average-case performance analysis), it's just using the worst case over a sequence rather than a single operation. And that's usually a better way to look at it because a single operation is usually so fast that nobody cares about its performance, they care about long sequences.

(As far as I can tell from some searching, the equivalent amortized analysis of a B-tree says an insert still costs O(log n), unless you're starting from an empty tree in which case any sequence of n insert/delete operations takes O(n) time and therefore each operation has O(1) amortized cost. Hash table operations are O(1) amortized cost regardless of the initial state, so the O(1) is more widely applicable there.)

Bucket expansion

Posted Jul 19, 2024 15:53 UTC (Fri) by paulj (subscriber, #341) [Link]

O(f(n)) is by definition a bound over the worst-case case input to a function.

Perhaps from an engineering perspective it is interesting to consider amortised costs (call that f(n) \in A(g(n), say), but if you have a function whose behaviour can vary greatly over different inputs, then what does "amortised" even mean? One work load could produce A(1) behaviour, in terms of time complexity, another A(n^2), another A(n!) perhaps. If that work load can be influenced by others, then really you must consider only the worst-case, or those others can trigger a DoS potentially. So the more formal definition of O(g(n)) is actually the more useful one, for anything with security consequences.

Hash tables have terrible security properties, because they are O(n) - not O(1). You can take mitigations (cryptographically secure hash function [wait, you were worried about cycles?], oversized table for expected loads), but it's not that hard for people to miss those or get it wrong or... simply over time for typical work-loads to grow much larger than originally anticipated (more pages, more packets, more hosts, etc.).

Bucket expansion

Posted Jul 19, 2024 16:41 UTC (Fri) by Wol (subscriber, #4433) [Link] (15 responses)

> That's not the case. Insert for a normal hash table is O(n). Remember, O means the upper bound - the worst case. If you mean the non-split case is O(1) well then the equivalent non-rebalance case for a tree is also O(1). Indeed, the best cycle cost for a series of B-tree inserts could be much better than for a hash table, depending on memory details, as farnz pointed out.

Except I explicitly specified linear hashing. Although I was discussing the 95% case, not the worst case.

> Not sure about the split case, depends on the details of rebucketing I guess. I wouldn't be surprised if that ends up O(logn) as for trees. In which case it's O(logn). Maybe it's worse - analyses seem to be hard to be find. And these analyses /do/ matter, if you want to be have a chance at making your code resilient to DoS attacks. You need this kind of info to choose the right data-structures.

Okay, I'm thinking databases, which are unlikely targets for DoS attacks. Although some implementations do explicitly ask what sort of keys you are using, to make the system more resilient. No modern implementation I know of does that, though.

And again, I'm discussing the 95% case, not a pathological case, but again, "depending on the details of rebucketing", for linear hashing time is proportional to delta buckets - so if we exclude that 5% worst case it comes out as O(1)

> Sometimes the fastest common-case data-structure is a /bad/ choice, if it means attackers can bring your machine to its knees. Sometimes the slower average-case data-structure is better, if it has that reliable speed over /all/ conditions!

Sometimes. But it's a trade-off! Is rebooting your router in the middle of a DoS a good idea? It *might* be better than trying to stay up! I think we often make the wrong decision because we can :-) Here's a case in point - if O is defined as worst case, it unfairly paints hashing in a very bad light if you actually want the normal case.

Sometimes five nines isn't worth the cost. One and a half nines is still pretty damn good! Especially where that's reliably and demonstrably faster. And if that buys us three and a half nines of saved time? We can afford the reboot ...

Cheers,
Wol

Bucket expansion

Posted Jul 22, 2024 10:09 UTC (Mon) by paulj (subscriber, #341) [Link] (14 responses)

On the last bit, I strongly disagree. A system with predictable behaviour, that responds in a predictable, progressive behaviour to load (i.e. slows down linearly at worst, ideally logarithmically) will be the more reliable system; the less dangerous system; the system that doesn't randomly make you lose your weekends, or worse (take down important wide-scale systems, randomly).

Shaving a % or 2 off the common case, and forgetting that your worst case has now become so bad that you've now got something almost approaching a chaotic system, is what - too regularly - makes the tech world look like a clown show.

Bucket expansion

Posted Jul 22, 2024 10:16 UTC (Mon) by paulj (subscriber, #341) [Link] (6 responses)

Wish I ciould edit. Predictable of load response allows for proper planning. Planning for expected load, planning when you need to upgrade. That ability to plan, plus the resilience against unforeseen collapse in the face of unpredicted load, buys you time to do proper engineering to make sure you've got all the other good engineering practices ticked off.

In aviation parlance, it keeps you "ahead of the plane" (i.e. you've already anticipated what should happen and what can happen, have planned for it, and are completely in control either way).

Unfortunately in tech, we instead love to cut corners, to ignore good engineering practice, to throw things over the wall hoping it works and praying we can fix it quickly enough tomorrow if it does not. We wing it. We fly by the seat of our pants. And of course, we often end up flying "behind the aircraft" - reacting too late, without enough time. And we crash and burn.

Anyway, choose resilient data-structures, build robust systems. Stop the tech clown show.

(Just thinking of a recent event.... where a certain company apparently pushed some kind of file of control variables that gets parsed by a kernel driver, which was invalid and caused kernel driver to crash or panic, and broke critical windows machines all over the world causing wide-scale outages - and they pushed on a friday!).

Bucket expansion

Posted Jul 22, 2024 11:51 UTC (Mon) by Wol (subscriber, #4433) [Link] (5 responses)

> Anyway, choose resilient data-structures, build robust systems. Stop the tech clown show.

Build rockets that can't take off because they're carrying so much extra weight they can't reach escape velocity ...

Instead build systems that are nimble, slick, and will simply outperform your robust systems into the dust.

I've never heard of a company that's had problems with my "5% worst case" - I guess because it's statistically improbable. I've heard plenty of stories of firms using your "resilient, robust" systems who simply couldn't adapt fast enough, and were taken down by the competition.

If you want fail-safe systems, sure, build fail-safe systems. My every-day work life consists of fighting systems that are too slow, are badly thrown together designs, are a nightmare to update, and use your allegedly robust systems to provide us with the wrong data. Yes I am very definitely in the end-user camp, and as far as I'm concerned, the clown show is all these "resilient, robust" systems, which are "heavy, lugubrious and slow", and are simply useless from the end-user's PoV.

The number of people I talk to who are totally UNsurprised at the number of FTSE 100 companies that seem to be run very much on Excel spreadsheets - talk about a totally unsuitable tool - except it just seems to be so much better at the job than most IT tools they want us to use.

My company is just *beginning* to go live with system that has been "two years away" for maybe the last 5, 10 years? And yet there are systems out there - MultiValue anyone? - with a reputation for coming in EARLIER and CHEAPER than prediction.

Cheers,
Wol

Bucket expansion

Posted Jul 22, 2024 11:56 UTC (Mon) by Wol (subscriber, #4433) [Link] (2 responses)

Let me give you a scenario. You're being chased by a lion. Do you

(a) Run. Doesn't matter where, doesn't matter what direction
or
(b) Spend five minutes working out the best way to escape and then carry out the plan.

You seem to be advocating (b) as the ONLY acceptable choice in ALL circumstances. Especially if choosing (b) divides your top sprint speed by four ...

In business, as in life, you'll end up as someone else's lunch in damn quick order :-)

Cheers,
Wol

Bucket expansion

Posted Jul 22, 2024 12:31 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (1 responses)

Or (c), Run now, but remember to consider "lions are near" next time you're in the area. Also known as "keeping track of tech debt".

Bucket expansion

Posted Jul 22, 2024 12:34 UTC (Mon) by Wol (subscriber, #4433) [Link]

:-)

Cheers,
Wol

Bucket expansion

Posted Jul 22, 2024 14:15 UTC (Mon) by paulj (subscriber, #341) [Link] (1 responses)

No, you build a rocket where you can reliably predict the behaviour of the control loop. E.g. (contrived), you know that even if the rate of sensor inputs increases by X, the control loop will only slow down by X/n, and hence you can confidently say - knowing there are only m*X sensors with a maximum output rate of z/s, that the control loop will still complete its iterations within the required bounds to keep the rocket under control.

And if it does not quite meet the bounds, _then_ you can start shaving cycles. And you can still be confident - assuming you don't change the fundamentals of the underlying data-structures - that you're not introducing some catastrophic worst-case.

Alternatively, ignore this stuff, and your control loop works fine 99% of the time in simulation, but then within the first few launches, *boom*. You shave some cycles, but cause you still don't understand what the bounds are on your data-structure and - worse - when you /do/ do an analysis of the bounds of your data-structure you're constructing O(f(n)) analyses based on _amortised_ costs (wtf?), and so you're both misleading _yourself_ and any other engineer about the worst-case behaviour. And... another few launches... and *boom*.

And look, I recognise very much what you describe. A lot of IT is cello-taped together. (Or held together with linen patches and "dope", to go back to an [early]) aviation analogy). And we won't get rid of ad-hoc systems (ad-hoc from conception, or by degeneration). And many serve a purpose. That's fine.

What I'm asking for is that the software engineers who should _know better_ and who are meant to _design_ well-engineered systems (as opposed to hack stuff together) recognise the value in things like "Trees with decades of formal analyses have provably better described behaviour than array and/or list-based data-structures that rely on pseudo-tree like operations that have only 1 paper describing them and next to no independent formal analyses".

Bucket expansion

Posted Jul 22, 2024 14:42 UTC (Mon) by paulj (subscriber, #341) [Link]

"But Paul, most of us aren't working on rockets or nuclear power stations. We writing code for consumer stuff that doesn't matter!"

And people come home and find their fridge has burned out or defrosted: https://www.youtube.com/watch?v=2gv5MXXR4TE . Or no one at the airport can find any flight information (not the handlers at the check-in desks, not the information screens, etc.). And ok, those bugs usually aren't data-structure choices, but sometimes they are. Either way, it's all part of the same clown-show attitude in tech.

Large numbers and tiny probabilities

Posted Jul 22, 2024 11:38 UTC (Mon) by farnz (subscriber, #17727) [Link] (6 responses)

The other thing that plays into this is that people's untrained intuition for numbers is pretty poor outside the range 0 to 100 or so.

A random failure rate needing human intervention of 1 in 10¹² sounds pretty decent in isolation - 10¹² is a huge number. But if you deal with 1 billion RPS (not unreasonable scale for someone like Google), that same failure rate means you're averaging something on the order of one human intervention every 15 minutes - most people can't intuit that the high RPS rate means that a very low failure per request rate is still significant.

Large numbers and tiny probabilities

Posted Jul 22, 2024 12:43 UTC (Mon) by Wol (subscriber, #4433) [Link] (5 responses)

> The other thing that plays into this is that people's untrained intuition for numbers is pretty poor outside the range 0 to 100 or so.

And everybody wears blinkers, that they're often unaware of. I can see that in myself sometimes :-)

Thing is, in this particular argument, I see people arguing "but we can guarantee 100% certain that all our runners will complete the 100m in one minute or less".

SERIOUSLY, I'm pointing out that the statistical likelihood of any of my runners requiring more than 10 seconds is beyond the bounds of normality.

Both approaches have their place, but when kernel developers are arguing over a few percent, a speed difference of that magnitude is hard to ignore. And the cost of mitigating an attack - IF it happens - is a tradeoff.

imnhso a productivity hole of that magnitude is worth a few days overtime once a year :-) Others may disagree ... (ime that hole CAUSES a lot more than a few days overtime once a year ...)

Cheers,
Wol

Large numbers and tiny probabilities

Posted Jul 22, 2024 13:48 UTC (Mon) by farnz (subscriber, #17727) [Link] (4 responses)

But this is why it's an engineering problem, and there is no one right answer.

If the consequences of one of your runners taking more than a minute to complete a 100 metre sprint is "the universe ceases to exist", then I'd like a lot more confidence than "the statistical likelihood of any of my runners needing more than 1/5th of a minute is extremely low". If, on the other hand, it's "my coffee takes an extra minute to prepare", then your statistical claim is good enough.

Reality will be somewhere between those two extremes, and part of the problem is that intuition is very often not enough to determine where it falls between those extremes, except in massively simplified cases, in part because intuition breaks down when the numbers fall outside your normal experience.

And I'd note that, if we were talking about the 100m sprint, there's fewer than 200 people in the world who've done a 100m sprint in under 10 seconds without external aid (e.g. wind in their favour). Only 34 people have run that fast in 2024, for example; this says something about your intuition, perhaps :-)

Large numbers and tiny probabilities

Posted Jul 22, 2024 14:16 UTC (Mon) by Wol (subscriber, #4433) [Link]

> And I'd note that, if we were talking about the 100m sprint, there's fewer than 200 people in the world who've done a 100m sprint in under 10 seconds without external aid (e.g. wind in their favour). Only 34 people have run that fast in 2024, for example; this says something about your intuition, perhaps :-)

:-) :-)

Yup, I know the 10sec 100m is very fast. But as far my database is concerned, the speed difference between the first and ninth decile is ZERO. The best case is also the normal case. (Not true of running, I know.)

Cheers,
Wol

Large numbers and tiny probabilities

Posted Jul 22, 2024 14:19 UTC (Mon) by Wol (subscriber, #4433) [Link] (2 responses)

> But this is why it's an engineering problem, and there is no one right answer.

And if you read back over my posts, you can see I very much agree with that attitude. I just feel that the attitude "we need a bounded worst case" is a theoretical requirement beloved of academicians, that is a very poor fit to most real world cases - of course, a nuclear power station meltdown isn't one of them, but it's not a normal case, either :-) But that's the case most real world scenarios are bludgeoned into. Square pegs and round holes and all that ...

Cheers,
Wol

Bounded worst case and ripple effects

Posted Jul 22, 2024 15:44 UTC (Mon) by farnz (subscriber, #17727) [Link] (1 responses)

Unless you know the effect of failing to meet your worst case, though, you can't make that statement; for example, I've worked on a system where the effect of a request taking too long was for the load balancer to reroute the request to another server, on the assumption that the server it first chose is broken, and to mark the server as "failing". One accidental construction of a bad request later, and the load balancer merrily took out the entire server fleet because (a) requests were not independent - once one client had cause to construct a bad request, it was very likely that the rest would soon do the same, and (b) a slow request turned into "server down" and automatic mitigation, which escalated until the entire system was down.

And it's important to note that I've elided a lot of detail there - underlying it is that everything in the system was locally sensible on a "statistically unlikely" basis, right up until we hit some bad luck and a 1 in 10⁷⁷ random occurrence happened across a large number of clients all at the same time, because the assumptions underlying "that's statistically unlikely" failed to take into account the inter-dependencies of the system as a whole once you brought the users into the loop.

Bounded worst case and ripple effects

Posted Jul 22, 2024 17:35 UTC (Mon) by Wol (subscriber, #4433) [Link]

> Unless you know the effect of failing to meet your worst case, though, you can't make that statement; for example, I've worked on a system where the effect of a request taking too long was for the load balancer to reroute the request to another server, on the assumption that the server it first chose is broken, and to mark the server as "failing". One accidental construction of a bad request later, and the load balancer merrily took out the entire server fleet because (a) requests were not independent - once one client had cause to construct a bad request, it was very likely that the rest would soon do the same, and (b) a slow request turned into "server down" and automatic mitigation, which escalated until the entire system was down.

Which again, is (or rather should be) part of the engineering rationale for deploying such a system. Okay, I work on what you would probably call small-scale systems - a supermarket customer management system. And I'm only interested in what's going on in the next four weeks (unless it's Christmas, of course).

I have to use the same OTT safety conscious systems as the world thinks is "best practice". Personally I'd use real-world primary keys but I have to use guids ... but if I'm only interested in 28 days data, and I have to use some sort of non-real-world primary key, it's pretty obvious an incrementing sequential number that expires cannot be a pathological case for any sane hash system ...

If I can deploy my "System of choice", that has 80% less resource requirement, the hours a day I spend watching the hourglass as Snoracle tries to find the data I want will suddenly be available for useful work. If installed on your load-balancing systems, it would be far less likely to overload the system (okay, that's a reduction in risk, not a removal of risk),

Thing is, your "guaranteed response" has a real cost. I'd cost it as sticking a nought on the price tag. Which may be well worth paying! But paulj is afraid of coming home and finding his fridge defrosted. I'm more scared of coming home and finding my house underwater because the Antarctic has defrosted ... If I can reduce my energy costs by 80% by using appropriate technology ... it'd make a hell of a difference if everybody did it.

I've never heard of anybody using btrees for anything but indices in MV (because there's no choice). You'd have thought if the same economics applied to data storage, btree would be the choice. But it's unheard of, so I guess the cost is too great.

Guess what I should do (if only I could find the time, might use it to learn Rust ! :-) is to write a hashtree algorithm. That uses dynamic hashing by default, changes the seed every time it doubles the number of buckets (which most kernel algorithms do, I believe), and if it thinks an attack is under way (or the RT guys want it) switches to a constant-response tree.

Given that I think in key->data terms, and most of you seem to think in key->ptr terms, I'm not sure the speed increase here would be quite what I would hope. But I would still expect - in the normal case - for it to be noticeably faster. Just maybe not 80%

Cheers,
Wol

Bucket expansion

Posted Jul 19, 2024 10:27 UTC (Fri) by Wol (subscriber, #4433) [Link]

> Modern B-trees often have approximately 512 entries per level, since that fits well into a 4096-byte page. (Although good B-tree implementations often use pointer- or key- compression, so the exact number of values that will fit can vary) So in the case of having 1000 items, many of those can be stored inline in the top-level node, and the others should be at most one level down — although of course the exact details depend on your implementation. Generally, B-trees can hold hundreds of thousands of items with two levels, hundreds of millions with three levels, etc.

Ah - so each entry in your b-tree is key/pointer. That makes sense, you minimise page accesses, and you assume that data is "hot" so once you've got the pointer you go straight to it. Whereas each entry in my hash is key/value, so with a 2048-byte page it's not unusual to only have three or four entries max. (Page size is configurable per hash table.)

> But yes, tree operations are generally O(log n), and hash table operations are generally O(1), so the difference can be significant with large enough datasets.

And this is (a) where the maths gets complicated, and (b) people start talking past each other. My cold best (normal) case is twice as fast as yours - because I'm thinking key->value=data. But you can't see that, because you're thinking key->value=ptr.

Then of course your hot path closes that gap. At which point you now have a fixed worst case (which I don't). But my best case is also my normal case, so if I'm happy with the occasional pathology ...

(And it explains why Pick/MV implementations have switched to btree indices, which cheeses me off because for me they are usually foreign primary keys, and I can no longer use the index as a separate table of virtual data fields!) (And because I'm thinking databases, and disk i/o, the normal case is very important to me.)

Hmmm... tradeoffs, tradeoffs, ...

Cheers,
Wol

Bucket expansion

Posted Jul 18, 2024 17:05 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

Sounds like a topic for one of Knuth's Christmas Tree lectures (which may well have already been done).

Bucket expansion

Posted Jul 15, 2024 20:42 UTC (Mon) by edeloget (subscriber, #88392) [Link]

But then there should still exist degenerated cases where we get too many item per bucket, unless I'm mistaken. Does someone have any clue on how thé worst case perform?

Rehashing

Posted Jul 15, 2024 18:52 UTC (Mon) by npws (subscriber, #168248) [Link]

While average access time is important, worst case is as well, especially considering the real time efforts. Not sure in which contexts this is intended to be used (I assume purely user context), but I'm sceptical that even if it beats rhashtable on average performance, if worst case performance will be acceptable as well.

Hash tables are everywhere

Posted Jul 15, 2024 20:42 UTC (Mon) by Sesse (subscriber, #53779) [Link] (8 responses)

I'm a bit surprised that these designs are still in use. In the application world, I don't think I know of a single design anymore that uses multiple-values-in-a-bucket (precisely because it needs linked lists and/or memory allocation). I see linear probing, quadratic probing, double hashing, cuckoo hashing, linear probing with Robin Hood hashing, 4-way associativity (in particular, for the case where you have fixed-size values and are allowed to drop some of them if you'd like), SwissTable's linear-probing-with-16-way-associativity-and-SIMD… does the kernel really have such different needs due to RCU and such, or is it just that they don't want to look at state of the art in userspace?

Or maybe someone will prove me wrong and show that the bucket-overflow thing is still really common and great. I think I saw it in glib back in the day, but nobody serious would use that as a high-performance hash table. :-)

Hash tables are everywhere

Posted Jul 15, 2024 21:21 UTC (Mon) by Wol (subscriber, #4433) [Link] (5 responses)

> I'm a bit surprised that these designs are still in use.

Maybe because some people aren't taken in by "ooh - new - shiny". Old fashioned simple hashing, yes that doesn't work that well. But bucket overflow really isn't a thing with linear hashing. Which isn't that much more complicated than simple hashing. Given a database with strings of random length (so your buckets can be all over the place) the stats show that linear hashing has a 95% hit rate.

To put a bit more flesh on that statement, an MV bucket is typically 2KB in size. Given a record's primary key, there is a 95% chance that the database will find either the record it's looking for, or a direct pointer saying exactly where it is, in the first bucket it retrieves.

With only 5% room for improvement, is the cost of all these fancy techniques worth it?

(and because MV can store a list of related keys in a record, it almost never needs to do a search of any sort ...)

Cheers,
Wol

Hash tables are everywhere

Posted Jul 16, 2024 4:21 UTC (Tue) by intelfx (subscriber, #130118) [Link] (2 responses)

> Maybe because some people aren't taken in by "ooh - new - shiny"

Blindly writing off new inventions and novel algorithms as "ooh new shiny" is a fallacy not any better than the one you refer to.

Hash tables are everywhere

Posted Jul 16, 2024 5:46 UTC (Tue) by LtWorf (subscriber, #124958) [Link] (1 responses)

I don't have this impression, from reading the entirety of the comment.

Hash tables are everywhere

Posted Jul 16, 2024 7:25 UTC (Tue) by Wol (subscriber, #4433) [Link]

THIS.

Sometimes "ooh new shiny" is worth it. But if the maths says "the cost of the optimisation is greater than 5%, and the technique you're currently using is 95% efficient", then new shiny is a net cost, not gain.

And you'll have seen me railing over tech churn. Okay, I have massive exposure to older people (many of them elderly not just old), and I'm repeatedly and forcibly exposed to problems caused by new tech, or even just tech.

Elderly people in particular just CANNOT adapt.

Cheers,
Wol

Hash tables are everywhere

Posted Jul 17, 2024 0:38 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (1 responses)

> With only 5% room for improvement, is the cost of all these fancy techniques worth it?

This is why we benchmark. Sometimes, your average hash table has three things (keys) in it, and (assuming a reasonable number of buckets) collisions basically don't exist. The problem is easy, and any non-braindead implementation will probably be fine. Sometimes, your average hash table has thousands of things in it, and you pretty much have to balance between too many collisions and too many buckets. In an ideal world, the hash table implementation is carefully tuned for your use case, so that it rehashes at an appropriate fullness to meet your needs (e.g. batch workloads can use a higher fullness if you're OK with them running a bit slower, but interactive workloads can't get away with that).

But in the real world, that's not always what happens. People use the stdlib hash table implementation, if the language provides one, because it is easy. It does not demand that they think about complexities such as rehashing and optimal fullness. In exchange for this convenience, it is probably at least a little suboptimal for most use cases. You probably have too many collisions, or too many buckets, at least some of the time.

When you think about it in those terms, you realize that every performance optimization probably helps someone, somewhere. If collisions are cheaper, then the people who need fast lookups are happy, the people who need low memory usage can get away with a higher fullness for the same time performance, and the people who are just using stdlib hash tables should benefit in at least one of those dimensions, depending on how the implementation decides to handle the improvement. Meanwhile, the people storing three things in their hash table don't know or care about it, because the API is exactly the same as before.

Hash tables are everywhere

Posted Jul 17, 2024 8:51 UTC (Wed) by Wol (subscriber, #4433) [Link]

> This is why we benchmark. Sometimes, your average hash table has three things (keys) in it, and (assuming a reasonable number of buckets) collisions basically don't exist.

Which is why, in Pick/MV with linear hashing, I create all my tables with a bucket-count of 1, and then leave it to MV and forget about it. It splits buckets as required, and given that the cost of splitting a bucket is order 1 (have I got that right - 1 is constant?) it can do it on the fly without me noticing a thing. So to use your words, in Pick/MV collisions basically don't exist. At any time (unless I've screwed up my choice of hash function). Oh - and space usage is about 75% fill.

The only time I wouldn't use the default of 1 is if I have a table of known average, or I'm prefilling a grow-only table, where I will specify a minimum size to be the average, or the initial filled size, because it makes loading it a bit faster.

(I remember before linear hashing came in, someone created a table with a bucket count of 5. It was only when the users came to me and said "the system is in incredibly slow" that I took a look, and "OUCH!!!". Cue a couple of days checking and resizing every table :-)

Cheers,
Wol

Hash tables are everywhere

Posted Jul 16, 2024 13:43 UTC (Tue) by willy (subscriber, #9762) [Link] (1 responses)

I don't really have time to work on this. If you want to be "the hash table expert" in the Linux kernel, this would be a great time to jump in and take it over.

(this is a "plural you" -- not necessarily the person I'm replying to)

There are a lot of very poor data structures in use inside Linux. Having someone who really knows what they're doing with modern data structures would be very beneficial.

Hash tables are everywhere

Posted Jul 16, 2024 13:57 UTC (Tue) by Sesse (subscriber, #53779) [Link]

> There are a lot of very poor data structures in use inside Linux. Having someone who really knows what they're doing with modern data structures would be very beneficial.

I think this answers my question. Thanks!

(And yes, I know this is a lot of work. I've implemented hash tables in other large projects, and tuning the performance optimally for everyone is a pretty exhausting topic.)

No induced collision avoidance?

Posted Jul 15, 2024 21:42 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (9 responses)

I'm super worried that they are not doing anything to avoid induced collisions. This can become a vector for DoS attacks.

Perhaps a per-map random seed to rejiggle the buckets on array growth?

No induced collision avoidance?

Posted Jul 15, 2024 22:30 UTC (Mon) by Paf (subscriber, #91811) [Link] (6 responses)

Is there such a thing in the current rhashtable?

No induced collision avoidance?

Posted Jul 15, 2024 22:40 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

Nope. But it looks like rhashtable is not widely used, it's mostly present in Ethernet drivers and in the network stack. For things that are somewhat harder to manipulate by an attacker.

The only other significant user, that is a good attack vector, seems to be bcachefs.

No induced collision avoidance?

Posted Jul 16, 2024 23:43 UTC (Tue) by neilbrown (subscriber, #359) [Link] (2 responses)

> Nope.

Wrong answer :-)

rhashtable generates a random seed each time the table is resized, and if a chain ever reaches 16, the table is "resized" though the size doesn't change.
So as long as the hash function is effective at using the seed to modify the distribution, any induced collision will be quickly disabled by the seed being changed.
Of course if client code requests a poor hash function, all bets are off.

No induced collision avoidance?

Posted Jul 16, 2024 23:45 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Ah, great! Is rosebush similar?

No induced collision avoidance?

Posted Jul 16, 2024 23:58 UTC (Tue) by neilbrown (subscriber, #359) [Link]

> Ah, great! Is rosebush similar?

I cannot find the word "seed" in the patch, and the documentation suggests that hashing is left entirely up to the caller.
So I don't think rosebush attempts to address this issue.

No induced collision avoidance?

Posted Jul 16, 2024 7:35 UTC (Tue) by Wol (subscriber, #4433) [Link]

Dunno about rhashtable in particular, but I think anything with a seed has it generated at the same time as the hashtable.

So while it remains constant as long as the hash table "just grows", as soon as anything triggers the hash table into rebuilding (this could be as simple as it getting too big and being rebuilt with a new size), a new seed is generated.

So any attack that needs to know the seed is aiming at a moving target ...

Cheers,
Wol

No induced collision avoidance?

Posted Jul 16, 2024 10:46 UTC (Tue) by npws (subscriber, #168248) [Link]

Its up to the caller to add a "salt" to the key, like net_random().

No induced collision avoidance?

Posted Jul 16, 2024 8:48 UTC (Tue) by vegard (subscriber, #52330) [Link] (1 responses)

The kernel currently uses the parent directory pointer as a salt for the dentry hash function. It does so in such a way that if two directory entries of the same parent directory were to collide, they would almost certainly not collide under a different parent directory (in other words, creating a collision in one directory does not imply the same names would collide in a different directory). Kernel pointers are not entirely random, but they are also not knowable to an attacker, so this is deemed good enough. Looking at the Rosebush patches for the dentry cache, they don't change the hashing mechanism at all, so it would still be using the same mechanism. Other Rosebush users would presumably have to supply their own hashing function with its own salting mechanism.

No induced collision avoidance?

Posted Jul 16, 2024 22:33 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> Kernel pointers are not entirely random, but they are also not knowable to an attacker

Hah, I wonder if this can be used in reverse, to obtain the kernel pointers by carefully timing the insertions...?

Embedded linkage

Posted Jul 16, 2024 23:53 UTC (Tue) by neilbrown (subscriber, #359) [Link] (2 responses)

> For one thing, a structure with an embedded linked list entry in it cannot be stored in multiple lists.

Not true. You simply need one embedded link for each list (or hash table) you want to store the structure in.

The avoidance of embedding is an interesting choice and certainly has some positives. Maybe it will grow on me....

Embedded linkage

Posted Jul 17, 2024 8:42 UTC (Wed) by Wol (subscriber, #4433) [Link]

And I should have replied earlier, but to take the Pick/MV example - I think a chain of buckets is a linked list, but each bucket is a disk block read into a buffer. So the only memory allocation needed is buffers for buckets.

Within the bucket everything is ?Fortran style strings, a block of memory with a byte-count followed by space for the string. Updating the data may not use optimal tactics, but if the majority of your buckets are not in overflow, finding what you're looking for is extremely fast.

Oh - and I've mentioned it before - the astronomical database that turned into a shootout between Cache and (sn)Oracle. Where the acceptance test specified 100K insertions per hour (or whatever it was). Oracle had to disable live indexing etc etc in order to hit the target. When Cache went into production, they breezed passed 250K in weeks ...

(Cache/Mumps isn't Pick/MV, but someone's written an MV userspace for it, and it's based on similar ideas.)

Cheers,
Wol

Embedded linkage

Posted Jul 17, 2024 10:23 UTC (Wed) by andy_shev (subscriber, #75870) [Link]

The problem with this approach is that one list entry access can be made a no-op (by putting a node the first member in the structure), and all the rest is not. While having list-free object makes it equal from this point of view.

Analysis is hard.

Posted Jul 17, 2024 0:07 UTC (Wed) by neilbrown (subscriber, #359) [Link]

> Since the kernel never has any reason to look up an empty bucket, it is more interesting to consider the average length of non-empty buckets.

When looking for an entry which is not (yet) present, you may well look in an empty bucket. In fact with a good hash you are very likely to look in an empty bucket.

I think average bucket length is certainly an interesting metric, but given the complex affects of CPU cache (which rosebush deliberately attempts to work with), I'd trust actually testing much more any abstract analysis.

Looking forward to the followup article which pits there two excellent libraries head to head and declares a winner!

What does the spinlock protect?

Posted Jul 18, 2024 20:57 UTC (Thu) by DanilaBerezin (subscriber, #168271) [Link]

Maybe I'm reading it wrong or maybe the documentation is just wrong (although I doubt that), but it seems like each bucket is protected by the spin lock, and only the key pointers are RCU protected:

"Each bucket contains a spinlock (for modifications to the bucket), the number of entries in the bucket and a contention counter."

https://lwn.net/ml/linux-kernel/20240222203726.1101861-2-...