(Translated by https://www.hiragana.jp/)
Bucket expansion [LWN.net]
|
|
Subscribe / Log in / New account

Bucket expansion

Bucket expansion

Posted Jul 19, 2024 11:23 UTC (Fri) by Wol (subscriber, #4433)
In reply to: Bucket expansion by paulj
Parent article: A hash table by any other name

> The question is on mixed work-loads really. I'd be curious how they compare. I guess best way would be implement and measure over different workloads.

Don't forget - with linear (and others) hashing insert and delete are also O(1).

Cheers,
Wol


to post comments

Bucket expansion

Posted Jul 19, 2024 12:29 UTC (Fri) by paulj (subscriber, #341) [Link] (18 responses)

That's not the case. Insert for a normal hash table is O(n). Remember, O means the upper bound - the worst case. If you mean the non-split case is O(1) well then the equivalent non-rebalance case for a tree is also O(1). Indeed, the best cycle cost for a series of B-tree inserts could be much better than for a hash table, depending on memory details, as farnz pointed out.

Not sure about the split case, depends on the details of rebucketing I guess. I wouldn't be surprised if that ends up O(logn) as for trees. In which case it's O(logn). Maybe it's worse - analyses seem to be hard to be find. And these analyses /do/ matter, if you want to be have a chance at making your code resilient to DoS attacks. You need this kind of info to choose the right data-structures.

Sometimes the fastest common-case data-structure is a /bad/ choice, if it means attackers can bring your machine to its knees. Sometimes the slower average-case data-structure is better, if it has that reliable speed over /all/ conditions!

Bucket expansion

Posted Jul 19, 2024 14:08 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)

I believe it's more normal to say hash table insert has O(1) amortized cost. Specifically that means a sequence of n inserts takes O(n) time (in the strict upper-bound sense of 'O'), which is counted as an average of O(n)/n = O(1) per insert. That's still the worst case performance (in contrast to e.g. a probabilistic average-case performance analysis), it's just using the worst case over a sequence rather than a single operation. And that's usually a better way to look at it because a single operation is usually so fast that nobody cares about its performance, they care about long sequences.

(As far as I can tell from some searching, the equivalent amortized analysis of a B-tree says an insert still costs O(log n), unless you're starting from an empty tree in which case any sequence of n insert/delete operations takes O(n) time and therefore each operation has O(1) amortized cost. Hash table operations are O(1) amortized cost regardless of the initial state, so the O(1) is more widely applicable there.)

Bucket expansion

Posted Jul 19, 2024 15:53 UTC (Fri) by paulj (subscriber, #341) [Link]

O(f(n)) is by definition a bound over the worst-case case input to a function.

Perhaps from an engineering perspective it is interesting to consider amortised costs (call that f(n) \in A(g(n), say), but if you have a function whose behaviour can vary greatly over different inputs, then what does "amortised" even mean? One work load could produce A(1) behaviour, in terms of time complexity, another A(n^2), another A(n!) perhaps. If that work load can be influenced by others, then really you must consider only the worst-case, or those others can trigger a DoS potentially. So the more formal definition of O(g(n)) is actually the more useful one, for anything with security consequences.

Hash tables have terrible security properties, because they are O(n) - not O(1). You can take mitigations (cryptographically secure hash function [wait, you were worried about cycles?], oversized table for expected loads), but it's not that hard for people to miss those or get it wrong or... simply over time for typical work-loads to grow much larger than originally anticipated (more pages, more packets, more hosts, etc.).

Bucket expansion

Posted Jul 19, 2024 16:41 UTC (Fri) by Wol (subscriber, #4433) [Link] (15 responses)

> That's not the case. Insert for a normal hash table is O(n). Remember, O means the upper bound - the worst case. If you mean the non-split case is O(1) well then the equivalent non-rebalance case for a tree is also O(1). Indeed, the best cycle cost for a series of B-tree inserts could be much better than for a hash table, depending on memory details, as farnz pointed out.

Except I explicitly specified linear hashing. Although I was discussing the 95% case, not the worst case.

> Not sure about the split case, depends on the details of rebucketing I guess. I wouldn't be surprised if that ends up O(logn) as for trees. In which case it's O(logn). Maybe it's worse - analyses seem to be hard to be find. And these analyses /do/ matter, if you want to be have a chance at making your code resilient to DoS attacks. You need this kind of info to choose the right data-structures.

Okay, I'm thinking databases, which are unlikely targets for DoS attacks. Although some implementations do explicitly ask what sort of keys you are using, to make the system more resilient. No modern implementation I know of does that, though.

And again, I'm discussing the 95% case, not a pathological case, but again, "depending on the details of rebucketing", for linear hashing time is proportional to delta buckets - so if we exclude that 5% worst case it comes out as O(1)

> Sometimes the fastest common-case data-structure is a /bad/ choice, if it means attackers can bring your machine to its knees. Sometimes the slower average-case data-structure is better, if it has that reliable speed over /all/ conditions!

Sometimes. But it's a trade-off! Is rebooting your router in the middle of a DoS a good idea? It *might* be better than trying to stay up! I think we often make the wrong decision because we can :-) Here's a case in point - if O is defined as worst case, it unfairly paints hashing in a very bad light if you actually want the normal case.

Sometimes five nines isn't worth the cost. One and a half nines is still pretty damn good! Especially where that's reliably and demonstrably faster. And if that buys us three and a half nines of saved time? We can afford the reboot ...

Cheers,
Wol

Bucket expansion

Posted Jul 22, 2024 10:09 UTC (Mon) by paulj (subscriber, #341) [Link] (14 responses)

On the last bit, I strongly disagree. A system with predictable behaviour, that responds in a predictable, progressive behaviour to load (i.e. slows down linearly at worst, ideally logarithmically) will be the more reliable system; the less dangerous system; the system that doesn't randomly make you lose your weekends, or worse (take down important wide-scale systems, randomly).

Shaving a % or 2 off the common case, and forgetting that your worst case has now become so bad that you've now got something almost approaching a chaotic system, is what - too regularly - makes the tech world look like a clown show.

Bucket expansion

Posted Jul 22, 2024 10:16 UTC (Mon) by paulj (subscriber, #341) [Link] (6 responses)

Wish I ciould edit. Predictable of load response allows for proper planning. Planning for expected load, planning when you need to upgrade. That ability to plan, plus the resilience against unforeseen collapse in the face of unpredicted load, buys you time to do proper engineering to make sure you've got all the other good engineering practices ticked off.

In aviation parlance, it keeps you "ahead of the plane" (i.e. you've already anticipated what should happen and what can happen, have planned for it, and are completely in control either way).

Unfortunately in tech, we instead love to cut corners, to ignore good engineering practice, to throw things over the wall hoping it works and praying we can fix it quickly enough tomorrow if it does not. We wing it. We fly by the seat of our pants. And of course, we often end up flying "behind the aircraft" - reacting too late, without enough time. And we crash and burn.

Anyway, choose resilient data-structures, build robust systems. Stop the tech clown show.

(Just thinking of a recent event.... where a certain company apparently pushed some kind of file of control variables that gets parsed by a kernel driver, which was invalid and caused kernel driver to crash or panic, and broke critical windows machines all over the world causing wide-scale outages - and they pushed on a friday!).

Bucket expansion

Posted Jul 22, 2024 11:51 UTC (Mon) by Wol (subscriber, #4433) [Link] (5 responses)

> Anyway, choose resilient data-structures, build robust systems. Stop the tech clown show.

Build rockets that can't take off because they're carrying so much extra weight they can't reach escape velocity ...

Instead build systems that are nimble, slick, and will simply outperform your robust systems into the dust.

I've never heard of a company that's had problems with my "5% worst case" - I guess because it's statistically improbable. I've heard plenty of stories of firms using your "resilient, robust" systems who simply couldn't adapt fast enough, and were taken down by the competition.

If you want fail-safe systems, sure, build fail-safe systems. My every-day work life consists of fighting systems that are too slow, are badly thrown together designs, are a nightmare to update, and use your allegedly robust systems to provide us with the wrong data. Yes I am very definitely in the end-user camp, and as far as I'm concerned, the clown show is all these "resilient, robust" systems, which are "heavy, lugubrious and slow", and are simply useless from the end-user's PoV.

The number of people I talk to who are totally UNsurprised at the number of FTSE 100 companies that seem to be run very much on Excel spreadsheets - talk about a totally unsuitable tool - except it just seems to be so much better at the job than most IT tools they want us to use.

My company is just *beginning* to go live with system that has been "two years away" for maybe the last 5, 10 years? And yet there are systems out there - MultiValue anyone? - with a reputation for coming in EARLIER and CHEAPER than prediction.

Cheers,
Wol

Bucket expansion

Posted Jul 22, 2024 11:56 UTC (Mon) by Wol (subscriber, #4433) [Link] (2 responses)

Let me give you a scenario. You're being chased by a lion. Do you

(a) Run. Doesn't matter where, doesn't matter what direction
or
(b) Spend five minutes working out the best way to escape and then carry out the plan.

You seem to be advocating (b) as the ONLY acceptable choice in ALL circumstances. Especially if choosing (b) divides your top sprint speed by four ...

In business, as in life, you'll end up as someone else's lunch in damn quick order :-)

Cheers,
Wol

Bucket expansion

Posted Jul 22, 2024 12:31 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (1 responses)

Or (c), Run now, but remember to consider "lions are near" next time you're in the area. Also known as "keeping track of tech debt".

Bucket expansion

Posted Jul 22, 2024 12:34 UTC (Mon) by Wol (subscriber, #4433) [Link]

:-)

Cheers,
Wol

Bucket expansion

Posted Jul 22, 2024 14:15 UTC (Mon) by paulj (subscriber, #341) [Link] (1 responses)

No, you build a rocket where you can reliably predict the behaviour of the control loop. E.g. (contrived), you know that even if the rate of sensor inputs increases by X, the control loop will only slow down by X/n, and hence you can confidently say - knowing there are only m*X sensors with a maximum output rate of z/s, that the control loop will still complete its iterations within the required bounds to keep the rocket under control.

And if it does not quite meet the bounds, _then_ you can start shaving cycles. And you can still be confident - assuming you don't change the fundamentals of the underlying data-structures - that you're not introducing some catastrophic worst-case.

Alternatively, ignore this stuff, and your control loop works fine 99% of the time in simulation, but then within the first few launches, *boom*. You shave some cycles, but cause you still don't understand what the bounds are on your data-structure and - worse - when you /do/ do an analysis of the bounds of your data-structure you're constructing O(f(n)) analyses based on _amortised_ costs (wtf?), and so you're both misleading _yourself_ and any other engineer about the worst-case behaviour. And... another few launches... and *boom*.

And look, I recognise very much what you describe. A lot of IT is cello-taped together. (Or held together with linen patches and "dope", to go back to an [early]) aviation analogy). And we won't get rid of ad-hoc systems (ad-hoc from conception, or by degeneration). And many serve a purpose. That's fine.

What I'm asking for is that the software engineers who should _know better_ and who are meant to _design_ well-engineered systems (as opposed to hack stuff together) recognise the value in things like "Trees with decades of formal analyses have provably better described behaviour than array and/or list-based data-structures that rely on pseudo-tree like operations that have only 1 paper describing them and next to no independent formal analyses".

Bucket expansion

Posted Jul 22, 2024 14:42 UTC (Mon) by paulj (subscriber, #341) [Link]

"But Paul, most of us aren't working on rockets or nuclear power stations. We writing code for consumer stuff that doesn't matter!"

And people come home and find their fridge has burned out or defrosted: https://www.youtube.com/watch?v=2gv5MXXR4TE . Or no one at the airport can find any flight information (not the handlers at the check-in desks, not the information screens, etc.). And ok, those bugs usually aren't data-structure choices, but sometimes they are. Either way, it's all part of the same clown-show attitude in tech.

Large numbers and tiny probabilities

Posted Jul 22, 2024 11:38 UTC (Mon) by farnz (subscriber, #17727) [Link] (6 responses)

The other thing that plays into this is that people's untrained intuition for numbers is pretty poor outside the range 0 to 100 or so.

A random failure rate needing human intervention of 1 in 1012 sounds pretty decent in isolation - 1012 is a huge number. But if you deal with 1 billion RPS (not unreasonable scale for someone like Google), that same failure rate means you're averaging something on the order of one human intervention every 15 minutes - most people can't intuit that the high RPS rate means that a very low failure per request rate is still significant.

Large numbers and tiny probabilities

Posted Jul 22, 2024 12:43 UTC (Mon) by Wol (subscriber, #4433) [Link] (5 responses)

> The other thing that plays into this is that people's untrained intuition for numbers is pretty poor outside the range 0 to 100 or so.

And everybody wears blinkers, that they're often unaware of. I can see that in myself sometimes :-)

Thing is, in this particular argument, I see people arguing "but we can guarantee 100% certain that all our runners will complete the 100m in one minute or less".

SERIOUSLY, I'm pointing out that the statistical likelihood of any of my runners requiring more than 10 seconds is beyond the bounds of normality.

Both approaches have their place, but when kernel developers are arguing over a few percent, a speed difference of that magnitude is hard to ignore. And the cost of mitigating an attack - IF it happens - is a tradeoff.

imnhso a productivity hole of that magnitude is worth a few days overtime once a year :-) Others may disagree ... (ime that hole CAUSES a lot more than a few days overtime once a year ...)

Cheers,
Wol

Large numbers and tiny probabilities

Posted Jul 22, 2024 13:48 UTC (Mon) by farnz (subscriber, #17727) [Link] (4 responses)

But this is why it's an engineering problem, and there is no one right answer.

If the consequences of one of your runners taking more than a minute to complete a 100 metre sprint is "the universe ceases to exist", then I'd like a lot more confidence than "the statistical likelihood of any of my runners needing more than 1/5th of a minute is extremely low". If, on the other hand, it's "my coffee takes an extra minute to prepare", then your statistical claim is good enough.

Reality will be somewhere between those two extremes, and part of the problem is that intuition is very often not enough to determine where it falls between those extremes, except in massively simplified cases, in part because intuition breaks down when the numbers fall outside your normal experience.

And I'd note that, if we were talking about the 100m sprint, there's fewer than 200 people in the world who've done a 100m sprint in under 10 seconds without external aid (e.g. wind in their favour). Only 34 people have run that fast in 2024, for example; this says something about your intuition, perhaps :-)

Large numbers and tiny probabilities

Posted Jul 22, 2024 14:16 UTC (Mon) by Wol (subscriber, #4433) [Link]

> And I'd note that, if we were talking about the 100m sprint, there's fewer than 200 people in the world who've done a 100m sprint in under 10 seconds without external aid (e.g. wind in their favour). Only 34 people have run that fast in 2024, for example; this says something about your intuition, perhaps :-)

:-) :-)

Yup, I know the 10sec 100m is very fast. But as far my database is concerned, the speed difference between the first and ninth decile is ZERO. The best case is also the normal case. (Not true of running, I know.)

Cheers,
Wol

Large numbers and tiny probabilities

Posted Jul 22, 2024 14:19 UTC (Mon) by Wol (subscriber, #4433) [Link] (2 responses)

> But this is why it's an engineering problem, and there is no one right answer.

And if you read back over my posts, you can see I very much agree with that attitude. I just feel that the attitude "we need a bounded worst case" is a theoretical requirement beloved of academicians, that is a very poor fit to most real world cases - of course, a nuclear power station meltdown isn't one of them, but it's not a normal case, either :-) But that's the case most real world scenarios are bludgeoned into. Square pegs and round holes and all that ...

Cheers,
Wol

Bounded worst case and ripple effects

Posted Jul 22, 2024 15:44 UTC (Mon) by farnz (subscriber, #17727) [Link] (1 responses)

Unless you know the effect of failing to meet your worst case, though, you can't make that statement; for example, I've worked on a system where the effect of a request taking too long was for the load balancer to reroute the request to another server, on the assumption that the server it first chose is broken, and to mark the server as "failing". One accidental construction of a bad request later, and the load balancer merrily took out the entire server fleet because (a) requests were not independent - once one client had cause to construct a bad request, it was very likely that the rest would soon do the same, and (b) a slow request turned into "server down" and automatic mitigation, which escalated until the entire system was down.

And it's important to note that I've elided a lot of detail there - underlying it is that everything in the system was locally sensible on a "statistically unlikely" basis, right up until we hit some bad luck and a 1 in 1077 random occurrence happened across a large number of clients all at the same time, because the assumptions underlying "that's statistically unlikely" failed to take into account the inter-dependencies of the system as a whole once you brought the users into the loop.

Bounded worst case and ripple effects

Posted Jul 22, 2024 17:35 UTC (Mon) by Wol (subscriber, #4433) [Link]

> Unless you know the effect of failing to meet your worst case, though, you can't make that statement; for example, I've worked on a system where the effect of a request taking too long was for the load balancer to reroute the request to another server, on the assumption that the server it first chose is broken, and to mark the server as "failing". One accidental construction of a bad request later, and the load balancer merrily took out the entire server fleet because (a) requests were not independent - once one client had cause to construct a bad request, it was very likely that the rest would soon do the same, and (b) a slow request turned into "server down" and automatic mitigation, which escalated until the entire system was down.

Which again, is (or rather should be) part of the engineering rationale for deploying such a system. Okay, I work on what you would probably call small-scale systems - a supermarket customer management system. And I'm only interested in what's going on in the next four weeks (unless it's Christmas, of course).

I have to use the same OTT safety conscious systems as the world thinks is "best practice". Personally I'd use real-world primary keys but I have to use guids ... but if I'm only interested in 28 days data, and I have to use some sort of non-real-world primary key, it's pretty obvious an incrementing sequential number that expires cannot be a pathological case for any sane hash system ...

If I can deploy my "System of choice", that has 80% less resource requirement, the hours a day I spend watching the hourglass as Snoracle tries to find the data I want will suddenly be available for useful work. If installed on your load-balancing systems, it would be far less likely to overload the system (okay, that's a reduction in risk, not a removal of risk),

Thing is, your "guaranteed response" has a real cost. I'd cost it as sticking a nought on the price tag. Which may be well worth paying! But paulj is afraid of coming home and finding his fridge defrosted. I'm more scared of coming home and finding my house underwater because the Antarctic has defrosted ... If I can reduce my energy costs by 80% by using appropriate technology ... it'd make a hell of a difference if everybody did it.

I've never heard of anybody using btrees for anything but indices in MV (because there's no choice). You'd have thought if the same economics applied to data storage, btree would be the choice. But it's unheard of, so I guess the cost is too great.

Guess what I should do (if only I could find the time, might use it to learn Rust ! :-) is to write a hashtree algorithm. That uses dynamic hashing by default, changes the seed every time it doubles the number of buckets (which most kernel algorithms do, I believe), and if it thinks an attack is under way (or the RT guys want it) switches to a constant-response tree.

Given that I think in key->data terms, and most of you seem to think in key->ptr terms, I'm not sure the speed increase here would be quite what I would hope. But I would still expect - in the normal case - for it to be noticeably faster. Just maybe not 80%

Cheers,
Wol


Copyright © 2024, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds