Monoid-augmented FIFOs, deamortised

26 points by pervognsen

kmicklas

This is so cool. Normally I like to think I have a decent intuition for optimal asymptotics of problems (even if I can’t construct an actual algorithm), but I would not have thought this was possible.

cblake

It may depend on the sort of statistics you want to access, but it’s pretty easy to do approximate order statistics things independent of the window size if you recast “the sort of accuracy” you are after with just elementary/middle school math. Often statistics on these time series (like a rolling median, say) are not temporally stable to better than 1% anyway. So, all you have to do is histograms of logs (with thanks to Napier &| scientific notation) to have a few thousand bins do a lot of work for you. You can do a log in <10ns time these days on machines that the 2015 paper Paul is covering were only doing 50 Mupdates/second (aka 20ns). So, this is real-time competitive at realistic data scales.

E.g., if you can bound the range of your x[t] to say, 16 orders of magnitude (which is really quite a lot - nanoseconds to 10 MegaSec, say, if x were a duration), then you only need about 800 bins to get 1% accuracy because 16/.02 = 800 and that means you’ll only ever be off by at most 1 bin. Then you only need a representation of this 800 bin (probably 3200..6400 byte/L1 CPU cache resident) histogram that lets you answer the queries you want quickly. You can do the naive array repr and a parallel prefix sum to convert to a CDF and answer all sorts of things and this already kind of “solves” the big-O problem or rather shifts it to big-O on histogram operations like increment & query CDF/inverse CDF. :-)

Depending on scales, you can actually do a little better than a flat array & parPfxSum with Wong & Easton 1980’s sum heaps or Fenwick trees and/or use GPUs to optimize further, I think being within 1..3X real-time of the paper Paul is analyzing. Note the discretization error here is also really easy to understand even by end users. Nothing in this re-framing depends upon the size of the sliding/moving/rolling data window (except the state you maintain to properly “expire” influence, which can actually be trivially dropped if you do exponential time weighting of the bins and are ok with EWMA like infinite-memory of older epochs of data). Those are all bigger topics. Maybe you already knew it, though and had other stats in mind.

BTW, these things go by a name like “high dynamic range histograms” or “datadog sketches” or other stuff like that, but it’s really barely more than “log-scale paper” people used in the 1950s. So, I prefer “histogram of logs” or histogram of transformed data.
- kmicklas
  
  I am not at all surprised that approximations can be efficient. Personally I am interested in this data structure for reasons having nothing to do with statistics, on complex monoids.
- fanf
  
  Yeah, log-linear histograms are really nice. Easy to understand with high-school maths and efficient to implement. (I learned a nice trick from @pvk when I was experimenting with them a few years ago.)
  
  There are also “quantile sketch” data structures which seem to target the same kind of applications, but they are much more tricky. They target a quantile or rank error, which is quite a different beast from the relative-value error that a log-linear histogram gives you. In particular a quantile sketch will quantize outliers, so it becomes hard to distinguish 99% vs 99.9% latency (for example) whereas a histogram of the same size can keep a vast range of latencies to 2 s.f. accuracy.
  - cblake
    
    Yeah - I’ve always (or at least since the early noughties) just done what Paul mentions at the end of that 2015 article in his “P.S. If a chip out there has fast int->FP conversion”, though I learned it on my own and do it at a “higher level” - just defining a “destructuring struct” with C / Nim bit fields to extract IEEE fields to get a power of two range and a couple of 6 term Taylor series for a 24-bit log (the famous 1n(1+x) and an arctanh one). On modern CPUs, natural logs can be “about” one pipeline depth fast, amortized” (at which point all the shared usage of CPU micro-resources complicate descriptions).
    
    I actually looked through the whole 300 page Cormode 2011 thing Paul mentions in the article of this thread and I couldn’t even find this idea being alluded to. Maybe I just missed it. I don’t know why its propagation is so bad. Maybe AQP/DB/networking people are just allergic to floating point-like ideas? Maybe we need a catchy novel-sounding name like HDR numbers? Maybe since it goes back to Napier and slide rules and pocket protectors it just doesn’t seem novel enough to attend? I kind of just don’t get its neglect. It also tends to be very “low code”.
    
    The amount of academic blood sport in, for lack of a better condensation, data compression without regard to speed (of either compression/decompression or both) is also sad. It’s causes are surely multi-factorial, but a main driver is probably ease of precisely & portably measuring space with real-time being..tricky. Most of the time I read those papers I feel like, if they even have one, the speed analysis is pretty weak in some way or another, but ostensibly that was The Point, originally. I like how you mention your 1 significant figure is only 2K and 2 significant figures/decimals is 17k. Similar numbers should apply to all such techniques.
    
    I don’t know if doing windowed/rolling/moving/sliding medians/quantiles/distributions matters for your BIND application, and I’m sure the Nim makes it look odd, but you might nevertheless find some of the histograms in https://github.com/c-blake/adix interesting.
- DustyFuzzy
  
  I definitely have to come back to this later, trying to understand “monoid”, “clupeids”, and “amortised queue” before lunch seems impossible, haha.
  - Hecate
    
    A monoid is just a structure that can be concatenated and be empty. Like a list, for instance.
    
    See https://argumatronic.com/posts/2019-06-21-algebra-cheatsheet.html.
    
    robey
    
    Wait I thought a monoid was like a burrito?
  - Axman6
    
    You probably make use of many monoids every day without thinking about it. They’re just a type, some operation for combining things of that type, and a value which when combined with other values, is always equal to the other values - a.k.a, an identity.
    
    Examples with (type, combining function, identity value):
    
    (numbers, +, 0)
    
    (numbers, *, 1)
    
    (numbers, max, -∞)
    
    (numbers, min, ∞)
    
    (lists, +/++/append, []) – also arrays, and many sequence/container types
    
    (Set, union, {})
    
    (Dict, merge, {})
    
    (Bool, ||, False)
    
    (Bool, &&, True)
    
    (Bool, xor, False)
    
    (functions :: a -> a, ./compose, identity/id)
    
    ((monoid A, monoid B), (a,x) (b,y) -> (a + b, x+y), (identity A, identity B)) – a pair of monoids is also a monoid, so any arbitrary number of monoids is also a monoid
    
    You learned about monoids in your tens’, they just never taught you the word. You can define a monoid which represents calculating means:
    
    data Mean a = Mean { sum: a, count: int } toMean :: a -> Mean a toMean a = Mean {sum = a, count = 1} meanZero :: Monoid a => Mean a meanZero = Mean {sum = zero, count = 0 } meanAdd :: Mean a -> Mean a -> Mean a meanAdd a b = Mean { sum = a.sum + b.sum, count = a.count + b.count }
    
    The main rule is that the combining function must be associative, so (a + b) + c = a + (b + c). Because of this, they can be computed separately, for example, in parallel, and then combined. Need to know the mean of chunks data? Compute the means of all the chunks in parallel and then combine the results.