Materialized views are obviously useful

62 points by carlana

If folks have questions about incrementally maintaining materialized views, ask away. (Materialize employee).

The main tl;dr is that yes they usually work great, though it does depend on your queries (no magic if you want to re-hash the world, or various SQL equivalents). The occasional surprise is that the “maintenance” requires ongoing work and state as the data change, which is a cost. One way to think about IVM is as shifting the costs from reads (as in most conventional databases) to writes. Many use cases want to scale their reads, ideally without paying much more or eating a performance tax, and maintained materialized views fit well in that setting.

quad

(Materialize employee)

😂
quad
OK, more seriously, since no one else has taken your gracious offer up…

The sibling comments have mentioned most of the new IVM technologies that I’m aware of:
- Materialize based off Differential Dataflow 🙇
- Feldera based off DBSP
- ReadySet based off Noria
- RisingWave whose paper doesn’t reveal much
Which other players, if any, do you find interesting? And what are the key differences between the various players?
- frankmcsherry
  
  First off, I definitely agree with the article that the idea of doing work ahead of time, and maintaining the results as they change, is the big deal, and that is reflected in just about all of the tech, and will be indelible.
  
  DD and DBSP are the most similar, with afaiu DBSP being a restriction of DD (at least, when I was involved with each). Noria had some interesting ideas around bidirectional dataflow that would allow the system to move the materialization “boundary” around, but at the time hadn’t entirely nailed the consistency story (the distributed version was nearer to eventual consistency than e.g. DD). Same feels as you about RisingWave: I haven’t found much concrete to read, but they have a pile of smarts so .. we’ll see.
  
  Personal experience is that the engine tech is not as differentiating as the ecosystems. MZ for example gets less out of using DD than it gets out of using Virtual Time to align its various moving parts (CDC ingestion, multiple independent dataflows, query serving and streaming out). Feldera and RisingWave are more likely to stand out with their spill-to-object-storage backends than their join implementations. Still unclear if folks have anything like DD’s arrangements (MZ’s indexes). Clickhouse is doing great with a fundamentally broken matview implementation, because if you are doing (non-SQL) temporal joins, it matches what you want.
  
  There are other players, who come and go, but I think a bit of the interesting action is going to be in the broader ecosystem. At the moment “sync engines” come up quite a lot (e.g. Zero, ElectricSQL), as no one actually needs just a database, but something to wire it to in order to translate its output to something meaningful for humans. Without wanting to be that guy, AI/LLMs are also going to dramatically change our demands for data, and figuring out how to hook them up will be impactful (like, actually; not just fluffy hype). Aligning customer wants/needs and expectations has been more interesting / challenging than the engines themselves.
  - quad
    
    Thank you for your thoughtful reply!
    
    Without wanting to be that guy, AI/LLMs are also going to dramatically change our demands for data, and figuring out how to hook them up will be impactful (like, actually; not just fluffy hype). Aligning customer wants/needs and expectations has been more interesting / challenging than the engines themselves.
    
    I did a deep dive on DD and Materialize back in ~2020 specifically for AI purposes when I was on risk at Cash App. Funnily, what attracted me were pains that (in the end) home-rolling Virtual Time mostly solved. So, yeah, I see where you’re coming from.
- andyc
  
  I also wonder about RAMA … it’s a bit different, but I think it is supposed to scale more, probably at the cost of transactional semantics
  
  https://redplanetlabs.com/programming-model
  
  Rama’s programming model is different from how most applications are built. Most applications have an application layer and a database layer, where the database stores the “current state of the world”
  
  …
  
  In contrast, Rama’s programming model is based on event sourcing and materialized views.
fiatjaf

Why always SQL?

Can’t we agree that SQL and normalized data are an ugly abstraction that we do as a tradeoff for some performance?

Now if the performance is being provided by these optimized indexing and caching layers couldn’t we switch to some kind of document database or hierarchical database that is saner and easier to understand?
- frankmcsherry
  Personal takes:
  
  SQL is ugly, just like English. Like English it’s both useful and pervasive, but it’s also a trashpile of syntactic and semantic nonsense. See the “worse is better” argument. Importantly, I think, we don’t do SQL because we (builders) love it; we do SQL because it makes our output available to orders of magnitude more users than if we build the thing that is different but makes sense to us. See the comment downstream about how TD/DD are too confusing. :D Software engineering is about managing complexity through abstraction, and you 100% can peel back the SQL layers to get at the TD/DD innards if you like, but most folks don’t want to (and you do want to meet them somewhere).
  
  Normalized data is amazing, and the only trade-off is that you have to think harder about things before you get started. Normalization prevents so many classes of errors, and unlocks manifold expansion (it acts as the first line of documentation about the relationships in your data). You don’t get the good IVM performance from nothing; it comes mainly from the normalizing structure, which reveals independence in the queries, and the independence is what unlocks fast updates.
  
  All personal takes, ofc. It really depends what your goals are, e.g. building for oneself, building for others, doing fundamental research, building a sustainable team.
  
  Ooo, and an optimistic spin on the AI stuff: odds are within (short time) the folks who currently benefit most from SQL will be plaintively negotiating with their LLMs to write the thing they imagine is correct, and the LLM might be easier to retarget from SQL to some other intermediate language.
- koala
  
  Citation hugely needed. What is better than SQL?
  
  Which document database or hierarchical database works better than PostgreSQL? I love that my database can enforce that my data has the right structure, being able to do things like updating a ton of rows programatically in a transaction, and so many others.
  
  I am painfully aware of the limitations of SQL; PRQL and Logica solve some of its problems, but they are basically layers on top of SQL (they compile to SQL)… and so far I have not really needed to use them.
  
  Normalization is not an ugly abstraction, IMHO. I understood it so much better when I realized a relation (table) is just a dict, from the key columns to the non-key columns. That’s such a basic way to represent data that really, I can’t really think of non-relational data.
djoldman

My understanding is that all views are not materialized because a materialized view increases storage space, perhaps substantially, and also increases overhead when writing to the underlying table(s).

Can you give some high level intuitions as to relative costs if we’re just thinking about write overhead? Are there some types or combinations of types of computations we should be wary of when creating materialized views?

Also, is there work being done to automatically mark some views as materialized in a smart way (whatever that means)?
- frankmcsherry
  
  You are right that MZ at least doesn’t materialize all views, primarily because of the cost. This is the same reason that most OLTP databases don’t index every table by every field: it takes scarce random access storage, and imposes a write overhead.
  
  Internally MZ converts SQL to an AST based around DD, whose internal operators align with the costs incurred by the computers. Most stateful DD operators (joins, reduces, etc) are split into “arrangement” operators, roughly an index build and maintenance operator, and less stateful chaser operators that do whatever they need to do. Join, for example, is largely stateless itself, just recording how far it has made it through each of its input arrangements. These arrangement operators are to a first order where things “cost”, in terms of compute and storage, and what we look for when doing performance diagnosis.
  
  If you have nicely normalized data, not uncommon with SQL, you can build arrangements on the base data (or derived views), and re-use those arrangements. Queries that reuse arrangements can be very cheap to maintain. For an example, here’s MZ (circa 2020) maintaining the 22 TPC-H queries, and you can see how with indexes in place things can be very cheap for some queries, and not for others: https://github.com/frankmcsherry/blog/blob/master/posts/2020-11-18.md#conclusions.
  
  Personal experience is that a lot of it comes down to how much data does a query need to maintain random access to, and how much of that is fundamentally new to the query vs not. Inner joins on existing arrangements by their primary/foreign keys can be very cheap, because the data are already laid out the way you need. On the other hand a window function ordered and partitioned by some bespoke expression will likely need a bunch of its own data built up and maintained. A lot of our optimization work is to minimize the amount of novel data that we need to arrange, without sacrificing the end-to-end latencies.
  
  I’m not aware of work to automatically materialize things, beyond things like “you seem to query this a lot; maybe you should materialize it?”. Materialization is about trading one elastic resource (compute, memory, .. money essentially) for time, which is relatively harder to get more of. It’s hard (for me) to say when that trade-off is worth it for someone else. The current state at MZ is that we mean to make it easy to spin up new performance-isolated materializations so that you can judge whether it would be useful without negatively impacting your other use cases. I could imagine noodling a bit on whether folks want automatic materialization, or clearer communication about costs that make the decision easier, or .. some other approach to configuring workloads (and .. generally the development lifecycle, which is rarely fire and forget; you often materialize during business logic development to cut the iteration time, and then think harder about what you want to pay for longer term).
  
  Sorry for the wall of text. It’s a good question that I don’t have a great answer to!
  
  Edit: I should add, the “intuition” may only be relevant for MZ/DD. If other systems don’t have shared arrangements, no reason joins would be efficient when “re-using” existing data.
  - djoldman
    
    Wow! Thanks for all that detail.
    
    For “smart” materialization: I was wondering if there might be a setting/configuration such that if estimated expense in storage and/or compute is below a threshold, then views would be materialized. I assume this would end in disaster but I’m curious.
    
    E.g. something like select count(1) from tasks where project_id = $1 seems like it would have low expense.
    
    frankmcsherry
    
    Yeah, seems totally reasonable! One gotcha that can show up is when the magic behind the scenes results in confused users. For example, MZ has count(*) and limit k optimizations that make reading fast in these cases, but when you introduce a filter to the first case, or an order by to the second, the performance gets much worse. These cases are cheap, but they’re often used mainly in development (anyone using limit without order by is a mystery to me, unless they just want to eyeball some data). It seems very reasonable (and relevant!) to want an “easy mode” development set-up where more views get automatically materialized because while developing you value time more than one-off resource costs, and you don’t want to navigate the costs until you have the logic right.
    
    djoldman
    
    Agreed. Ideally we’d want to trust our developers and give them the tools to know what’s going on and not hide things behind magic.
    
    bsdinis
    
    Materialized views have been a bane of databases for a long time. As the blog post notes they are almost tautologically useful but hard to implement.
    
    There is a research paper on trying to implement them in a nice way, Noria: noria_osdi18.pdf https://share.google/3ZVPHnN2j1zFXkOCS
    
    IIRC there was a startup based on it
    
    snej
    
    The Noria paper has a shout-out to lobste.rs:
    
    When serving the Lobsters web application on a single Amazon EC2 VM, our prototype outperforms the default MySQL based backend by 5× while simultaneously simplifying the application.
    
    Makes me wonder if the site is using something like it today. (I can’t tell what year the paper’s from.)
    
    eb
    
    The citations date the publication to be no earlier than 09/06/2018, so my guess is that it was finalized some time during Fall 2018.
    
    Edit: The paper was presented at the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018. It was held October 8–10, 2018: https://www.usenix.org/conference/osdi18/presentation/gjengset
    
    quad
    
    IIRC there was a startup based on it
    
    ReadySet
    
    simonw
    
    I was just reading this post from them yesterday: https://readyset.io/blog/optimizing-straddled-joins-in-readyset-from-hash-joins-to-index-condition-pushdown
    
    arxanas
    
    This has been my thesis for a while:
    
    Half of my work in big tech has just been “adding cache” or “removing a cache” in response to scaling latency/throughput requirements.
    
    We absolutely need higher-level primitives for developing incremental systems, particularly in distributed contexts.
    
    Somehow, our best alternative at each point has been to just build another ad-hoc, informally specified, bug-ridden, slow implementation of a build system.
    
    I’m excited for efforts like Feldera (DBSP) to help provide
    
    concretely, a specific implementation of a “build system as a library”, but also
    
    abstractly, a reliable pattern that can be reimplemented across multiple environments where it’s unavailable (kind of like ReactiveX).
    
    I think the main missing piece is that it’s hard to reconcile “persistent state” across many different environments/runtimes in a well-patterned way:
    
    DBSP gives you a formula for effective incremental computation, but I don’t yet know how hard it is to wire it up to arbitrary data sources for input and output.
    
    More specifically, the pattern seems to require that the underlying data source has a linear notion of “time”, which is not how most people have been designing their databases in practice.
    
    I can’t go take an arbitrary database at work that somebody implemented five years ago and easily ask Postgres for the delta stream.
    
    I mean, I’m sure it’s possible, but if it’s not dead simple, then I as a random developer am not going to be equipped to do it.
    
    sebastien
    
    I think you’re spot on. Part of the problem, I think, is that we don’t have good low level patterns for designing incremental (and ideally reversible) stream processors. I would argue that our overall understanding of streams is low, and that translates in programming languages and standard libraries not being well equipped to solve that problem. Iterators and generators are a good step in that direction, but in my mind the basic toolkit is: deltas, immutable/persistable data structures and interruptible operations.
    
    arxanas
    
    My take on DBSP is that it actually provides a solid foundation for the algorithms and our understanding of streams, with a relatively good complexity trade-off. (In contrast, differential/timely data flow seemed quite capable, but were also too complex to easily pick up as a developer.) It now seems more a question of integrating those patterns with existing codebases and data stores.
    
    blp
    
    We’re working on that at Feldera, it’s just that there’s a lot of existing codebases and data stores and there are only a few of us :-)
    
    blp
    
    It’s definitely a better impedance match to fit Feldera/DBSP up against a system that also works natively in terms of streams of deltas, but “change data capture”, which translates something like Postgres into a series of deltas, has its own little database sub-industry. So while I’d hope that more systems work in deltas as time goes on, it’s already something that is possible.
    
    (I work at Feldera.)
    
    hibachrach
    
    Useful context for anyone who is not familiar: materialized views in Postgres are not smart in the way the author is talking about. They are just caches that are manually refreshed.
    
    alexpovel
    
    Feldera is one such implementation of IVM. I have high hopes for it. One of its operating principles is making changes a first-class concept (via Z-sets), which seems similar in spirit to CDC but grounded in first principles. Next to Materialize it is the only such stream computation system I am aware of that maintains “internal consistency”.
    
    A similar technology in this space is ClickHouse with its Incremental Materialized Views. It has you specify state aggregations more explicitly, whereas in Feldera you would write somewhat more standard and therefore declarative SQL.
    
    Student
    
    Triggers are the solution?
    
    squadette
    
    Designing a general theory on how to keep materialized views up-to-date, no matter what their algebraic structure is, is the hard part.
    
    Student
    
    Which is why I suggest writing code for each one :)
    
    einacio
    
    Integer Indexes are slow now? Wow
    
    squadette
    
    Depending on how many rows are found after filtering, the runtime can have some variability, and you don’t want variability.
    
    Also, this is just an example; you can imagine a different, less trivial problem where indexes are not so obviously helpful, but “materialized views” continue to be a useful tool.
    
    Finding a good example is difficult: https://www.hillelwayne.com/post/persuasive-examples/
    
    eterps
    
    Isn’t this wat streaming databases like Materialize and RisingWave do?
    
    alper
    
    Very curious if people have used either at scale and what their experiences are.
    
    pragmatic
    
    No, it should be doing an index scan, not a seek.
    
    It’s almost always an index problem.
    
    Just use an index.
    
    squadette
    
    it has to do a complete index scan of the tasks for the project
    
    the article says exactly that, the design space is broader than database indexes.