Materialized views are obviously useful
62 points by carlana
62 points by carlana
If folks have questions about incrementally maintaining materialized views, ask away. (Materialize employee).
The main tl;dr is that yes they usually work great, though it does depend on your queries (no magic if you want to re-hash the world, or various SQL equivalents). The occasional surprise is that the “maintenance” requires ongoing work and state as the data change, which is a cost. One way to think about IVM is as shifting the costs from reads (as in most conventional databases) to writes. Many use cases want to scale their reads, ideally without paying much more or eating a performance tax, and maintained materialized views fit well in that setting.
OK, more seriously, since no one else has taken your gracious offer up…
The sibling comments have mentioned most of the new IVM technologies that I’m aware of:
Which other players, if any, do you find interesting? And what are the key differences between the various players?
First off, I definitely agree with the article that the idea of doing work ahead of time, and maintaining the results as they change, is the big deal, and that is reflected in just about all of the tech, and will be indelible.
DD and DBSP are the most similar, with afaiu DBSP being a restriction of DD (at least, when I was involved with each). Noria had some interesting ideas around bidirectional dataflow that would allow the system to move the materialization “boundary” around, but at the time hadn’t entirely nailed the consistency story (the distributed version was nearer to eventual consistency than e.g. DD). Same feels as you about RisingWave: I haven’t found much concrete to read, but they have a pile of smarts so .. we’ll see.
Personal experience is that the engine tech is not as differentiating as the ecosystems. MZ for example gets less out of using DD than it gets out of using Virtual Time to align its various moving parts (CDC ingestion, multiple independent dataflows, query serving and streaming out). Feldera and RisingWave are more likely to stand out with their spill-to-object-storage backends than their join implementations. Still unclear if folks have anything like DD’s arrangements (MZ’s indexes). Clickhouse is doing great with a fundamentally broken matview implementation, because if you are doing (non-SQL) temporal joins, it matches what you want.
There are other players, who come and go, but I think a bit of the interesting action is going to be in the broader ecosystem. At the moment “sync engines” come up quite a lot (e.g. Zero, ElectricSQL), as no one actually needs just a database, but something to wire it to in order to translate its output to something meaningful for humans. Without wanting to be that guy, AI/LLMs are also going to dramatically change our demands for data, and figuring out how to hook them up will be impactful (like, actually; not just fluffy hype). Aligning customer wants/needs and expectations has been more interesting / challenging than the engines themselves.
Thank you for your thoughtful reply!
Without wanting to be that guy, AI/LLMs are also going to dramatically change our demands for data, and figuring out how to hook them up will be impactful (like, actually; not just fluffy hype). Aligning customer wants/needs and expectations has been more interesting / challenging than the engines themselves.
I did a deep dive on DD and Materialize back in ~2020 specifically for AI purposes when I was on risk at Cash App. Funnily, what attracted me were pains that (in the end) home-rolling Virtual Time mostly solved. So, yeah, I see where you’re coming from.
I also wonder about RAMA … it’s a bit different, but I think it is supposed to scale more, probably at the cost of transactional semantics
https://redplanetlabs.com/programming-model
Rama’s programming model is different from how most applications are built. Most applications have an application layer and a database layer, where the database stores the “current state of the world”
…
In contrast, Rama’s programming model is based on event sourcing and materialized views.
Why always SQL?
Can’t we agree that SQL and normalized data are an ugly abstraction that we do as a tradeoff for some performance?
Now if the performance is being provided by these optimized indexing and caching layers couldn’t we switch to some kind of document database or hierarchical database that is saner and easier to understand?
Personal takes:
All personal takes, ofc. It really depends what your goals are, e.g. building for oneself, building for others, doing fundamental research, building a sustainable team.
Ooo, and an optimistic spin on the AI stuff: odds are within (short time) the folks who currently benefit most from SQL will be plaintively negotiating with their LLMs to write the thing they imagine is correct, and the LLM might be easier to retarget from SQL to some other intermediate language.
Citation hugely needed. What is better than SQL?
Which document database or hierarchical database works better than PostgreSQL? I love that my database can enforce that my data has the right structure, being able to do things like updating a ton of rows programatically in a transaction, and so many others.
I am painfully aware of the limitations of SQL; PRQL and Logica solve some of its problems, but they are basically layers on top of SQL (they compile to SQL)… and so far I have not really needed to use them.
Normalization is not an ugly abstraction, IMHO. I understood it so much better when I realized a relation (table) is just a dict, from the key columns to the non-key columns. That’s such a basic way to represent data that really, I can’t really think of non-relational data.
My understanding is that all views are not materialized because a materialized view increases storage space, perhaps substantially, and also increases overhead when writing to the underlying table(s).
Can you give some high level intuitions as to relative costs if we’re just thinking about write overhead? Are there some types or combinations of types of computations we should be wary of when creating materialized views?
Also, is there work being done to automatically mark some views as materialized in a smart way (whatever that means)?
You are right that MZ at least doesn’t materialize all views, primarily because of the cost. This is the same reason that most OLTP databases don’t index every table by every field: it takes scarce random access storage, and imposes a write overhead.
Internally MZ converts SQL to an AST based around DD, whose internal operators align with the costs incurred by the computers. Most stateful DD operators (joins, reduces, etc) are split into “arrangement” operators, roughly an index build and maintenance operator, and less stateful chaser operators that do whatever they need to do. Join, for example, is largely stateless itself, just recording how far it has made it through each of its input arrangements. These arrangement operators are to a first order where things “cost”, in terms of compute and storage, and what we look for when doing performance diagnosis.
If you have nicely normalized data, not uncommon with SQL, you can build arrangements on the base data (or derived views), and re-use those arrangements. Queries that reuse arrangements can be very cheap to maintain. For an example, here’s MZ (circa 2020) maintaining the 22 TPC-H queries, and you can see how with indexes in place things can be very cheap for some queries, and not for others: https://github.com/frankmcsherry/blog/blob/master/posts/2020-11-18.md#conclusions.
Personal experience is that a lot of it comes down to how much data does a query need to maintain random access to, and how much of that is fundamentally new to the query vs not. Inner joins on existing arrangements by their primary/foreign keys can be very cheap, because the data are already laid out the way you need. On the other hand a window function ordered and partitioned by some bespoke expression will likely need a bunch of its own data built up and maintained. A lot of our optimization work is to minimize the amount of novel data that we need to arrange, without sacrificing the end-to-end latencies.
I’m not aware of work to automatically materialize things, beyond things like “you seem to query this a lot; maybe you should materialize it?”. Materialization is about trading one elastic resource (compute, memory, .. money essentially) for time, which is relatively harder to get more of. It’s hard (for me) to say when that trade-off is worth it for someone else. The current state at MZ is that we mean to make it easy to spin up new performance-isolated materializations so that you can judge whether it would be useful without negatively impacting your other use cases. I could imagine noodling a bit on whether folks want automatic materialization, or clearer communication about costs that make the decision easier, or .. some other approach to configuring workloads (and .. generally the development lifecycle, which is rarely fire and forget; you often materialize during business logic development to cut the iteration time, and then think harder about what you want to pay for longer term).
Sorry for the wall of text. It’s a good question that I don’t have a great answer to!
Edit: I should add, the “intuition” may only be relevant for MZ/DD. If other systems don’t have shared arrangements, no reason joins would be efficient when “re-using” existing data.
Wow! Thanks for all that detail.
For “smart” materialization: I was wondering if there might be a setting/configuration such that if estimated expense in storage and/or compute is below a threshold, then views would be materialized. I assume this would end in disaster but I’m curious.
E.g. something like select count(1) from tasks where project_id = $1
seems like it would have low expense.
Yeah, seems totally reasonable! One gotcha that can show up is when the magic behind the scenes results in confused users. For example, MZ has count(*)
and limit k
optimizations that make reading fast in these cases, but when you introduce a filter to the first case, or an order by
to the second, the performance gets much worse. These cases are cheap, but they’re often used mainly in development (anyone using limit
without order by
is a mystery to me, unless they just want to eyeball some data). It seems very reasonable (and relevant!) to want an “easy mode” development set-up where more views get automatically materialized because while developing you value time more than one-off resource costs, and you don’t want to navigate the costs until you have the logic right.
Agreed. Ideally we’d want to trust our developers and give them the tools to know what’s going on and not hide things behind magic.
Materialized views have been a bane of databases for a long time. As the blog post notes they are almost tautologically useful but hard to implement.
There is a research paper on trying to implement them in a nice way, Noria: noria_osdi18.pdf https://share.google/3ZVPHnN2j1zFXkOCS
IIRC there was a startup based on it
The Noria paper has a shout-out to lobste.rs:
When serving the Lobsters web application on a single Amazon EC2 VM, our prototype outperforms the default MySQL based backend by 5× while simultaneously simplifying the application.
Makes me wonder if the site is using something like it today. (I can’t tell what year the paper’s from.)
The citations date the publication to be no earlier than 09/06/2018, so my guess is that it was finalized some time during Fall 2018.
Edit: The paper was presented at the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018. It was held October 8–10, 2018: https://www.usenix.org/conference/osdi18/presentation/gjengset
IIRC there was a startup based on it
ReadySet
I was just reading this post from them yesterday: https://readyset.io/blog/optimizing-straddled-joins-in-readyset-from-hash-joins-to-index-condition-pushdown
This has been my thesis for a while:
I’m excited for efforts like Feldera (DBSP) to help provide
I think the main missing piece is that it’s hard to reconcile “persistent state” across many different environments/runtimes in a well-patterned way:
I think you’re spot on. Part of the problem, I think, is that we don’t have good low level patterns for designing incremental (and ideally reversible) stream processors. I would argue that our overall understanding of streams is low, and that translates in programming languages and standard libraries not being well equipped to solve that problem. Iterators and generators are a good step in that direction, but in my mind the basic toolkit is: deltas, immutable/persistable data structures and interruptible operations.
My take on DBSP is that it actually provides a solid foundation for the algorithms and our understanding of streams, with a relatively good complexity trade-off. (In contrast, differential/timely data flow seemed quite capable, but were also too complex to easily pick up as a developer.) It now seems more a question of integrating those patterns with existing codebases and data stores.
We’re working on that at Feldera, it’s just that there’s a lot of existing codebases and data stores and there are only a few of us :-)
It’s definitely a better impedance match to fit Feldera/DBSP up against a system that also works natively in terms of streams of deltas, but “change data capture”, which translates something like Postgres into a series of deltas, has its own little database sub-industry. So while I’d hope that more systems work in deltas as time goes on, it’s already something that is possible.
(I work at Feldera.)
Useful context for anyone who is not familiar: materialized views in Postgres are not smart in the way the author is talking about. They are just caches that are manually refreshed.
Feldera is one such implementation of IVM. I have high hopes for it. One of its operating principles is making changes a first-class concept (via Z-sets), which seems similar in spirit to CDC but grounded in first principles. Next to Materialize it is the only such stream computation system I am aware of that maintains “internal consistency”.
A similar technology in this space is ClickHouse with its Incremental Materialized Views. It has you specify state aggregations more explicitly, whereas in Feldera you would write somewhat more standard and therefore declarative SQL.
Triggers are the solution?
Integer Indexes are slow now? Wow
Depending on how many rows are found after filtering, the runtime can have some variability, and you don’t want variability.
Also, this is just an example; you can imagine a different, less trivial problem where indexes are not so obviously helpful, but “materialized views” continue to be a useful tool.
Finding a good example is difficult: https://www.hillelwayne.com/post/persuasive-examples/