Our Grafana and Loki installs have quietly become 'legacy software' here
31 points by runxiyu
31 points by runxiyu
I worked near the Angular team when they did their migration from Angular 1 to 2. It was a big-bang rewrite from scratch, with all the new hot things like TypeScript and custom compilation tooling.
From the perspective of the people writing it: they had to break compatibility to enable all these great new ideas they had. (I have no comment on how great the new ideas actually were – I’m not too familiar with this library. I am just describing here the perspective of the authors of the code.)
But from the perspective of their existing users: they were just told the tool they were using (Angular 1) was getting abandoned, and if they wanted to keep using Angular they would need to rewrite from scratch atop the new one. At that point, once you’re rewriting from scratch, you might as well reevaluate which library you’re using, and in particular once you’re vendor shopping you probably will reconsider picking the vendor that just ditched you.
The framework churn in Web/GUI is real. I don’t know why that’s the case. Same with GTK. Every single major version increment destroys the entire ecosystem and rebuilds it anew. GTK2 was neat and pervasive. Parts of the ecosystems still haven’t recovered since GTK2 was abandoned for GTK3 (and the same fate is now awaiting GTK3 with the introduction of GTK4). And for what? The improvements aren’t tangible.
I am not sure if that just means that we haven’t found the best abstraction layers and interface designs yet, or if the cause for these issues is far more irrational. Meanwhile, libc
generally remains backwards compatible to its end-users forever, and so does the interface between Linux and its user-space. So we know it’s possible, it’s just that for some unknown reasons, certain parts of the stack are completely trashed and rewritten every few years.
One hypothesis was that many open source projects are done on a voluntary basis on the free time of open source developers and they don’t want to do the maintenance of projects but build new things.
It’s not always bad. From the top of my head: Go, Clojure, Java, Emacs, NixOS, Postgres etc are backward compatible and/or stable.
This is what happened with Python 3. No users were crying out for a wholesale replacement, but the devs didn’t find Python 2 fun any more.
Modifying an existing project in a backward compatible way is exponentially more difficult than non-backward compatible way. And projects tend to improve things they can improve, until they reach a point where they are left with a bunch of things that are near impossible to change.
It’s really the same way as organism producing offspring, and dying themselves. Organism accumulates shortcomings, damage, deficiencies… entropy, while the environment around it changes.
At least with GTK the timelines between the major releases are quite significant.
GTK2: released 2002 GTK3: released 2011 GTK4: released 2020
Getting close to a decade of stability doesn’t seem too bad
For comparison, I can read the OpenStep specification from 1994 and use it to write an application that will run on the most recent macOS. I’ll get a bunch of deprecation warnings (Apple renamed a bunch of enum values to work better with autocomplete, for example, but the old ones are still there), but pretty much everything still works. It often isn’t the recommended way of doing things. OpenStep was designed in an era when 8 MiB of RAM was a lot, so has abstractions like NSCell
to allow a single view instance to be used repeatedly in different places, whereas now you’d just burn a few MiBs of RAM for a much simpler programming model. Similarly, modern Cocoa makes pervasive use of CALayer
to let views render to textures and then composite the result rather than redrawing, because that’s more power efficient with a vaguely modern GPU (but was much slower on 1994 workstations). These incremental changes have allowed OpenStep/Cocoa to evolve from something very efficient on early-’90s hardware to something that scales down to mobile phones and up to workstations on modern hardware.
That’s super interesting. I wasn’t aware it was possible to have such long lasting MacOS applications. I was under the impression Apple often makes breaking changes.
I’m curious if Apple maintains ABI compatibility (Sans power pc, to x86 to ARM changes)? ie can an app compiled on x86 Snow Leopard run on the latest x86 macos?
I was under the impression Apple often makes breaking changes.
They do, but they have made very few in the core of OpenStep. The thing that they do more often is deprecate and remove entire frameworks. Things like their sync services APIs, for example, which did arbitrary sync with client devices were removed shortly after the iPhone launched because non-Apple devices using iSync worked better with OS X than iPhones.
They usually do graceful deprecation, even outside of the ‘core’ frameworks, so you’ve got a few releases of the old things working.
I’m curious if Apple maintains ABI compatibility (Sans power pc, to x86 to ARM changes)? ie can an app compiled on x86 Snow Leopard run on the latest x86 macos?
Yes, unless you do unsupported things. If it’s a 64-bit x86 binary, it will probably work. I’ve had a few exceptions: OmniOutliner 3, for example, had a bug that happened to be benign on older versions but causes it to crash on startup on newer ones. Apple doesn’t have Microsoft’s policy of adding compatibility layers to make buggy code keep working and that can be annoying when you buy a product and the only way for it to work with a newer OS is to buy a new version. Especially when the new version is much worse.
If you dynamically link libSystem, things will mostly keep working. If you do raw system calls (undocumented), things may break. Apple famously changed the parameters for gettimeofday
a few years ago, which broke every single Go binary (I suspect they did it deliberately to discourage Go from depending on things that they may want to change).
One of the big changes around Snow Leopard was the Objective-C non-fragile ABI. This meant that every ObjC ivar (field) was accessed via an indirection layer. When you subclassed a class from a system framework, it could add fields without breaking the layout of your class. This, in turn, meant that the headers didn’t need to include any private ivars (and most framework-provided classes had no private ivars). This made it quite easy to change the internal implementation of things without breaking user code. At the same time, reflection meant that you could poke at these implementation details if you wanted to, and if you did then things may break one minor release to the next.
For JS related things specifically, I think its in part that JS doesn’t really lend itself very well to abstractions, So you end up with something that feels like it does only 1 of X things right, so you aim for a higher number of right things, but you still end up at the 1 of X things right, its just a different 1 now.
This has not been my experience with React. There have been a few major versions that required changes, probably the biggest being the deprecation of createClass
and maybe also changes around context
(which was provided with the caveat that it was experimental in the first place) but there have always been codemods provided that reduced the pain appreciably.
we haven’t found the best abstraction layers and interface designs yet
I think this is probably the case, at least for UI stuff. For all the warts of web UI libraries my understanding is that they still represent some of the best abstractions we have for UI, which suggests to me that there is still a lot of room for improvement.
I feel a bit glib saying it like this but libc and GTK seem pretty different in what they’re trying to do, don’t they?
I don’t feel like I have the capability to do this, but perhaps someone can look at the different types of software, and how their design affects how they evolve. Libc isn’t the kernel ABI, isn’t GTK, isn’t a web-front-end, and what it costs to create backwards compatibility is going to be different for each of them.
But from the perspective of their existing users
From the perspective of their users, it was a rousing success with a very high rate of application conversion
From the perspective of the rest of their users, who happened to not work for Google, it was a disaster
Anyone have any good replacement for Grafana though? To me it has for a long time felt like this thing everyone uses not because it’s great, but because there isn’t really anything better (at least that’s not SaaS / massively enterprise)
It’s great… in comparison to Kibana :-)
I think that depends on which side of the user/admin side you are on.
We have just replaced Kibana & Graylog with Grafana Loki, which I’m told is a hella lot cheaper and also much less hassle to support for our use case.
Sadly as a user Grafana sure feels like a giant step backward. It’s a log search tool that can’t, like, search logs? I now have to search in 15 minute slices of time to avoid timeouts, whereas with Graylog I could search a week or more and get almost instant results.
Perhaps I’m being unkind. Graylog was prone to ingestion-lag, which was becoming more common. It’s early days in our Grafana journey, and perhaps our infra team can make it work better now they don’t have to wrestle with Graylog the whole time.
I created some visualisations with Kibana in the past and it was such a pain and after an update they were all broken. If I remember correctly it was not even possible to export them. Creating visualisations with Grafana is so much easier.
I’m prepared to believe that. I haven’t attempted to build dashboards in either tool, I’ve just used them for ad-hoc search.
Literally anything else? I honestly don’t know how Grafana/tempo/etc became the defacto leader in the ‘open source monitoring’ stack. The UI is incredibly clunky, slow, and painful to use. Tying together traces with logs or graphs is done using clunky split planes, and long traces often freeze the browser tab. Loki timeouts constantly, and I know $currjob keeps throwing hardware at it and has smart people managing it.
Persistence of dashboards is a non-standard feature that is crazy complicated to setup and get working alongside gitops correctly. I’ve searched for how it works and have seen gnashing of teeth of ops people trying to set it up so it sticks around through deploys. This is the ability for a user to save a custom graph or dashboard - that’s it…which is kind of a cornerstone of getting real engineering adoption of monitoring tools.
Datadog is so far ahead in this game its not funny…and yes, I know its super expensive and their sales tactics are shadey and costs can spiral. But they make a great product that I’ve had up and running and useful within a day or two of setting up their agents and adding a few key integrations. And you can save graphs!
Even New Relic and Appsignal were okay when I’ve used them, tho that was like at least 8-10 years ago.
I really wish there was much more competition in the space, especially at the ‘freemium/OSS level’. It seems like you either pay 6 figures a year for DD or maybe a competitor, or you have a team of people supporting and managing grafana year-round for a tool that is mediocre at best.
on edit: apologies for the rant :), I clearly feel strongly about having good monitoring tools to bridge the world across ops, dev, even product and non-tech folks. And I used to use every tool available every day to keep tabs on what my team is shipping. That just feels too painful w/ the current entrenched state of grafana.
Persistence of dashboards is a non-standard feature
That really sounds like a “you’re holding it wrong” sort of problem.
It stores its dashboard, user, etc data in sqlite3, mysql, or postgres. How to not delete your database when you deploy doesn’t seem like the biggest challenge.
The latest release does have a git sync feature, which does seem like a nice addition if you don’t want a DB to worry about though.
Perhaps it is “easy” on a brand new Grafana setup, running the latest version, with a very simple setup.
The experience I’ve seen of a large, long-running, multiple environment Grafana setup, using k8s, gitops, all the ‘best practices’, is that it is a very real challenge. The ‘standard path’ for Grafana, at least as of 3-4 years ago, was you had predefined dashboards in JSON that you could edit in JSON if you wanted to change them. If you make changes to ‘provisioned dashboards’ in the UI there was no easy way to persist them or even get them to save to a new dashboard. The hybrid approach with a stateful DB is still pretty new and has lots of rough edges.
[Allowing saving of provisioned dashboards] i.e. a pretty fundamental feature, was merged in Oct 2019. It had many issues and fiddly bits that would cause all sorts of pain and needed workarounds well up to today even. Here is just one open issue, and searching around for related issues shows many other basic, user-focused things that don’t work or require specific config and setup to handle.
I agree that Grafana and Gitops don’t go well together for dashboards, and recommend against it.
Defining your dashboards in code or in JSON just isn’t any fun.
Treat the database as a persistent datastore, and don’t wipe and reset it every deploy. Back it up if you have to.
Anyways, this is changing now that Grafana’s got an official git-sync feature in beta.
I’d probably look at one of the newer observability platforms like openobserve/signoz/victoria-*.
Maybe SigNoz? (https://signoz.io)
I evaluated it a while back but never got into a full deployment. It’s supposed to be a one-stop shop and easier to setup. Their Open Source version has all of the functionality one would need to replace OSS Grafana I believe.
As an employee I’m biased, but I think Gravwell does a pretty good job. We have a good free offering at 14GB/day ingest (trivially bumped to 50GB/day if you’re a business user kicking the tires), you can self-host it, and we take great pains to avoid ever breaking existing dashboards or queries. Good support for querying from outside the web UI, too, which is a big thing for me.
VictoriaLogs author here. If you are interested in technical details on why VictoriaLogs can replace 27-node Elasticsearch cluster, then read this article.
Your linked medium article doesn’t seem to touch the subject on how compute intense the creation of those bloom filters is, especially since I would guess you have to recalculate them for every new log ? Also is that one filter per token over all log entries ?
I would guess you have to recalculate them for every new log
VictoriaLogs splits all fields across all the ingested logs into tokens (words) on data ingestion, and then calculates bloom filters on the obtained tokens. This takes additional CPU time, but it works quite fast on average, so a single CPU core can process up to 50 MB of ingested logs per second (and this can be optimized further).
is that one filter per token over all log entries ?
No, logs are stored in independent blocks. Every block has its own bloom filters. So identical tokens, which belong to distinct blocks, are stored to distinct bloom filters.
The fediverse thread on VictoriaLogs that @cks linked to includes his previous comments on Grafana Loki
Another factor was it became increasingly obvious that Loki was not intended for our simple setup and future versions of Loki might well work even worse in it than our current version does
This.
Every time I see a SaaS product packaged to be self hosted, I know it’s almost certainly going to be increasingly complicated to operate on your own. And it’s not surprising.. a hosted product that supposed to support a multi tenanted production system will be designed differently from a single tenant one.
Is VictoriaLogs open-core like victoriametrics?
VictoriaLogs is open-source published under Apache2 license. It doesn’t have closed-source features.
Also, VictoriaMetrics is an open-source under Apache2. There is an enterprise version, which contains additional features, which could be useful in Enterprise environment, but the open-source VictoriaMetrics has all the essential features, which are needed by the most of users. See the list of enterprise features here.
I can confirm, most of the things unlocked by the enterprise feature really isn’t useful in a normal setting. In my opinion it’s really well targeted. Their incremental backup tool is particularly nice in its simplicity.
I’m going to chime in here and give @valyala a shoutout (along with Artem and Derek!). I was part of the vendor relationship team at my last employer and we had an enterprise contract with VictoriaMetrics. They were by far our best vendor. What we got out of the contract was their expertise. They were one of the most responsive and sensible vendors and were incredibly helpful at making a product that’s already pretty easy to operate go as far as you want to take it.
I can’t recommend enough either: That you use their software for free, or give them your money. Both is absolutely worth it.
Also, VictoriaMetrics is an open-source under Apache2. There is an enterprise version, which contains additional features, which could be useful in Enterprise environment, but the open-source VictoriaMetrics has all the essential features, which are needed by the most of users. See the list of enterprise features here.
That’s still open-core.
The open-core model is a business model for the monetization of commercially produced open-source software. The open-core model primarily involves offering a “core” or feature-limited version of a software product as free and open-source software, while offering “commercial” versions or add-ons as proprietary software.[1][2] The term was coined by Andrew Lampitt in 2008.[3][4]
The concept of open-core software has proven to be controversial, as many developers do not consider the business model to be true open-source software. Despite this, open-core models are used by many open-source software companies.[5]
https://en.wikipedia.org/wiki/Open-core_model
Open-core is still better than nothing, but I always avoid it if I can.