A Decade on Datomic - Davis Shepherd & Jonathan Indig (Netflix)

9 points by adityaathalye


pragmatic

To save someone some ad views and the ear splitting muzac intro :

(warning AI intro ahead, but I'm a reader not a watcher and I'm curious about clojure and datatomic in the wild )

This video, "A Decade on Datomic," features Davis Shepherd and Jonathan Indig from Netflix discussing their experience building and evolving an orchestration system (Dagaba) using Clojure and Datomic over the past decade (0:12).

Here's a breakdown of the key points:

• What is Dagaba? It's a system for building data pipelines, specifically for data scientists and researchers at Netflix to define ML training pipelines (1:21). It uses a Python API which is translated into EDN for execution (1:34).

• Key Feature: Data Provenance and Structural Sharing Dagaba's unique strength lies in its robust model for tracking data provenance. This allows for structural sharing of data pipelines, meaning common components can be reused, significantly boosting productivity and saving computation time and cost by only recomputing necessary deltas (2:15-3:12).

• Initial Architecture and Evolution: • Initially, Dagaba was a monolith deployed in an HA way, managing all aspects of the application (3:51).

• The state machine uses a forward inference rules engine, aggressively retracting inactive entities for performance (4:22-5:00).

• Over 10 years, it evolved from serving a single ML team to dozens of teams training thousands of models (5:21-5:32).

• The system was split into smaller deployable units to improve horizontal scaling and efficiency while sharing the same codebase (5:50-6:15).

• Scaling Challenges and Datomic Solutions: • Reporter Component: The first split was the reporter, which runs long historical queries. Datomic's built-in object cache allowed for read load to be decoupled from write load, with a very low cache miss rate, enabling independent scaling (6:39-7:58).

• API Component: Scaling read APIs (stateless queries) presented a challenge due to heterogeneous queries. Datomic's desync primitive and the basis t value allowed for read-after-write consistency across separate instances without complex distributed system problems (8:06-10:25).

• Archiver Component and Large Transactions: Aggressively retracting inactive graphs led to large transactions that occupied the transactor for multiple seconds (10:31-11:48). The solution involved splitting the archiver onto its own stack and moving to tombstones (marking entities as inactive instead of full retraction), leveraging Datomic's flexible schema and def-filter (11:58-12:55).

• Topological Sort for Retraction: To break up large retraction transactions, they implemented a topological sort from leaf to root using Datomic's entity API and regular Clojure code, which proved to be very fast even for graphs with half a million nodes (12:59-13:49).

• Monitor Component and Security: The most recent evolution was driven by the business need to handle PII data, requiring the monitor component (which handles side effects like launching jobs) to be isolated due to its broad permissions and the presence of untrusted user code (13:53-15:10). Datomic's sync (15:57) and tx-report-queue (16:47) were crucial for ensuring synchronized state at startup and receiving change notifications (15:10-17:00).

• Benefits of Clojure and Datomic: • The existing architecture's strong component and protocol boundaries, combined with Datomic treating the database as a value, made state updates a pure, testable function (17:06-17:29).

• The primitives provided by Datomic made solving distributed system problems straightforward (18:01-18:15).

• Datomic has proven to be extremely robust infrastructure despite heavy usage (18:17-18:28).

• The system handles approximately 100,000 daily executions for thousands of models, with the application running on about 40 peers across five stacks, and a 20 TB database with 15 billion datoms (18:31-18:48).

• Stability of Clojure Ecosystem: Jonathan Indig emphasizes the unusual stability of the Clojure ecosystem (19:10-19:21). He contrasts this with the significant disruption and migration efforts (taking months to years) caused by frequent breaking changes in other ecosystems like Hadoop, Spark, Python, and Java runtimes (20:17-22:41).

In contrast, Clojure upgrades often only require a version bump in the build file with no source code changes (22:42-24:03), and Datomic upgrades have also been similarly straightforward with minimal issues (24:06-25:44).

• Conclusion: The speakers express gratitude to the many engineers who contributed to Dagaba, the Clojure open-source community for providing world-class and stable tools (27:00-28:09), and especially to the Clojure and Datomic teams for enabling them to focus on solving business problems without being encumbered by unpredictable or breaking tools (28:51-29:28)