Scaling Graph Databases

One machine stops being enough long before your graph stops growing. This day is about what breaks when you partition a graph across machines: why graph partitioning is NP-hard, how cross-partition edges quietly dominate traversal cost, what supernodes do to your latency, and how Neo4j Fabric, JanusGraph, TigerGraph, and Amazon Neptune each make a different bet on where the hard part should live.

Day 1 Progress0%

Why Graphs Are Hard to Partition

The Beginner and Intermediate courses treated the graph as a single thing living on a single machine. The Advanced course opens with the inconvenient truth: a graph is the worst-case data structure to split across machines. Relational rows shard cleanly by key. Vector indexes shard by hashing IDs. A graph fights you, because the entire value of a graph is in its edges — and edges are exactly what partitioning has to cut.

The Core Problem: Edge Cuts

When you split a graph's vertices across P partitions, every edge is either local (both endpoints in the same partition) or a cross-partition edge (endpoints in different partitions, sometimes called a "cut edge" or "ghost edge"). Local edges are cheap to traverse — a pointer chase in RAM. Cross-partition edges require a network hop to follow.

The whole game of graph partitioning is: minimize the number of cross-partition edges while keeping partitions roughly balanced in size. Formally this is the balanced minimum k-cut problem.

Min-Cut and Why It's NP-Hard

The unconstrained minimum cut between two vertices is solvable in polynomial time (max-flow / min-cut theorem). But the moment you add the balance constraint — each partition must hold roughly |V| / P vertices — the problem becomes NP-hard. You cannot, in general, find the optimal balanced partition of a large graph in reasonable time.

So in practice nobody computes the optimal cut. They use heuristics:

  • METIS / multilevel partitioning: coarsen the graph (collapse vertices), partition the small coarse graph, then refine (Kernighan–Lin swaps) as you uncoarsen. The de-facto standard for offline, static graphs.
  • Label propagation: each vertex iteratively adopts the partition label most common among its neighbors. Cheap, parallel, streaming-friendly, lower quality than METIS.
  • Streaming partitioners (LDG, Fennel): assign each vertex to a partition as it arrives, greedily placing it where most of its already-seen neighbors live, with a penalty for overfull partitions. Used when the graph is too big to hold in memory.

Edge-Cut vs Vertex-Cut

There are two dual ways to partition:

  • Edge-cut (vertex partitioning): vertices are assigned to partitions; edges that span partitions are "cut." Best when the graph is relatively uniform in degree.
  • Vertex-cut (edge partitioning): edges are assigned to partitions; high-degree vertices get replicated across the partitions that hold their edges. This is what PowerGraph/GraphX use, and it's dramatically better for power-law graphs (social networks, web graphs) where a few vertices have enormous degree. Cutting a million-edge vertex once (edge-cut) creates a brutal hotspot; spreading its edges (vertex-cut) balances the load.

Why Real-World Graphs Make It Worse

Most interesting graphs are scale-free / power-law: degree distribution follows a heavy tail. A handful of vertices (celebrities, popular products, the literal node for "United States") have millions of edges, while most vertices have a handful. This destroys naive partitioning — you can't cut around a vertex whose edges touch every partition no matter where you put it. That special pain has a name, and we devote Section 3 to it: supernodes.

The Takeaway Before You Write Any Code

Distribution is not free and the cost is recall and latency on traversals, not storage. A graph that is 5% cross-partition edges can still be 50%+ cross-partition traversal cost, because traversals concentrate on the well-connected core. Measuring cross-partition edge ratio is the first diagnostic — and you'll build exactly that in the Practice tab.

Key Takeaways
  • Partitioning a graph means cutting edges, and balanced minimum k-cut is NP-hard — production systems use heuristics (METIS, label propagation, streaming partitioners) not optimal solutions
  • Edge-cut assigns vertices and cuts spanning edges; vertex-cut assigns edges and replicates high-degree vertices — vertex-cut is far better for power-law graphs
  • Cross-partition edge ratio is the first diagnostic, but traversal cost is worse than the raw ratio because traversals concentrate on the densely-connected core

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
50 min
Lessons
5 sections