Name: Scaling Graph Databases
Availability: InStock

Why Graphs Are Hard to Partition

The Beginner and Intermediate courses treated the graph as a single thing living on a single machine. The Advanced course opens with the inconvenient truth: a graph is the worst-case data structure to split across machines. Relational rows shard cleanly by key. Vector indexes shard by hashing IDs. A graph fights you, because the entire value of a graph is in its edges — and edges are exactly what partitioning has to cut.

The Core Problem: Edge Cuts

When you split a graph's vertices across P partitions, every edge is either local (both endpoints in the same partition) or a cross-partition edge (endpoints in different partitions, sometimes called a "cut edge" or "ghost edge"). Local edges are cheap to traverse — a pointer chase in RAM. Cross-partition edges require a network hop to follow.

The whole game of graph partitioning is: minimize the number of cross-partition edges while keeping partitions roughly balanced in size. Formally this is the balanced minimum k-cut problem.

Min-Cut and Why It's NP-Hard

The unconstrained minimum cut between two vertices is solvable in polynomial time (max-flow / min-cut theorem). But the moment you add the balance constraint — each partition must hold roughly |V| / P vertices — the problem becomes NP-hard. You cannot, in general, find the optimal balanced partition of a large graph in reasonable time.

So in practice nobody computes the optimal cut. They use heuristics:

METIS / multilevel partitioning: coarsen the graph (collapse vertices), partition the small coarse graph, then refine (Kernighan–Lin swaps) as you uncoarsen. The de-facto standard for offline, static graphs.
Label propagation: each vertex iteratively adopts the partition label most common among its neighbors. Cheap, parallel, streaming-friendly, lower quality than METIS.
Streaming partitioners (LDG, Fennel): assign each vertex to a partition as it arrives, greedily placing it where most of its already-seen neighbors live, with a penalty for overfull partitions. Used when the graph is too big to hold in memory.

Edge-Cut vs Vertex-Cut

There are two dual ways to partition:

Edge-cut (vertex partitioning): vertices are assigned to partitions; edges that span partitions are "cut." Best when the graph is relatively uniform in degree.
Vertex-cut (edge partitioning): edges are assigned to partitions; high-degree vertices get replicated across the partitions that hold their edges. This is what PowerGraph/GraphX use, and it's dramatically better for power-law graphs (social networks, web graphs) where a few vertices have enormous degree. Cutting a million-edge vertex once (edge-cut) creates a brutal hotspot; spreading its edges (vertex-cut) balances the load.

Why Real-World Graphs Make It Worse

Most interesting graphs are scale-free / power-law: degree distribution follows a heavy tail. A handful of vertices (celebrities, popular products, the literal node for "United States") have millions of edges, while most vertices have a handful. This destroys naive partitioning — you can't cut around a vertex whose edges touch every partition no matter where you put it. That special pain has a name, and we devote Section 3 to it: supernodes.

The Takeaway Before You Write Any Code

Distribution is not free and the cost is recall and latency on traversals, not storage. A graph that is 5% cross-partition edges can still be 50%+ cross-partition traversal cost, because traversals concentrate on the well-connected core. Measuring cross-partition edge ratio is the first diagnostic — and you'll build exactly that in the Practice tab.

Key Takeaways

Partitioning a graph means cutting edges, and balanced minimum k-cut is NP-hard — production systems use heuristics (METIS, label propagation, streaming partitioners) not optimal solutions
Edge-cut assigns vertices and cuts spanning edges; vertex-cut assigns edges and replicates high-degree vertices — vertex-cut is far better for power-law graphs
Cross-partition edge ratio is the first diagnostic, but traversal cost is worse than the raw ratio because traversals concentrate on the densely-connected core

Scaling Graph Databases

Why Graphs Are Hard to Partition

Why Graphs Are Hard to Partition

The Core Problem: Edge Cuts

Min-Cut and Why It's NP-Hard

Edge-Cut vs Vertex-Cut

Why Real-World Graphs Make It Worse

The Takeaway Before You Write Any Code

Sharding Strategies for Graphs

Supernodes and How to Tame Them

Replication, Consistency, and Distributed Traversal Cost

How the Engines Approach Scale

AI Learning Assistant

Course Stats

Up Next