Skip to main content

Elixir Syn: Partitioning Scopes with phash2 to Reduce Message Queue Buildup

ยท 3 min read
๐Ÿ‘‹ I'm a dev at Supabase

I work on logging and analytics, and manage the underlying service that Supabase Logs and Logflare. The service does over 7 billion requests each day with traffic constantly growing, and these devlog posts talk a bit about my day-to-day open source dev work.

It serves as some insight into what one can expect when working on high availability software, with real code snippets and PRs too. Enjoy!๐Ÿ˜Š

When working with large distributed Elixir clusters at Logflare/Supabase, we encountered situations where the syn_gen_scope gen_server process would become overwhelmed with messages during cross-cluster synchronization. This happens particularly when thousands of processes register under a single scope, causing the message queue to grow significantly and impacting cluster discovery and synchronization performance.

Under the hood, each :syn scope runs as a single gen_server process (see syn_gen_scope.erl). This process handles node discovery, state synchronization when nodes join or rejoin, and broadcasting updates across the cluster. All of these operations funnel through one process per scope. When thousands of processes register under a single scope, this gen_server has to handle every discovery request, every sync acknowledgment, and every broadcast. Its message queue grows, and cluster synchronization slows down.

One way to deal with this is to partition scopes by creating multiple scope processes and using phash2 to consistently hash a term (such a resource identifier) to determine which partition scope to use. This splits the synchronization load across multiple syn_gen_scope processes and helps to increase processing throughput by allowing all cores to process the messages, reducing the message queue length on any single scope process and improving overall :syn stability.

The phash2 function provides a consistent hash that will always map the same term to the same partition, ensuring that registration and lookup operations use the same scope across all nodes in the cluster. This consistency is critical for distributed systems where processes on different nodes need to agree on which scope contains a particular registration.

Here's how we configure partitioned scopes based on the number of schedulers available on the system:

# runtime.exs
# explicitly set the atom scopes during application startup
syn_my_scope_partitions =
for n <- 0..System.schedulers_online(), do: "my_scope_#{n}" |> String.to_atom()

config :syn,
scopes: [:other_scopes] ++ syn_my_scope_partitions

This creates scopes like :my_scope_0, :my_scope_1, up to :my_scope_N where N matches the number of schedulers. Matching the partition count to schedulers helps ensure good distribution across available CPU cores, and splits up messages across multiple syn_gen_scope processes.

To use these partitioned scopes, we use :via tuples with GenServer.start_link/3. The format {:via, :syn, {scope, name, meta}} lets Syn handle registration automatically: the process registers on start and unregisters on termination.

defmodule Logflare.Endpoint do
use GenServer

@partition_count System.schedulers_online() + 1

def start_link(identifier) do
GenServer.start_link(__MODULE__, identifier, name: via(identifier))
end

def get_info(identifier) do
GenServer.call(via(identifier), :get_info)
end

defp via(identifier) do
scope = :"my_scope_#{:erlang.phash2(identifier, @partition_count)}"
{:via, :syn, {scope, identifier}}
end

@impl true
def init(identifier), do: {:ok, %{identifier: identifier}}

@impl true
def handle_call(:get_info, _from, state), do: {:reply, state, state}
end

Since phash2 is deterministic, lookups from any node in the cluster resolve to the correct partition scope.

The benefits of this partitioning approach are significant. By distributing registrations across multiple scope processes, each syn_gen_scope gen_server handles a fraction of the total synchronization load. This reduces message queue buildup and improves the responsiveness of cluster synchronization operations, especially when dealing with large numbers of processes in the process registry.

Conclusionโ€‹

Partitioning Syn scopes with phash2 provides a straightforward way to scale distributed process registration across large clusters, preventing message queue buildup and ensuring that cluster synchronization remains responsive even as the number of registered processes grows into the thousands or millions.