QCon London 2012: High availability at Heroku

Mark Mebranaghan
Track: High available systems


Lots of load balancing.

Embrace crashing and enabling supervision (like erlang):

  • distributed supervision
  • crashes as code paths
  • crashes as hot code paths (exercised a lot)
  • keep smaller and smaller kernel : keeps it simple

Message passing (json format). Handle different versions of messages.

Continuously running.

It’s a distributed system with granular failure.

Brokered queueing (producer/consumer pattern):

  • RabbitMQ used for a while
  • the broker node is a SPOF : publish to one, subscribe to all
  • several brokers : load balanced publish to one, all subscribers subscribed to all brokers

Read call-graph partial failure: graceful termination.

Write call-graph de-synchronization : write a ticket (to a local database) to delay write operations when not available.

Execution (eveything outside architecture)

Evolving socio-technical ecosystems. Most of the problems are:

  • failed deploy
  • bad visibility
  • cascading feedback

Need for a very repeatable deploy:

  • incremental deploys: deploy to a few nodes and incresing deploy perimeter when confidence grows
  • incremental rollouts (features): feature flags for dev, beta users then all users
  • real time visibility: dashboards (60s visibility)
  • service level assertions : asserts in code, global level for a service. Assert good things too

Flow control and back pressure : some systems can’t absorb all load:

  • divert traffic from this system
  • limit message passing (parameters on a file system like /etc/rate/publish)