Skip to main content

Resilience

Platform Resilience

This page provides guidance on how continuous operation and high availability can be achieved.

Overview

The following diagram illustrates the main components. It shows example configurations for secondary hosts and which platform components can be optionally deployed.

Resilience Overview

Key Access

Key Access instances operate as a fault tolerant cluster providing the following services:

  • Authorisation and authentication services.
  • Heartbeat based monitoring of applications.
  • Leader election services for clustered applications.
Resilience Key Access
  • All instances can accept registration requests.
  • The ingress load balancer should be configured to distribute requests to running instances.

Key Access has two cluster node discovery modes:

  • Multicast - enables automatic discovery of new cluster members, allowing seamless scaling and dynamic addition of nodes.
  • TCP - static TCP endpoints can be configured.

Recommendation: Running at least two instances ensures that the service remains available if one instance fails.

Sequencer

The Sequencer is responsible for ordering messages within the platform. It can be deployed in an Active / Hot Standby configuration.

Active / Hot Standby

  • A single active instance of the sequencer will sequence messages.
  • Hot-standby instance(s) are ready to automatically take over if the active instance fails.

Having a single sequencer instance rather than a cluster, reduces the need to exchange information between instances. This allows for lower latency and higher throughput. The trade off being that in the event of sequencer failure there is a short period (configurable) in which a standby sequencer instance takes over.

Resilience Sequencer

Upon failover, publishers are automatically notified, prompting them to re-send the most recent image for each topic to recover any messages that may have been lost in transit. This ensures that any message loss is limited to in-flight image updates for a given topic.

For instance, if a publisher is streaming prices for several instruments (topics) and a failover occurs, the system prompts the publisher to resend the latest price image for each instrument (topic). Even if several intermediate updates were lost in transit, subscribers will still receive the most up-to-date price image for every instrument (topic), ensuring data continuity and minimising any impact on downstream consumers.

Recommendation: Deploying at least two sequencer instances (one active, one running hot standby) ensures rapid recovery from failures.

Clustered

note

This is on our roadmap, however the active / hot standby approach is preferred for on-prem deployments (where server loss risks are low) due to performance benefits.

Relays

  • A single Relay Live and Relay Cache instance should be deployed on each host.
  • The relays only serve applications that are co-located on the same host.
Resilience Relays

In the event of host failure, no action needs to be taken regarding the relays because they only serve processes that co-located on the same host.

Recommendation: Run a pair of relays on each host.

Monitor

The Monitor component collects and stores platform metrics, providing insights into system health and performance.

  • Multiple instances of the Monitor can be run safely.
  • They capture metrics published on the platform and idempotently write these into a database. Idempotent database writes prevent duplicate metric entries, ensuring data accuracy even if the same metric is reported multiple times.
Resilience Monitor

Recommendation: Running at least two Monitor instances increases reliability and ensures continuous metric collection.

KeySquare Proxy

  • Multiple instances of the Proxy can be run.
  • Clients will be automatically load balanced to an active instance.
  • In the event of host failure an alternative proxy instance will be used.
Resilience Proxy

Recommendation: Running at least two instances ensures that clients are automatically redirected to healthy nodes in case of failures.

Workspace

  • Multiple instances of the Workspace can be run.
  • Clients will be automatically load balanced to an active instance.
  • In the event of host failure an alternative instance will be used.
Resilience Workspace

Recommendation: Running at least two instances ensures that clients are automatically redirected to healthy nodes in case of failures.

Web Ingress

The platform uses an ingress controller (HAProxy) for development purposes. In production environments, it is recommended to replace this with a highly available solution to ensure scalability and reliability.

Application Resilience

note

The Key Access cluster forms the resiliency backbone of the platform and provides both health and leader election services to platform applications. These services will shortly be made available to all applications and integrated into the api.

Application Status

The ApplicationStatus message provides details of the health of an individual application.

This has the following properties:

PropertyDescription
topic groupapplication group name
topic idapplication name
applicationIdthe application id that this information is related to.
applicationGroupIdthe application group id that this information is related to.
sessionIdthe current (or last) session id assigned to the application.
livenesscan be used to determine if an application is online / offline - this is automatically determined by the platform.
readinesscan be used to determine if the application is in a "ready" state - readiness can be signalled by the application using the api.

Resiliency Status

The ResiliencyStatus message provides details of a cluster of applications and can be used to determine which application is the lead.

This has the following properties:

PropertyDescription
topic idapplication group name.
leadApplicationIdthe application id of the cluster leader (zero if none is available).
leadApplicationNamethe application name of the cluster leader (empty string if none is available).
fencingTokena fencing token for the cluster - see How to do distributed locking