2023-10-05 03:11:27 +02:00
|
|
|
# Planning
|
|
|
|
|
2024-07-23 04:21:56 +02:00
|
|
|
> [!WARNING]
|
|
|
|
>
|
2024-08-17 13:38:52 +02:00
|
|
|
> Replication is a newly developed feature. This means it requires manual configuration and careful
|
|
|
|
> monitoring. You should keep backups if you choose to proceed.
|
2023-10-05 03:11:27 +02:00
|
|
|
|
|
|
|
It is important that you plan your replication deployment before you proceed. You may have a need
|
|
|
|
for high availability within a datacentre, geographic redundancy, or improvement of read scaling.
|
|
|
|
|
|
|
|
## Improvement of Read Throughput
|
|
|
|
|
|
|
|
Addition of replicas can improve the amount of read and authentication operations performed over the
|
|
|
|
topology as a whole. This is because read operations throughput is additive between nodes.
|
|
|
|
|
|
|
|
For example, if you had two servers that can process 1000 authentications per second each, then when
|
|
|
|
in replication the topology can process 2000 authentications per second.
|
|
|
|
|
|
|
|
However, while you may gain in read throughput, you must account for downtime - you should not
|
|
|
|
always rely on every server to be online.
|
|
|
|
|
|
|
|
The optimal loading of any server is approximately 50%. This allows overhead to absorb load if
|
|
|
|
nearby nodes experience outages. It also allows for absorption of load spikes or other unexpected
|
|
|
|
events.
|
|
|
|
|
|
|
|
It is important to note however that as you add replicas the _write_ throughput does not increase in
|
|
|
|
the same way as read throughput. This is because for each write that occurs on a node, it must be
|
|
|
|
replicated and written to every other node. Therefore your write throughput is always bounded by the
|
|
|
|
_slowest_ server in your topology. In reality there is a "slight" improvement in writes due to
|
|
|
|
coalescing that occurs as part of replication, but you should assume that writes are not improved
|
|
|
|
through the addition of more nodes.
|
|
|
|
|
|
|
|
## Directing Clients to Live Servers
|
|
|
|
|
|
|
|
Operating replicas of Kanidm allows you to minimise outages if a single or multiple servers
|
|
|
|
experience downtime. This can assist you with patching and other administrative tasks that you must
|
|
|
|
perform.
|
|
|
|
|
|
|
|
However, there are some key limitations to this fault tolerance.
|
|
|
|
|
|
|
|
You require a method to fail over between servers. This generally involves a load balancer, which
|
|
|
|
itself must be fault tolerant. Load balancers can be made fault tolerant through the use of
|
|
|
|
protocols like `CARP` or `VRRP`, or by configuration of routers with anycast.
|
|
|
|
|
|
|
|
If you elect to use `CARP` or `VRRP` directly on your Kanidm servers, then be aware that you will be
|
|
|
|
configuring your systems as active-passive, rather than active-active, so you will not benefit from
|
|
|
|
improved read throughput. Contrast, anycast will always route to the closest Kanidm server and will
|
|
|
|
failover to nearby servers so this may be an attractive choice.
|
|
|
|
|
|
|
|
You should _NOT_ use DNS based failover mechanisms as clients can cache DNS records and remain
|
|
|
|
"stuck" to a node in a failed state.
|
|
|
|
|
|
|
|
## Maximum Downtime of a Server
|
|
|
|
|
2023-10-13 00:50:36 +02:00
|
|
|
Kanidm's replication protocol enforces limits on how long a server can be offline. This is due to
|
|
|
|
how tombstones are handled. By default the maximum is 7 days. If a server is offline for more than 7
|
|
|
|
days a refresh will be required for that server to continue participation in the topology.
|
2023-10-05 03:11:27 +02:00
|
|
|
|
|
|
|
It is important you avoid extended downtime of servers to avoid this condition.
|