68 20230831 design replication coordinator ()

This commit is contained in:
Firstyear 2023-09-05 16:39:16 +10:00 committed by GitHub
parent eee9b09338
commit 436a3f0307
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
4 changed files with 802 additions and 7 deletions

View file

@ -54,11 +54,14 @@
- [Developer Guide](DEVELOPER_README.md)
- [FAQ](developers/faq.md)
- [Design Documents]()
- [Architecture](developers/designs/architecture.md)
- [Access Profiles 2022](developers/designs/access_profiles_rework_2022.md)
- [Access Profiles Original](developers/designs/access_profiles_and_security.md)
- [REST Interface](developers/designs/rest_interface.md)
- [Elevated Priv Mode](developers/designs/elevated_priv_mode.md)
- [Oauth2 Refresh Tokens](developers/designs/oauth2_refresh_tokens.md)
- [Replication Internals](developers/designs/replication.md)
- [Replication Coordinator](developers/designs/replication_coord.md)
- [Python Module](developers/python.md)
- [RADIUS Integration](developers/radius.md)

View file

@ -1,15 +1,30 @@
# The identity verification API
The following diagram describes the api request/response of the identity verification feature (from here on referred as “idv”). The api takes an _IdentifyUserRequest_ instance as input, which in the diagram is represented by a circle shape, and it returns an _IdentifyUserResponse_, which is represented by a rectangle.
The response rectangles are colored with green or red, and although all responses belong to the same enum, the colors are meant to provide additional information. A green response means that the input was valid and therefore it contains the next step in the identity verification flow, while a red response means the input was invalid and the flow terminates there. Note that the protocol is completely stateless, so the following diagram is not to be intended as a state machine, for the idv state machine go [here](#the-identity-verification-state-machine-idv).
The following diagram describes the api request/response of the identity verification feature (from
here on referred as “idv”). The api takes an _IdentifyUserRequest_ instance as input, which in the
diagram is represented by a circle shape, and it returns an _IdentifyUserResponse_, which is
represented by a rectangle. The response rectangles are colored with green or red, and although all
responses belong to the same enum, the colors are meant to provide additional information. A green
response means that the input was valid and therefore it contains the next step in the identity
verification flow, while a red response means the input was invalid and the flow terminates there.
Note that the protocol is completely stateless, so the following diagram is not to be intended as a
state machine, for the idv state machine go [here](#the-identity-verification-state-machine-idv).
![idv api diagram](diagrams/idv_api_diagram.drawio.svg)
Note that the endpoint path is _/v1/person/:id/\_identify_user_, therefore every request is made up by the _IdentifyUserRequest_ and an Id. Furthermore to use the api a user needs to be authenticated, so we link their userid to all their idv requests. Since all requests contains this additional information, there is a subset of responses that solely depend on it and therefore can **always** be returned regardless of what _IdentifyUserRequest_ what provided. Below you can find said subset along with an explanation for every response.
Note that the endpoint path is _/v1/person/:id/\_identify_user_, therefore every request is made up
by the _IdentifyUserRequest_ and an Id. Furthermore to use the api a user needs to be authenticated,
so we link their userid to all their idv requests. Since all requests contains this additional
information, there is a subset of responses that solely depend on it and therefore can **always** be
returned regardless of what _IdentifyUserRequest_ what provided. Below you can find said subset
along with an explanation for every response.
![generic api responses](diagrams/idv_generic_responses.drawio.svg)
Here are the _IdentifyUserRequest_ and _IdentifyUserResponse_ enums just described as found inside the [source code](https://github.com/kanidm/kanidm/blob/05b35df413e017ca44cc4410cc255b63728ef373/proto/src/internal.rs#L32) :
Here are the _IdentifyUserRequest_ and _IdentifyUserResponse_ enums just described as found inside
the
[source code](https://github.com/kanidm/kanidm/blob/05b35df413e017ca44cc4410cc255b63728ef373/proto/src/internal.rs#L32)
:
```rust
pub enum IdentifyUserRequest {
@ -28,12 +43,17 @@ pub enum IdentifyUserResponse {
CodeFailure,
InvalidUserId,
}
```
## The identity verification state machine
Here is the idv state machine, built on top of the idv endpoint request/response types previously described.
Since the protocol provided by kanidm is completely stateless and doesnt involve any online communication, some extra work is needed on the ui side to make things work. Specifically on the diagram you will notice some black arrows: they represent all the state transitions entirely driven by the ui without requiring any api call. Youll also notice some empty rectangles with a red border: they represent the scenario in which the other user tells us that the code provided doesnt match. This makes the idv fail, and its the only case in which the failure is entirely driven by the ui.
Here is the idv state machine, built on top of the idv endpoint request/response types previously
described. Since the protocol provided by kanidm is completely stateless and doesnt involve any
online communication, some extra work is needed on the ui side to make things work. Specifically on
the diagram you will notice some black arrows: they represent all the state transitions entirely
driven by the ui without requiring any api call. Youll also notice some empty rectangles with a red
border: they represent the scenario in which the other user tells us that the code provided doesnt
match. This makes the idv fail, and its the only case in which the failure is entirely driven by
the ui.
![idv state machine](diagrams/idv_state_machine.drawio.svg)

View file

@ -0,0 +1,465 @@
# Replication Design and Notes
Replication is a critical feature in an IDM system, especially when deployed at major sites and
businesses. It allows for horizontal scaling of system read and write capacity, improves fault
tolerance (hardware, power, network, environmental), and can improve client latency (by positioning
replicas near clients).
## Replication Background
Replication is a directed graph model, where each node (server) and directed edge (replication
agreement) form a graph (topology). As the topology and direction can be seen, nodes of the graph
can be classified based on their data transit properties.
NOTE: Historically many replication systems used the terms "master" and "slave". This has a number
of negative cultural connotations, and is not used by this project.
### Read-Write server
This is a server that is fully writable. It accepts external client writes, and these writes are
propagated to the topology. Many read-write servers can be in a topology and written to in parallel.
### Transport Hub
This is a server that is not writeable to clients, but can accept incoming replicated writes, and
then propagates these to other servers. All servers that are directly after this server in the
topology must not be a read-write, as writes may not propagate back from the transport hub. IE the
following is invalid
RW 1 ---> HUB <--- RW 2
Note the replication direction in this, and that changes into HUB will not propagate back to RW 1 or
RW 2.
### Read-Only server
Also called a read-only replica, or in AD an RODC. This is a server that only accepts incoming
replicated changes, and has no outbound replication agreements.
Replication systems are dictated by CAP theorem. This is a theory that states from "consistency,
availability and partition tolerance" you may only have two of the three at any time.
### Consistency
This is the property that a write to a server is guaranteed to be consistent and acknowledged to all
servers in the replication topology. A change happens on all nodes or it does not happen at all, and
clients contacting any server will always see the latest data.
### Availability
This is the property that every request will receive a non-error response without the guarantee that
the data is "up to date".
### Partition Tolerance
This is the property that your topology in the face of partition tolerance will continue to provide
functional services (generally reads).
Almost all systems expect partition tolerance, so the choice becomes between consistency and
availability. These create a series of tradeoffs. Choosing consistency normally comes at
significantly reduced write throughput due to the need for a majority of nodes to acknowledge
changes. However, it avoids a need for complex conflict resolution systems. It also means that
clients can be in a situation where they can't write. For IDM this would mean new sessions could not
be created or accounts locked for security reasons.
Kanidm has chosen availability, as the needs of IDM dictate that we always function even in the face
of partition tolerance, and when other failures occur. This comes at the cost of needing to manage
conflict resolution. This AP selection is often called "eventually consistent" as nodes will
convenge to an identical state over time.
## Replication Phases
There are two phases of replication
1. Refresh
This is when the content of a node is completely removed, and has the content of another node
applied to replace it. No conflicts or issues can occur in this, as the refreshed node is now a
"perfect clone" of the source node.
2. Incremental
This is when differentials of changes are sent between nodes in the topology. By sending small
diffs, it saves bandwidth between nodes and allows changes on all nodes to be merged and combined
with other nodes. It is the handling of these incremental updates that can create conflicts in the
data of the server.
## Ordering of Writes - Change Identifiers
Rather than using an external coordinator to determine consistency, time is used to determine
ordering of events. This allows any server to create a total-ordering of all events as though every
write had occurred on a single server. This is how all nodes in replication will "arrive" at the
same conclusion about data state, without the need for communication.
In order for time to be used in this fashion, it is important that the clock in use is always
_advancing_ and never stepping backwards. If a clock was to go backwards, it would cause an event on
one node to be written in a different order than the way that other servers will apply the same
writes. This creates data corruption.
As an aside, there _are_ systems that do replication today and _do not_ use always advancing clocks
which can allow data corruption to seep in.
In addition it's also important that if an event happens at exactly the same moment on two nodes
(down to the nanosecond) that a way of breaking the tie exists. This is why each server has an
internal uuid, where the server uuid is used to order events if the timestamps are identical.
These points in time are represented by a changed identifier that contains the time of the event,
and the server uuid that performed the event. In addition every write transaction of the server
records the current time of the transaction, and if a subsequent transaction starts with a "time in
the past", then the time is "dragged forward" to one nanosecond after the former transaction. This
means CID's always advance - and never go backwards.
## Conflict Resolution
Despite the ability to order writes by time, consistency is not a property that we can guarantee in
an AP system. we must be able to handle the possibility of inconsistent data and the correct methods
to bring all nodes into a consistent state with cross communication. These consistency errors are
called conflicts. There are multiple types of conflict that can occur in a system like Kanidm.
### Entry Conflicts
This is when the UUID of an entry is duplicated on a separate node. For example, two entries with
UUID=A are created at the same time on two separate nodes. During replication one of these two nodes
will persist and the other will become conflicted.
### Attribute Conflicts
When entries are updated on two nodes at the same time, the changes between the entries need to be
merged. If the same attribute is updated on two nodes at the same time the differences need to be
reconciled. There are three common levels of resolution used for this. Lets consider an entry such
as:
# Node A
attr_a: 1
attr_b: 2
attr_c: 3
# Node B
attr_b: 1
attr_c: 2
attr_d: 3
- Object Level
In object level resolution the entry that was "last written wins". The whole content of the last
written entry is used, and the earlier write is lost.
In our example, if node B was the last write the entry would resolve as:
# OL Resolution
attr_b: 1
attr_c: 2
attr_d: 3
- Attribute Level
In attribute level resolution, the time of update for each attribute is tracked. If an attribute was
written later, the content of that attribute wins over the other entries.
For example if attr b was written last on node B, and attr c was written last on node A then the
entry would resolve to:
# AL Resolution
attr_a: 1 <- from node A
attr_b: 1 <- from node B
attr_c: 3 <- from node A
attr_d: 3 <- from node B
- Value Level
In value level resolution, the values of each attribute is tracked for changes. This allows values
to be merged, depending on the type of attribute. This is the most "consistent" way to create an AP
system, but it's also the most complex to implement, generally requiring a changelog of entry states
and differentials for sequential reapplication.
Using this, our entries would resolve to:
# VL Resolution
attr_a: 1
attr_b: 1, 2
attr_c: 2, 3
attr_d: 3
Each of these strategies has pros and cons. In Kanidm we have used a modified attribute level
strategy where individual attributes can internally perform value level resolution if needed in
limited cases. This allows fast and simple replication, while still allowing the best properties of
value level resolution in limited cases.
### Schema Conflicts
When an entry is updated on two nodes at once, it may be possible that the updates on each node
individually are valid, but when combined create an inconsistent entry that is not valid with
respect to the schema of the server.
### Plugin Conflicts
Kanidm has a number of "plugins" that can enforce logical rules in the database such as referential
integrity and attribute uniqueness. In cases that these rules are violated due to incremental
updates, the plugins in some cases can repair the data. However in cases where this can not occur,
entries may become conflicts.
## Tracking Writes - Change State
To track these writes, each entry contains a hidden internal structure called a change state. The
change state tracks when the entry was created, when any attribute was written to, and when the
entry was deleted.
The change state reflects the lifecycle of the entry. It can either be:
- Live
- Tombstoned
A live entry is capable of being modified and written to. It is the "normal" state of an entry in
the database. A live entry contains an "origin time" or "created at" timestamp. This allows unique
identification of the entry when combined with the uuid of the entry itself.
A tombstoned entry is a "one way street". This represents that the entry at this uuid is _deleted_.
The tombstone propagates between all nodes of the topology, and after a tombstone window has passed,
is reaped by all nodes internally.
A live entry also contains a map of change times. This contains the maximum CID of when an attribute
of the entry was updated last. Consider an entry like:
attr_a: 1
attr_b: 2
attr_c: 3
uuid: X
This entries changestate would show:
Live {
at: { server_uuid: A, cid: 1 },
attrs: {
attr_a: cid = 1
attr_b: cid = 1
attr_c: cid = 2
}
}
This shows us that the entry was created on server A, at cid time 1. At creation, the attributes a
and b were created since they have the same cid.
attr c was either updated or created after this - we can't tell if it existed at cid 1, we can only
know that a write of some kind occurred at cid 2.
## Resolving Conflicts
With knowledge of the change state structure we can now demonstrate how the lower level entry and
attribute conflicts are detected and managed in Kanidm.
### Entry
An entry conflict occurs when two servers create an entry with the same UUID at the same time. This
would be shown as:
Server A Server B
Time 0: create entry X
Time 1: create entry X
Time 2: <-- incremental --
Time 3: -- incremental -->
We can add in our entry change state for liveness here.
Time 0: create entry X cid { time: 0, server: A }
Time 1: create entry X cid { time: 1, server: B }
Time 2: <-- incremental --
Time 3: -- incremental -->
When the incremental occurs at time point 2, server A would consider these on a timeline as:
Time 0: create entry X cid { time: 0, server: A }
Time 1: create entry X cid { time: 1, server: B }
When viewed like this, we can see that if the second create had been performed on the same server,
it would have been rejected as a duplicate entry. With replication enabled, this means that the
latter entry will be moved to the conflict state instead.
The same process occurs with the same results when the reverse incremental operation occurs to
server B where it receives the entry with the earlier creation from A. It will order the events and
"conflict" it's local copy of the entry.
### Attribute
An attribute conflict occurs when two servers modify the same attribute of the same entry before an
incremental replication occurs.
Server A Server B
Time 0: create entry X
Time 1: -- incremental -->
Time 2: modify entry X
Time 3: modify entry X
Time 4: <-- incremental --
Time 5: -- incremental -->
During an incremental operation, a modification to a live entry is allowed to apply provided the
entries UUID and AT match the servers metadata. This gives the servers assurance that an entry is
not in a conflict state, and that the node applied the change to the same entry. Were the AT values
not the same, then the entry conflict process would be applied.
We can expand the metadata of the modifications to help understand the process here for the
attribute.
Server A Server B
Time 0: create entry X
Time 1: -- incremental -->
Time 2: modify entry X attr A cid { time: 2, server: B }
Time 3: modify entry X attr A cid { time: 3, server: A }
Time 4: <-- incremental --
Time 5: -- incremental -->
When the incremental is sent in time 4 from B to A, since the modification of the attribute is
earlier than the content of A, the incoming attribute state is discarded. (A future version of
Kanidm may preserve the data instead).
At time 5 when the increment returns from A to B, the higher cid causes the value of attr A to be
replaced with the content from server A.
This allows all servers to correctly order and merge changes between nodes.
### Schema
An unlikely but possible scenario is a set of modifications that create incompatible entry states
with regard to schema. For example:
Server A Server B
Time 0: create group X
Time 1: -- incremental -->
Time 2: modify group X into person X
Time 3: modify group X attr member
Time 4: <-- incremental --
Time 5: -- incremental -->
It is rare (if not will never happen) that an entry is morphed in place from a group to a person,
from one class to a fundamentally different class. But the possibility exists so we must account for
it.
In this case what would occur is that the attribute of 'member' would be applied to a person, which
is invalid for the kanidm schema. In this case, the entry would be moved into a conflict state since
logically it is not valid for directory operations (even if the attributes and entry level
replication requirements for consistency have been met).
### Plugin
Finally, plugins allow enforcement of rules above schema. An example is attribute uniqueness.
Consider the following operations.
Server A Server B
Time 0: create entry X create entry Y
Time 1: -- incremental -->
Time 2: <-- incremental --
Time 3: modify entry X attr name = A
Time 4: modify entry Y attr name = A
Time 5: <-- incremental --
Time 6: -- incremental -->
Here the entry is valid per the entry, attribute and schema rules. However, name is a unique
attribute and can not have duplicates. This is the most likely scenario for conflicts to occur,
since users can rename themself at any time.
In this scenario, in the incremental replication both entry Y and X would be move to the conflict
state. This is because the name attribute may have been updated multiple times, or between
incremental operations, meaning that either server can not reliably determine if X or Y is valid at
_this_ point in time, and wrt to future replications.
## Incremental Replication
To this point, we have described "refresh" as a full clone of data between servers. This is easy to
understand, and works as you may expect. The full set of all entries and their changestates are sent
from a supplier to a consumer, replacing all database content on the consumer.
Incremental replication however requires knowledge of the state of the consumer and supplier to
determine a difference of the entries between the pair.
To achieve this each server tracks a replication update vector (RUV), that describes the _range_ of
changes organised per server that originated the change. For example, the RUV on server B may
contain:
|-----|----------|----------|
| | s_uuid A | s_uuid B |
|-----|----------|----------|
| min | T4 | T6 |
|-----|----------|----------|
| max | T8 | T16 |
|-----|----------|----------|
This shows that server B contains the set of data ranging _from_ server A at time 4 and server B at
time 6 to the latest values of server A at time 8 and server B at time 16.
During incremental replication the consumer sends it RUV to the supplier. The supplier calculates
the _difference_ between the consumer RUV and the supplier RUV. For example
Server A RUV Server B RUV
|-----|----------|----------| |-----|----------|----------|
| | s_uuid A | s_uuid B | | | s_uuid A | s_uuid B |
|-----|----------|----------| |-----|----------|----------|
| min | T4 | T6 | | min | T4 | T6 |
|-----|----------|----------| |-----|----------|----------|
| max | T10 | T16 | | max | T8 | T20 |
|-----|----------|----------| |-----|----------|----------|
If A was the supplier, and B the consumer, when comparing these RUV's Server A would determine that
B required the changes `A {T9, T10}`. Since B is ahead of A wrt to the server B changes, server A
would not supply these ranges. In the reverse, B would supply `B {T17 -> T20}`.
If there were multiple servers, this allows replicas to _proxy_ changes.
Server A RUV Server B RUV
|-----|----------|----------|----------| |-----|----------|----------|----------|
| | s_uuid A | s_uuid B | s_uuid C | | | s_uuid A | s_uuid B | s_uuid C |
|-----|----------|----------|----------| |-----|----------|----------|----------|
| min | T4 | T6 | T5 | | min | T4 | T6 | T4 |
|-----|----------|----------|----------| |-----|----------|----------|----------|
| max | T10 | T16 | T13 | | max | T8 | T20 | T8 |
|-----|----------|----------|----------| |-----|----------|----------|----------|
In this example, if A were supplying to B, then A would supply `A {T9, T10}` and `C {T9 -> T13}`.
This allows the replication to avoid full connection (where every node must contact every other
node).
In order to select the entries for supply, the database maintains an index of entries that are
affected by any change for any cid. This allows range requests to be made for efficient selection of
what entries were affected in any cid.
After an incremental replication is applied, the RUV is updated to reflect the application of these
differences.
## Lagging / Advanced Consumers
Replication relies on each node periodically communicating for incremental updates. This is because
of _deletes_. A delete event occurs by a Live entry becoming a Tombstone. A tombstone is replicated
over the live entry. Tombstones are then _reaped_ by each node individually once the replication
delay window has passed.
This delay window is there to allow every node the chance to have the tombstone replicated to it, so
that all nodes will delete the tombstone at a similar time.
Once the delay window passes, the RUV is _trimmed_. This moves the RUV minimum.
We now need to consider the reason for this trimming process. Lets use these RUV's
Server A RUV Server B RUV
|-----|----------|----------| |-----|----------|----------|
| | s_uuid A | s_uuid B | | | s_uuid A | s_uuid B |
|-----|----------|----------| |-----|----------|----------|
| min | T10 | T6 | | min | T4 | T9 |
|-----|----------|----------| |-----|----------|----------|
| max | T15 | T16 | | max | T8 | T20 |
|-----|----------|----------| |-----|----------|----------|
The RUV for A on A does not overlap the range of the RUV for A on B (A min 10, B max 8).
This means that a tombstone _could_ have been created at T9 _and then_ reaped. This would mean that
B would not have perceived that delete and then the entry would become a zombie - back from the
dead, risen again, escaping the grave, breaking the tombstone. This could have security consequences
especially if the entry was a group providing access or a user who was needing to be deleted.
To prevent this, we denote server B as _lagging_ since it is too old. We denote A as _advanced_
since it has data newer that can not be applied to B.
This will "freeze" B, where data will not be supplied to B, nor will data from B be accepted by
other nodes. This is to prevent the risk of data corruption / zombies.
There is some harm to extending the RUV trim / tombstone reaping window. This window could be
expanded even to values as long as years. It would increase the risk of conflicting changes however,
where nodes that are segregated for extended periods have been accepting changes that may conflict
with the other side of the topology.

View file

@ -0,0 +1,307 @@
# Replication Coordinator Design
Many other IDM systems configure replication on each node of the topology. This means that the
administrator is responsible for ensuring all nodes are connected properly, and that agreements are
bidirectional. As well this requires manual work for administrators to configure each node
individually, as well as monitoring individually. This adds a significant barrier to "stateless"
configurations.
In Kanidm we want to avoid this - we want replication to be coordinated to make deployment of
replicas as easy as possible for new sites.
## Kanidm Replication Coordinator
The intent of the replication coordinator (KRC) is to allow nodes to subscribe to the KRC which
configures the state of replication across the topology.
```
1. Out of band - ┌────────────────┐
issue KRC ca + ────────────────┤ │
Client JWT. │ │
│ ┌──────────────▶│ │──────────────────────┐
│ │2. HTTPS │ Kanidm │ │
│ JWT in Bearer │ Replication │ 5. Issue repl config
│ Request repl config │ Coordinator │ with partner public
│ Send self signed ID │ │ key
│ │ cert │ │ │
│ │ ┌─────────│ │◀────────┐ │
│ │ │ │ │ 4. HTTPS │
│ │ │ └────────────────┘ JWT in Bearer │
│ │ 3. Issue Request repl config │
│ │ repl config Send self signed ID │
│ │ │ cert │
│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
▼ │ ▼ │ ▼
┌────────────────┐ ┌─┴──────────────┐
│ │ │ │
│ │ │ │
│ │ 5. mTLS with self │ │
│ │──────────signed cert──────────▶│ │
│ Kanidm Server │ Perform replication │ Kanidm Server │
│ (node) │ │ (node) │
│ │ │ │
│ │ │ │
│ │ │ │
│ │ │ │
└────────────────┘ └────────────────┘
```
The KRC issues configuration tokens. These are JWT's that are signed by the KRC.
A configuration token is _not_ unique to a node. It can be copied between many nodes. This allows
stateless deployments where nodes can be spun up and provided their replication config.
The node is provided with the KRC TLS CA, and a configuration token.
The node when configured contacts the KRC with its configuration token as bearer authentication. The
KRC uses this to determine and issue a replication configuration. Because the configuration token is
signed by the KRC, a fraudulent configuration token can _not_ be used by an attacker to fraudulently
subscribe a kanidm node. Because the KRC is contacted over TLS this gives the node strong assurances
of the legitimacy of the KRC due to TLS certificate validation and pinning.
The KRC must be able to revoke replication configuration tokens in case of a token disclosure.
The node sends its KRC token, server UUID, and server repl public key to the KRC.
The configuration token defines the replication group identifier of that node. The KRC uses the
configuration token _and_ the servers UUID to assign replication metadata to the node. The KRC
issues a replication configuration to the node.
The replication configuration defines the nodes that the server should connect to, as well as
providing the public keys that are required for that node to perform replication. These are
elaborated on in node configuration.
## Kanidm Node Configuration
There are some limited cases where an administrator may wish to _manually_ define replication
configuration for their deployments. In these cases the admin can manually configure replication
parameters in the Kanidm configuration.
A kanidm node for replication requires either:
- The URL to the KRC
- the KRC CA cert
- KRC issued configuration JWT
OR
- A replication configuration map
A replication configuration map contains a set of agreements and their direction of operation.
All replicas require:
- The direct url that other nodes can reach them on (this is NOT the origin of the server!)
### Pull mode
This is the standard and preferred mode. The map contains for each node to pull from.
- the url of the node's replication endpoint.
- The self-signed node certificate to be pinned for the connection.
- If a refresh required message is received, if automatic refresh should be carried out.
### Push mode
This mode is only available in manual configurations, and should only be used as a last resort.
- The url of the nodes replication endpoint.
- The self-signed node certificate to be pinned for the connection.
- If a refresh required message would be sent, if the node should be force-refreshed next cycle.
## Worked examples
### Manual configuration
There are two nodes, A and B.
The administrator configures the kanidm server with replication urls
```
[replication]
node_url = https://private.name.of.node
```
The administrator extracts their replication certificates with the kanidmd binary admin features.
This will reflect the `node_url` in the certificate.
kanidmd replication get-certificate
For each node, a replication configuration is created in json. For A pulling from B.
```
[
{ "pull":
{
url: "https://node-b.private-name",
publiccert: "pem certificate from B",
automatic_refresh: false
}
},
{ "allow-pull":
{
clientcert: "pem certificate from B"
}
}
]
```
For B pulling from A.
```
[
{ "pull":
{
url: "https://node-a.private-name",
publiccert: "pem certificate from A",
automatic_refresh: false
}
},
{ "allow-pull":
{
clientcert: "pem certificate from A"
}
}
]
```
Notice that automatic refresh only goes from A -> B and not the other way around. This allows one
server to be "authoritative".
### KRC Configuration
> Still not fully sure about the KRC config yet. More thinking needed!
The KRC is configured with it's URL and certificates.
```
[krc_config]
origin = https://krc.example.com
tls_chain = /path/to/tls/chain
tls_key = /path/to/tls/key
```
The KRC is also configured with replication groups.
```
[origin_nodes]
# This group never auto refreshes - they are authoritative.
mesh = full
[replicas_syd]
# Every node has two links inside of this group.
mesh = 2
# at least 2 nodes in this group link externally.
linkcount = 2
linkto = [ "origin_nodes" ]
[replicas_bne]
# Every node has one link inside of this group.
mesh = 1
# at least 1 node in this group link externally.
linkcount = 1
linkto = [ "origin_nodes" ]
```
This would yield the following arrangement.
```
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
origin_nodes │
┌────────┐ ┌────────┐ │
│ │ │ │ │
│ O1 │◀───────▶│ O2 │ │
│ │ │ │ │
└────────┘◀───┬───▶└────────┘ │
│ ▲ │ ▲
│ │ │ │
│ │ │ │
▼ │ ▼ │
│ ┌────────┐◀───┴───▶┌────────┐
│ │ │ │ │
│ │ O3 │◀───────▶│ O4 │◀─────────────────────────────┐
│ │ │ │ │ │
│ └────────┘ └────────┘ │
▲ ▲ │ │
└ ─ ─ ─ ─│─ ─ ─ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ │
│ │ │
│ │ │
│ │ │
┌──┘ │ │
│ │ │
│ │ │
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ │ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─
replicas_bne │ │ │ replicas_syd │ │
│ │ │ │ │
┌────────┐ ┌────────┐ │ │ ┌────────┐ ┌────────┐ │
│ │ │ │ │ │ │ │ │ │ │
│ B1 │◀───────▶│ B2 │ │ └──────────│ S1 │◀───────▶│ S2 │ │
│ │ │ │ │ │ │ │ │ │
└────────┘ └────────┘ │ └────────┘ └────────┘ │
│ ▲ │ ▲ ▲
│ │ │ │ │
│ │ │ │ │
▼ │ ▼ ▼ │
│ ┌────────┐ ┌────────┐ │ ┌────────┐ ┌────────┐
│ │ │ │ │ │ │ │ │ │
│ │ B3 │◀───────▶│ B4 │ │ │ S3 │◀───────▶│ S4 │
│ │ │ │ │ │ │ │ │ │
│ └────────┘ └────────┘ │ └────────┘ └────────┘
│ │
└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
```
!!! TBD - How to remove / decomission nodes?
I think origin nodes are persistent and must be manually defined. Will this require configuration of
their server uuid in the config?
Auto-node groups need to check in with periodic elements, and missed checkins.
Checkins need to send ruv? This will allow the KRC to detect nodes that are stale.
If a node misses checkins after a certain period they should be removed from the KRC knowledge?
R/O nodes could removed after x days of failed checkins, without much consequence.
R/W nodes on the other hand it's a bit trickier to know if they should be automatically removed.
Or is delete of nodes a manual cleanup / triggers clean-ruv?
Should replication maps have "priorities" to make it a tree so that if nodes are offline then it can
auto-re-route? Should they have multiple paths? Want to avoid excess links/loops/disconnections of
nodes.
I think some more thought is needed here. Possibly a node state machine.
I think for R/O nodes, we need to define how R/W will pass through. I can see a possibility like
```
No direct line
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ of sight─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
│ ▼
┌────────────┐ ┌─────────────┐────OOB Write────▶┌─────────────┐
│ │ │ Remote Kani │ │ │
│ Client │─────Write──────▶│ Server │ │ Main │
│ │ │ │ │ │
└────────────┘ └─────────────┘◀───Replication───└─────────────┘
```
This could potentially even have some filtering rules about what's allowed to proxy writes through.
Generally though I think that RO will need a lot more thought, for now I want to focus on just
simple cases like a pair / group of four replicas. 😅
### Requirements
- Cryptographic (key) only authentication
- Node to Node specific authentication
- Scheduling of replication
- Multiple nodes
- Direction of traffic?
- Use of self-signed issued certs for nodes.
- Nodes must reject if incoming clients have the same certs.