Replication Design and Notes ---------------------------- Replication is a critical feature to an IDM system, especially when deployed at major sites and businesses. It allows for horizontal scaling of system read and write capacity, improves fault tolerance (hardware fault, power fault, network fault, geographic disaster), can improve client latency (positioning replicas near clients) and more. Replication Background ====================== Replication is a directed graph model, where each node (server) and directed edge (replication agreement) form a graph (topology). As the topology and direction can be seen, nodes of the graph can be classified based on their data transit properties. NOTE: Historically many replication systems used the terms "master" and "slave". This has a number of negative cultural connotations, and is not used by this project. * Read-Write server This is a server that is fully writable. It accepts external client writes, and these writes are propogated to the topology. Many read-write servers can be in a topology and written to in parallel. * Transport Hub This is a server that is not writeable to clients, but can accept incoming replicated writes, and then propogates these to other servers. All servers that are directly after this server inthe topology must not be a read-write, as writes may not propogate back from the transport hub. IE the following is invalid :: RW 1 ---> HUB <--- RW 2 Note the replication direction in this, and that changes into HUB will not propogate back to RW 1 or RW 2. * Read-Only server Also called a read-only replica, or in AD an RODC. This is a server that only accepts incoming replicated changes, and has no outbound replication agreements. Replication systems are dictated by CAP theorem. This is a theory that states from "consistency, availability and paritition tolerance" you may only have two of the three at any time. * Consistency This is the property that a write to a server is guaranteed to be consistent and acknowledged to all servers in the replication topology. A change happens on all nodes or it does not happen at all, and clients contacting any server will always see the latest data. * Availability This is the property that every request will recieve a non-error response without the guarantee that the data is "up to date". * Partition Tolerance This is the property that your topology in the face of patition tolerance will continue to provide functional services (generally reads). Almost all systems expect partition tolerance, so the choice becomes between consistency and availability. Kanidm has chosen availability, as the needs of IDM dictate that we always function even in the face of partition tolerance, and when other failures occur. This AP selection is often called "eventually consistent". Conflict Resolution =================== As consistency is not a property that we can uphold in an AP system, we must handle the possibility of inconsistent data and the correct methods to handle it to bring a system into a consistent state. There are two levels that inconsistency can occur at in a system like Kanidm. * Object Level * Attribute Level Object Level inconsistency occurs when two read-write servers who are partitioned, both allocate the same entry UUID to an entry. Since the uuid is the "primary key" which anchors all other changes, and can not be duplicated, when the paritioning is resolved, the replication will occur, and one of the two items must be discarded as inconsistent. Attribute Level inconsistency occurs within a single entry, where the same attribute is altered by two servers who are partition. When replicated, this attribute's state must be resolved in a manner consistent to schema of the system. An additional complexity is that both servers must be able to resolve this conflict in isolation, without further communication. All servers must arrive at the same result, necesitating a set of conflict management rules that must be the same to all members of the replication topology. As a result, almost all facets of replication are designed around the possible handling of these conflict cases, and tracking of changes as they occur in a manner that allows forward and reverse application of changes. CID === A CID is a change identifier. This is a compound type consisting of the domain uuid, the server uuid, and a 64 bit timestamp. This allows selecting of changes by domain origin, a unique server, a timerange, or a selection of these properties. This may also be called a change serial number (CSN) in other application like 389 Directory Server This is so that any allocated CID is guaranteed to be unique as the timestamp generator within a server is a lamport clock that always advances with each new write transaction. Change ====== A single change is a CID associated to the UUID of entries that have had their state altered in an operation. These state records are atomic units that describe a set of changes that must occur together to be considered valid. There may be multiple state assertions in a change. The possible entry states are: * NonExistant * Live * Recycled * Tombstone The possible state transitions are: * Create * Modify * Recycle * Revive * Tombstoned * Purge (*not changelogged!*) A conceptual pseudocode change could be: :: CID: { d_uuid, s_uuid, ts } transitions: [ create: uuid, entry_state { ... } modify: uuid, modifylist recycle: uuid tombstoned: uuid ] The valid transitions are representing in a NFA, where any un-listed transition is considered invalid and must be discarded. Transitions are consider 'in-order' within a CID. :: create + NonExistant -> Live modify + Live -> Live recycle + Live -> Recycled revive + Recycled -> Live tombstoned + Recycled -> Tombstone purge + Tombstone -> NonExistant .. image:: diagrams/object-lifecycle-states.png :width: 800 Within a single CID, in a single server, it's consider that every transition applies, or none do. Entry Change Log ================ Within Kanidm id2entry is the primary store of active entry state representation. However the content of id2entry is a reflection of the series of modifications and changes that have applied to create that entitiy. As a result id2entry can be considered as an entry state cache. The true stable storage and representation for an entry will exist in a seperate Entry Change Log type. Each entry will have it's own internal changelog that represents the changes that have occured in the entries lifetime and it's relevant state at that time. The reason for making a per-entry change log is to allow fine grained testing of the conflict resolution state machine on a per-entry scale, and then to be able to test the higher level object behaviour above that. This will allow us to model and test all possible states. Changelog Index =============== The changelog stores a series of changes associated by their CID, allowing querying of changes based on CID properties. The changelog stores changes from multiple server uuid's or domain uuid's, acting as a single linear history of effects on the data of this system. If we assume we have a single read-write server, there is no possibility of conflict and the changelog becomes a perfect history of transitions within the database content. We can visualise the changelog index as a series of CID's with references to the associated entries that need to be considered. This is where we start to consider the true implementation structure of how we will code this within Kanidm. :: ┌─────────────────────────────────┐ ┌─────────────────────────┐ │ Changelog Index │ │ e1 - entry change log │ │┌───────────────────────────────┐│ │ ┌─────────────────────┐ │ ││CID 1 ││ │ │CID 2 │ │ │├───────────────────────────────┤│ │ │┌───────────────────┐│ │ ││ ││ │ ││create: { ││ │ ││CID 2 ││ ┌──────────┼─▶│ attrs ││ │ ││transitions { ││ │ │ ││} ││ │ ││ create: uuid - e1, ────────┼┼──┘ │ │├───────────────────┤│ │ ││ modify: uuid - e1, ────────┼┼─────────────┼─▶│ modify: attrs ││ │ ││ recycle: uuid - e2,────────┼┼──┐ │ │└───────────────────┘│ │ ││} ││ │ │ └─────────────────────┘ │ ││ ││ │ │ ... │ │├───────────────────────────────┤│ │ │ │ ││CID 3 ││ │ │ │ │├───────────────────────────────┤│ │ └─────────────────────────┘ ││CID 4 ││ │ │├───────────────────────────────┤│ │ ┌─────────────────────────┐ ││CID 5 ││ │ │ e2 - entry change log │ │├───────────────────────────────┤│ │ │ ┌─────────────────────┐ │ ││CID 6 ││ │ │ │CID 2 │ │ │├───────────────────────────────┤│ │ │ │┌───────────────────┐│ │ ││CID 7 ││ └──────────┼─▶│recycle ││ │ │├───────────────────────────────┤│ │ │└───────────────────┘│ │ ││CID 8 ││ │ └─────────────────────┘ │ │├───────────────────────────────┤│ │ ... │ ││CID 9 ││ │ │ │├───────────────────────────────┤│ └─────────────────────────┘ ││CID 10 ││ │├───────────────────────────────┤│ ││CID 11 ││ │├───────────────────────────────┤│ ││CID 12 ││ │└───────────────────────────────┘│ └─────────────────────────────────┘ This allows expression of both: * An ordered set of changes globally to be applied to the set of entries * Entries to internally maintain their set of ordered changes Entry Snapshots =============== Within an entry there may be many changes, and if we have an old change inserted, we need to be able to replay those events. For example: :: ┌─────────────────────────────────┐ ┌─────────────────────────┐ │ Changelog Index │ │ e1 - entry change log │ ┌───────────────────────────────┐ │ │ │ ┌─────────────────────┐ │ │CID 1 ─┼──┼─────────────▶ │ │ │CID 2 │ │ └───────────────────────────────┘ │┌───────────────────────────────┐│ │ │┌───────────────────┐│ │ ││ ││ │ ││create: { ││ │ ││CID 2 ││ ┌──────────┼─▶│ attrs ││ │ ││transitions { ││ │ │ ││} ││ │ ││ create: uuid - e1, ────────┼┼──┘ │ │├───────────────────┤│ │ ││ modify: uuid - e1, ────────┼┼─────────────┼─▶│ modify: attrs ││ │ ││ recycle: uuid - e2,────────┼┼──┐ │ │└───────────────────┘│ │ ││} ││ │ │ └─────────────────────┘ │ ││ ││ │ │ ... │ │├───────────────────────────────┤│ │ │ │ ││CID 3 ││ │ │ │ │├───────────────────────────────┤│ │ └─────────────────────────┘ ││CID 4 ││ │ │├───────────────────────────────┤│ │ ┌─────────────────────────┐ ││CID 5 ││ │ │ e2 - entry change log │ │├───────────────────────────────┤│ │ │ ┌─────────────────────┐ │ ││CID 6 ││ │ │ │CID 2 │ │ │├───────────────────────────────┤│ │ │ │┌───────────────────┐│ │ ││CID 7 ││ └──────────┼─▶│recycle ││ │ │├───────────────────────────────┤│ │ │└───────────────────┘│ │ ││CID 8 ││ │ └─────────────────────┘ │ │├───────────────────────────────┤│ │ ... │ ││CID 9 ││ │ │ │├───────────────────────────────┤│ └─────────────────────────┘ ││CID 10 ││ │├───────────────────────────────┤│ ││CID 11 ││ │├───────────────────────────────┤│ ││CID 12 ││ │└───────────────────────────────┘│ └─────────────────────────────────┘ Since CID 1 has been inserted previous to CID 2 we need to "undo" the changes of CID 2 in e1/e2 and then replay from CID 1 and all subsequent changes affecting the same UUID's to ensure the state is applied in order correctly. In order to improve the processing time of this operation, entry change logs need snapshots of their entry state. At the start of the entry change log is an anchor snapshot that describes the entry as the sum of previous changes. :: ┌─────────────────────────┐ │ e1 - entry change log │ │ ┌─────────────────────┐ │ │ │Anchor Snapshot │ │ │ │state: { │ │ │ │ ... │ │ │ │} │ │ │ │ │ │ │ ├─────────────────────┤ │ │ │CID 2 │ │ │ │┌───────────────────┐│ │ │ ││create: { ││ │ │ ││ attrs ││ │ │ ││} ││ │ │ │├───────────────────┤│ │ │ ││ modify: attrs ││ │ │ │└───────────────────┘│ │ │ ├─────────────────────┤ │ │ │Snapshot │ │ │ │state: { │ │ │ │ ... │ │ │ │} │ │ │ │ │ │ │ └─────────────────────┘ │ │ ... │ │ │ └─────────────────────────┘ In our example here we would find the snapshot preceeding our newely inserted CID (in this case our Anchor) and from that we would then replay all subsequent changes to ensure they apply correctly (or are rejected as conflicts). For example if our newly inserted CID was say CID 15 then we would use the second snapshot and we would not need to replay CID 2. These snapshots are a trade between space (disk/memory) and replay processing time. Snapshot frequency is not yet determined. It will require measurement and heuristic to determine an effective space/time saving. For example larger entries may want fewer snapshots due to the size of their snapshots, where smaller entries may want more snapshots to allow faster change replay. Replay Processing Details ========================= Given our CID 1 inserted prior to other CID's, we need to consider how to replay these effectively. If CID 1 changed uuid A and B, we would add these to the active replay set. These are based on the snapshots which are then replayed up to and include CID 1 (but no further). From there we now proceed through the changelog index, and only consider changes that contain A or B. Let's assume CID 3 operated on B and C. C was not considered before, and is now added to the replay set, and the same process begins to replay A, B, C to CID 3 now. This process continues such that the replay set is always expanding to the set of affected entries that require processing to ensure consistency of their changes. If a change is inconsistent or rejected, then it is rejected and marked as such in the changelog index. Remember a future replay may allow the rejected change to be applied correctly, this rejection is just metadata so we know what changes were not applied. Even if a change is rejected, we still continue to assume that the entries include in that set of changes should be consider for replay. In theory we could skip them if they were added in this change, but it's simpler and correct to continue to consider them. Changelog Comparison - Replication Update Vector (RUV) ====================================================== A changelog is a single servers knowledge of all changes that have occured in history of a topology. Of course, the point of replication is that multiple servers are exchanging their changes, and potentially that a server must proxy changes to other servers. For this to occur we need a method of comparing changelog states, and then allowing fractional transmission of the difference in the sets. To calculate this, we can use our changelog to construct a table called the replication update vector. The RUV is a single servers changelog state, categorised by the originating server of the change. A psudeo example of this is: :: |-----|--------------------|--------------------|--------------------| | | {d_uuid, s_uuid A} | {d_uuid, s_uuid B} | {d_uuid, s_uuid C} | |-----|--------------------|--------------------|--------------------| | min | T4 | T6 | T0 | |-----|--------------------|--------------------|--------------------| | max | T8 | T16 | T7 | |-----|--------------------|--------------------|--------------------| Summarised, this shows that on our server, our changelog has changes from A for time range T4 to T8, B T6 to T16, and C T0 to T7. Individiually, a RUV does not allow much, but now we can compare RUVs to another server. Lets assume a second server exists with the RUV of: :: |-----|--------------------|--------------------|--------------------| | | {d_uuid, s_uuid A} | {d_uuid, s_uuid B} | {d_uuid, s_uuid C} | |-----|--------------------|--------------------|--------------------| | min | T4 | T8 | T0 | |-----|--------------------|--------------------|--------------------| | max | T6 | T10 | T11 | |-----|--------------------|--------------------|--------------------| This shows the server has A T4 to T6, B TT8 to T10, and C T0 to T11. Let's assume that we are *sending* changes from our first server to this second server. We perform a diff of the RUV and find that for the changes of A, T7 to T8 are not present on the second server, and that changes T11 to T16 are not present. Since C has a more "advanced" state than us, we do not need to send anything (and later, this server should send changes to us!). So now we know that we must send A T7 to T8 and B T11 to T16 for this replica to be brought up to the state we hold. You may notice the "min" and "max". The purpose of this is to show how far we may rewind our changelog - we have changes from min to max. If a server provides it's ruv, and it's max is lower than our min, we must consider that server has been disconnected for "too long" and we are unable to supply changes until an administrator intervenes. As a more graphical representation, we could consider our ruv as follows: :: ┌─────────────────────┐ ┌─────────────────────────────────┐ ┌─────────────────────────┐ │RUV │ │ Changelog Index │ │ e1 - entry change log │ │┌───────────────────┐│ │┌───────────────────────────────┐│ │ ┌─────────────────────┐ │ ││{d_uuid, s_uuid}: ││ ─ ─ ─ ─ ─ ─ ─▶│CID 1 ││ │ │CID 2 │ │ ││ min: CID 2 ────┼┼────┼─┐ │├───────────────────────────────┤│ │ │┌───────────────────┐│ │ ││ max: CID 4 ────┼┼──────┤ ││ ││ │ ││create: { ││ │ │├───────────────────┤│ │ ├───────────▶│CID 2 ││ ┌──────────┼─▶│ attrs ││ │ ││{d_uuid, s_uuid}: ││ │ ││transitions { ││ │ │ ││} ││ │ ││ min: CID 1 ─ ─ ┼│─ ─ ┘ │ ││ create: uuid - e1, ────────┼┼──┘ │ │├───────────────────┤│ │ ││ max: CID 8 ─ ─ ┼│─ ─ ┐ │ ││ modify: uuid - e1, ────────┼┼─────────────┼─▶│ modify: attrs ││ │ │├───────────────────┤│ │ ││ recycle: uuid - e2,────────┼┼──┐ │ │└───────────────────┘│ │ ││{d_uuid, s_uuid}: ││ │ │ ││} ││ │ │ └─────────────────────┘ │ ││ min: CID 3 ────┼┼──┐ │ ││ ││ │ │ ... │ ││ max: CID 12────┼┼──┤ │ │ │├───────────────────────────────┤│ │ │ │ │└───────────────────┘│ ├───┼───────────▶│CID 3 ││ │ │ │ └─────────────────────┘ │ │ │ │├───────────────────────────────┤│ │ └─────────────────────────┘ │ └───────────▶│CID 4 ││ │ │ │ │├───────────────────────────────┤│ │ ┌─────────────────────────┐ │ ││CID 5 ││ │ │ e2 - entry change log │ │ │ │├───────────────────────────────┤│ │ │ ┌─────────────────────┐ │ │ ││CID 6 ││ │ │ │CID 2 │ │ │ │ │├───────────────────────────────┤│ │ │ │┌───────────────────┐│ │ │ ││CID 7 ││ └──────────┼─▶│recycle ││ │ │ │ │├───────────────────────────────┤│ │ │└───────────────────┘│ │ │ ─ ─ ─ ─ ─ ─ ─▶│CID 8 ││ │ └─────────────────────┘ │ │ │├───────────────────────────────┤│ │ ... │ │ ││CID 9 ││ │ │ │ │├───────────────────────────────┤│ └─────────────────────────┘ │ ││CID 10 ││ │ │├───────────────────────────────┤│ │ ││CID 11 ││ │ │├───────────────────────────────┤│ └───────────────▶│CID 12 ││ │└───────────────────────────────┘│ └─────────────────────────────────┘ It may be that we also add a RUV index that allows the association of exact set of CID's to a server's cl, or if during CL replay we just iterate through the CL index finding all values that are greater than the set of min CID's requested in this operation. Changelog Purging ================= In order to prevent infinite growth of the changelog, any change older than a fixed window X is trimmed from the changelog. When trimming occurs this moves the "min" CID in the RUV up to a higher point in time. This also trims the entry change log and recreates a new anchor snapshot. RUV cleaning ============ TODO: Conflict UUID Generation ======================== As multiple servers must arrive at the same UUID so that they are all in a deterministic state, the UUID of a conflicting entry should be generated in a deterministic manner. TODO: Conflict Class ============== TODO: Must origUUID, Object Level Conflict Handling =============================== With the constructs defined, we have enough in place to be able to handle various scenarioes. For the purposes of these discussions we will present two servers with a series of changes over time. Let's consider a good case, where no conflict occurs. :: Server A Server B T0: T1: Create E T2: -- repl --> T3: Modify E T4: <-- repl -- Another trivial example is the following. :: Server A Server B T0: T1: Create E1 T2: Create E2 T3: -- repl --> T4: <-- repl -- These situations are clear, and valid. However, as mentioned the fun is when we have scenarios that conflict. To resolve this, we combine the series of changes, ordered by time, and then re-apply these changes, discarding changes that would be invalid for those states. As a reminder: :: create + NonExistant -> Live modify + Live -> Live recycle + Live -> Recycled revive + Recycled -> Live tombstoned + Recycled -> Tombstone purge(*) + Tombstone -> NonExistant Lets now show a conflict case: :: Server A Server B T0: T1: Create E1 T2: Create E1 T3: -- repl --> T4: <-- repl -- Notice that both servers create E1. In order to resolve this conflict, we use the only synchronisation mechanism that we possess - time. On Server B at T3 when the changelog of Server A is recieved, the events are replayed, and linearised to: :: T0: NonExistant E1 # For illustration only T1: Create E1 (from A) T2: Create E1 (from B) As the event at T2 can not be valid, the change at T2 is *skipped* - E1 from B is turned into a conflict + recycled entry. See conflict UUID generation above. Infact, having this state machine means we can see exactly what can and can not be resolved correctly as combinations. Here is the complete list of valid combinations. :: T0: Create E1 (Live) T1: Modify E1 (Live) T0: Create E1 (Live) T1: Recycle E1 (Recycled) T0: Modify E1 (Live) T1: Modify E1 (Live) T0: Modify E1 (Live) T1: Recycle E1 (Recycled) T0: Recycle E1 (Recycled) T1: Revive E1 (Live) T0: Recycle E1 (Recycled) T1: Tombstone E1 (Tombstoned) T0: Revive E1 (Live) T1: Modify E1 (Live) T0: Revive E1 (Live) T1: Recycle E1 (Recycled) If two items in a changelog are not a pair of these valid orderings, then we discard the later operation. It's worth noting that if any state of a change conflicts, the entire change is discarded as we consider changes to be whole, atomic units of change. Attribute Level Conflict Handling ================================= TODO: