* added `thread_count` configuration for the server
* added `thread_count` to orca
---------
Co-authored-by: Sebastiano Tocci <sebastiano.tocci@proton.me>
Thanks to @Seba-T's work with Orca, we were able to identify a number of performance issues in certain high load conditions.
This commit contains fixes for the following issues
* Unbounded Memory Growth - due to how ARCache works, to maintain temporal consistency it must retain copies of keys (not values) in a special data set for tracking. The Filter Resolve Cache was using unresolved filters as keys. This caused memory explosions when refint or memberof were updating a group with a large number of members because they would emit a query with hundreds of filter terms that would only be used once and never again, causing the ARCache haunted set to grow without bound. To limit this, we no longer cache large/complex queries for resolution, and in future we may implement some other methods to reduce this like sha256/hmac of the queries.
* When creating a new account, dyngroups would be engaged to add the account as a member due to the matching scope. However the change to the dyngroup was triggering an update of all the dyngroups *members* related memberof attributes. This would mean that adding an account would trigger every other account to be loaded an updated.
* When memberof would iterate over leaf entries and update them one at a time. This mean a large number of small fragmented queries in the case of a lot of leaf entries being updated. Now leaf entries are updated in a single stripe once groups are stabilised.
* Member of would always trigger it's members to always update. Instead, we should only update members where a difference is observed, or all members if the group's memberof itself has changed since this needs to propogate to all leaf entries. This significantly reduces the amount of writes and operations to examine the changed member of set.
* Referential integrity would examine all reference uuids on entries for validity rather than just the reference uuids that were altered within the transaction. This change means that only uuids that were *added* are validated during an operation.
* During async write backs (delayed actions) these were performed one at a time. Instead, when possible this should be done in a single transaction as the write transaction caches all writes in memory until the commit meaning that by batching we reduce overall latency.
* In the server there can only be one write transaction and many readers. These are guarded by tokio semaphores that act as fair queues - first in gets the lock next. Due to the design of the server readers would be blocked on the *database* semaphore, and writers would block on the write semaphore and THEN the database semaphore. This arrangement was creating a situation which unfairly advantaged readers over writers, as any write would first have to become the head of it's queue, and then compete with all readers to access a db transaction. Instead, we now have a reader semaphore with size threads minus 1, clamped at a minimum of 1. This means that provided there are two or more threads, then a writer will *always* have a database handle available, and readers will pre-queue with each other before queueing on the db ticket. If there is only one thread, then writes and reads will alternate between each other fairly.
While basking under the shade of the coolabah tree, I was overcome by an intense desire to improve the performance and memory usage of Kanidm.
This pr reduces a major source of repeated small clones, lowers default log level in testing, removes some trace fields that are both large and probably shouldn't be traced, and also changes some lto settings for release builds.
* kanidm cli logs on debug level - Fixes#2745
* such clippy like wow
* It's important for a wordsmith to know when to get its fixes in.
* updootin' wasms
This completely reworks how we approach and handle cryptographic keys in Kanidm. This is needed as a foundation for replication coordination which will require handling and rotation of cryptographic keys in automated ways.
This change influences many other parts of the code base in it's implementation.
The primary influences are:
* Modification of how domain user signing keys are revoked or rotated.
* Merging of all existing service-account token keys are retired (retained) keys into the domain to simplify token signing and validation
* Allowing multiple configurations of local command line tools to swap between instances using disparate signing keys.
* Modification of key retrieval to be key id based (KID), removing the need to embed the JWK into tokens
A side effect of this change is that most user authentication sessions and oauth2 sessions will have to be re-established after upgrade. However we feel that session renewal after upgrade is an expected side effect of an upgrade.
In the future this lays the ground work to remove a large number of legacy key handling processes that have evolved, which will allow large parts of code to be removed.
Improve the support for the resolver to support MFA options with pam. This enables async task spawning and cancelation via the resolver backend as well.
Co-authored-by: David Mulder <dmulder@samba.org>
Fixes#2601Fixes#393 - gid numbers can be part of the systemd nspawn range.
Previously we allocated gid numbers based on the fact that uid_t is a u32, so we allowed 65536 through u32::max. However, there are two major issues with this that I didn't realise. The first is that anything greater than i32::max (2147483648) can confuse the linux kernel.
The second is that systemd allocates 524288 through 1879048191 to itself for nspawn.
This leaves with with only a few usable ranges.
1000 through 60000
60578 through 61183
65520 through 65533
65536 through 524287
1879048192 through 2147483647
The last range being the largest is the natural and obvious area we should allocate from. This happens to nicely fall in the pattern of 0x7000_0000 through 0x7fff_ffff which allows us to take the last 24 bits of the uuid then applying a bit mask we can ensure that we end up in this range.
There are now two major issues.
We have now changed our validation code to enforce a tighter range, but we may have already allocated users into these ranges.
External systems like FreeIPA allocated uid/gid numbers with reckless abandon directly into these ranges.
As a result we need to make two concessions.
We *secretly* still allow manual allocation of id's from 65536 through to 1879048191 which is the nspawn container range. This happens to be the range that freeipa allocates into. We will never generate an ID in this range, but we will allow it to ease imports since the users of these ranges already have shown they 'don't care' about that range. This also affects SCIM imports for longer term migrations.
Second is id's that fall outside the valid ranges. In the extremely unlikely event this has occurred, a startup migration has been added to regenerate these id values for affected entries to prevent upgrade issues.
An accidental effect of this is freeing up the range 524288 to 1879048191 for other subuid uses.
* Refactor: move the object graph ui to admin web ui
* Add dynamic js loading support
Load viz.js dynamically
* Add some js docs
* chore: cleanup imports
* chore: remove unused clipboard feature
chore: remove unused mermaid.sh
* Messing with the profile.release settings and reverting the changes I tried has now made the build much smaller yay :D
* Refactor: user raw search requests
Assert service-accounts properly
* refactor: new v1 proto structure
* Add self to CONTRIBUTORS.md
---------
Co-authored-by: James Hodgkinson <james@terminaloutcomes.com>
Fixes two major issues with replication.
The first was related to server refreshes. When a server was refreshed it would retain it's server unique id. If the server had lagged and was disconnected from replication and administrator would naturally then refresh it's database. This meant that on next tombstone purge of the server, it's RUV would jump ahead causing it's refresh-supplier to now believe it was lagging (which was not the case).
In the situation where a server is refreshed, we reset the servers unique replication ID which avoids the RUV having "jumps".
The second issue was related to RUV trimming. A server which had older RUV entries (say from servers that have been trimmed) would "taint" and re-supply those server ID's back to nodes that wanted to trim them. This also meant that on a restart of the server, that if the node had correctly trimmed the server ID, it would be re-added in memory.
This improves RUV trimming by limiting what what compare and check as a supplier to only CID's that are within the valid changelog window. This itself presented challenges with "how to determine if a server should be removed from the RUV". To achieve this we now check for "overlap" of the RUVS. If overlap isn't occurring it indicates split brain or node isolation, and replication is stopped in these cases.
* otel can eprintln kthx
* started python integration tests, features
* more tests more things
* adding heaps more things
* updating docs
* fixing python test
* fixing errors, updating integration test
* Add models for OAuth2, Person, ServiceAccount and add missing endpoints
* Alias Group to GroupInfo to keep it retrocompatible
* Fixed issues from review
* adding oauth2rs_get_basic_secret
* adding oauth2rs_get_basic_secret
* Fixed mypy issues
* adding more error logs
* updating test scripts and configs
* fixing tests and validating things
* more errors
---------
Co-authored-by: Dogeek <simon.bordeyne@gmail.com>