kanidm/designs/resource_limits.rst

Resource Limits
---------------

As security sensitive software, kanidm must be "available" (as defined by
confidentiality, integrity, and availability). This means that as a service we must
be able to handle a large volume of potentially malicous traffic, and still able
to serve legitimate requests without fault or failure.

To achieve this, the resources of the server must be managed and distributed to allow
potentially thousands of operations per second, while preventing exhaustion of those
resources.

Kanidm is structured as a database, where each request requires a process
to resolve that query into an answer. This could be a request for authetication
which is a true/false response, or a request for an identity so that we can
determine their groups for authorisation, or even just a request to find
someone's email address in a corporate directory context.

Each operation requires resources to complete individually, but many operations
can be processed in parallel.

Resource exhaustion occurs when input from a client consumes more resources 
than the server can provide. This means the attack surface is any possible input
and how the server interprets and processes that input. In kanidm this could be
a search filter for example - the query may be small, but it could be expensive
to compute.

CPU time
========

Every operation requires time on a CPU. A CPU logically is many threads which
can have many operations in parallel, but also bound by a frequency and their
performance as to how many instructions can be performed per second.

This means that given a say 8 core CPU, each individual core may be able to
complete 1000 operations per second, and combined this becomes 8x1000.

If CPU time is exhausted, then operations are unable to be served. This means
that we must manage the amount of time an operation is allowed to take
on the system.

Some types of queries can take more time than others. For example:

* Unindexed searches which require full table scans.
* Indexed searches that have a large number of results.
* Write operations that affect many entries.

These necesitate the following limits:

* No unindexed searches allowed
* Prevent searching on terms that do not exist (done, filter schema validation)
* Limit on the number of entries that can be checked during a partial indexed search
* Limit on the number of entries that can be loaded from a fully indexed search
* Limit on the number of entries that can be written during a write operation
* Reduce the size of allowed queries to prevent complex index access patterns

Memory (RAM) Capacity
=====================

A system has finite amounts of RAM which must be shared between all CPUs of the system.
Effective use of that RAM improves the speed of all operations, meaning that as a shared
resource it can not be monopolised by a single operation.

Each operation can be assumed to have access to RAM/n CPUs of RAM. If an operation temporarily
exceeds this, the impact is minimal, but continued use of a single CPU using a high capacity
of RAM can prevent other operations functioning. Additionally, past operations can impact
the content of RAM though cache invalidation attacks.

To prevent overuse of RAM, the following must be considered

* Cache strategies to prevent invalidation attacks (done, ARC + ahash).
* Limit the amount of entries that can be dirtied in a write operation
* Limit the number of entries that can be filter-tested in partial index queries
* Limit the amount of entries that can be loaded for a read operation
* Limit the size of cache to allow room for operation result sets (done)
* Cache closer to the application rather than using db/vfs caches (done)
* Compress resources that consume bandwidth to operate on (done, index compression)

Disk IOPS
=========

Storage has a limit on how many input/output operations it can perform per second. Each operation
must be able to access the storage system for entries and indexes that may not be cached in memory.
The best way to reduce IOPS is more RAM to allow more application level caching, but this is
not always possible. This means that we must have operations minimise their IOPS requirements,
or to not issue or allow requests that would generate large amounts of IOPS.

* Cache strategies that keep cache hit ratios high (done, ARC)
* Compress elements to reduce IOPS for transfer to RAM (done indexes, not-done entries)
* Limit the number of large items that can be loaded in an operation IE partial index scans, full table scans
* Limit the size of write operations.

Disk Capacity
=============

If the storage capacity is exceeded this can cause the system to panic or stop, which would act
as a denial of service. It may also prevent security auditing or other data to be recorded or in
the worst case, data corruption. This means that during write operations, the amount of change
must be limited.

* Limit the number of entries that can be dirtied in an operation
* Limit the maximum number of multivalue attributes on an entry to prevent the entry growing over-size
* Entry maximum size limits
* Limit the amount of churn in an entry to prevent excessive changelog growth

Stack Limits
============

As some operations are recursive, the stack depth becomes a concern because when the stack depth is exceeded, the system
can crash or panic. The major recursive structure is filters and how we process queries.

To prevent stack exhaustion:

* Limit the depth of nesting in a search query (filter).

Network
=======

To prevent a client from exhausting memory, but also network bandwidth, if an operation is too
large it should be denied

* Limit size of incoming network requests

Limits Summary
==============

This summary is the set of limits that should be applied. Note some limits prevent multiple issues,
and the limits apply in different areas.

Search Limits:
* Prevent searching on terms that do not exist (done, filter schema validation)
* Deny unindexed searches
* Limit the amount of entries that can be loaded for a read operation
* Limit the number of entries that can be filter-tested in partial index queries
* Limit on the number of entries that can be loaded from a search
* Limit the depth of nesting in a search query (filter).
* Reduce the size of allowed queries to prevent complex index access patterns

Caching:
* Cache strategies to prevent invalidation attacks (done, ARC + ahash).
* Limit the size of cache to allow room for operation result sets (done)
* Cache closer to the application rather than using db/vfs caches (done)

Network:
* Limit size of incoming network requests

Db storage:
* Compress elements to reduce IOPS for transfer to RAM (done indexes, not-done entries)

Writes:
* Limit the number of entries that can be dirtied in an operation
* Limit the maximum number of multivalue attributes on an entry to prevent the entry growing over-size
* Entry maximum size limits
* Limit the amount of churn in an entry to prevent excessive changelog growth

These limits should be applied per-account to allow some accounts to override these, for example
an application which needs to bulk update accounts, or admins who need to perform audits.

The system maintains a default set of limits. Accounts can then have resource groups associated.
The "highest" value of the resource group or default is the value that is applied. These limits
could also be claim based or related, meaning they become per session rather than per account, so
they would be stored in the user authentication token.

The session limits would be:

* allow unindexed search
* maximum number of entries in search
* maximum number of entries in partial filter test
* number of filter elements
* maximum number of modified entries
* write rate limit (writes over time)
* network request size

The system limits that can not be account overridden are:

* maximum entry size
* maximum number of multi value attributes
doc 2020-08-08 04:37:27 +02:00			`Resource Limits`
			`---------------`

Apply suggestions from code review Co-authored-by: Charelle Collett <Charcol@users.noreply.github.com> 2020-08-08 05:23:16 +02:00			`As security sensitive software, kanidm must be "available" (as defined by`
			`confidentiality, integrity, and availability). This means that as a service we must`
doc 2020-08-08 04:37:27 +02:00			`be able to handle a large volume of potentially malicous traffic, and still able`
			`to serve legitimate requests without fault or failure.`

Apply suggestions from code review Co-authored-by: Charelle Collett <Charcol@users.noreply.github.com> 2020-08-08 05:23:16 +02:00			`To achieve this, the resources of the server must be managed and distributed to allow`
doc 2020-08-08 04:37:27 +02:00			`potentially thousands of operations per second, while preventing exhaustion of those`
			`resources.`

			`Kanidm is structured as a database, where each request requires a process`
			`to resolve that query into an answer. This could be a request for authetication`
			`which is a true/false response, or a request for an identity so that we can`
			`determine their groups for authorisation, or even just a request to find`
			`someone's email address in a corporate directory context.`

			`Each operation requires resources to complete individually, but many operations`
			`can be processed in parallel.`

Apply suggestions from code review Co-authored-by: Charelle Collett <Charcol@users.noreply.github.com> 2020-08-08 05:23:16 +02:00			`Resource exhaustion occurs when input from a client consumes more resources`
			`than the server can provide. This means the attack surface is any possible input`
			`and how the server interprets and processes that input. In kanidm this could be`
doc 2020-08-08 04:37:27 +02:00			`a search filter for example - the query may be small, but it could be expensive`
			`to compute.`

			`CPU time`
			`========`

			`Every operation requires time on a CPU. A CPU logically is many threads which`
			`can have many operations in parallel, but also bound by a frequency and their`
			`performance as to how many instructions can be performed per second.`

			`This means that given a say 8 core CPU, each individual core may be able to`
			`complete 1000 operations per second, and combined this becomes 8x1000.`

			`If CPU time is exhausted, then operations are unable to be served. This means`
			`that we must manage the amount of time an operation is allowed to take`
			`on the system.`

			`Some types of queries can take more time than others. For example:`

			`* Unindexed searches which require full table scans.`
			`* Indexed searches that have a large number of results.`
			`* Write operations that affect many entries.`

			`These necesitate the following limits:`

			`* No unindexed searches allowed`
			`* Prevent searching on terms that do not exist (done, filter schema validation)`
			`* Limit on the number of entries that can be checked during a partial indexed search`
			`* Limit on the number of entries that can be loaded from a fully indexed search`
			`* Limit on the number of entries that can be written during a write operation`
			`* Reduce the size of allowed queries to prevent complex index access patterns`

			`Memory (RAM) Capacity`
			`=====================`

Apply suggestions from code review Co-authored-by: Charelle Collett <Charcol@users.noreply.github.com> 2020-08-08 05:23:16 +02:00			`A system has finite amounts of RAM which must be shared between all CPUs of the system.`
doc 2020-08-08 04:37:27 +02:00			`Effective use of that RAM improves the speed of all operations, meaning that as a shared`
			`resource it can not be monopolised by a single operation.`

Apply suggestions from code review Co-authored-by: Charelle Collett <Charcol@users.noreply.github.com> 2020-08-08 05:23:16 +02:00			`Each operation can be assumed to have access to RAM/n CPUs of RAM. If an operation temporarily`
doc 2020-08-08 04:37:27 +02:00			`exceeds this, the impact is minimal, but continued use of a single CPU using a high capacity`
			`of RAM can prevent other operations functioning. Additionally, past operations can impact`
Apply suggestions from code review Co-authored-by: Charelle Collett <Charcol@users.noreply.github.com> 2020-08-08 05:23:16 +02:00			`the content of RAM though cache invalidation attacks.`
doc 2020-08-08 04:37:27 +02:00
			`To prevent overuse of RAM, the following must be considered`

			`* Cache strategies to prevent invalidation attacks (done, ARC + ahash).`
			`* Limit the amount of entries that can be dirtied in a write operation`
Apply suggestions from code review Co-authored-by: Charelle Collett <Charcol@users.noreply.github.com> 2020-08-08 05:23:16 +02:00			`* Limit the number of entries that can be filter-tested in partial index queries`
doc 2020-08-08 04:37:27 +02:00			`* Limit the amount of entries that can be loaded for a read operation`
			`* Limit the size of cache to allow room for operation result sets (done)`
			`* Cache closer to the application rather than using db/vfs caches (done)`
			`* Compress resources that consume bandwidth to operate on (done, index compression)`

			`Disk IOPS`
			`=========`

			`Storage has a limit on how many input/output operations it can perform per second. Each operation`
			`must be able to access the storage system for entries and indexes that may not be cached in memory.`
			`The best way to reduce IOPS is more RAM to allow more application level caching, but this is`
			`not always possible. This means that we must have operations minimise their IOPS requirements,`
			`or to not issue or allow requests that would generate large amounts of IOPS.`

			`* Cache strategies that keep cache hit ratios high (done, ARC)`
			`* Compress elements to reduce IOPS for transfer to RAM (done indexes, not-done entries)`
			`* Limit the number of large items that can be loaded in an operation IE partial index scans, full table scans`
			`* Limit the size of write operations.`

Apply suggestions from code review Co-authored-by: Charelle Collett <Charcol@users.noreply.github.com> 2020-08-08 05:23:16 +02:00			`Disk Capacity`
doc 2020-08-08 04:37:27 +02:00			`=============`

			`If the storage capacity is exceeded this can cause the system to panic or stop, which would act`
			`as a denial of service. It may also prevent security auditing or other data to be recorded or in`
			`the worst case, data corruption. This means that during write operations, the amount of change`
			`must be limited.`

			`* Limit the number of entries that can be dirtied in an operation`
Apply suggestions from code review Co-authored-by: Charelle Collett <Charcol@users.noreply.github.com> 2020-08-08 05:23:16 +02:00			`* Limit the maximum number of multivalue attributes on an entry to prevent the entry growing over-size`
doc 2020-08-08 04:37:27 +02:00			`* Entry maximum size limits`
Apply suggestions from code review Co-authored-by: Charelle Collett <Charcol@users.noreply.github.com> 2020-08-08 05:23:16 +02:00			`* Limit the amount of churn in an entry to prevent excessive changelog growth`
doc 2020-08-08 04:37:27 +02:00
			`Stack Limits`
			`============`

Apply suggestions from code review Co-authored-by: Charelle Collett <Charcol@users.noreply.github.com> 2020-08-08 05:23:16 +02:00			`As some operations are recursive, the stack depth becomes a concern because when the stack depth is exceeded, the system`
doc 2020-08-08 04:37:27 +02:00			`can crash or panic. The major recursive structure is filters and how we process queries.`

			`To prevent stack exhaustion:`

			`* Limit the depth of nesting in a search query (filter).`

			`Network`
			`=======`

			`To prevent a client from exhausting memory, but also network bandwidth, if an operation is too`
			`large it should be denied`

			`* Limit size of incoming network requests`

			`Limits Summary`
			`==============`

			`This summary is the set of limits that should be applied. Note some limits prevent multiple issues,`
			`and the limits apply in different areas.`

			`Search Limits:`
			`* Prevent searching on terms that do not exist (done, filter schema validation)`
			`* Deny unindexed searches`
			`* Limit the amount of entries that can be loaded for a read operation`
Apply suggestions from code review Co-authored-by: Charelle Collett <Charcol@users.noreply.github.com> 2020-08-08 05:23:16 +02:00			`* Limit the number of entries that can be filter-tested in partial index queries`
doc 2020-08-08 04:37:27 +02:00			`* Limit on the number of entries that can be loaded from a search`
			`* Limit the depth of nesting in a search query (filter).`
			`* Reduce the size of allowed queries to prevent complex index access patterns`

			`Caching:`
			`* Cache strategies to prevent invalidation attacks (done, ARC + ahash).`
			`* Limit the size of cache to allow room for operation result sets (done)`
			`* Cache closer to the application rather than using db/vfs caches (done)`

			`Network:`
			`* Limit size of incoming network requests`

			`Db storage:`
			`* Compress elements to reduce IOPS for transfer to RAM (done indexes, not-done entries)`

			`Writes:`
			`* Limit the number of entries that can be dirtied in an operation`
			`* Limit the maximum number of multivalue attributes on an entry to prevent the entry growing over-size`
			`* Entry maximum size limits`
Apply suggestions from code review Co-authored-by: Charelle Collett <Charcol@users.noreply.github.com> 2020-08-08 05:23:16 +02:00			`* Limit the amount of churn in an entry to prevent excessive changelog growth`
doc 2020-08-08 04:37:27 +02:00
Apply suggestions from code review Co-authored-by: Charelle Collett <Charcol@users.noreply.github.com> 2020-08-08 05:23:16 +02:00			`These limits should be applied per-account to allow some accounts to override these, for example`
doc 2020-08-08 04:37:27 +02:00			`an application which needs to bulk update accounts, or admins who need to perform audits.`

			`The system maintains a default set of limits. Accounts can then have resource groups associated.`
			`The "highest" value of the resource group or default is the value that is applied. These limits`
			`could also be claim based or related, meaning they become per session rather than per account, so`
			`they would be stored in the user authentication token.`

			`The session limits would be:`

			`* allow unindexed search`
			`* maximum number of entries in search`
			`* maximum number of entries in partial filter test`
			`* number of filter elements`
			`* maximum number of modified entries`
			`* write rate limit (writes over time)`
			`* network request size`

Apply suggestions from code review Co-authored-by: Charelle Collett <Charcol@users.noreply.github.com> 2020-08-08 05:23:16 +02:00			`The system limits that can not be account overridden are:`
doc 2020-08-08 04:37:27 +02:00
			`* maximum entry size`
			`* maximum number of multi value attributes`