Add design documents as drafts

2025-02-23 12:37:00 +01:00 · 2019-02-14 12:49:45 +10:00 · 2019-02-14 12:49:45 +10:00 · 84ff865304
parent 651abe3762
commit 84ff865304
7 changed files with 582 additions and 50 deletions
--- a/designs/access_profiles_and_security.rst
+++ b/designs/access_profiles_and_security.rst
@ -28,6 +28,109 @@ An example is that user Alice should only be able to search for objects where th
 is person, and where they are a memberOf "visible" group. Alice should only be able to
 see those users displayNames (not their legalName for example), and their public email.

+Worded a bit differently. You need permission over the scope of entries, you need to be able
+to read the attribute to filter on it, and you need to be able to read the attribute to recieve
+it in the result entry.
+
+Threat: If we search for '(&(name=william)(secretdata=x))', we should not allow this to
+proceed because you don't have the rights to read secret data, so you should not be allowed
+to filter on it. How does this work with two overlapping ACPs? For example one that allows read
+of name and description to class = group, and one that allows name to user. We don't want to
+say '(&(name=x)(description=foo))' and have it allowed, because we don't know the target class
+of the filter. Do we "unmatch" all users because they have no access to the filter components? (Could
+be done by inverting and putting in an AndNot of the non-matchable overlaps). Or do we just
+filter our description from the users returned (But that implies they DID match, which is a disclosure).
+
+More concrete:
+
+    search {
+        action: allow
+        targetscope: Eq("class", "group")
+        targetattr: name
+        targetattr: description
+    }
+
+    search {
+        action: allow
+        targetscope: Eq("class", "user")
+        targetattr: name
+    }
+
+    SearchRequest {
+        ...
+        filter: And: {
+            Pres("name"),
+            Pres("description"),
+        }
+    }
+
+A potential defense is:
+
+    acp class group: Pres(name) and Pres(desc) both in target attr, allow
+    acp class user: Pres(name) allow, Pres(desc) deny. Invert and Append
+
+    So the filter now is:
+    And: {
+        AndNot: {
+            Eq("class", "user")
+        },
+        And: {
+            Pres("name"),
+            Pres("description"),
+        },
+    }
+
+    This would now only allow access to the name/desc of group.
+
+If we extend this to a third, this would work. But a more complex example:
+
+    search {
+        action: allow
+        targetscope: Eq("class", "group")
+        targetattr: name
+        targetattr: description
+    }
+
+    search {
+        action: allow
+        targetscope: Eq("class", "user")
+        targetattr: name
+    }
+
+    search {
+        action: allow
+        targetscope: And(Eq("class", "user"), Eq("name", "william"))
+        targetattr: description
+    }
+
+Now we have a single user where we can read desc. So the compiled filter above as:
+
+    And: {
+        AndNot: {
+            Eq("class", "user")
+        },
+        And: {
+            Pres("name"),
+            Pres("description"),
+        },
+    }
+
+This would now be invalid, first, because we would see that class=user and william has no name
+so that would be excluded also. We also may not even have "class=user" in the second ACP, so we can't
+use subset filter matching to merge the two.
+
+As a result, I think the only possible valid solution is to perform the initial filter, then determine
+on the candidates if we *could* have have valid access to filter on all required attributes. IE
+this means even with an index look up, we still are required to perform some filter application
+on the candidates.
+
+I think this will mean on a possible candidate, we have to apply all ACP, then create a union of
+the resulting targetattrs, and then compared that set into the set of attributes in the filter.
+
+This will be slow on large candidate sets (potentially), but could be sped up with parallelism, caching
+or other. However, in the same step, we can also apply the step of extracting only the allowed
+read target attrs, so this is a valuable exercise.
+
 Delete Requirements
 -------------------

@ -91,6 +194,8 @@ be best implemented as a compilation of self -> eq(uuid, self.uuid).
 Implementation Details
 ----------------------

+CHANGE: Receiver should be a group, and should be single value/multivalue? Can *only* be a group.
+
 Example profiles:

    search {
--- a/designs/entries.rst
+++ b/designs/entries.rst
@ -0,0 +1,38 @@
+
+Entries
+-------
+
+Entries are the base unit of data in this server. This is one of the three foundational concepts
+along with filters and schema that everything thing else builds upon.
+
+What is an Entry?
+-----------------
+
+An entry is a collection of attribute-values. These are sometimes called attribute-value-assertions,
+attr-value sets. The attribute is a "key", and it holds 1 to infinite values associated. An entry
+can have many avas associated, which creates the entry as a whole. An example entry (minus schema):
+
+    Entry {
+        "name": ["william"],
+        "mail": ["william@email", "email@william"],
+        "uuid": ["..."],
+    }
+
+There are only a few rules that are true in entries.
+
+* UUID
+
+All entries *must* have a UUID attribute, and there must ONLY exist a single value. This UUID ava
+MUST be unique within the database, regardless of entry state (live, recycled, tombstoned etc).
+
+* Zero values
+
+An attribute with zero values, is removed from the entry.
+
+* Unsorted
+
+Values within an attribute are "not sorted" in any meaningful way for a client utility (in reality
+they are sorted by an undefined internal order for fast lookup/insertion).
+
+
+That's it.
--- a/designs/filter.rst
+++ b/designs/filter.rst
@ -0,0 +1,163 @@
+
+Filters
+-------
+
+Filters (along with Entries and Schema) is one of the foundational concepts in the
+design of KaniDM. They are used in nearly every aspect of the server to provide
+checking and searching over entry sets.
+
+A filter is a set of requirements where the attribute-value pairs of the entry must
+conform for the filter to be considered a "match". This has two useful properties:
+
+* We can apply a filter to a single entry to determine quickly assertions about that entry
+hold true.
+* We can apply a filter to a set of entries to reduce the set only to the matching entries.
+
+Filter Construction
+-------------------
+
+Filters are rooted in relational algebra and set mathematics. I am not an expert on either
+topic, and have learnt from experience about there design.
+
+* Presence
+
+The simplest filter is a "presence" test. It asserts that some attribute, regardless
+of it's value exists on the entry. For example, the entries below:
+
+    Entry {
+        name: william
+    }
+
+    Entry {
+        description: test
+    }
+
+If we apply "Pres(name)", then we would only see the entry containing "name: william" as a matching
+result.
+
+* Equality
+
+Equality checks that an attribute and value are present on an entry. For example
+
+    Entry {
+        name: william
+    }
+
+    Entry {
+        name: test
+    }
+
+If we apply Eq(name, william) only the first entry would match. If the attribute is multivalued,
+we only assert that one value in the set is there. For example:
+
+    Entry {
+        name: william
+    }
+
+    Entry {
+        name: test
+        name: claire
+    }
+
+In this case application of Eq(name, claire), would match the second entry as name=claire is present
+in the multivalue set.
+
+* Sub
+
+Substring checks that the substring exists in an attribute of the entry. This is a specialisation
+of equality, where the same value and multivalue handling holds true.
+
+    Entry {
+        name: william
+    }
+
+In this example, Sub(name, liam) would match, but Sub(name, air) would not.
+
+* Or
+
+Or contains multiple filters and asserts that provided *any* of them are true, this condition
+will hold true. For example:
+
+    Entry {
+        name: claire
+    }
+
+In this the filter Or(Eq(name, claire), Eq(name, william)) will be true, because the Eq(name, claire)
+is true, thus the Or condition is true. If nothing inside the Or is true, it returns false.
+
+* And
+
+And checks that all inner filter conditions are true, to return true. If any are false, it will
+yield false.
+
+    Entry {
+        name: claire
+        class: person
+    }
+
+For this example, And(Eq(class, person), Eq(name, claire)) would be true, but And(Eq(class, group),
+Eq(name, claire)) would be false.
+
+* AndNot
+
+AndNot is different to a logical not.
+
+If we had Not(Eq(name, claire)), then the logical result is "All entries where name is not
+claire". However, this is (today...) not very efficient. Instead, we have "AndNot" which asserts
+that a condition of a candidate set is not true. So the operation: AndNot(Eq(name, claire)) would
+yield and empty set. AndNot is important when you need to check that something is also not true
+but without getting all entries where that not holds. An example:
+
+    Entry {
+        name: william
+        class: person
+    }
+
+    Entry {
+        name: claire
+        class: person
+    }
+
+In this case "And(Eq(class, person), AndNot(Eq(name, claire)))". This would find all persons
+where their name is also not claire: IE william. However, the following would be empty result.
+"AndNot(Eq(name, claire))". This is because there is no candidate set already existing, so there
+is nothing to return.
+
+
+Filter Schema Considerations
+----------------------------
+
+In order to make filters work properly, the server normalises entries on input to allow simpler
+comparisons and ordering in the actual search phases. This means that for a filter to operate
+it too must be normalised an valid.
+
+If a filter requests an operation on an attribute we do not know of in schema, the operation
+is rejected. This is to prevent a denial of service attack where Eq(NonExist, value) would cause
+un-indexed full table scans to be performed consuming server resources.
+
+In a filter request, the Attribute name in use is normalised according to schema, as it
+the search value. For example, Eq(nAmE, Claire) would normalise to Eq(name, claire) as both
+attrname and name are UTF8_INSENSITIVE. However, displayName is case sensitive so a search like:
+Eq(displayName, Claire) would become Eq(displayname, Claire). Note Claire remains cased.
+
+This means that instead of having costly routines to normalise entries on each read and search,
+we can normalise on entry modify and create, then we only need to ensure filters match and we
+can do basic string comparisons as needed.
+
+
+Discussion
+----------
+
+Is it worth adding a true "not" type, and using that instead? It would be extremely costly on
+indexes or filter testing, but would logically be better than AndNot as a filter term.
+
+Not could be implemented as Not(<filter>) -> And(Pres(class), AndNot(<filter>)) which would
+yield the equivalent result, but it would consume a very large index component. In this case
+though, filter optimising would promote Eq > Pres, so we would should be able to skip to a candidate
+test, or we access the index and get the right result anyway over fulltable scan.
+
+Additionally, Not/AndNot could be security risks because they could be combined with And
+queries that allow them to bypass the filter-attribute permission check. Is there an example
+of using And(Eq, AndNot(Eq)) that could be used to provide information disclosure about
+the status of an attribute given a result/non result where the AndNot is false/true?
+
--- a/designs/indexing.rst
+++ b/designs/indexing.rst
@ -0,0 +1,152 @@
+
+Indexing
+--------
+
+Indexing is deeply tied to the concept of filtering. Indexes exist to make the application of a
+search term (filter) faster.
+
+World without indexing
+----------------------
+
+Almost all databases are built ontop of a key-value storage engine of some nature. In our
+case we are using (feb 2019) sqlite and hopefully SLED in the future.
+
+So our entries that contain sets of avas, these are serialised into a byte format (feb 2019, json
+but soon cbor) and stored in a table of "id: entry". For example:
+
+    |----------------------------------------------------------------------------------------|
+    |  ID  |                                     data                                        |
+    |----------------------------------------------------------------------------------------|
+    |  01  | { 'Entry': { 'name': ['name'], 'class': ['person'], 'uuid': ['...'] } }         |
+    |  02  | { 'Entry': { 'name': ['beth'], 'class': ['person'], 'uuid': ['...'] } }         |
+    |  03  | { 'Entry': { 'name': ['alan'], 'class': ['person'], 'uuid': ['...'] } }         |
+    |  04  | { 'Entry': { 'name': ['john'], 'class': ['person'], 'uuid': ['...'] } }         |
+    |  05  | { 'Entry': { 'name': ['kris'], 'class': ['person'], 'uuid': ['...'] } }         |
+    |----------------------------------------------------------------------------------------|
+
+The ID column is *private* to the backend implementation and is never revealed to the higher
+level components. However the ID is very important to indexing :)
+
+If we wanted to find Eq(name, john) here, what do we need to do? A full table scan is where we
+perform:
+
+    data = sqlite.do(SELECT * from id2entry);
+    for row in data:
+        entry = deserialise(row)
+        entry.match_filter(...) // check Eq(name, john)
+
+For a small database (maybe up to 20 objects), this is probably fine. But once you start to get
+much larger this is really costly. We continually load, deserialise, check and free data that
+is not relevant to the search. This is why full table scans of any database (sql, ldap, anything)
+are so costly. It's really really scanning everything!
+
+How does indexing work?
+-----------------------
+
+Indexing is a pre-computed lookup table of what you *might* search in a specific format. Let's say
+in our example we have an equality index on "name" as an attribute. Now in our backend we define
+an extra table called "index_eq_name". It's contents would look like:
+
+    |------------------------------------------|
+    |  index    | idl                          |
+    |------------------------------------------|
+    |  alan     | [03, ]                       |
+    |  beth     | [02, ]                       |
+    |  john     | [04, ]                       |
+    |  kris     | [05, ]                       |
+    |  name     | [01, ]                       |
+    |------------------------------------------|
+
+So when we perform our search for Eq(name, john) again, we see name is indexed. We then perform:
+
+    SELECT * from index_eq_name where index=john;
+
+This would give us the idl (ID list) of [04,]. This is the "ID's of every entry where name equals
+john".
+
+We can now take this back to our id2entry table and perform:
+
+    data = sqlite.do(SELECT * from id2entry where ID = 04)
+
+The key-value engine only gives us the entry for john, and we have a match! If id2entry had 1 million
+entries, a full table scan would be 1 million loads and compares - with the index, it was 2 loads and
+one compare. That's 30000x faster (potentially ;) )!
+
+To improve on this, if we had a query like Or(Eq(name, john), Eq(name, kris)) we can use our
+indexes to speed this up.
+
+We would query index_eq_name again, and we would perform the search for both john, and kris. Because
+this is an OR we then union the two idl's, and we would have:
+
+    [04, 05,]
+
+Now we just have to get entries 04,05 from id2entry, and we have our matching query. This means
+filters are often applied as idl set operations.
+
+Compressed ID lists
+-------------------
+
+In order to make idl loading faster, and the set operations faster there is an idl library
+(developed by me, firstyear), which will be used for this. To read more see:
+
+https://github.com/Firstyear/idlset
+
+Filter Optimisation
+-------------------
+
+Filter optimisation begins to play an important role when we have indexes. If we indexed
+something like "Pres(class)", then the idl for that search is the set of all database
+entries. Similar, if our database of 1 million entries has 250,000 class=person, then
+Eq(class, person), will have an idl containing 250,000 ids. Even with idl compression, this
+is still a lot of data!
+
+There tend to be two types of searches against a directory like kanidm.
+
+* Broad searches
+* Targetted single entry searches
+
+For broad searches, filter optimising does little - we just have to load those large idls, and
+use them. (Yes, loading the large idl and using it is still better than full table scan though!)
+
+However, for targetted searches, filter optimisng really helps.
+
+Imagine a query like:
+
+    And(Eq(class, person), Eq(name, claire))
+
+In this case with our database of 250,000 persons, our idl's would have:
+
+    And( idl[250,000 ids], idl(1 id))
+
+Which means the result will always be the *single* id in the idl or *no* value because it wasn't
+present.
+
+We add a single concept to the server called the "filter test threshold". This is the state in which
+a candidate set that is not completed operation, is shortcut, and we then apply the filter in
+the manner of a full table scan to the partial set because it will be faster than the index loading
+and testing.
+
+When we have this test threshold, there exists two possibilities for this filter.
+
+    And( idl[250,000 ids], idl(1 id))
+
+We load 250,000 idl and then perform the intersection with the idl of 1 value, and result in 1 or 0.
+
+    And( idl(1 id), idl[250,000 ids])
+
+We load the single idl value for name, and then as we are below the test-threshold we shortcut out
+and apply the filter to entry ID 1 - yielding a match or no match.
+
+Notice in the second, by promoting the "smaller" idl, we were able to save the work of the idl load
+and intersection as our first equality of "name" was more targetted?
+
+Filter optimisation is about re-arranging these filters in the server using our insight to
+data to provide faster searches and avoid indexes that are costly unless they are needed.
+
+In this case, we would *demote* any filter where Eq(class, ...) to the *end* of the And, because it
+is highly likely to be less targetted than the other Eq types. Another example would be promotion
+of Eq filters to the front of an And over a Sub term, wherh Sub indexes tend to be larger and have
+longer IDLs.
+
+
+
--- a/designs/schema.rst
+++ b/designs/schema.rst
@ -0,0 +1,117 @@
+
+Schema
+------
+
+Schema is one of the three foundational concepts of the server, along with filters and entries.
+Schema defines how attribute values *must* be represented, sorted, indexed and more. It also
+defines what attributes could exist on an entry.
+
+Why Schema?
+-----------
+
+The way that the server is designed, you could extract the backend parts and just have "Entries"
+with no schema. That's totally valid if you want!
+
+However, usually in the world all data maintains some form of structure, even if loose. We want to
+have ways to say a database entry represents a person, and what a person requires.
+
+Attributes
+----------
+
+In the entry document, I discuss that avas have a single attribute, and 1 to infinite values that
+are utf8 case sensitive strings. Which schema attribute types we can constrain these avas on an
+entry.
+
+For example, while the entry may be capable of holding 1 to infinite "name" values, the schema
+defines that only one name is valid on the entry. Addition of a second name would be a violation. Of
+course, schema also defines "multi-value", our usual 1 to infinite value storage concept.
+
+Schema can also define that values of the attribute must conform to a syntax. For example, name
+is a case *insensitive* string. So despite the fact that avas store case-sensitive data, all inputs
+to name will be normalised to a lowercase form for faster matching. There are a number of syntax
+types built into the server, and we'll add more later.
+
+Finally, an attribute can be defined as indexed, and in which ways it can be indexed. We often will
+want to search for "mail" on a person, so we can define in the schema that mail is indexed by the
+backend indexing system. We don't define *how* the index is built - only that some index should exist
+for when a query is made.
+
+Classes
+-------
+
+So while we have attributes that define "what is valid in the avas", classes define "which attributes
+can exist on the entry itself".
+
+A class defines requirements that are "may", "must", "systemmay", "systemmust". The system- variants
+exist so that we can ship what we believe are good definitions. The may and must exists so you can
+edit and extend our classes with your extra attribute fields (but it may be better just to add
+your own class types :) )
+
+An attribute in a class marked as "may" is optional on the entry. It can be present as an ava, or
+it may not be.
+
+An attribute in a class marked as "must" is required on the entry. An ava that is valid to the
+attribute syntax is required on this entry.
+
+An attribute that is not "may" or "must" can not be present on this entry.
+
+Lets imagine we have a class (pseudo example) of "person". We'll make it:
+
+    Class {
+        "name": "person",
+        "systemmust": ["name"],
+        "systemmay": ["mail"]
+    }
+
+If we had an entry such as:
+
+    Entry {
+        "class": ["person"],
+        "uid": ["bob"],
+        "mail": ["bob@email"]
+    }
+
+This would be invalid: We are missing the "systemmust" name attribute. It's also invalid because uid
+is not present in systemmust or systemmay.
+
+    Entry {
+        "class": ["person"],
+        "name": ["claire"],
+        "mail": ["claire@email"]
+    }
+
+This entry is now valid. We have met the must requirement of name, and we have the optional
+mail ava populated. The following is also valid.
+
+    Entry {
+        "class": ["person"],
+        "name": ["claire"],
+    }
+
+Classes are 'additive' - this means given two classes on an entry, the must/may are unioned, and the
+strongest rule is applied to attribute presence.
+
+Imagine we have also
+
+    Class {
+        "name": "person",
+        "systemmust": ["name"],
+        "systemmay": ["mail"]
+    }
+
+    Class {
+        "name": "emailperson",
+        "systemmust": ["mail"]
+    }
+
+With our entry now, this turns the "may" from person, into a "must" because of the emailperson
+class. On our entry Claire, that means this entry below is now invalid:
+
+    Entry {
+        "class": ["person", "emailperson"],
+        "name": ["claire"],
+    }
+
+Simply adding an ava of mail back to the entry would make it valid once again.
+
+
--- a/src/lib/filter.rs
+++ b/src/lib/filter.rs
@ -158,56 +158,6 @@ impl Filter<FilterInvalid> {
            }
            _ => panic!(),
        }
-
-        /*
-        match self {
-            Filter::Eq(attr, value) => match schema_attributes.get(attr) {
-                Some(schema_a) => schema_a.validate_value(value),
-                None => Err(SchemaError::InvalidAttribute),
-            },
-            Filter::Sub(attr, value) => match schema_attributes.get(attr) {
-                Some(schema_a) => schema_a.validate_value(value),
-                None => Err(SchemaError::InvalidAttribute),
-            },
-            Filter::Pres(attr) => {
-                // This could be better as a contains_key
-                // because we never use the value
-                match schema_attributes.get(attr) {
-                    Some(_) => Ok(()),
-                    None => Err(SchemaError::InvalidAttribute),
-                }
-            }
-            Filter::Or(filters) => {
-                // This should never happen because
-                // optimising should remove them as invalid parts?
-                if filters.len() == 0 {
-                    return Err(SchemaError::EmptyFilter);
-                };
-                filters.iter().fold(Ok(()), |acc, filt| {
-                    if acc.is_ok() {
-                        self.validate(filt)
-                    } else {
-                        acc
-                    }
-                })
-            }
-            Filter::And(filters) => {
-                // This should never happen because
-                // optimising should remove them as invalid parts?
-                if filters.len() == 0 {
-                    return Err(SchemaError::EmptyFilter);
-                };
-                filters.iter().fold(Ok(()), |acc, filt| {
-                    if acc.is_ok() {
-                        self.validate(filt)
-                    } else {
-                        acc
-                    }
-                })
-            }
-            Filter::Not(filter) => self.validate(filter),
-        }
-        */
    }

    pub fn from(f: &ProtoFilter) -> Self {
--- a/src/lib/server.rs
+++ b/src/lib/server.rs
@ -1040,6 +1040,13 @@ mod tests {
        })
    }

+    #[test]
+    fn test_modify_invalid_class() {
+        // Test modifying an entry and adding an extra class, that would cause the entry
+        // to no longer conform to schema.
+        unimplemented!()
+    }
+
    #[test]
    fn test_qs_delete() {
        run_test!(|_log, mut server: QueryServer, audit: &mut AuditScope| {