Add design documents as drafts

2025-02-23 20:47:01 +01:00 · 2019-02-14 12:49:45 +10:00 · 2019-02-14 12:49:45 +10:00 · 84ff865304
parent 651abe3762
commit 84ff865304
7 changed files with 582 additions and 50 deletions
--- a/designs/access_profiles_and_security.rst
+++ b/designs/access_profiles_and_security.rst
@ -28,6 +28,109 @@ An example is that user Alice should only be able to search for objects where th
 is person, and where they are a memberOf "visible" group. Alice should only be able to
 see those users displayNames (not their legalName for example), and their public email.
 Worded a bit differently. You need permission over the scope of entries, you need to be able
 to read the attribute to filter on it, and you need to be able to read the attribute to recieve
 it in the result entry.
 Threat: If we search for '(&(name=william)(secretdata=x))', we should not allow this to
 proceed because you don't have the rights to read secret data, so you should not be allowed
 to filter on it. How does this work with two overlapping ACPs? For example one that allows read
 of name and description to class = group, and one that allows name to user. We don't want to
 say '(&(name=x)(description=foo))' and have it allowed, because we don't know the target class
 of the filter. Do we "unmatch" all users because they have no access to the filter components? (Could
 be done by inverting and putting in an AndNot of the non-matchable overlaps). Or do we just
 filter our description from the users returned (But that implies they DID match, which is a disclosure).
 More concrete:
    search {
        action: allow
        targetscope: Eq("class", "group")
        targetattr: name
        targetattr: description
    }
    search {
        action: allow
        targetscope: Eq("class", "user")
        targetattr: name
    }
    SearchRequest {
        ...
        filter: And: {
            Pres("name"),
            Pres("description"),
        }
    }
 A potential defense is:
    acp class group: Pres(name) and Pres(desc) both in target attr, allow
    acp class user: Pres(name) allow, Pres(desc) deny. Invert and Append
    So the filter now is:
    And: {
        AndNot: {
            Eq("class", "user")
        },
        And: {
            Pres("name"),
            Pres("description"),
        },
    }
    This would now only allow access to the name/desc of group.
 If we extend this to a third, this would work. But a more complex example:
    search {
        action: allow
        targetscope: Eq("class", "group")
        targetattr: name
        targetattr: description
    }
    search {
        action: allow
        targetscope: Eq("class", "user")
        targetattr: name
    }
    search {
        action: allow
        targetscope: And(Eq("class", "user"), Eq("name", "william"))
        targetattr: description
    }
 Now we have a single user where we can read desc. So the compiled filter above as:
    And: {
        AndNot: {
            Eq("class", "user")
        },
        And: {
            Pres("name"),
            Pres("description"),
        },
    }
 This would now be invalid, first, because we would see that class=user and william has no name
 so that would be excluded also. We also may not even have "class=user" in the second ACP, so we can't
 use subset filter matching to merge the two.
 As a result, I think the only possible valid solution is to perform the initial filter, then determine
 on the candidates if we *could* have have valid access to filter on all required attributes. IE
 this means even with an index look up, we still are required to perform some filter application
 on the candidates.
 I think this will mean on a possible candidate, we have to apply all ACP, then create a union of
 the resulting targetattrs, and then compared that set into the set of attributes in the filter.
 This will be slow on large candidate sets (potentially), but could be sped up with parallelism, caching
 or other. However, in the same step, we can also apply the step of extracting only the allowed
 read target attrs, so this is a valuable exercise.
 Delete Requirements
 -------------------
@ -91,6 +194,8 @@ be best implemented as a compilation of self -> eq(uuid, self.uuid).
 Implementation Details
 ----------------------
 CHANGE: Receiver should be a group, and should be single value/multivalue? Can *only* be a group.
 Example profiles:
    search {
--- a/designs/entries.rst
+++ b/designs/entries.rst
@ -0,0 +1,38 @@
 Entries
 -------
 Entries are the base unit of data in this server. This is one of the three foundational concepts
 along with filters and schema that everything thing else builds upon.
 What is an Entry?
 -----------------
 An entry is a collection of attribute-values. These are sometimes called attribute-value-assertions,
 attr-value sets. The attribute is a "key", and it holds 1 to infinite values associated. An entry
 can have many avas associated, which creates the entry as a whole. An example entry (minus schema):
    Entry {
        "name": ["william"],
        "mail": ["william@email", "email@william"],
        "uuid": ["..."],
    }
 There are only a few rules that are true in entries.
 * UUID
 All entries *must* have a UUID attribute, and there must ONLY exist a single value. This UUID ava
 MUST be unique within the database, regardless of entry state (live, recycled, tombstoned etc).
 * Zero values
 An attribute with zero values, is removed from the entry.
 * Unsorted
 Values within an attribute are "not sorted" in any meaningful way for a client utility (in reality
 they are sorted by an undefined internal order for fast lookup/insertion).
 That's it.
--- a/designs/filter.rst
+++ b/designs/filter.rst
@ -0,0 +1,163 @@
 Filters
 -------
 Filters (along with Entries and Schema) is one of the foundational concepts in the
 design of KaniDM. They are used in nearly every aspect of the server to provide
 checking and searching over entry sets.
 A filter is a set of requirements where the attribute-value pairs of the entry must
 conform for the filter to be considered a "match". This has two useful properties:
 * We can apply a filter to a single entry to determine quickly assertions about that entry
 hold true.
 * We can apply a filter to a set of entries to reduce the set only to the matching entries.
 Filter Construction
 -------------------
 Filters are rooted in relational algebra and set mathematics. I am not an expert on either
 topic, and have learnt from experience about there design.
 * Presence
 The simplest filter is a "presence" test. It asserts that some attribute, regardless
 of it's value exists on the entry. For example, the entries below:
    Entry {
        name: william
    }
    Entry {
        description: test
    }
 If we apply "Pres(name)", then we would only see the entry containing "name: william" as a matching
 result.
 * Equality
 Equality checks that an attribute and value are present on an entry. For example
    Entry {
        name: william
    }
    Entry {
        name: test
    }
 If we apply Eq(name, william) only the first entry would match. If the attribute is multivalued,
 we only assert that one value in the set is there. For example:
    Entry {
        name: william
    }
    Entry {
        name: test
        name: claire
    }
 In this case application of Eq(name, claire), would match the second entry as name=claire is present
 in the multivalue set.
 * Sub
 Substring checks that the substring exists in an attribute of the entry. This is a specialisation
 of equality, where the same value and multivalue handling holds true.
    Entry {
        name: william
    }
 In this example, Sub(name, liam) would match, but Sub(name, air) would not.
 * Or
 Or contains multiple filters and asserts that provided *any* of them are true, this condition
 will hold true. For example:
    Entry {
        name: claire
    }
 In this the filter Or(Eq(name, claire), Eq(name, william)) will be true, because the Eq(name, claire)
 is true, thus the Or condition is true. If nothing inside the Or is true, it returns false.
 * And
 And checks that all inner filter conditions are true, to return true. If any are false, it will
 yield false.
    Entry {
        name: claire
        class: person
    }
 For this example, And(Eq(class, person), Eq(name, claire)) would be true, but And(Eq(class, group),
 Eq(name, claire)) would be false.
 * AndNot
 AndNot is different to a logical not.
 If we had Not(Eq(name, claire)), then the logical result is "All entries where name is not
 claire". However, this is (today...) not very efficient. Instead, we have "AndNot" which asserts
 that a condition of a candidate set is not true. So the operation: AndNot(Eq(name, claire)) would
 yield and empty set. AndNot is important when you need to check that something is also not true
 but without getting all entries where that not holds. An example:
    Entry {
        name: william
        class: person
    }
    Entry {
        name: claire
        class: person
    }
 In this case "And(Eq(class, person), AndNot(Eq(name, claire)))". This would find all persons
 where their name is also not claire: IE william. However, the following would be empty result.
 "AndNot(Eq(name, claire))". This is because there is no candidate set already existing, so there
 is nothing to return.
 Filter Schema Considerations
 ----------------------------
 In order to make filters work properly, the server normalises entries on input to allow simpler
 comparisons and ordering in the actual search phases. This means that for a filter to operate
 it too must be normalised an valid.
 If a filter requests an operation on an attribute we do not know of in schema, the operation
 is rejected. This is to prevent a denial of service attack where Eq(NonExist, value) would cause
 un-indexed full table scans to be performed consuming server resources.
 In a filter request, the Attribute name in use is normalised according to schema, as it
 the search value. For example, Eq(nAmE, Claire) would normalise to Eq(name, claire) as both
 attrname and name are UTF8_INSENSITIVE. However, displayName is case sensitive so a search like:
 Eq(displayName, Claire) would become Eq(displayname, Claire). Note Claire remains cased.
 This means that instead of having costly routines to normalise entries on each read and search,
 we can normalise on entry modify and create, then we only need to ensure filters match and we
 can do basic string comparisons as needed.
 Discussion
 ----------
 Is it worth adding a true "not" type, and using that instead? It would be extremely costly on
 indexes or filter testing, but would logically be better than AndNot as a filter term.
 Not could be implemented as Not(<filter>) -> And(Pres(class), AndNot(<filter>)) which would
 yield the equivalent result, but it would consume a very large index component. In this case
 though, filter optimising would promote Eq > Pres, so we would should be able to skip to a candidate
 test, or we access the index and get the right result anyway over fulltable scan.
 Additionally, Not/AndNot could be security risks because they could be combined with And
 queries that allow them to bypass the filter-attribute permission check. Is there an example
 of using And(Eq, AndNot(Eq)) that could be used to provide information disclosure about
 the status of an attribute given a result/non result where the AndNot is false/true?
--- a/designs/indexing.rst
+++ b/designs/indexing.rst
@ -0,0 +1,152 @@
 Indexing
 --------
 Indexing is deeply tied to the concept of filtering. Indexes exist to make the application of a
 search term (filter) faster.
 World without indexing
 ----------------------
 Almost all databases are built ontop of a key-value storage engine of some nature. In our
 case we are using (feb 2019) sqlite and hopefully SLED in the future.
 So our entries that contain sets of avas, these are serialised into a byte format (feb 2019, json
 but soon cbor) and stored in a table of "id: entry". For example:
    |----------------------------------------------------------------------------------------|
    |  ID  |                                     data                                        |
    |----------------------------------------------------------------------------------------|
    |  01  | { 'Entry': { 'name': ['name'], 'class': ['person'], 'uuid': ['...'] } }         |
    |  02  | { 'Entry': { 'name': ['beth'], 'class': ['person'], 'uuid': ['...'] } }         |
    |  03  | { 'Entry': { 'name': ['alan'], 'class': ['person'], 'uuid': ['...'] } }         |
    |  04  | { 'Entry': { 'name': ['john'], 'class': ['person'], 'uuid': ['...'] } }         |
    |  05  | { 'Entry': { 'name': ['kris'], 'class': ['person'], 'uuid': ['...'] } }         |
    |----------------------------------------------------------------------------------------|
 The ID column is *private* to the backend implementation and is never revealed to the higher
 level components. However the ID is very important to indexing :)
 If we wanted to find Eq(name, john) here, what do we need to do? A full table scan is where we
 perform:
    data = sqlite.do(SELECT * from id2entry);
    for row in data:
        entry = deserialise(row)
        entry.match_filter(...) // check Eq(name, john)
 For a small database (maybe up to 20 objects), this is probably fine. But once you start to get
 much larger this is really costly. We continually load, deserialise, check and free data that
 is not relevant to the search. This is why full table scans of any database (sql, ldap, anything)
 are so costly. It's really really scanning everything!
 How does indexing work?
 -----------------------
 Indexing is a pre-computed lookup table of what you *might* search in a specific format. Let's say
 in our example we have an equality index on "name" as an attribute. Now in our backend we define
 an extra table called "index_eq_name". It's contents would look like:
    |------------------------------------------|
    |  index    | idl                          |
    |------------------------------------------|
    |  alan     | [03, ]                       |
    |  beth     | [02, ]                       |
    |  john     | [04, ]                       |
    |  kris     | [05, ]                       |
    |  name     | [01, ]                       |
    |------------------------------------------|
 So when we perform our search for Eq(name, john) again, we see name is indexed. We then perform:
    SELECT * from index_eq_name where index=john;
 This would give us the idl (ID list) of [04,]. This is the "ID's of every entry where name equals
 john".
 We can now take this back to our id2entry table and perform:
    data = sqlite.do(SELECT * from id2entry where ID = 04)
 The key-value engine only gives us the entry for john, and we have a match! If id2entry had 1 million
 entries, a full table scan would be 1 million loads and compares - with the index, it was 2 loads and
 one compare. That's 30000x faster (potentially ;) )!
 To improve on this, if we had a query like Or(Eq(name, john), Eq(name, kris)) we can use our
 indexes to speed this up.
 We would query index_eq_name again, and we would perform the search for both john, and kris. Because
 this is an OR we then union the two idl's, and we would have:
    [04, 05,]
 Now we just have to get entries 04,05 from id2entry, and we have our matching query. This means
 filters are often applied as idl set operations.
 Compressed ID lists
 -------------------
 In order to make idl loading faster, and the set operations faster there is an idl library
 (developed by me, firstyear), which will be used for this. To read more see:
 https://github.com/Firstyear/idlset
 Filter Optimisation
 -------------------
 Filter optimisation begins to play an important role when we have indexes. If we indexed
 something like "Pres(class)", then the idl for that search is the set of all database
 entries. Similar, if our database of 1 million entries has 250,000 class=person, then
 Eq(class, person), will have an idl containing 250,000 ids. Even with idl compression, this
 is still a lot of data!
 There tend to be two types of searches against a directory like kanidm.
 * Broad searches
 * Targetted single entry searches
 For broad searches, filter optimising does little - we just have to load those large idls, and
 use them. (Yes, loading the large idl and using it is still better than full table scan though!)
 However, for targetted searches, filter optimisng really helps.
 Imagine a query like:
    And(Eq(class, person), Eq(name, claire))
 In this case with our database of 250,000 persons, our idl's would have:
    And( idl[250,000 ids], idl(1 id))
 Which means the result will always be the *single* id in the idl or *no* value because it wasn't
 present.
 We add a single concept to the server called the "filter test threshold". This is the state in which
 a candidate set that is not completed operation, is shortcut, and we then apply the filter in
 the manner of a full table scan to the partial set because it will be faster than the index loading
 and testing.
 When we have this test threshold, there exists two possibilities for this filter.
    And( idl[250,000 ids], idl(1 id))
 We load 250,000 idl and then perform the intersection with the idl of 1 value, and result in 1 or 0.
    And( idl(1 id), idl[250,000 ids])
 We load the single idl value for name, and then as we are below the test-threshold we shortcut out
 and apply the filter to entry ID 1 - yielding a match or no match.
 Notice in the second, by promoting the "smaller" idl, we were able to save the work of the idl load
 and intersection as our first equality of "name" was more targetted?
 Filter optimisation is about re-arranging these filters in the server using our insight to
 data to provide faster searches and avoid indexes that are costly unless they are needed.
 In this case, we would *demote* any filter where Eq(class, ...) to the *end* of the And, because it
 is highly likely to be less targetted than the other Eq types. Another example would be promotion
 of Eq filters to the front of an And over a Sub term, wherh Sub indexes tend to be larger and have
 longer IDLs.
--- a/designs/schema.rst
+++ b/designs/schema.rst
@ -0,0 +1,117 @@
 Schema
 ------
 Schema is one of the three foundational concepts of the server, along with filters and entries.
 Schema defines how attribute values *must* be represented, sorted, indexed and more. It also
 defines what attributes could exist on an entry.
 Why Schema?
 -----------
 The way that the server is designed, you could extract the backend parts and just have "Entries"
 with no schema. That's totally valid if you want!
 However, usually in the world all data maintains some form of structure, even if loose. We want to
 have ways to say a database entry represents a person, and what a person requires.
 Attributes
 ----------
 In the entry document, I discuss that avas have a single attribute, and 1 to infinite values that
 are utf8 case sensitive strings. Which schema attribute types we can constrain these avas on an
 entry.
 For example, while the entry may be capable of holding 1 to infinite "name" values, the schema
 defines that only one name is valid on the entry. Addition of a second name would be a violation. Of
 course, schema also defines "multi-value", our usual 1 to infinite value storage concept.
 Schema can also define that values of the attribute must conform to a syntax. For example, name
 is a case *insensitive* string. So despite the fact that avas store case-sensitive data, all inputs
 to name will be normalised to a lowercase form for faster matching. There are a number of syntax
 types built into the server, and we'll add more later.
 Finally, an attribute can be defined as indexed, and in which ways it can be indexed. We often will
 want to search for "mail" on a person, so we can define in the schema that mail is indexed by the
 backend indexing system. We don't define *how* the index is built - only that some index should exist
 for when a query is made.
 Classes
 -------
 So while we have attributes that define "what is valid in the avas", classes define "which attributes
 can exist on the entry itself".
 A class defines requirements that are "may", "must", "systemmay", "systemmust". The system- variants
 exist so that we can ship what we believe are good definitions. The may and must exists so you can
 edit and extend our classes with your extra attribute fields (but it may be better just to add
 your own class types :) )
 An attribute in a class marked as "may" is optional on the entry. It can be present as an ava, or
 it may not be.
 An attribute in a class marked as "must" is required on the entry. An ava that is valid to the
 attribute syntax is required on this entry.
 An attribute that is not "may" or "must" can not be present on this entry.
 Lets imagine we have a class (pseudo example) of "person". We'll make it:
    Class {
        "name": "person",
        "systemmust": ["name"],
        "systemmay": ["mail"]
    }
 If we had an entry such as:
    Entry {
        "class": ["person"],
        "uid": ["bob"],
        "mail": ["bob@email"]
    }
 This would be invalid: We are missing the "systemmust" name attribute. It's also invalid because uid
 is not present in systemmust or systemmay.
    Entry {
        "class": ["person"],
        "name": ["claire"],
        "mail": ["claire@email"]
    }
 This entry is now valid. We have met the must requirement of name, and we have the optional
 mail ava populated. The following is also valid.
    Entry {
        "class": ["person"],
        "name": ["claire"],
    }
 Classes are 'additive' - this means given two classes on an entry, the must/may are unioned, and the
 strongest rule is applied to attribute presence.
 Imagine we have also
    Class {
        "name": "person",
        "systemmust": ["name"],
        "systemmay": ["mail"]
    }
    Class {
        "name": "emailperson",
        "systemmust": ["mail"]
    }
 With our entry now, this turns the "may" from person, into a "must" because of the emailperson
 class. On our entry Claire, that means this entry below is now invalid:
    Entry {
        "class": ["person", "emailperson"],
        "name": ["claire"],
    }
 Simply adding an ava of mail back to the entry would make it valid once again.
--- a/src/lib/filter.rs
+++ b/src/lib/filter.rs
@ -158,56 +158,6 @@ impl Filter<FilterInvalid> {
            }
            _ => panic!(),
        }
        /*
        match self {
            Filter::Eq(attr, value) => match schema_attributes.get(attr) {
                Some(schema_a) => schema_a.validate_value(value),
                None => Err(SchemaError::InvalidAttribute),
            },
            Filter::Sub(attr, value) => match schema_attributes.get(attr) {
                Some(schema_a) => schema_a.validate_value(value),
                None => Err(SchemaError::InvalidAttribute),
            },
            Filter::Pres(attr) => {
                // This could be better as a contains_key
                // because we never use the value
                match schema_attributes.get(attr) {
                    Some(_) => Ok(()),
                    None => Err(SchemaError::InvalidAttribute),
                }
            }
            Filter::Or(filters) => {
                // This should never happen because
                // optimising should remove them as invalid parts?
                if filters.len() == 0 {
                    return Err(SchemaError::EmptyFilter);
                };
                filters.iter().fold(Ok(()), |acc, filt| {
                    if acc.is_ok() {
                        self.validate(filt)
                    } else {
                        acc
                    }
                })
            }
            Filter::And(filters) => {
                // This should never happen because
                // optimising should remove them as invalid parts?
                if filters.len() == 0 {
                    return Err(SchemaError::EmptyFilter);
                };
                filters.iter().fold(Ok(()), |acc, filt| {
                    if acc.is_ok() {
                        self.validate(filt)
                    } else {
                        acc
                    }
                })
            }
            Filter::Not(filter) => self.validate(filter),
        }
        */
    }
    pub fn from(f: &ProtoFilter) -> Self {
--- a/src/lib/server.rs
+++ b/src/lib/server.rs
@ -1040,6 +1040,13 @@ mod tests {
        })
    }
    #[test]
    fn test_modify_invalid_class() {
        // Test modifying an entry and adding an extra class, that would cause the entry
        // to no longer conform to schema.
        unimplemented!()
    }
    #[test]
    fn test_qs_delete() {
        run_test!(|_log, mut server: QueryServer, audit: &mut AuditScope| {