diff --git a/designs/access_profiles_and_security.rst b/designs/access_profiles_and_security.rst index 8e4f85f9d..f5d7c5915 100644 --- a/designs/access_profiles_and_security.rst +++ b/designs/access_profiles_and_security.rst @@ -28,6 +28,109 @@ An example is that user Alice should only be able to search for objects where th is person, and where they are a memberOf "visible" group. Alice should only be able to see those users displayNames (not their legalName for example), and their public email. +Worded a bit differently. You need permission over the scope of entries, you need to be able +to read the attribute to filter on it, and you need to be able to read the attribute to recieve +it in the result entry. + +Threat: If we search for '(&(name=william)(secretdata=x))', we should not allow this to +proceed because you don't have the rights to read secret data, so you should not be allowed +to filter on it. How does this work with two overlapping ACPs? For example one that allows read +of name and description to class = group, and one that allows name to user. We don't want to +say '(&(name=x)(description=foo))' and have it allowed, because we don't know the target class +of the filter. Do we "unmatch" all users because they have no access to the filter components? (Could +be done by inverting and putting in an AndNot of the non-matchable overlaps). Or do we just +filter our description from the users returned (But that implies they DID match, which is a disclosure). + +More concrete: + + search { + action: allow + targetscope: Eq("class", "group") + targetattr: name + targetattr: description + } + + search { + action: allow + targetscope: Eq("class", "user") + targetattr: name + } + + SearchRequest { + ... + filter: And: { + Pres("name"), + Pres("description"), + } + } + +A potential defense is: + + acp class group: Pres(name) and Pres(desc) both in target attr, allow + acp class user: Pres(name) allow, Pres(desc) deny. Invert and Append + + So the filter now is: + And: { + AndNot: { + Eq("class", "user") + }, + And: { + Pres("name"), + Pres("description"), + }, + } + + This would now only allow access to the name/desc of group. + +If we extend this to a third, this would work. But a more complex example: + + search { + action: allow + targetscope: Eq("class", "group") + targetattr: name + targetattr: description + } + + search { + action: allow + targetscope: Eq("class", "user") + targetattr: name + } + + search { + action: allow + targetscope: And(Eq("class", "user"), Eq("name", "william")) + targetattr: description + } + +Now we have a single user where we can read desc. So the compiled filter above as: + + And: { + AndNot: { + Eq("class", "user") + }, + And: { + Pres("name"), + Pres("description"), + }, + } + +This would now be invalid, first, because we would see that class=user and william has no name +so that would be excluded also. We also may not even have "class=user" in the second ACP, so we can't +use subset filter matching to merge the two. + +As a result, I think the only possible valid solution is to perform the initial filter, then determine +on the candidates if we *could* have have valid access to filter on all required attributes. IE +this means even with an index look up, we still are required to perform some filter application +on the candidates. + +I think this will mean on a possible candidate, we have to apply all ACP, then create a union of +the resulting targetattrs, and then compared that set into the set of attributes in the filter. + +This will be slow on large candidate sets (potentially), but could be sped up with parallelism, caching +or other. However, in the same step, we can also apply the step of extracting only the allowed +read target attrs, so this is a valuable exercise. + Delete Requirements ------------------- @@ -91,6 +194,8 @@ be best implemented as a compilation of self -> eq(uuid, self.uuid). Implementation Details ---------------------- +CHANGE: Receiver should be a group, and should be single value/multivalue? Can *only* be a group. + Example profiles: search { diff --git a/designs/entries.rst b/designs/entries.rst new file mode 100644 index 000000000..a5c865007 --- /dev/null +++ b/designs/entries.rst @@ -0,0 +1,38 @@ + +Entries +------- + +Entries are the base unit of data in this server. This is one of the three foundational concepts +along with filters and schema that everything thing else builds upon. + +What is an Entry? +----------------- + +An entry is a collection of attribute-values. These are sometimes called attribute-value-assertions, +attr-value sets. The attribute is a "key", and it holds 1 to infinite values associated. An entry +can have many avas associated, which creates the entry as a whole. An example entry (minus schema): + + Entry { + "name": ["william"], + "mail": ["william@email", "email@william"], + "uuid": ["..."], + } + +There are only a few rules that are true in entries. + +* UUID + +All entries *must* have a UUID attribute, and there must ONLY exist a single value. This UUID ava +MUST be unique within the database, regardless of entry state (live, recycled, tombstoned etc). + +* Zero values + +An attribute with zero values, is removed from the entry. + +* Unsorted + +Values within an attribute are "not sorted" in any meaningful way for a client utility (in reality +they are sorted by an undefined internal order for fast lookup/insertion). + + +That's it. diff --git a/designs/filter.rst b/designs/filter.rst new file mode 100644 index 000000000..c403dac63 --- /dev/null +++ b/designs/filter.rst @@ -0,0 +1,163 @@ + +Filters +------- + +Filters (along with Entries and Schema) is one of the foundational concepts in the +design of KaniDM. They are used in nearly every aspect of the server to provide +checking and searching over entry sets. + +A filter is a set of requirements where the attribute-value pairs of the entry must +conform for the filter to be considered a "match". This has two useful properties: + +* We can apply a filter to a single entry to determine quickly assertions about that entry +hold true. +* We can apply a filter to a set of entries to reduce the set only to the matching entries. + +Filter Construction +------------------- + +Filters are rooted in relational algebra and set mathematics. I am not an expert on either +topic, and have learnt from experience about there design. + +* Presence + +The simplest filter is a "presence" test. It asserts that some attribute, regardless +of it's value exists on the entry. For example, the entries below: + + Entry { + name: william + } + + Entry { + description: test + } + +If we apply "Pres(name)", then we would only see the entry containing "name: william" as a matching +result. + +* Equality + +Equality checks that an attribute and value are present on an entry. For example + + Entry { + name: william + } + + Entry { + name: test + } + +If we apply Eq(name, william) only the first entry would match. If the attribute is multivalued, +we only assert that one value in the set is there. For example: + + Entry { + name: william + } + + Entry { + name: test + name: claire + } + +In this case application of Eq(name, claire), would match the second entry as name=claire is present +in the multivalue set. + +* Sub + +Substring checks that the substring exists in an attribute of the entry. This is a specialisation +of equality, where the same value and multivalue handling holds true. + + Entry { + name: william + } + +In this example, Sub(name, liam) would match, but Sub(name, air) would not. + +* Or + +Or contains multiple filters and asserts that provided *any* of them are true, this condition +will hold true. For example: + + Entry { + name: claire + } + +In this the filter Or(Eq(name, claire), Eq(name, william)) will be true, because the Eq(name, claire) +is true, thus the Or condition is true. If nothing inside the Or is true, it returns false. + +* And + +And checks that all inner filter conditions are true, to return true. If any are false, it will +yield false. + + Entry { + name: claire + class: person + } + +For this example, And(Eq(class, person), Eq(name, claire)) would be true, but And(Eq(class, group), +Eq(name, claire)) would be false. + +* AndNot + +AndNot is different to a logical not. + +If we had Not(Eq(name, claire)), then the logical result is "All entries where name is not +claire". However, this is (today...) not very efficient. Instead, we have "AndNot" which asserts +that a condition of a candidate set is not true. So the operation: AndNot(Eq(name, claire)) would +yield and empty set. AndNot is important when you need to check that something is also not true +but without getting all entries where that not holds. An example: + + Entry { + name: william + class: person + } + + Entry { + name: claire + class: person + } + +In this case "And(Eq(class, person), AndNot(Eq(name, claire)))". This would find all persons +where their name is also not claire: IE william. However, the following would be empty result. +"AndNot(Eq(name, claire))". This is because there is no candidate set already existing, so there +is nothing to return. + + +Filter Schema Considerations +---------------------------- + +In order to make filters work properly, the server normalises entries on input to allow simpler +comparisons and ordering in the actual search phases. This means that for a filter to operate +it too must be normalised an valid. + +If a filter requests an operation on an attribute we do not know of in schema, the operation +is rejected. This is to prevent a denial of service attack where Eq(NonExist, value) would cause +un-indexed full table scans to be performed consuming server resources. + +In a filter request, the Attribute name in use is normalised according to schema, as it +the search value. For example, Eq(nAmE, Claire) would normalise to Eq(name, claire) as both +attrname and name are UTF8_INSENSITIVE. However, displayName is case sensitive so a search like: +Eq(displayName, Claire) would become Eq(displayname, Claire). Note Claire remains cased. + +This means that instead of having costly routines to normalise entries on each read and search, +we can normalise on entry modify and create, then we only need to ensure filters match and we +can do basic string comparisons as needed. + + +Discussion +---------- + +Is it worth adding a true "not" type, and using that instead? It would be extremely costly on +indexes or filter testing, but would logically be better than AndNot as a filter term. + +Not could be implemented as Not() -> And(Pres(class), AndNot()) which would +yield the equivalent result, but it would consume a very large index component. In this case +though, filter optimising would promote Eq > Pres, so we would should be able to skip to a candidate +test, or we access the index and get the right result anyway over fulltable scan. + +Additionally, Not/AndNot could be security risks because they could be combined with And +queries that allow them to bypass the filter-attribute permission check. Is there an example +of using And(Eq, AndNot(Eq)) that could be used to provide information disclosure about +the status of an attribute given a result/non result where the AndNot is false/true? + diff --git a/designs/indexing.rst b/designs/indexing.rst new file mode 100644 index 000000000..f93b2dd12 --- /dev/null +++ b/designs/indexing.rst @@ -0,0 +1,152 @@ + +Indexing +-------- + +Indexing is deeply tied to the concept of filtering. Indexes exist to make the application of a +search term (filter) faster. + +World without indexing +---------------------- + +Almost all databases are built ontop of a key-value storage engine of some nature. In our +case we are using (feb 2019) sqlite and hopefully SLED in the future. + +So our entries that contain sets of avas, these are serialised into a byte format (feb 2019, json +but soon cbor) and stored in a table of "id: entry". For example: + + |----------------------------------------------------------------------------------------| + | ID | data | + |----------------------------------------------------------------------------------------| + | 01 | { 'Entry': { 'name': ['name'], 'class': ['person'], 'uuid': ['...'] } } | + | 02 | { 'Entry': { 'name': ['beth'], 'class': ['person'], 'uuid': ['...'] } } | + | 03 | { 'Entry': { 'name': ['alan'], 'class': ['person'], 'uuid': ['...'] } } | + | 04 | { 'Entry': { 'name': ['john'], 'class': ['person'], 'uuid': ['...'] } } | + | 05 | { 'Entry': { 'name': ['kris'], 'class': ['person'], 'uuid': ['...'] } } | + |----------------------------------------------------------------------------------------| + +The ID column is *private* to the backend implementation and is never revealed to the higher +level components. However the ID is very important to indexing :) + +If we wanted to find Eq(name, john) here, what do we need to do? A full table scan is where we +perform: + + data = sqlite.do(SELECT * from id2entry); + for row in data: + entry = deserialise(row) + entry.match_filter(...) // check Eq(name, john) + +For a small database (maybe up to 20 objects), this is probably fine. But once you start to get +much larger this is really costly. We continually load, deserialise, check and free data that +is not relevant to the search. This is why full table scans of any database (sql, ldap, anything) +are so costly. It's really really scanning everything! + +How does indexing work? +----------------------- + +Indexing is a pre-computed lookup table of what you *might* search in a specific format. Let's say +in our example we have an equality index on "name" as an attribute. Now in our backend we define +an extra table called "index_eq_name". It's contents would look like: + + |------------------------------------------| + | index | idl | + |------------------------------------------| + | alan | [03, ] | + | beth | [02, ] | + | john | [04, ] | + | kris | [05, ] | + | name | [01, ] | + |------------------------------------------| + +So when we perform our search for Eq(name, john) again, we see name is indexed. We then perform: + + SELECT * from index_eq_name where index=john; + +This would give us the idl (ID list) of [04,]. This is the "ID's of every entry where name equals +john". + +We can now take this back to our id2entry table and perform: + + data = sqlite.do(SELECT * from id2entry where ID = 04) + +The key-value engine only gives us the entry for john, and we have a match! If id2entry had 1 million +entries, a full table scan would be 1 million loads and compares - with the index, it was 2 loads and +one compare. That's 30000x faster (potentially ;) )! + +To improve on this, if we had a query like Or(Eq(name, john), Eq(name, kris)) we can use our +indexes to speed this up. + +We would query index_eq_name again, and we would perform the search for both john, and kris. Because +this is an OR we then union the two idl's, and we would have: + + [04, 05,] + +Now we just have to get entries 04,05 from id2entry, and we have our matching query. This means +filters are often applied as idl set operations. + +Compressed ID lists +------------------- + +In order to make idl loading faster, and the set operations faster there is an idl library +(developed by me, firstyear), which will be used for this. To read more see: + +https://github.com/Firstyear/idlset + +Filter Optimisation +------------------- + +Filter optimisation begins to play an important role when we have indexes. If we indexed +something like "Pres(class)", then the idl for that search is the set of all database +entries. Similar, if our database of 1 million entries has 250,000 class=person, then +Eq(class, person), will have an idl containing 250,000 ids. Even with idl compression, this +is still a lot of data! + +There tend to be two types of searches against a directory like kanidm. + +* Broad searches +* Targetted single entry searches + +For broad searches, filter optimising does little - we just have to load those large idls, and +use them. (Yes, loading the large idl and using it is still better than full table scan though!) + +However, for targetted searches, filter optimisng really helps. + +Imagine a query like: + + And(Eq(class, person), Eq(name, claire)) + +In this case with our database of 250,000 persons, our idl's would have: + + And( idl[250,000 ids], idl(1 id)) + +Which means the result will always be the *single* id in the idl or *no* value because it wasn't +present. + +We add a single concept to the server called the "filter test threshold". This is the state in which +a candidate set that is not completed operation, is shortcut, and we then apply the filter in +the manner of a full table scan to the partial set because it will be faster than the index loading +and testing. + +When we have this test threshold, there exists two possibilities for this filter. + + And( idl[250,000 ids], idl(1 id)) + +We load 250,000 idl and then perform the intersection with the idl of 1 value, and result in 1 or 0. + + And( idl(1 id), idl[250,000 ids]) + +We load the single idl value for name, and then as we are below the test-threshold we shortcut out +and apply the filter to entry ID 1 - yielding a match or no match. + +Notice in the second, by promoting the "smaller" idl, we were able to save the work of the idl load +and intersection as our first equality of "name" was more targetted? + +Filter optimisation is about re-arranging these filters in the server using our insight to +data to provide faster searches and avoid indexes that are costly unless they are needed. + +In this case, we would *demote* any filter where Eq(class, ...) to the *end* of the And, because it +is highly likely to be less targetted than the other Eq types. Another example would be promotion +of Eq filters to the front of an And over a Sub term, wherh Sub indexes tend to be larger and have +longer IDLs. + + + diff --git a/designs/schema.rst b/designs/schema.rst new file mode 100644 index 000000000..e1f2a2e79 --- /dev/null +++ b/designs/schema.rst @@ -0,0 +1,117 @@ + +Schema +------ + +Schema is one of the three foundational concepts of the server, along with filters and entries. +Schema defines how attribute values *must* be represented, sorted, indexed and more. It also +defines what attributes could exist on an entry. + +Why Schema? +----------- + +The way that the server is designed, you could extract the backend parts and just have "Entries" +with no schema. That's totally valid if you want! + +However, usually in the world all data maintains some form of structure, even if loose. We want to +have ways to say a database entry represents a person, and what a person requires. + +Attributes +---------- + +In the entry document, I discuss that avas have a single attribute, and 1 to infinite values that +are utf8 case sensitive strings. Which schema attribute types we can constrain these avas on an +entry. + +For example, while the entry may be capable of holding 1 to infinite "name" values, the schema +defines that only one name is valid on the entry. Addition of a second name would be a violation. Of +course, schema also defines "multi-value", our usual 1 to infinite value storage concept. + +Schema can also define that values of the attribute must conform to a syntax. For example, name +is a case *insensitive* string. So despite the fact that avas store case-sensitive data, all inputs +to name will be normalised to a lowercase form for faster matching. There are a number of syntax +types built into the server, and we'll add more later. + +Finally, an attribute can be defined as indexed, and in which ways it can be indexed. We often will +want to search for "mail" on a person, so we can define in the schema that mail is indexed by the +backend indexing system. We don't define *how* the index is built - only that some index should exist +for when a query is made. + +Classes +------- + +So while we have attributes that define "what is valid in the avas", classes define "which attributes +can exist on the entry itself". + +A class defines requirements that are "may", "must", "systemmay", "systemmust". The system- variants +exist so that we can ship what we believe are good definitions. The may and must exists so you can +edit and extend our classes with your extra attribute fields (but it may be better just to add +your own class types :) ) + +An attribute in a class marked as "may" is optional on the entry. It can be present as an ava, or +it may not be. + +An attribute in a class marked as "must" is required on the entry. An ava that is valid to the +attribute syntax is required on this entry. + +An attribute that is not "may" or "must" can not be present on this entry. + +Lets imagine we have a class (pseudo example) of "person". We'll make it: + + Class { + "name": "person", + "systemmust": ["name"], + "systemmay": ["mail"] + } + +If we had an entry such as: + + Entry { + "class": ["person"], + "uid": ["bob"], + "mail": ["bob@email"] + } + +This would be invalid: We are missing the "systemmust" name attribute. It's also invalid because uid +is not present in systemmust or systemmay. + + Entry { + "class": ["person"], + "name": ["claire"], + "mail": ["claire@email"] + } + +This entry is now valid. We have met the must requirement of name, and we have the optional +mail ava populated. The following is also valid. + + Entry { + "class": ["person"], + "name": ["claire"], + } + +Classes are 'additive' - this means given two classes on an entry, the must/may are unioned, and the +strongest rule is applied to attribute presence. + +Imagine we have also + + Class { + "name": "person", + "systemmust": ["name"], + "systemmay": ["mail"] + } + + Class { + "name": "emailperson", + "systemmust": ["mail"] + } + +With our entry now, this turns the "may" from person, into a "must" because of the emailperson +class. On our entry Claire, that means this entry below is now invalid: + + Entry { + "class": ["person", "emailperson"], + "name": ["claire"], + } + +Simply adding an ava of mail back to the entry would make it valid once again. + + diff --git a/src/lib/filter.rs b/src/lib/filter.rs index fb7813722..4b8081360 100644 --- a/src/lib/filter.rs +++ b/src/lib/filter.rs @@ -158,56 +158,6 @@ impl Filter { } _ => panic!(), } - - /* - match self { - Filter::Eq(attr, value) => match schema_attributes.get(attr) { - Some(schema_a) => schema_a.validate_value(value), - None => Err(SchemaError::InvalidAttribute), - }, - Filter::Sub(attr, value) => match schema_attributes.get(attr) { - Some(schema_a) => schema_a.validate_value(value), - None => Err(SchemaError::InvalidAttribute), - }, - Filter::Pres(attr) => { - // This could be better as a contains_key - // because we never use the value - match schema_attributes.get(attr) { - Some(_) => Ok(()), - None => Err(SchemaError::InvalidAttribute), - } - } - Filter::Or(filters) => { - // This should never happen because - // optimising should remove them as invalid parts? - if filters.len() == 0 { - return Err(SchemaError::EmptyFilter); - }; - filters.iter().fold(Ok(()), |acc, filt| { - if acc.is_ok() { - self.validate(filt) - } else { - acc - } - }) - } - Filter::And(filters) => { - // This should never happen because - // optimising should remove them as invalid parts? - if filters.len() == 0 { - return Err(SchemaError::EmptyFilter); - }; - filters.iter().fold(Ok(()), |acc, filt| { - if acc.is_ok() { - self.validate(filt) - } else { - acc - } - }) - } - Filter::Not(filter) => self.validate(filter), - } - */ } pub fn from(f: &ProtoFilter) -> Self { diff --git a/src/lib/server.rs b/src/lib/server.rs index 9c98d1b62..36ff70e7d 100644 --- a/src/lib/server.rs +++ b/src/lib/server.rs @@ -1040,6 +1040,13 @@ mod tests { }) } + #[test] + fn test_modify_invalid_class() { + // Test modifying an entry and adding an extra class, that would cause the entry + // to no longer conform to schema. + unimplemented!() + } + #[test] fn test_qs_delete() { run_test!(|_log, mut server: QueryServer, audit: &mut AuditScope| {