Add design documents as drafts

This commit is contained in:
William Brown 2019-02-14 12:49:45 +10:00
parent 651abe3762
commit 84ff865304
7 changed files with 582 additions and 50 deletions

View file

@ -28,6 +28,109 @@ An example is that user Alice should only be able to search for objects where th
is person, and where they are a memberOf "visible" group. Alice should only be able to
see those users displayNames (not their legalName for example), and their public email.
Worded a bit differently. You need permission over the scope of entries, you need to be able
to read the attribute to filter on it, and you need to be able to read the attribute to recieve
it in the result entry.
Threat: If we search for '(&(name=william)(secretdata=x))', we should not allow this to
proceed because you don't have the rights to read secret data, so you should not be allowed
to filter on it. How does this work with two overlapping ACPs? For example one that allows read
of name and description to class = group, and one that allows name to user. We don't want to
say '(&(name=x)(description=foo))' and have it allowed, because we don't know the target class
of the filter. Do we "unmatch" all users because they have no access to the filter components? (Could
be done by inverting and putting in an AndNot of the non-matchable overlaps). Or do we just
filter our description from the users returned (But that implies they DID match, which is a disclosure).
More concrete:
search {
action: allow
targetscope: Eq("class", "group")
targetattr: name
targetattr: description
}
search {
action: allow
targetscope: Eq("class", "user")
targetattr: name
}
SearchRequest {
...
filter: And: {
Pres("name"),
Pres("description"),
}
}
A potential defense is:
acp class group: Pres(name) and Pres(desc) both in target attr, allow
acp class user: Pres(name) allow, Pres(desc) deny. Invert and Append
So the filter now is:
And: {
AndNot: {
Eq("class", "user")
},
And: {
Pres("name"),
Pres("description"),
},
}
This would now only allow access to the name/desc of group.
If we extend this to a third, this would work. But a more complex example:
search {
action: allow
targetscope: Eq("class", "group")
targetattr: name
targetattr: description
}
search {
action: allow
targetscope: Eq("class", "user")
targetattr: name
}
search {
action: allow
targetscope: And(Eq("class", "user"), Eq("name", "william"))
targetattr: description
}
Now we have a single user where we can read desc. So the compiled filter above as:
And: {
AndNot: {
Eq("class", "user")
},
And: {
Pres("name"),
Pres("description"),
},
}
This would now be invalid, first, because we would see that class=user and william has no name
so that would be excluded also. We also may not even have "class=user" in the second ACP, so we can't
use subset filter matching to merge the two.
As a result, I think the only possible valid solution is to perform the initial filter, then determine
on the candidates if we *could* have have valid access to filter on all required attributes. IE
this means even with an index look up, we still are required to perform some filter application
on the candidates.
I think this will mean on a possible candidate, we have to apply all ACP, then create a union of
the resulting targetattrs, and then compared that set into the set of attributes in the filter.
This will be slow on large candidate sets (potentially), but could be sped up with parallelism, caching
or other. However, in the same step, we can also apply the step of extracting only the allowed
read target attrs, so this is a valuable exercise.
Delete Requirements
-------------------
@ -91,6 +194,8 @@ be best implemented as a compilation of self -> eq(uuid, self.uuid).
Implementation Details
----------------------
CHANGE: Receiver should be a group, and should be single value/multivalue? Can *only* be a group.
Example profiles:
search {

38
designs/entries.rst Normal file
View file

@ -0,0 +1,38 @@
Entries
-------
Entries are the base unit of data in this server. This is one of the three foundational concepts
along with filters and schema that everything thing else builds upon.
What is an Entry?
-----------------
An entry is a collection of attribute-values. These are sometimes called attribute-value-assertions,
attr-value sets. The attribute is a "key", and it holds 1 to infinite values associated. An entry
can have many avas associated, which creates the entry as a whole. An example entry (minus schema):
Entry {
"name": ["william"],
"mail": ["william@email", "email@william"],
"uuid": ["..."],
}
There are only a few rules that are true in entries.
* UUID
All entries *must* have a UUID attribute, and there must ONLY exist a single value. This UUID ava
MUST be unique within the database, regardless of entry state (live, recycled, tombstoned etc).
* Zero values
An attribute with zero values, is removed from the entry.
* Unsorted
Values within an attribute are "not sorted" in any meaningful way for a client utility (in reality
they are sorted by an undefined internal order for fast lookup/insertion).
That's it.

163
designs/filter.rst Normal file
View file

@ -0,0 +1,163 @@
Filters
-------
Filters (along with Entries and Schema) is one of the foundational concepts in the
design of KaniDM. They are used in nearly every aspect of the server to provide
checking and searching over entry sets.
A filter is a set of requirements where the attribute-value pairs of the entry must
conform for the filter to be considered a "match". This has two useful properties:
* We can apply a filter to a single entry to determine quickly assertions about that entry
hold true.
* We can apply a filter to a set of entries to reduce the set only to the matching entries.
Filter Construction
-------------------
Filters are rooted in relational algebra and set mathematics. I am not an expert on either
topic, and have learnt from experience about there design.
* Presence
The simplest filter is a "presence" test. It asserts that some attribute, regardless
of it's value exists on the entry. For example, the entries below:
Entry {
name: william
}
Entry {
description: test
}
If we apply "Pres(name)", then we would only see the entry containing "name: william" as a matching
result.
* Equality
Equality checks that an attribute and value are present on an entry. For example
Entry {
name: william
}
Entry {
name: test
}
If we apply Eq(name, william) only the first entry would match. If the attribute is multivalued,
we only assert that one value in the set is there. For example:
Entry {
name: william
}
Entry {
name: test
name: claire
}
In this case application of Eq(name, claire), would match the second entry as name=claire is present
in the multivalue set.
* Sub
Substring checks that the substring exists in an attribute of the entry. This is a specialisation
of equality, where the same value and multivalue handling holds true.
Entry {
name: william
}
In this example, Sub(name, liam) would match, but Sub(name, air) would not.
* Or
Or contains multiple filters and asserts that provided *any* of them are true, this condition
will hold true. For example:
Entry {
name: claire
}
In this the filter Or(Eq(name, claire), Eq(name, william)) will be true, because the Eq(name, claire)
is true, thus the Or condition is true. If nothing inside the Or is true, it returns false.
* And
And checks that all inner filter conditions are true, to return true. If any are false, it will
yield false.
Entry {
name: claire
class: person
}
For this example, And(Eq(class, person), Eq(name, claire)) would be true, but And(Eq(class, group),
Eq(name, claire)) would be false.
* AndNot
AndNot is different to a logical not.
If we had Not(Eq(name, claire)), then the logical result is "All entries where name is not
claire". However, this is (today...) not very efficient. Instead, we have "AndNot" which asserts
that a condition of a candidate set is not true. So the operation: AndNot(Eq(name, claire)) would
yield and empty set. AndNot is important when you need to check that something is also not true
but without getting all entries where that not holds. An example:
Entry {
name: william
class: person
}
Entry {
name: claire
class: person
}
In this case "And(Eq(class, person), AndNot(Eq(name, claire)))". This would find all persons
where their name is also not claire: IE william. However, the following would be empty result.
"AndNot(Eq(name, claire))". This is because there is no candidate set already existing, so there
is nothing to return.
Filter Schema Considerations
----------------------------
In order to make filters work properly, the server normalises entries on input to allow simpler
comparisons and ordering in the actual search phases. This means that for a filter to operate
it too must be normalised an valid.
If a filter requests an operation on an attribute we do not know of in schema, the operation
is rejected. This is to prevent a denial of service attack where Eq(NonExist, value) would cause
un-indexed full table scans to be performed consuming server resources.
In a filter request, the Attribute name in use is normalised according to schema, as it
the search value. For example, Eq(nAmE, Claire) would normalise to Eq(name, claire) as both
attrname and name are UTF8_INSENSITIVE. However, displayName is case sensitive so a search like:
Eq(displayName, Claire) would become Eq(displayname, Claire). Note Claire remains cased.
This means that instead of having costly routines to normalise entries on each read and search,
we can normalise on entry modify and create, then we only need to ensure filters match and we
can do basic string comparisons as needed.
Discussion
----------
Is it worth adding a true "not" type, and using that instead? It would be extremely costly on
indexes or filter testing, but would logically be better than AndNot as a filter term.
Not could be implemented as Not(<filter>) -> And(Pres(class), AndNot(<filter>)) which would
yield the equivalent result, but it would consume a very large index component. In this case
though, filter optimising would promote Eq > Pres, so we would should be able to skip to a candidate
test, or we access the index and get the right result anyway over fulltable scan.
Additionally, Not/AndNot could be security risks because they could be combined with And
queries that allow them to bypass the filter-attribute permission check. Is there an example
of using And(Eq, AndNot(Eq)) that could be used to provide information disclosure about
the status of an attribute given a result/non result where the AndNot is false/true?

152
designs/indexing.rst Normal file
View file

@ -0,0 +1,152 @@
Indexing
--------
Indexing is deeply tied to the concept of filtering. Indexes exist to make the application of a
search term (filter) faster.
World without indexing
----------------------
Almost all databases are built ontop of a key-value storage engine of some nature. In our
case we are using (feb 2019) sqlite and hopefully SLED in the future.
So our entries that contain sets of avas, these are serialised into a byte format (feb 2019, json
but soon cbor) and stored in a table of "id: entry". For example:
|----------------------------------------------------------------------------------------|
| ID | data |
|----------------------------------------------------------------------------------------|
| 01 | { 'Entry': { 'name': ['name'], 'class': ['person'], 'uuid': ['...'] } } |
| 02 | { 'Entry': { 'name': ['beth'], 'class': ['person'], 'uuid': ['...'] } } |
| 03 | { 'Entry': { 'name': ['alan'], 'class': ['person'], 'uuid': ['...'] } } |
| 04 | { 'Entry': { 'name': ['john'], 'class': ['person'], 'uuid': ['...'] } } |
| 05 | { 'Entry': { 'name': ['kris'], 'class': ['person'], 'uuid': ['...'] } } |
|----------------------------------------------------------------------------------------|
The ID column is *private* to the backend implementation and is never revealed to the higher
level components. However the ID is very important to indexing :)
If we wanted to find Eq(name, john) here, what do we need to do? A full table scan is where we
perform:
data = sqlite.do(SELECT * from id2entry);
for row in data:
entry = deserialise(row)
entry.match_filter(...) // check Eq(name, john)
For a small database (maybe up to 20 objects), this is probably fine. But once you start to get
much larger this is really costly. We continually load, deserialise, check and free data that
is not relevant to the search. This is why full table scans of any database (sql, ldap, anything)
are so costly. It's really really scanning everything!
How does indexing work?
-----------------------
Indexing is a pre-computed lookup table of what you *might* search in a specific format. Let's say
in our example we have an equality index on "name" as an attribute. Now in our backend we define
an extra table called "index_eq_name". It's contents would look like:
|------------------------------------------|
| index | idl |
|------------------------------------------|
| alan | [03, ] |
| beth | [02, ] |
| john | [04, ] |
| kris | [05, ] |
| name | [01, ] |
|------------------------------------------|
So when we perform our search for Eq(name, john) again, we see name is indexed. We then perform:
SELECT * from index_eq_name where index=john;
This would give us the idl (ID list) of [04,]. This is the "ID's of every entry where name equals
john".
We can now take this back to our id2entry table and perform:
data = sqlite.do(SELECT * from id2entry where ID = 04)
The key-value engine only gives us the entry for john, and we have a match! If id2entry had 1 million
entries, a full table scan would be 1 million loads and compares - with the index, it was 2 loads and
one compare. That's 30000x faster (potentially ;) )!
To improve on this, if we had a query like Or(Eq(name, john), Eq(name, kris)) we can use our
indexes to speed this up.
We would query index_eq_name again, and we would perform the search for both john, and kris. Because
this is an OR we then union the two idl's, and we would have:
[04, 05,]
Now we just have to get entries 04,05 from id2entry, and we have our matching query. This means
filters are often applied as idl set operations.
Compressed ID lists
-------------------
In order to make idl loading faster, and the set operations faster there is an idl library
(developed by me, firstyear), which will be used for this. To read more see:
https://github.com/Firstyear/idlset
Filter Optimisation
-------------------
Filter optimisation begins to play an important role when we have indexes. If we indexed
something like "Pres(class)", then the idl for that search is the set of all database
entries. Similar, if our database of 1 million entries has 250,000 class=person, then
Eq(class, person), will have an idl containing 250,000 ids. Even with idl compression, this
is still a lot of data!
There tend to be two types of searches against a directory like kanidm.
* Broad searches
* Targetted single entry searches
For broad searches, filter optimising does little - we just have to load those large idls, and
use them. (Yes, loading the large idl and using it is still better than full table scan though!)
However, for targetted searches, filter optimisng really helps.
Imagine a query like:
And(Eq(class, person), Eq(name, claire))
In this case with our database of 250,000 persons, our idl's would have:
And( idl[250,000 ids], idl(1 id))
Which means the result will always be the *single* id in the idl or *no* value because it wasn't
present.
We add a single concept to the server called the "filter test threshold". This is the state in which
a candidate set that is not completed operation, is shortcut, and we then apply the filter in
the manner of a full table scan to the partial set because it will be faster than the index loading
and testing.
When we have this test threshold, there exists two possibilities for this filter.
And( idl[250,000 ids], idl(1 id))
We load 250,000 idl and then perform the intersection with the idl of 1 value, and result in 1 or 0.
And( idl(1 id), idl[250,000 ids])
We load the single idl value for name, and then as we are below the test-threshold we shortcut out
and apply the filter to entry ID 1 - yielding a match or no match.
Notice in the second, by promoting the "smaller" idl, we were able to save the work of the idl load
and intersection as our first equality of "name" was more targetted?
Filter optimisation is about re-arranging these filters in the server using our insight to
data to provide faster searches and avoid indexes that are costly unless they are needed.
In this case, we would *demote* any filter where Eq(class, ...) to the *end* of the And, because it
is highly likely to be less targetted than the other Eq types. Another example would be promotion
of Eq filters to the front of an And over a Sub term, wherh Sub indexes tend to be larger and have
longer IDLs.

117
designs/schema.rst Normal file
View file

@ -0,0 +1,117 @@
Schema
------
Schema is one of the three foundational concepts of the server, along with filters and entries.
Schema defines how attribute values *must* be represented, sorted, indexed and more. It also
defines what attributes could exist on an entry.
Why Schema?
-----------
The way that the server is designed, you could extract the backend parts and just have "Entries"
with no schema. That's totally valid if you want!
However, usually in the world all data maintains some form of structure, even if loose. We want to
have ways to say a database entry represents a person, and what a person requires.
Attributes
----------
In the entry document, I discuss that avas have a single attribute, and 1 to infinite values that
are utf8 case sensitive strings. Which schema attribute types we can constrain these avas on an
entry.
For example, while the entry may be capable of holding 1 to infinite "name" values, the schema
defines that only one name is valid on the entry. Addition of a second name would be a violation. Of
course, schema also defines "multi-value", our usual 1 to infinite value storage concept.
Schema can also define that values of the attribute must conform to a syntax. For example, name
is a case *insensitive* string. So despite the fact that avas store case-sensitive data, all inputs
to name will be normalised to a lowercase form for faster matching. There are a number of syntax
types built into the server, and we'll add more later.
Finally, an attribute can be defined as indexed, and in which ways it can be indexed. We often will
want to search for "mail" on a person, so we can define in the schema that mail is indexed by the
backend indexing system. We don't define *how* the index is built - only that some index should exist
for when a query is made.
Classes
-------
So while we have attributes that define "what is valid in the avas", classes define "which attributes
can exist on the entry itself".
A class defines requirements that are "may", "must", "systemmay", "systemmust". The system- variants
exist so that we can ship what we believe are good definitions. The may and must exists so you can
edit and extend our classes with your extra attribute fields (but it may be better just to add
your own class types :) )
An attribute in a class marked as "may" is optional on the entry. It can be present as an ava, or
it may not be.
An attribute in a class marked as "must" is required on the entry. An ava that is valid to the
attribute syntax is required on this entry.
An attribute that is not "may" or "must" can not be present on this entry.
Lets imagine we have a class (pseudo example) of "person". We'll make it:
Class {
"name": "person",
"systemmust": ["name"],
"systemmay": ["mail"]
}
If we had an entry such as:
Entry {
"class": ["person"],
"uid": ["bob"],
"mail": ["bob@email"]
}
This would be invalid: We are missing the "systemmust" name attribute. It's also invalid because uid
is not present in systemmust or systemmay.
Entry {
"class": ["person"],
"name": ["claire"],
"mail": ["claire@email"]
}
This entry is now valid. We have met the must requirement of name, and we have the optional
mail ava populated. The following is also valid.
Entry {
"class": ["person"],
"name": ["claire"],
}
Classes are 'additive' - this means given two classes on an entry, the must/may are unioned, and the
strongest rule is applied to attribute presence.
Imagine we have also
Class {
"name": "person",
"systemmust": ["name"],
"systemmay": ["mail"]
}
Class {
"name": "emailperson",
"systemmust": ["mail"]
}
With our entry now, this turns the "may" from person, into a "must" because of the emailperson
class. On our entry Claire, that means this entry below is now invalid:
Entry {
"class": ["person", "emailperson"],
"name": ["claire"],
}
Simply adding an ava of mail back to the entry would make it valid once again.

View file

@ -158,56 +158,6 @@ impl Filter<FilterInvalid> {
}
_ => panic!(),
}
/*
match self {
Filter::Eq(attr, value) => match schema_attributes.get(attr) {
Some(schema_a) => schema_a.validate_value(value),
None => Err(SchemaError::InvalidAttribute),
},
Filter::Sub(attr, value) => match schema_attributes.get(attr) {
Some(schema_a) => schema_a.validate_value(value),
None => Err(SchemaError::InvalidAttribute),
},
Filter::Pres(attr) => {
// This could be better as a contains_key
// because we never use the value
match schema_attributes.get(attr) {
Some(_) => Ok(()),
None => Err(SchemaError::InvalidAttribute),
}
}
Filter::Or(filters) => {
// This should never happen because
// optimising should remove them as invalid parts?
if filters.len() == 0 {
return Err(SchemaError::EmptyFilter);
};
filters.iter().fold(Ok(()), |acc, filt| {
if acc.is_ok() {
self.validate(filt)
} else {
acc
}
})
}
Filter::And(filters) => {
// This should never happen because
// optimising should remove them as invalid parts?
if filters.len() == 0 {
return Err(SchemaError::EmptyFilter);
};
filters.iter().fold(Ok(()), |acc, filt| {
if acc.is_ok() {
self.validate(filt)
} else {
acc
}
})
}
Filter::Not(filter) => self.validate(filter),
}
*/
}
pub fn from(f: &ProtoFilter) -> Self {

View file

@ -1040,6 +1040,13 @@ mod tests {
})
}
#[test]
fn test_modify_invalid_class() {
// Test modifying an entry and adding an extra class, that would cause the entry
// to no longer conform to schema.
unimplemented!()
}
#[test]
fn test_qs_delete() {
run_test!(|_log, mut server: QueryServer, audit: &mut AuditScope| {