mirror of
https://github.com/kanidm/kanidm.git
synced 2025-02-23 20:47:01 +01:00
* codespell run and spelling fixes * some clippying * minor fmt fix * making yamllint happy * adding codespell github action
171 lines
5.5 KiB
ReStructuredText
171 lines
5.5 KiB
ReStructuredText
Logging Design (Refactor)
|
|
-------------------------
|
|
|
|
Logging is how the server communicates to developers and administrators about the state
|
|
of the service, and how operations are performing and what they are doing. It's important
|
|
this is clear in how it communicates. Today (2020-05-12) the log has been written with
|
|
development in mind, and has a structure that has as a result, become hard to parse and
|
|
understand. This has motivated a rewrite of logging to improve how the servers state
|
|
and errors are communicated to users.
|
|
|
|
Use Cases
|
|
---------
|
|
|
|
* Developer Bug Reports
|
|
|
|
A developer should be able to see the internal state of the server, and why any decision
|
|
or logic path was taken, so that errors or logic can be analysed post-incident. The
|
|
information in the log and the source code should be enough to resolve any issues, as we
|
|
may not have LLDB access to any consumers site, or any effective reproducer.
|
|
|
|
* Security Audits
|
|
|
|
We must be able to see why any security decision was made, such as credential validation,
|
|
access control application, or group/claim issuing to a session. This should be connected
|
|
to the IP and other identifiers of the caller.
|
|
|
|
* Error Analysis
|
|
|
|
For an administrator, they must be able to determine why an operation is failing in detail
|
|
so they can advise on how a consumer or user could change their behaviour to improve the
|
|
situation (beyond the error messages we return).
|
|
|
|
* Performance
|
|
|
|
Administrators and Developers should be able to analyse fine grained information about the
|
|
performance of any operation, and make informed decisions about tuning (such as caches or
|
|
or threads), and developers should be able to identify code paths that are under pressure
|
|
and could be targets for improvement.
|
|
|
|
* Containers/Systemd
|
|
|
|
Logs should be emitted on stdout/stderr as this is the easiest interface for existing
|
|
log aggregation systems to collect data from.
|
|
|
|
Details
|
|
-------
|
|
|
|
As developers we should indicate what messages are relevant to what use case as part of the
|
|
message. Log levels are used in other services, but that allows messages to be missed. Instead
|
|
we log every "service" always, but filter them to different locations.
|
|
|
|
This leads to the following log categories:
|
|
|
|
* Analysis
|
|
* Display of all logic branches and why decision points or paths taken
|
|
* A unique event ID that associates related log messages
|
|
* Performance
|
|
* Cache and DB metrics available
|
|
* Performance frames of timing of key points
|
|
* Structure of the performance frames to understand the execution paths taken.
|
|
* Display of query optimisation
|
|
* Display of query planning and application
|
|
* Failure (server failure)
|
|
* Hard Errors
|
|
* Warning (admin should take action)
|
|
* Possible misconfiguration
|
|
* OperationError (user mistake, op mistake etc)
|
|
* All error reports and finalised result summaries logged
|
|
* The unique event ID is provided in any operation success or failure.
|
|
* Security (aka audit)
|
|
* Filtering of security sensitive attributes (via debug/display features)
|
|
* Display of sufficient information to establish a security picture of connected actions via the user's uuid/session id.
|
|
* Tracking of who-changed-what-when-why
|
|
* Replication
|
|
* TODO
|
|
|
|
It can be seen pretty quickly that multiple message types are useful across categories. For
|
|
example, the unique event id for all messages, how hard errors affect operation errors
|
|
or how an operation error can come from a security denial.
|
|
|
|
Logging must also remain a separate thread and async for performance.
|
|
|
|
This means that the best way to declare these logs is a unified log which can be filtered based
|
|
on the admins or consumers needs.
|
|
|
|
API
|
|
---
|
|
|
|
For all types, it's important that we can associate all related events correctly. When the
|
|
operation initiates we assign an event-id that is part of the audit trail.
|
|
|
|
Statistics
|
|
==========
|
|
|
|
Stats should be accumulated in a statistics variable so that we can determine possible
|
|
tuning and other events related. Useful stats would be:
|
|
|
|
* Cache Hits
|
|
* Cache Misses
|
|
* Cache Inclusions
|
|
|
|
* Number of Searches
|
|
* Number of Entries Modified
|
|
|
|
This would be then logged as a structured line such as:
|
|
|
|
{ 'entry_cache_miss': 8, 'idl_cache_miss': 8, 'entry_cache_hit': 16', .... }
|
|
|
|
This would also then be fed back to the global stats thread for averaging.
|
|
|
|
Performance
|
|
===========
|
|
|
|
The key metric for performance is time-in-function so it would be good to be able to
|
|
build a value like:
|
|
|
|
{
|
|
'name': 'do_search',
|
|
'time': x,
|
|
'pct': 100,
|
|
called: [
|
|
{
|
|
'name': 'filter2idl',
|
|
'time': x',
|
|
called: [],
|
|
},
|
|
{
|
|
...
|
|
}
|
|
]
|
|
}
|
|
|
|
This would allow a rich view of how much time went to any function at a high level, as then
|
|
further investigation can occur.
|
|
|
|
Query Analysis
|
|
==============
|
|
|
|
To analyse a query we need:
|
|
|
|
* The original query
|
|
* The optimised version, with index tagging/profiling choices.
|
|
* The idl's that were loaded and how the query was applied
|
|
* The idl of the final result set.
|
|
|
|
Security Events
|
|
===============
|
|
|
|
* What access controls were considered?
|
|
* Who authenticated and where from?
|
|
* Audit of who modified what when why.
|
|
|
|
Analysis
|
|
========
|
|
|
|
This is generally what is "debug" logging, which is just decision points and verbose
|
|
descriptions of what we went where.
|
|
|
|
Admin Notification
|
|
==================
|
|
|
|
This is warnings or errors that the admin should be aware of.
|
|
|
|
User Events
|
|
===========
|
|
|
|
This must associate what happened for a user
|
|
|
|
|
|
|