ADR-0014: Scan Metric Collection
Status: | DRAFT |
Date: | 2022-09-01 |
Author(s): | Max Maass max.maass@iteratec.com |
Context
It can be difficult to notice if a scan setup has broken, as long as the scan is still running and returning at least some results. For example, if a ZAP scan is configured with authentication, but the credentials expire or the authentication system is changed, the scan will not error out - it will simply work without authentication and thus provide a lot less data than it would otherwise. We currently do not have a good way of monitoring failure cases like this. Thus, we are considering adding a metrics collection system to the secureCodeBox, so that we at least have the data to potentially write some alerting on.
The general consensus in the team is that the monitoring itself is not the job of the secureCodeBox - we want to provide the data on a best-effort basis, but we do not want to replace Grafana / Prometheus for the actual monitoring. This ADR thus deals with how metrics can be collected and sent on to a dedicated monitoring platform.
Metrics to Collect
In general, the metrics that could be collected can be split into two types of metrics:
- Standard metrics for all scanners (runtime, number of findings, success or failure)
- Specific metrics for the individual scanners (sent requests and responses, number of responses with specific HTTP status codes, lines of code scanned, ...)
These two types of metrics must be handled very differently, thus, we will discuss them separately here.
Standard Metrics
The standard set of metrics can be collected through the operator in combination with Kubernetes functionality. The following metrics should be collected:
- Running time of the scanner pod
- Number of findings identified by the parser
- ... TODO: Any more?
Scanner-specific Metrics
For the scanner-specific metrics, it gets a bit more complicated. We need to differentiate two cases: either the data we are interested in is part of the machine-readable output format generated by the scanner, or it is only provided in the log output of the tool (but not the file). Metrics that are provided in neither of these two places and cannot be collected by Kubernetes in a standardized way are out of scope for this ADR.
The exact metrics we are interested in would vary depending on the scanner, and would have to be decided on a case-by-case basis.
Metrics from the Results File
Metrics that are contained in the file the lurker is pulling from the scanner pod can be extracted by the parser.
Other Metrics
Some tools do not write the relevant metrics to their output file (e.g., ZAP with its number of requests and their HTTP response codes). In these cases, there are different options for where the metrics may come from. They could be part of the logs printed to the standard output, or exposed by the application on a REST interface while it is running (as is the case for ZAP). In these cases, the data can only be collected while the scanner job is running.
Assuming we are interested enough in these metrics that we are willing to invest some time, they can be collected using a sidecar container. This sidecar can collect metrics from the running application using its REST interface, and written to a file upon completion of the job. Similarly, it can also wait for the main container to finish executing before pulling in a previously-specified logfile (which would be filled by adding a | tee >/path/to/file
or similar to the call of the main executable). It can then pull the relevant information from the logfile before putting it into the target metrics file in the file system. In both cases, the file will be collected by the lurker in the same way that it already collects the scanners output files, and provided to the parser.
In cases where we control the main container (and do not use an official container for the tool), we may be able to dispense with the sidecar entirely and include the processing as part of the main container.
Challenges:
- Log format may change, leading to higher maintainance overhead - especially since the log format is usually not considered part of the stable interface of a piece of software, so stability guarantees of semantic versioning do not apply.
- Requires writing custom metrics collectors for all scantypes (or at least for all we are interested in).
How to Record the Metrics
There are multiple options for how the collected metrics could then be recorded.
Included in Findings
The metrics could be written into the Findings by the parser. This would require making changes to the finding format. The format may look something like this:
Old Format:
[
{"Finding1" : "content 1"}
]
New Format:
{
"findings" = [
{"Finding1" : "content 1"},
],
"metrics" = {
"runtime": 2413,
// ... Extra standard metrics
"scannerMetrics": {
"200_response": 20,
"403_response": 32,
"404_response": 82,
// etc.
}
}
}
Advantages
- Having everything in one place makes it simpler to work with the data
Drawbacks
- Requires changes to the findings format, which may cause breakage for third-party scripts developed against the findings format specification
- Requires rewriting several parts of the SCB platform to deal with the new format
Extra File
Alternatively, the metrics could be provided as an extra file that sits next to the (unchanged) findings and results files. It would be an extra JSON file that contains just the metrics, and may look something like this:
// Filename: metrics.json
{
"runtime": 2413,
// ... Extra standard metrics
"scannerMetrics": {
"200_response": 20,
"403_response": 32,
"404_response": 82,
// etc.
}
}
The extra file would be generated by the parser, and would be accessible to ScanCompletionHooks.
Advantages
- No changes to the findings format necessary, thus maintaining compatibility of findings format
- Arbitrary format for the metrics file can be chosen
Disadvantages
- Requires providing additional presigned URLs for this purpose
- Requires changes to the parser and hook SDK, thereby potentially breaking third-party scanners maintained outside of the SCB
Transmitting Metrics to the Monitoring System
At this point, we have the metrics, but we still need to pass them on to the monitoring system. Regardless of which options were chosen in the previous parts of the design, this would likely be achieved through a ScanCompletionHook that reads the data and converts it to a format suitable for the relevant platform. In the case of push-based metrics systems like Grafana / Elastic, this can be as simple as pushing the data to the platform. For scraping-based systems like Prometheus, a push gateway may have to be used. The exact implementation would be up to the hook implementer, and not covered in this ADR. The hooks can then also specify a set of annotations that control how the data is labelled in the monitoring system, similar to how it is already being done by the DefectDojo hook.
Decision
For the question of which metrics to collect: We will support both a standard set of metrics, and custom metrics provided within a special-purpose file. The exact details are left open for the moment, and will be decided based on feasibility and utility during the implementation. Impacted components:
- Operator: Provide the presigned S3 URLs for the metrics collected by sidecars. The operator may also require additional changes to collect system metrics, like pod runtime.
- Scanners: Include the extra sidecars, which need to be implemented for scan types that require them. This includes writing those sidecars.
- Parser SDK: Provide the additional metrics files and runtime metadata information to the parsers, so they can incorporate it into the metrics.
- Parsers: Understand the extra metrics and pull in additional metrics from the result files
For the question of how the metrics will be communicated: In the interest of maintaining compatibility, we will leave the findings format unchanged and introduce a separate metrics file that contains the metrics. Impacted components:
- Operator: Provide presigned S3 URLs for the metrics file
- Parser SDK: Support returning the metrics and upload them to S3
- Parsers: Generate the metrics data in the right format
- Hook SDK: Support pulling in the new metrics file, and support overwriting it for ReadWriteHooks
- Hooks: Existing hooks may need to be adapted to support sending the metrics information (e.g., Elastic hook)