Monitoring
We are aiming to seamlessly connect services within Kubernetes to our monitoring stack for metrics, logs, and traces, utilizing various tools for this purpose. This description outlines how data from Kubernetes is structured within the monitoring system and how the different components can be utilized. For detailed information, please take a look at the monitoring itself.
Organization of the Data
The metrics, logs, and traces that are collected are aggregated based on the organizationalUnit
specified in the request. Read and write access to the data of an organizationalUnit cannot be further divided.
Grafana
Access to the data can be achieved using our Grafana instance. There, an organization can be associated with the data of an organizationalUnit.
In Grafana, the corresponding data sources are automatically created within the organizations to access metrics, logs, and traces. These sources are named:
- Metrics:
WWU Kube: Mimir
- Logs:
WWU Kube: Loki
- Traces:
WWU Kube: Tempo
The prefix WWU Kube
can be customized, and we are currently in the process of transitioning to assign new default names here.
Metrics
To gather metrics in Kubernetes, we utilize Prometheus in combination with the Prometheus Operator. As a long-term storage solution, we automatically forward the metrics to Mimir.
To gather metrics from services in the Kubernetes, there are several Custom Resource Definitions (CRDs) available that can be used to specify which metrics should be collected. A more detailed description of these CRDs can be found in the documentation of the Prometheus Operator. The most common case is likely to utilize a ServiceMonitor
to obtain metrics from the Pods of a Service. Note, that you will have to allow the traffic between the metrics endpoint and our ingester inside the cluster, like in this example.
For instance, the following resource collects metrics from all pods of the service, determined by the corresponding fields namespaceSelector
, selector
and endpoints
:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: example
namespace: example
spec:
endpoints:
- interval: 30s
path: /metrics
port: http-metrics
namespaceSelector:
matchNames:
- example
selector:
matchLabels:
app.kubernetes.io/component: example
app.kubernetes.io/instance: example
app.kubernetes.io/name: example
Certain metrics, such as whether a service is up, or CPU and memory consumption, are not metrics of our own services but are collected by external cluster services. To enable our tenants to use such metrics, we endeavor to enrich the metrics of the organizationalUnit with some cluster metrics. However, this is done only with metrics that correspond to the namespaces of the organizationalUnit.
Currently, this consists of a subset of metrics from both the Kubelet and the kube-state-metrics service. If additional cluster metrics are required, we will explore the possibility of making them available.
Logs
Logs in Kubernetes are collected using Vector and then sent to the central Loki of our monitoring system.
The logs of all pods in the namespaces of the Kubernetes Clusters are automatically collected under the respective organizationalUnit. Technically, this is achieved through the use of the owner
Annotations on the Pods, which are automatically added to each Pod.
Retention Time
By default, Loki stores logs for a period of 14 days. However, it is possible to increase the log retention time to either 90 or 180 days. This can be achieved by adding the cloud.uni-muenster.de/vector-retention
annotation to your Kubernetes pod configuration file and set them to 90d
or 180d
accordingly.
For a 90-day retention period, add the following annotation:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
labels:
app: my-app
annotations:
cloud.uni-muenster.de/vector-retention: "90d"
spec:
containers:
...
Traces
Kubernetes is also capable of collecting traces from services. For this purpose, we utilize the OpenTelemetry Collector to ingest data in various formats. The data is then send to the central Tempo of our monitoring system.
Sending Traces
On each node, an OpenTelemetry Collector Agent is running, which accepts traces via various ports and protocols. This agent can be accessed from within the pods using the syntax host:
Currently, we support the following interfaces:
Protocol | Port |
---|---|
otlp | tcp/4317 |
Jaeger GPRC | tcp/14250 |
Jaeger Thrift | tcp/14268 |
Jaeger Thrift Binary | upd/6832 |
Jaeger Thrift Compact | udp/6831 |
zipkin | tcp/9411 |
We ourselves predominantly rely on otlp where possible and also recommend it to our customers. It appears that tracing is gradually consolidating around this protocol. In the future, we are planning to deactivate protocols that haven’t gained traction.
To facilitate the onboarding process, there are examples available on how to send such traces.