- All Implemented Interfaces:
- org.apache.kafka.server.metrics.MetricsBuilderContext
- Enclosing class:
- TenantMetrics
public static class TenantMetrics.AggregateTenantMetricsContext
extends Object
implements org.apache.kafka.server.metrics.MetricsBuilderContext
See: CNK-1347
Currently, the request metrics tracked by the metrics API (Telemetry reporter) and what DataDog charts
track internally are not equivalent. This makes it hard to troubleshoot throttling issues, particularly
when the metrics API reports the request usage under the throttle ceiling for a given CKU but the dashboard
shows a much higher usage.
See: https://ccloud-production.datadoghq.com/notebook/278485/sum-of-swiggy-request-count
For eg: ESCALATION-3817 captured this discrepancy that made it hard to reason why the customer needed to
add more CKU while the metrics API indicated the request rate to be within the limit.
kafka.request.rate is the only internal dashboard that currently tracks the overall request rate across all
tenants/users on a cluster but also includes follower request rates.
For a more accurate metric, this class will now export an aggregate of individual requests across all tenants/users.
Aggregation is necessary to reduce cardinality of these metrics, particularly on a multi-tenant cluster, since the
worst case cardinality can be a factor of Users x Tenants x Requests leading to slow metrics scraping.
As it stands, this is mainly useful for dedicated clusters where No. of tenants = 1 and the aggregated metrics
have the highest correlation with the metrics API.
TODO: Revisit this when re-designing multi-tenant metrics