Getting started with Kafka client metrics

Apache Kafka stands as a widely known open supply occasion retailer and stream processing platform. It has developed into the de facto customary for information streaming, as over 80% of Fortune 500 firms use it. All main cloud suppliers present managed information streaming providers to satisfy this rising demand.

One key benefit of choosing managed Kafka providers is the delegation of accountability for dealer and operational metrics, permitting customers to focus solely on metrics particular to purposes. On this article, Product Supervisor Uche Nwankwo supplies steerage on a set of producer and shopper metrics that prospects ought to monitor for optimum efficiency.

With Kafka, monitoring usually includes numerous metrics which are associated to matters, partitions, brokers and shopper teams. Normal Kafka metrics embrace data on throughput, latency, replication and disk utilization. Discuss with the Kafka documentation and related monitoring instruments to grasp the particular metrics accessible on your model of Kafka and learn how to interpret them successfully.

Why is it necessary to observe Kafka purchasers?

Monitoring your IBM® Occasion Streams for IBM Cloud® occasion is essential to make sure optimum performance and general well being of your information pipeline. Monitoring your Kafka purchasers helps to establish early indicators of utility failure, corresponding to excessive useful resource utilization and lagging customers and bottlenecks. Figuring out these warning indicators early allows proactive response to potential points that reduce downtime and stop any disruption to enterprise operations.

Kafka purchasers (producers and customers) have their very own set of metrics to observe their efficiency and well being. As well as, the Occasion Streams service helps a wealthy set of metrics produced by the server. For extra data, see Monitoring Event Streams metrics by using IBM Cloud Monitoring.

Consumer metrics to observe

Producer metrics

Metric	Description
File-error-rate	This metric measures the typical per-second variety of data despatched that resulted in errors. A excessive (or a rise in) record-error-rate may point out a loss in information or information not being processed as anticipated. All these results may compromise the integrity of the info you might be processing and storing in Kafka. Monitoring this metric helps to make sure that information being despatched by producers is precisely and reliably recorded in your Kafka matters.
Request-latency-avg	That is the typical latency for every produce request in ms. A rise in latency impacts efficiency and may sign a problem. Measuring the request-latency-avg metric can assist to establish bottlenecks inside your occasion. For a lot of purposes, low latency is essential to make sure a high-quality person expertise and a spike in request-latency-avg may point out that you’re reaching the bounds of your provisioned occasion. You possibly can repair the difficulty by altering your producer settings, for instance, by batching or scaling your plan to optimize efficiency.
Byte-rate	The typical variety of bytes despatched per second for a subject is a measure of your throughput. In the event you stream information repeatedly, a drop in throughput can point out an anomaly in your Kafka occasion. The Occasion Streams Enterprise plan begins from 150MB-per-second break up one-to-one between ingress and egress, and you will need to know the way a lot of that you’re consuming for efficient capability planning. Don’t go above two-thirds of the utmost throughput, to account for the attainable impression of operational actions, corresponding to inner updates or failure modes (for instance, the lack of an availability zone).

Scroll to view full desk

Desk 1. Producer metrics

Client metrics

Metric	Description
Fetch-rate fetch-size-avg	The variety of fetch requests per second (fetch-rate) and the typical variety of bytes fetched per request (fetch-size-avg) are key indicators for the way properly your Kafka customers are performing. A excessive fetch-rate may sign inefficiency, particularly over a small variety of messages, because it means inadequate (presumably no) information is being acquired every time. The fetch-rate and fetch-size-avg are affected by three settings: fetch.min.bytes, fetch.max.bytes and fetch.max.wait.ms. Tune these settings to attain the specified general latency, whereas minimizing the variety of fetch requests and probably the load on the dealer CPU. Monitoring and optimizing each metrics ensures that you’re processing information effectively for present and future workloads.
Commit-latency-avg	This metric measures the typical time between a dedicated document being despatched and the commit response being acquired. Just like the request-latency-avg as a producer metric, a steady commit-latency-avg signifies that your offset commits occur in a well timed method. A high-commit latency may point out issues inside the shopper that stop it from committing offsets rapidly, which instantly impacts the reliability of knowledge processing. It would result in duplicate processing of messages if a shopper should restart and reprocess messages from a beforehand uncommitted offset. A high-commit latency additionally means spending extra time in administrative operations than precise message processing. This challenge may result in backlogs of messages ready to be processed, particularly in high-volume environments.
Bytes-consumed-rate	It is a consumer-fetch metric that measures the typical variety of bytes consumed per second. Just like the byte-rate as a producer metric, this must be a steady and anticipated metric. A sudden change within the anticipated development of the bytes-consumed-rate may signify a problem along with your purposes. A low fee may be a sign of effectivity in information fetches or over-provisioned sources. A better fee may overwhelm the customers’ processing functionality and thus require scaling, creating extra customers to steadiness out the load or altering shopper configurations, corresponding to fetch sizes.
Rebalance-rate-per-hour	The variety of group rebalances participated per hour. Rebalancing happens each time there’s a new shopper or when a shopper leaves the group and causes a delay in processing. This occurs as a result of partitions are reassigned making Kafka customers much less environment friendly if there are plenty of rebalances per hour. A better rebalance fee per hour may be brought on by misconfigurations resulting in unstable shopper habits. This rebalancing act could cause a rise in latency and may lead to purposes crashing. Make sure that your shopper teams are steady by monitoring a low and steady rebalance-rate-per-hour.

Scroll to view full desk

Desk 2. Client metrics

The metrics ought to cowl all kinds of purposes and use circumstances. Occasion Streams on IBM Cloud present a wealthy set of metrics which are documented right here and can present additional helpful insights relying on the area of your utility. Take the following step. Study extra about Event Streams for IBM Cloud.

What’s subsequent?

You’ve now received the data on important Kafka purchasers to observe. You’re invited to place these factors into follow and check out the totally managed Kafka providing on IBM Cloud. For any challenges in arrange, see the Getting Started Guide and FAQs.

Learn more about Kafka and its use cases

Provision an instance of Event Streams on IBM Cloud

Was this text useful?

SureNo

Product Supervisor, Occasion Streams on IBM Cloud