reoring

Posted on Feb 4

Envoy Gateway 1.3.0 – Overview of the New “Rate Limiting with Cost” Feature

#kubernetes #envoy #gatewayapi

Envoy Gateway v1.3.0 introduces an important enhancement to its rate limiting capabilities: Rate Limiting with Cost. This feature allows each request to consume a configurable “cost” from the rate limit budget, rather than counting every request as a single hit. In practice, this enables usage-based rate limiting, where different requests can deduct different amounts from the allowed quota. This overview will explain the feature’s details, verify them against official sources, and provide an English translation of the key points from the Japanese article, with accurate technical context. The target audience is assumed to be familiar with Envoy Gateway, Kubernetes, and Envoy’s rate limiting concepts.

Background: Envoy Gateway Rate Limiting

Envoy Gateway supports rate limiting to control traffic to services. In prior versions, rate limits were primarily count-based – each request counted as “1” towards a fixed limit (e.g. 100 requests per minute). Envoy Gateway implements global rate limiting (using an external rate limit service with Redis) as well as local (per-instance) rate limits. Global rate limiting ensures the limit is enforced across all Envoy replicas (e.g. 10 req/sec globally means 5 req/sec on one proxy + 5 req/sec on another would hit the limit). By default, when a rate limit is exceeded, Envoy Gateway returns an HTTP 429 (Too Many Requests) response.

Why “cost”-based rate limiting? In some scenarios, not all requests are equal. For example, an API might want to allocate more budget to expensive operations (like complex queries or large data transfers) or charge users based on usage (e.g. bandwidth or computational cost). A key use case is Generative AI APIs – one request might generate a response with thousands of tokens, consuming significant compute resources. Counting each request as 1 doesn’t reflect the actual load or cost imposed by that request. The community raised this in GitHub issues such as “Usage based Rate Limiting (Counting from response header values)” (Issue #4756) and “Generative AI support” (Issue #4748), prompting the need for a more flexible rate limiting mechanism.

Envoy Gateway 1.3.0 addresses this with Rate Limiting with Cost, which lets you assign a variable cost per request (and response) when decremented from the rate limit counters, instead of a fixed 1 per request.

New Feature Overview: Rate Limiting with Cost

In Envoy Gateway v1.3.0, the rate limit API (part of the BackendTrafficPolicy CRD or RateLimitFilter CRD) now supports a cost specifier for each rate limiting rule. This is implemented by adding a cost field to the rate limit configuration. The official release notes highlight “Rate Limiting with Cost: Added support for cost specifier in the rate limit BackendTrafficPolicy CRD.”. In practice, this means you can configure how much each request will count against the limit, and even split that into a request-phase cost and a response-phase cost.

Cost Configuration in RateLimit Rules

Each rate limit rule can now include an optional cost setting, which has two sub-fields: request and response. If cost is omitted, the behavior remains the same as previous versions: each request consumes 1 count at request time, and nothing is consumed at response time. By default, every incoming request will decrement the remaining quota by 1, and the response has no effect on the quota.

When cost is specified, you have fine-grained control:

Request Cost (cost.request): This defines how much to deduct from the rate limit counter when a request is received (before it’s forwarded to the backend). If you set this to a number >1, each request will consume that many “credits.” If the remaining quota is less than this cost, the request will be rate-limited (Envoy will respond with 429 immediately). You can also set this to 0, meaning do not deduct any quota at request time – effectively just perform a check without consumption. A 0 request cost can be useful in scenarios where you only want to enforce limits based on the response (as described below).
Response Cost (cost.response): This defines how much to deduct from the rate limit counter after the response is sent back to the client (when the request/response stream completes). This is particularly useful for usage-based limits where the “cost” of a request can only be determined after processing the request – for instance, after generating a response (e.g., counting AI tokens or data size). The crucial point is that the response cost is applied after the request has been processed, so it does not retroactively affect the current request’s admission. Instead, it will reduce the available quota for subsequent requests. In other words, even if a response has a high cost, the current request will always be allowed to complete once started; the cost will be accounted against future requests. If cost.response is not specified, no deduction occurs on response (so responses don’t affect the quota by default).

Both cost.request and cost.response are defined as Cost Specifiers, which means you can decide how the cost value is obtained:

Fixed Number: You can specify a fixed integer. In the CRD, this is done by setting from: Number and providing a number value. For example, cost.request.from: Number with cost.request.number: 5 means each request will consume 5 units from the quota. A fixed number is straightforward for static costs. (If you set the number to 0, as mentioned, Envoy will only perform a limit check without consuming tokens – effectively allowing you to gate the request on the current budget without deducting at that moment.)
Dynamic Metadata: More powerfully, you can have the cost be determined dynamically from the request’s metadata. In this mode, you set from: Metadata and specify a metadata source with a namespace and key. Envoy will retrieve a numeric value from the per-request dynamic metadata under that namespace/key and use it as the cost. This requires that some part of the request processing (e.g., an external processing filter or a WASM extension) has injected the usage value into Envoy’s dynamic metadata. For instance, an external gRPC service (using Envoy’s External Processing filter) could calculate the number of tokens used by an AI model and return that in dynamic metadata; Envoy Gateway can then pick up that value and deduct it from the rate limit budget. This design was precisely intended for generative AI use cases, where the cost (number of tokens) is determined at runtime. The Envoy Gateway API reference confirms that valid sources for cost are “Number” or “Metadata”, and that the metadata source requires specifying which dynamic metadata namespace and key to read.

Supported Scope: It’s important to note that as of v1.3.0, cost-based rate limiting is supported only for HTTP global rate limits. Global rate limiting is the variant that uses the external rate limit service (with Redis) to coordinate counts across Envoy instances. The cost specifier currently works in that context. If you configure a BackendTrafficPolicy or RateLimitFilter with type: Local (per-proxy rate limiting), the cost fields are not applied in this release. Likewise, the cost mechanism is oriented toward HTTP traffic. (The release notes and docs do not mention TCP or gRPC usage for cost-based limits in this version.)

Example Configuration

Let’s illustrate how one would configure Rate Limiting with Cost in Envoy Gateway 1.3.0. Assume we want to limit each client to 1000 “tokens” per minute on a certain API route, where the token count of each request is determined by an external processing step (for example, the number of GPT-4 tokens generated in the response). We want to allow each request through initially (as long as some budget remains), and deduct the exact tokens used after the response is ready.

We could define a BackendTrafficPolicy (or a RateLimitFilter) with a rule like:

apiVersion: gateway.envoyproxy.io/v1alpha2
kind: BackendTrafficPolicy
metadata:
  name: ai-api-rate-limit
spec:
  rateLimit:
    type: Global
    global:
      rules:
      - limit:
          requests: 1000            # 1000 tokens 
          unit: Minute              # per minute window
        cost:
          request:
            from: Number
            number: 0              # don't consume at request, just check
          response:
            from: Metadata
            metadata:
              namespace: ext_proc  # the dynamic metadata namespace used by ext processor
              key: token_count     # the key where the token count is stored

In this example, when a request comes in matching this rule, Envoy will check the global counter (say, via Redis) to see if at least 1 token is available (because request cost is 0, it doesn’t immediately deduct, but presumably Envoy still ensures the limit is not already fully exhausted – effectively allowing one request even if at exactly 0 tokens left, since cost 0 means it wouldn’t block on count? In practice, one might set a minimal request cost of 1 to ensure a check. However, using 0 means “just check but not consume” as a special case). The request is allowed to proceed to the AI service. The AI service (through an external processing filter or another mechanism) returns, for example, that token_count = 150 for this response. Envoy then, after sending the response, will deduct 150 from the rate limit counter in Redis. The net effect is that the client has consumed 150 out of 1000 tokens in that minute. If another request comes in and the remaining budget is less than the needed tokens, it will be handled accordingly. Future requests will be rejected with 429 only once the accumulated token usage exceeds 1000 in the minute window.

Note: The ability to use number: 0 for request cost is a deliberate feature to support this pattern of “check then deduct later.” The Envoy Gateway docs state that using zero as the cost allows you to “only check the rate limit counters without reducing them”. This is ideal for scenarios where the exact cost isn’t known until later – you ensure there was budget when starting the request (or at least that the limit wasn’t already completely exhausted), and then finalize the accounting at the end.

Internal Implementation and Related Issues

This feature was implemented in the Envoy Gateway codebase via two key pull requests: PR #4957 (which defined the API changes adding the cost fields to the CRD) and PR #5035 (which implemented the translation logic that turns this API configuration into Envoy’s configuration). The maintainers discussed naming (initially calling it “hits_addend” before settling on a clearer cost terminology) and ensured that Envoy Proxy support was in place (Envoy Proxy added support for adjusting rate limit counters via dynamic metadata in a relatively recent version, which Envoy Gateway now leverages). In fact, the translator implementation notes that response cost requires Envoy proxy version >= 1.33.0 to work properly.

The driving use-cases for Rate Limiting with Cost are captured in the GitHub issues mentioned earlier. Issue #4756 outlined usage-based rate limiting, for example by counting values from a response header. Issue #4748 is related to Generative AI support – the idea of integrating Envoy Gateway with AI workloads. In tandem, an Envoy AI Gateway effort (see the separate repository envoyproxy/ai-gateway) introduced an AIGatewayRoute resource that can calculate token usage. In fact, a corresponding update in the AI Gateway project added a RequestCost field to the AI-specific route definition so that rate limiting based on “token usage” is possible. This demonstrates the synergy: Envoy Gateway’s core now supports the generic mechanism (cost-based limits), and the AI Gateway integration can supply the actual usage values (tokens) to the rate limiter. The commit message explicitly states this allows limiting based on calculated token usage.

Conclusion

“Rate Limiting with Cost” in Envoy Gateway 1.3.0 is a powerful new feature that enhances the flexibility of API rate limiting. It enables use cases like per-user consumption quotas, tiered API usage plans, and AI-inference usage limits that were difficult to enforce with simple request counting. By verifying against official sources, we confirm that the implementation details are as described: the BackendTrafficPolicy (or RateLimitFilter) CRD now accepts a cost specification with per-request and per-response cost values, which can be fixed numbers or dynamically fetched from metadata. The default behavior remains one request = one count (to maintain backward compatibility). This cost-based mechanism currently applies to global HTTP rate limits, working in conjunction with Envoy’s global rate limit service (Redis).

For organizations and developers, this means more granular control over how clients consume their APIs. You can now impose limits not just on the number of requests, but on the “cost” of those requests – whether defined by data size, CPU time, tokens generated, or any custom metric you can feed into Envoy’s metadata. Envoy Gateway 1.3.0’s documentation and release notes solidify the accuracy of this feature description, and the translated content above should faithfully reflect the original article’s intent with added clarity and verified technical correctness.

Sources:

Envoy Gateway v1.3.0 Release Notes
Envoy Gateway API Reference – RateLimit cost specification
Envoy Gateway Pull Request #4957 (API changes for cost specifier)
Envoy Gateway Pull Request #5035 (implementation of rate limit cost in translator)
GitHub Issue #4756 – “Usage based Rate Limiting (Counting from response header values)” (Motivation for cost feature)
GitHub Issue #4748 – “Generative AI support” (Related to dynamic cost usage for AI scenarios)

DEV Community

Envoy Gateway 1.3.0 – Overview of the New “Rate Limiting with Cost” Feature

Background: Envoy Gateway Rate Limiting

New Feature Overview: Rate Limiting with Cost

Cost Configuration in RateLimit Rules

Example Configuration

Internal Implementation and Related Issues

Conclusion

Top comments (0)

Read next

Kubernetes on Hybrid Cloud: Persistent storages

Kubernetes Explained: Understanding the Key Components Driving Modern Infrastructure ⚙️

☸️ Kubernetes Architecture Deep Dive: Understanding the Control Plane and Worker Nodes

Kubernetes for Beginners: Making Sense of Container Orchestration in DevOps 🚀