Skip to main content
Version: 1.32

Rate limits

Rate limits are restrictions that our API imposes on the number of times a user or client can access our services within a specified period of time.

  • Rate limits can be hit across any of the options depending on what occurs first. For example, you might send 20 requests with only 100 prompt tokens to the ChatCompletions endpoint and that would fill your limit (if your request per minute limit was 20), even if you didn't send 20k tokens (if your prompt token per minute limit was 20k) within those 20 requests.
  • Rate limits are defined at the organization level, not API key level.
  • Rate limits may vary by the API and model being used.

Subscription Tiers

The free tier has a limited amount of prompt and completion tokens available per month.

Limit typeValue
Prompt tokens100,000/min and 1,000,000/month
Completion tokens10,000/min and 1,000,000/month
Requests20/min
Audio file size25 MB/min and 100 MB/month

Token Counting & Multipliers

Each model has rate limit multiplier, we apply this multiplier to the token count of each request. The effective token usage counted towards your rate limits and monthly quotas is calculated as:

Effective Usage = Token Count x Model Multiplier

Note: Cached tokens don't count towards your rate limits.

Model Multipliers

ModelMultiplierEffective Prompt Token Rate Limit (Free & Standard)Effective Monthly Quota (Free)
Gemma 3 27B1.0100,000 tokens/min1,000,000 tokens/month
gpt-oss-120b1.0100,000 tokens/min1,000,000 tokens/month
Qwen2.5-Coder 14B1.0100,000 tokens/min1,000,000 tokens/month
Qwen3-Coder 30B-A3B1.0100,000 tokens/min1,000,000 tokens/month
Qwen3-Embedding 4B0.11,000,000 tokens/min10,000,000 tokens/month