Google Unveils ‘Implicit Caching’ to Reduce Costs for Accessing Its Newest AI Models
Google is launching a new feature in its Gemini API aimed at lowering expenses for third-party developers using its latest AI models.
This feature, named “implicit caching,” is said to enable savings of up to 75% on “repetitive context” sent to models via the Gemini API. It works with Google’s Gemini 2.5 Pro and 2.5 Flash models.
This announcement is expected to be positively received by developers, particularly as the costs of utilizing advanced models continue to rise.
Caching is a common strategy in the AI sector, allowing the reuse of frequently accessed or pre-computed data from models, which helps minimize both computational load and costs. For instance, caches can hold responses to commonly asked questions, eliminating the need for models to recompute these answers.
Previously, Google offered model prompt caching, but only through explicit means, requiring developers to identify their most frequently used prompts. While it promised cost savings, explicit prompt caching often demanded significant manual effort.
Some developers expressed frustration with Google’s explicit caching for Gemini 2.5 Pro, which they found could result in unexpectedly high API costs. Complaints grew last week, prompting the Gemini team to apologize and commit to improvements.
Unlike its explicit counterpart, implicit caching functions automatically. It is enabled by default for Gemini 2.5 models, transmitting cost savings whenever a request to the Gemini API encounters a cache.
Techcrunch event
Berkeley, CA
|
June 5
BOOK NOW
“[W]hen you send a request to one of the Gemini 2.5 models, if that request shares a common prefix with any previous requests, it qualifies for a cache hit,” Google stated in a blog post. “We will dynamically pass the cost savings back to you.”
The minimum token count for implicit caching is established at 1,024 for 2.5 Flash and 2,048 for 2.5 Pro, according to Google’s developer documentation. This requirement is minimal, indicating that triggering these automatic savings should be straightforward. Tokens are basic data units models operate with, where 1,000 tokens are approximately equivalent to about 750 words.
Considering that previous claims of cost savings from caching have been scrutinized, there are some cautions regarding this new feature. Google advises developers to keep repetitive context at the beginning of requests to enhance the likelihood of implicit cache hits. Context that changes from request to request should be added at the end, according to the company.
Additionally, Google has not provided any independent confirmation that the new implicit caching system will deliver the promised automatic savings. Thus, feedback from early adopters will be vital.


