Reduce LLM Costs with Semantic Caching using Redis Vector Store and HuggingFace

Created by

Tihomir Mateev

Last update

Last update a day ago

Stop Paying for the Same Answer Twice

Your LLM is answering the same questions over and over. "What's the weather?" "How's the weather today?" "Tell me about the weather." Same answer, three API calls, triple the cost. This workflow fixes that.

What Does It Do?

Semantic caching with superpowers. When someone asks a question, it checks if you've answered something similar before. Not exact matches—semantic similarity. If it finds a match, boom, instant cached response. No LLM call, no cost, no waiting.

First time: "What's your refund policy?" → Calls LLM, caches answer
Next time: "How do refunds work?" → Instant cached response (it knows these are the same!)
Result: Faster responses + way lower API bills

The Flow

Question comes in through the chat interface
Vector search checks Redis for semantically similar past questions
Smart decision: Cache hit? Return instantly. Cache miss? Ask the LLM.
New answers get cached automatically for next time
Conversation memory keeps context across the whole chat

It's like having a really smart memo pad that understands meaning, not just exact words.

Quick Start

You'll need:

OpenAI API key (for the chat model)
huggingface API key (for embeddings)
Redis 8.x (for vector magic)

Get it running:

Drop in your credentials
Hit the chat interface
Watch your API costs drop as the cache fills up

That's it. No complex setup, no configuration hell.

Tune It Your Way

The distanceThreshold in the "Analyze results from store" node is your control knob:

Lower (0.2): Strict matching, fewer false positives, more LLM calls
Higher (0.5): Loose matching, more cache hits, occasional weird matches
Default (0.3): Sweet spot for most use cases

Play with it. Find what works for your questions.

Hack It Up

Some ideas to get you started:

Add TTL: Make cached answers expire after a day/week/month
Category filters: Different caches for different topics
Confidence scores: Show users when they got a cached vs fresh answer
Analytics dashboard: Track cache hit rates and cost savings
Multi-language: Cache works across languages (embeddings are multilingual!)
Custom embeddings: Swap OpenAI for local models or other providers

Real Talk 💡

When it shines:

Customer support (same questions, different words)
Documentation chatbots (limited knowledge base)
FAQ systems (obvious use case)
Internal tools (repetitive queries)

When to skip it:

Real-time data queries (stock prices, weather, etc.)
Highly personalized responses
Questions that need fresh context every time

Pro tip: Start with a higher threshold (0.4-0.5) and tighten it as you see what gets cached. Better to cache too much at first than miss obvious matches.

Built with n8n, Redis, Huggingface and OpenAI. Open source, self-hosted, completely under your control.