Search

    What is re-ranking in search?

    A two-step search pattern where a fast initial search narrows candidates down, then a more precise model reorders them, combining the speed of vector search with deeper relevance analysis.

    A two-step process: retrieve then refine

    Re-ranking separates the search problem into two steps. In the first step, a fast vector search retrieves a shortlist of the most promising candidates, typically the top 20 to 100 most similar items. In the second step, a more sophisticated model evaluates each of those candidates in detail and reorders the list based on a deeper understanding of relevance.

    The final results presented to the user come from this reordered list. Because the second step only evaluates the small shortlist (not the entire database), it can afford to be much more thorough without becoming slow.

    Why the second step is more accurate

    The first step uses a fast approach called a bi-encoder: the query and each document are processed separately, and similarity is a quick calculation between two pre-computed vectors. This is fast because document vectors are computed once during indexing and reused for every query.

    The second step uses a cross-encoder: both the query and the candidate document are fed into an AI model together at the same time. This allows the model to understand how specific words in the query relate to specific words in the document, catching relevance signals that the first-step comparison would miss. Popular re-ranking models include Cohere Rerank and various models from Hugging Face. This deeper analysis adds 20 to 200 milliseconds depending on the model and the number of candidates.

    When re-ranking is worth adding

    Re-ranking is most valuable when the order of results directly affects outcomes: legal research where the most relevant case needs to appear first, medical literature search where practitioners rely on the top result, and customer-facing search where ranking quality drives purchases.

    For simpler applications like internal question-answering chatbots, re-ranking is often unnecessary. If a well-tuned vector index retrieves the right 3 to 5 chunks for a question, an AI model can synthesize a good answer from them regardless of their order. Adding re-ranking in that scenario adds latency without meaningful benefit. The decision comes down to how important ranking quality is compared to response speed for your specific use case.

    Related concepts

    Put Re-ranking to work with Endee

    The highest-throughput vector database — 1,168 QPS on 4 CPUs. Free to start.