Use Case
Multimodal Search with Endee
Search across images, text, and audio in a unified vector space. Query with any modality and retrieve the most relevant results across all content types.
Capabilities
Built for any-to-any retrieval
Cross-modal Retrieval
Query with text and retrieve images. Query with an image and find similar sounds. Any modality can be the query and any modality can be the result. CLIP maps text and images to the same dimensional space so a text query finds visually matching images with no special pipeline.
Unified Embedding Space
Store all modalities in a single Endee index using multimodal embedding models like CLIP or ImageBind. No separate indexes per data type, no synchronization overhead. One search call retrieves the most relevant items across images, text, and audio simultaneously.
Multi-vector per Document
Store multiple embeddings per document when a single item has multiple modalities. A product can have both an image embedding and a text description embedding in the same record. Query against either vector and retrieve the full product record with metadata.
Metadata-filtered Cross-modal Search
Combine cross-modal vector search with structured metadata filters. Filter image results by upload date, rights status, or content category while searching by text query. Filters run during ANN search so there is no post-processing overhead on the result set.
Visual Search
Let users upload a photo and find visually similar products, scenes, or faces. CLIP and custom vision models produce embeddings that capture visual semantics, color, style, composition, so results are meaningfully similar, not just superficially matching.
Audio Similarity Search
Index audio clips as embeddings using audio encoders or ImageBind. Retrieve similar sounds, music tracks, or voice recordings by comparing embeddings. Use metadata filters to restrict by genre, duration, license type, or any custom attribute.
Process
How it works
Generate embeddings per modality
Use CLIP to encode both images and text into the same dimensional vector space. Use ImageBind for audio, depth, and thermal data alongside images and text. Run your embedding pipeline at ingest time so only the resulting vectors need to be stored.
Index all modalities in Endee
Insert all vectors into a single Endee collection tagged with a modality field in metadata. Store image URLs, transcripts, audio paths, or any other payload as metadata alongside the embedding. Use INT8 quantization to reduce storage by 75%.
Query from any modality
At query time, embed the input in the same model space and search Endee. A text query returns the most semantically similar images, audio clips, and text documents ranked together. Apply modality filters to restrict results to a specific type when needed.
In Practice
What teams build with multimodal search
Visual Product Search
Let customers upload a photo and find similar products in your catalog. No keywords required.
Media Asset Management
Search a photo or video library with natural language descriptions and retrieve matching frames or clips.
Audio Similarity Search
Find similar audio tracks, sound effects, or music by comparing audio embeddings in the shared vector space.
Cross-lingual Image Retrieval
A query in any language finds matching images because CLIP projects both into the same multilingual embedding space.