What is multimodal search?

Multimodal search lets you query across different data types in a single operation. For example, a text query can return image results, an image query can return similar images or matching text, and an audio clip can retrieve similar sounds or transcripts. Endee stores all modalities as vectors in a shared index.

What embedding models work with Endee for multimodal search?

Endee works with any model that produces fixed-dimension dense vectors. For multimodal search, CLIP maps images and text to a shared 512 or 768-dimensional space. Meta ImageBind maps six modalities (image, text, audio, depth, thermal, IMU) to a single embedding space.

Can I search images using a text query in Endee?

Yes. Using CLIP embeddings, both images and text are mapped to the same vector space. You can embed a text query and search an index of image vectors to retrieve the most visually relevant images without any image preprocessing at query time.

Use Case

Multimodal Search with Endee

Search across images, text, and audio in a unified vector space. Query with any modality and retrieve the most relevant results across all content types.

Start for free Read the docs

CLIP EmbeddingsImageBindText-to-ImageImage-to-ImageCross-lingualMulti-vector per Item

Capabilities

Built for any-to-any retrieval

Cross-modal Retrieval

Query with text and retrieve images. Query with an image and find similar sounds. Any modality can be the query and any modality can be the result. CLIP maps text and images to the same dimensional space so a text query finds visually matching images with no special pipeline.

Unified Embedding Space

Store all modalities in a single Endee index using multimodal embedding models like CLIP or ImageBind. No separate indexes per data type, no synchronization overhead. One search call retrieves the most relevant items across images, text, and audio simultaneously.

Multi-vector per Document

Store multiple embeddings per document when a single item has multiple modalities. A product can have both an image embedding and a text description embedding in the same record. Query against either vector and retrieve the full product record with metadata.

Metadata-filtered Cross-modal Search

Combine cross-modal vector search with structured metadata filters. Filter image results by upload date, rights status, or content category while searching by text query. Filters run during ANN search so there is no post-processing overhead on the result set.

Visual Search

Let users upload a photo and find visually similar products, scenes, or faces. CLIP and custom vision models produce embeddings that capture visual semantics, color, style, composition, so results are meaningfully similar, not just superficially matching.

Audio Similarity Search

Index audio clips as embeddings using audio encoders or ImageBind. Retrieve similar sounds, music tracks, or voice recordings by comparing embeddings. Use metadata filters to restrict by genre, duration, license type, or any custom attribute.

Process

How it works

Generate embeddings per modality

Use CLIP to encode both images and text into the same dimensional vector space. Use ImageBind for audio, depth, and thermal data alongside images and text. Run your embedding pipeline at ingest time so only the resulting vectors need to be stored.

Index all modalities in Endee

Insert all vectors into a single Endee collection tagged with a modality field in metadata. Store image URLs, transcripts, audio paths, or any other payload as metadata alongside the embedding. Use INT8 quantization to reduce storage by 75%.

Query from any modality

At query time, embed the input in the same model space and search Endee. A text query returns the most semantically similar images, audio clips, and text documents ranked together. Apply modality filters to restrict results to a specific type when needed.

In Practice

What teams build with multimodal search

Visual Product Search

Let customers upload a photo and find similar products in your catalog. No keywords required.

Media Asset Management

Search a photo or video library with natural language descriptions and retrieve matching frames or clips.

Audio Similarity Search

Find similar audio tracks, sound effects, or music by comparing audio embeddings in the shared vector space.

Cross-lingual Image Retrieval

A query in any language finds matching images because CLIP projects both into the same multilingual embedding space.

Related resources

Semantic Search

Use case

Recommendations

Use case

Edge AI

Product

Benchmarks

Performance data