Multimodal AI pipelines typically require separate models to handle text, images, video, and audio, each adding transcription overhead, latency, and cost before any search query can even run. Google’s ...