Content
@
https://opensea.io/collection/science-14
0 reply
0 recast
0 reaction
qt
@qt
"We introduce \model, a multimodal, late-interaction retriever that jointly indexes four modalities: video frames, transcribed speech, on-screen text, and other metadata. \model jointly encodes all modalities within a unified multimodal backbone for improved contextualization and is trained to enhance dynamic modality selection via two key innovations. First, to overcome the lack of training data for multimodal retrieval, we introduce MultiVENT 2.0++, a large-scale synthetic training dataset built on MultiVENT 2.0 (a dataset of event-centric videos in various languages paired with English queries) with modality-targeted queries to teach modality selection. Next, we propose a modality-aware contrastive loss that jointly trains according to a standard contrastive objective alongside an objective for learning correct modality usage." https://arxiv.org/html/2506.06144v1
1 reply
0 recast
2 reactions
Omar
@omarperacha.eth
A multimodal training set is huge. Human intelligence is multimodal, I'm sure all the best AIs will be too before long
0 reply
0 recast
0 reaction