Mastodawn

Unified Visual Relationship Detection with Vision and Language Models

VLM for scene understanding (VRD). DETR-like object detector (with bounding box prediction) and Perceiver Resampler for relationship decoder.

My summary on HFPapers: https://huggingface.co/papers/2303.08998#64ff22002597506d5adf7966
arXiv: https://arxiv.org/abs/2303.08998

#arxiv #paper #FoundationModels

Paper page - Unified Visual Relationship Detection with Vision and Language Models

Join the discussion on this paper page