Ferret: An End-to-End MLLM by Apple
Ferret: Refer and Ground Anything Anywhere at Any Granularity
An End-to-End MLLM that Accept Any-Form Referring and Ground Anything in Response. [Paper]
Haoxuan You *, Haotian Zhang *, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang
[*: equal contribution]
Introduction
Diagram of Ferret Model.
Secret Contributions:
Ferret Model - Hybrid Region Representation + Spatial-aware Visual Sampler make it possible for fine-grained and open-vocabulary referring and grounding in MLLM.
GRIT Dataset (~ 1.1 M) - A Large-scale, Hierarchical, Robust ground-and-refer guideline tuning dataset.
Ferret-Bench - A multimodal examination standard that collectively needs Referring/Grounding, Semantics, Knowledge, and Reasoning.
Release
[12/14] We la...