SAMTok: Representing Any Mask with Two Words
Yikang Zhou, Tao Zhang, Dengxian Gong +13 more
Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we pr...