Synthetic-Data-Driven MLLM for 3D Spatial Reasoning

指导老师:Jigang Wu创建者:姜荣臻

Vision-language models (VLMs) are advancedAI systems designed to process and understand information from both visual andtextual data simultaneously. By integrating deep learning techniques, thesemodels can interpret the content of images in a way that aligns with humanunderstanding of language, making them crucial for tasks that require nuancedinterpretation of multimedia content.


Spatial reasoning refers to the capabilityof vision-language models to interpret and comprehend the spatial relationshipsamong objects in images. For instance, when presented with an image of a soccerfield teeming with players during a match, models equipped with spatialreasoning can provide responses such as “player 7 is closest to player 10” or “the distance from player to the goal post is 8 meters.” Thisskill is vital for tasks requiring precise analysis of spatial information,such as in robotics.


However, the spatial reasoning performanceof current VLMs remains unsatisfactory. This limitation is not due to thestructure of the VLMs themselves but rather the quality and quantity of thetraining data. The majority of 3D datasets heavily rely on human annotations,which are inefficient and labor-intensive. Additionally, the scale of thesedatasets remains limited, with over half containing fewer than 100,000 datasamples. These challenges significantly impede the spatial reasoningcapabilities of vision-language models.


What is required is a greater quantity ofdata in the form of visual question-answer (VQA) pairs. Therefore, this projectaims to construct a pipeline leveraging the latest framework called ‘VQASynth’ toautomatically generate a large-scale dataset of VQA pairs. This initiative isintended to enhance the spatial reasoning capabilities of vision-languagemodels.