{"pk":49739,"title":"CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding","subtitle":null,"abstract":"How do vision-language (VL) transformer models ground verb phrases and do they integrate contextual and world knowledge in this process? We introduce the CV-Probes dataset, containing image-caption pairs involving verb phrases that require both social knowledge and visual context to interpret (e.g., `beg'), as well as pairs involving verb phrases that can be grounded based on information directly available in the image (e.g., ``sit\"). We show that VL models struggle to ground VPs that are strongly context-dependent. Further analysis using explainable AI techniques shows that such models may not pay sufficient attention to the verb token in the captions. Our results suggest a need for improved methodologies in VL model training and evaluation. The code and dataset will be available https://github.com/ivana-13/CV-Probes.","language":"eng","license":{"name":"","short_name":"","text":null,"url":""},"keywords":[{"word":"Artificial Intelligence; Language understanding; Natural Language Processing; Pattern recognition; Neural Networks"}],"section":"Papers with Poster Presentation","is_remote":true,"remote_url":"https://escholarship.org/uc/item/3h83566r","frozenauthors":[{"first_name":"Ivana","middle_name":"","last_name":"Benova","name_suffix":"","institution":"Brno University of Technology","department":""},{"first_name":"Michal","middle_name":"","last_name":"Gregor","name_suffix":"","institution":"Kempelen Institute of Intelligent Technologies","department":""},{"first_name":"Albert","middle_name":"","last_name":"Gatt","name_suffix":"","institution":"Utrecht University","department":""}],"date_submitted":null,"date_accepted":null,"date_published":"2025-01-01T18:00:00Z","render_galley":null,"galleys":[{"label":"PDF","type":"pdf","path":"https://journalpub.escholarship.org/cognitivesciencesociety/article/49739/galley/37701/download/"}]}