Why Existing Multimodal Crowd Counting Datasets Can Lead to Unfulfilled Expectations in Real-World Applications

Thissen, Martin; Hergenröther, Elke

Full metadata record

DC pole	Hodnota	Jazyk
dc.contributor.author	Thissen, Martin
dc.contributor.author	Hergenröther, Elke
dc.contributor.editor	Skala, Václav
dc.date.accessioned	2023-10-15T16:58:02Z
dc.date.available	2023-10-15T16:58:02Z
dc.date.issued	2023
dc.identifier.citation	WSCG 2023: full papers proceedings: 1. International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, p. 28-35.	en
dc.identifier.isbn	978-80-86943-32-9
dc.identifier.issn	2464–4617 (print)
dc.identifier.issn	2464–4625 (CD/DVD)
dc.identifier.uri	http://hdl.handle.net/11025/54396
dc.format	8 s.	cs
dc.format.mimetype	application/pdf
dc.language.iso	en	en
dc.publisher	Václav Skala - UNION Agency	en
dc.rights	© Václav Skala - UNION Agency	en
dc.subject	počítání davů	cs
dc.subject	multimodální učení	cs
dc.subject	RGB-T	cs
dc.subject	transformátor	cs
dc.title	Why Existing Multimodal Crowd Counting Datasets Can Lead to Unfulfilled Expectations in Real-World Applications	en
dc.type	konferenční příspěvek	cs
dc.type	conferenceObject	en
dc.rights.access	openAccess	en
dc.type.version	publishedVersion	en
dc.description.abstract-translated	More information leads to better decisions and predictions, right? Confirming this hypothesis, several studies concluded that the simultaneous use of optical and thermal images leads to better predictions in crowd counting. However, the way multimodal models extract enriched features from both modalities is not yet fully understood. Since the use of multimodal data usually increases the complexity, inference time, and memory requirements of the models, it is relevant to examine the differences and advantages of multimodal compared to monomodal models. In this work, all available multimodal datasets for crowd counting are used to investigate the differences between monomodal and multimodal models. To do so, we designed a monomodal architecture that considers the current state of research on monomodal crowd counting. In addition, several multimodal architectures have been developed using different multimodal learning strategies. The key components of the monomodal architecture are also used in the multimodal architectures to be able to answer whether multimodal models perform better in crowd counting in general. Surprisingly, no general answer to this question can be derived from the existing datasets. We found that the existing datasets hold a bias toward thermal images. This was determined by analyzing the relationship between the brightness of optical images and crowd count as well as examining the annotations made for each dataset. Since answering this question is important for future real-world applications of crowd counting, this paper establishes criteria for a potential dataset suitable for answering whether multimodal models perform better in crowd counting in general.	en
dc.subject.translated	crowd counting	en
dc.subject.translated	multimodal learning	en
dc.subject.translated	RGB-T	en
dc.subject.translated	transformer	en
dc.identifier.doi	https://www.doi.org/10.24132/CSRN.3301.5
dc.type.status	Peer-reviewed	en
Vyskytuje se v kolekcích:	WSCG 2023: Full Papers Proceedings

Soubory připojené k záznamu:

Soubor	Popis	Velikost	Formát
D97-full.pdf	Plný text	5,68 MB	Adobe PDF	Zobrazit/otevřít

Zobrazit minimální záznam Zobrazit statistiky

Použijte tento identifikátor k citaci nebo jako odkaz na tento záznam: http://hdl.handle.net/11025/54396

Všechny záznamy v DSpace jsou chráněny autorskými právy, všechna práva vyhrazena.

hledání

navigace