In virtual environments, both spatial and semantic characteristics might be at stake for optimal bimodal integration. We conducted an experiment to study how these two factors are intermingled. We assessed the role of spatial congruence between visual and auditory information in object recognition. We used a go/no-go paradigm with two meaningful objects, either visual and/or auditory. One was the target (a phone) and the other a distractor (a train). These objects were displayed in two spatial conditions, with or without spatial disparity. The visual stimulus was embedded in an “ecological” VE representing a room. We used a passive stereoscopic screen and Wave Field Synthesis (WFS) audio rendering, which allowed accurate reproduction of the cues of spatial sound localization. The results demonstrate a bimodal facilitation effect in the semantically congruent case compared to both unimodal auditory and visual conditions. Moreover, reaction times in the case of semantically congruent cross-modal stimuli were faster than in the case of semantically incongruent stimuli. Surprisingly, spatial congruency had no effect on subjects' responses. Thus, this experiment suggests a dominance of semantic congruence over spatial congruence. This finding constitutes experimental evidence of the importance of semantic content on multisensory integration.