Movie shots categorization may be approached by using audio and visual features for inferring high-level information about a movie shot. Low-level audio and visual features such as color and MFCC and mid-level features such as sky and speech detection have been used in multimedia understanding research. However, integrating all this features in a classifier remains a subject of study. In this paper, we propose a multimedia SVM fusion model, presented in Figure 1, for integrating knowledge from low-level and semantic features extracted from auditory and visual signal for scene classification of movie shots. We also compare our method with common approaches for feature integration based on Bayesian Network. Our computational results show that our model can achieve significantly better and more stable performance than other strategies.