Audio-visual scene understanding towards unified, explainable, and robust multisensory perception