Deep learning for video-grounded dialogue systems