Exploring the Limits of Multimodal Foundation Models for Visual Temporal Reasoning and Gesture Recognition Tasks