TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Soccer is a globally popular sporting event, typically characterized by long matches and distinctive highlight moments. Recent advances in Multimodal Large Language Models (MLLMs) show promising capabilities in temporal grounding and video understanding. However, generating soccer commentary requires both precise temporal localization and semantically rich descriptions over long-form videos. Existing soccer MLLMs often rely on temporal priors for caption generation, which limits their ability to process the entire video in an end-to-end manner. Traditional approaches, on the other hand, follow a complex two-step paradigm that fails to capture the global context, leading to suboptimal performance. To solve the above issues, we present TimeSoccer, the first end-to-end soccer MLLM for Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos. TimeSoccer jointly predicts timestamps and generates captions in a single pass, enabling global context modeling across 45-minute matches. To support long video understanding of soccer matches, we introduce MoFA-Select, a training-free, motion-aware frame compression module that adaptively selects representative frames via a coarse-to-fine strategy, and incorporates complementary training paradigms to strengthen the model's ability to handle long temporal sequences. Extensive experiments demonstrate that our TimeSoccer achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end form, generating high-quality commentary with accurate temporal alignment and strong semantic relevance. For more information, please visit: https://vpx-ecnu.github.io/TimeSoccer-Website/.

Original languageEnglish
Title of host publicationMM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PublisherAssociation for Computing Machinery, Inc
Pages3418-3427
Number of pages10
ISBN (Electronic)9798400720352
DOIs
StatePublished - 27 Oct 2025
Event33rd ACM International Conference on Multimedia, MM 2025 - Dublin, Ireland
Duration: 27 Oct 202531 Oct 2025

Publication series

NameMM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025

Conference

Conference33rd ACM International Conference on Multimedia, MM 2025
Country/TerritoryIreland
CityDublin
Period27/10/2531/10/25

Keywords

  • multimodal model
  • temporal localization
  • video captioning

Fingerprint

Dive into the research topics of 'TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation'. Together they form a unique fingerprint.

Cite this