Generating Textual Video Summaries using Modified Bi-Modal Transformer and Whisper Model | IEEE Conference Publication | IEEE Xplore