CASA-Net: Cross-attention and Self-attention for End-to-End Audio-visual Speaker Diarization | IEEE Conference Publication | IEEE Xplore