I. Introduction
Convolutional neural networks (CNNs) that are deployed on smartphones unlock a variety of novel applications empowered by deep learning, including object recognition-based second language learning [1]. With the rapid development of smartphones and broadband network technologies, the acquisition of multimedia information has become increasingly available [2], [3], resulting in a greater diversification of intelligent education formats. Among these formats, images represent the most intuitive and easily accessible learning materials [4], [5]. Intelligent recognition of image information by resource-constrained devices can be convenient for use in smart education systems. Directing learners to employ object recognition in real-world scenarios via smartphones, which then transforms these results into textual information and corresponding audio, stimulates greater interest in language learning and also enhances learning efficiency compared to traditional methods.