1. Introduction
Sketch representation and interpretation remains an open challenge, particularly for complex and casually constructed drawings. Yet, the ability to classify, search, and manipulate sketched content remains attractive as gesture and touch interfaces reach ubiquity. Advances in recurrent network architectures within language processing have recently inspired sequence modeling approaches to sketch (e.g. SketchRNN [1]) that encode sketch as a variable length sequence of strokes, rather than in a rasterized or ‘pixel’ form. In particular, long-short term memory (LSTM) networks have shown significant promise in learning search embeddings [2], [3] due to their ability to model higher-level structure and temporal order versus convolutional neural networks (CNNs) on rasterized sketches [4, 5, 6, 7]. Yet, the limited temporal extent of LSTM restricts the structural complexity of sketches that may be accommodated in sequence embeddings. In language modeling domain, this shortcoming has been addressed through the emergence of Transformer networks [8, 9, 10] in which slot masking enhances the ability to learn complex structures that are represented by longer sequences.