Abstract:
The conventional n-gram language model exploits dependencies between words and their fixed-length past. This letter presents a model that represents sentences as a concat...Show MoreMetadata
Abstract:
The conventional n-gram language model exploits dependencies between words and their fixed-length past. This letter presents a model that represents sentences as a concatenation of variable-length sequences of units and describes an algorithm for unsupervised estimation of the model parameters. The approach is illustrated for the segmentation of sequences of letters into subword-like units. It is evaluated as a language model on a corpus of transcribed spoken sentences. Multigrams can provide a significantly lower test set perplexity than n-gram models.<>
Published in: IEEE Signal Processing Letters ( Volume: 2, Issue: 6, June 1995)
DOI: 10.1109/97.388911