Unknown Words Modelling in Training and Using Language Models for Russian LVCSR System



The paper considers some peculiarities of training and using N-gram language models with open vocabulary. It is demonstrated that explicit modeling of the probability distribution of out-of-model (unknown) words is necessary in this case. Two known techniques for this modeling are considered and a new technique with several advantages is proposed. We present experiments which demonstrate the consistency of the proposed approach.