Text Generation using Hidden Markov Models

Abstract

Natural language generation (NLG) is the task of producing text that feels natural to the reader. The goal of this diploma thesis is to study to which level natural language generation can be achieved using statistical models – specifically hidden Markov models. The diploma thesis covers probability and information theories that allow the definition of hidden Markov models and describes how such models can be used for the purpose of text generation. Available tools for working with hidden markov models are reviewed, compared, and assesed for their suitability for generating text. A library for hidden Markov models is implemented in Elixir. Two of the reviewed tools and the implemented library are used to generate text from a corpus of written slovenian language. A criterion for comparing generated texts is chosen and used to compare the models as well as comparing the generated texts to the corpus.

Related projects

http://github.com/mfilej/himamo
http://hexdocs.pm/himamo