| Summary: | Word embeddings are a popular way of modelling relationships between words. Words are represented as low-dimensional vectors, such that the distances between the vectors reflect relationships between the words: words which are more similar to each other should be closer together in the embedding space.
This thesis explores several different aspects of word embeddings. First, we look at the problem of non-identifiability: word embeddings are generated by optimizing an objective function, but the optimal embedding set is not unique. This has consequences for how embeddings are evaluated, and for making comparisons between different word embedding methods. We explain why this is the case and propose some solutions for dealing with it.
We then explore the potential for generating semi-supervised word embeddings, with the aim being to more accurately capture the relationships between words, compared to using standard unsupervised embedding methods. We introduce three semi-supervised objective functions, derive algorithms for optimizing them, and implement them on simulated and real data.
Finally, we look at the generation of time-dependent word embeddings, in particular the development of statistical tests for assessing whether certain words have changed in meaning or usage over a given time period. We introduce a time-dependent word embedding model and use it to test for change over time. However, we find that we are unable to distinguish between the presence of time dependence and a misspecified embedding dimension.
|