Inducing Semantic Representations from Text with Little or No Supervision

Inducing meaning representations from text is one of the key objectives of natural language processing. Most existing statistical semantic analyzers rely on large human-annotated datasets, which are expensive to create and exist only for a very limited number of languages. Even then, they are not very robust, cover only a small proportion of semantic constructions appearing in the labeled data, and are domain-dependent. We investigate Bayesian models and tensor factorization approaches which do not use any labeled data but induce semantic representations from unannotated texts. Unlike semantically annotated data, unannotated texts are plentiful and available for many languages and many domains, which makes our approach particularly promising. We show that these models induce linguistically plausible semantic representations, significantly outperform current state-of-the-art approaches, and yield competitive results in applications (e.g., question answering in the biomedical domain). We also look into several extensions of the model, and specifically consider multilingual induction of semantics, where we show that multilingual parallel texts (i.e. sentences and their translations) provide an additional valuable source of supervision.