Domain adaptation of statistical machine translation with domain-focused web crawling

In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and paral...

Full description

Bibliographic Details
Main Authors: Pecina, Pavel, Toral, Antonio, Papavassiliou, Vassilis, Prokopidis, Prokopis, Tamchyna, Aleš, Way, Andy, van Genabith, Josef
Format: Online
Language:English
Published: Springer Netherlands 2014
Online Access:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4479164/
id pubmed-4479164
recordtype oai_dc
spelling pubmed-44791642015-06-26 Domain adaptation of statistical machine translation with domain-focused web crawling Pecina, Pavel Toral, Antonio Papavassiliou, Vassilis Prokopidis, Prokopis Tamchyna, Aleš Way, Andy van Genabith, Josef Original Paper In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English–French and English–Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute. Springer Netherlands 2014-12-03 2015 /pmc/articles/PMC4479164/ /pubmed/26120290 http://dx.doi.org/10.1007/s10579-014-9282-3 Text en © The Author(s) 2015 Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
repository_type Open Access Journal
institution_category Foreign Institution
institution US National Center for Biotechnology Information
building NCBI PubMed
collection Online Access
language English
format Online
author Pecina, Pavel
Toral, Antonio
Papavassiliou, Vassilis
Prokopidis, Prokopis
Tamchyna, Aleš
Way, Andy
van Genabith, Josef
spellingShingle Pecina, Pavel
Toral, Antonio
Papavassiliou, Vassilis
Prokopidis, Prokopis
Tamchyna, Aleš
Way, Andy
van Genabith, Josef
Domain adaptation of statistical machine translation with domain-focused web crawling
author_facet Pecina, Pavel
Toral, Antonio
Papavassiliou, Vassilis
Prokopidis, Prokopis
Tamchyna, Aleš
Way, Andy
van Genabith, Josef
author_sort Pecina, Pavel
title Domain adaptation of statistical machine translation with domain-focused web crawling
title_short Domain adaptation of statistical machine translation with domain-focused web crawling
title_full Domain adaptation of statistical machine translation with domain-focused web crawling
title_fullStr Domain adaptation of statistical machine translation with domain-focused web crawling
title_full_unstemmed Domain adaptation of statistical machine translation with domain-focused web crawling
title_sort domain adaptation of statistical machine translation with domain-focused web crawling
description In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English–French and English–Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.
publisher Springer Netherlands
publishDate 2014
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4479164/
_version_ 1613239654482444288