ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

WMT 2013 - ACL 2013 EIGHT WORKSHOP ON STATISTICAL MACHINE TRANSLATION

Date2013-08-08 - 2013-08-09

Deadline2013-05-31

VenueSofia, Bulgaria Bulgaria

Keywords

Websitehttps://www.statmt.org/wmt13/quality-est...

Topics/Call fo Papers

This shared task will examine automatic methods for estimating machine translation output quality at run-time. Quality estimation aims at providing a quality indicator for unseen translated sentences without relying on reference translations. In this second edition of the shared task, we will consider both word-level and sentence-level estimation.
Some interesting uses of sentence-level quality estimation are the following:
Decide whether a given translation is good enough for publishing as is
Inform readers of the target language only whether or not they can rely on a translation
Filter out sentences that are not good enough for post-editing by professional translators
Select the best translation among options from multiple MT and/or translation memory systems
Some interesting uses of word-level quality estimation are the following:
Highlight words that need editing in post-editing tasks
Inform readers of portions of the sentence that are not reliable
Select the best segments among options from multiple translation systems for MT system combination
Last year, a first shared task was organised as part of WMT12 on sentence-level estimation. This task provided a set of baseline features, datasets, evaluation metrics, and oracle results. The task attracted an impressive number of participants. Building on last year's experience, this year's shared task will reuse some of these resources, but provide additional training and test sets, use different annotation schemes and propose a few variants of the task for word- and sentence-level quality estimation.
Goals
The main goals of the shared quality estimation task are:
To push current work on sentence-level quality estimation towards robust models that can work across MT systems;
To test work on sentence-level quality estimation for the task of selecting the best translation amongst multiple systems;
To evaluate the applicability of quality estimation for post-editing tasks;
To provide a first common ground for development and comparison of quality estimation systems at word-level.
Task 1: Sentence-level QE
Task 1.1 Scoring and ranking for post-editing effort
This task is similar to the one in WMT12, but with one important difference in the scoring variant: based on feedback received last year, instead of using the [1-5] scores for post-editing effort, we will use HTER as our quality score, i.e.: the minimum edit distance between the machine translation and its manually post-edited version in [0,1]. Two variants of the results can be submitted:
Scoring: A quality score for each sentence translation in [0,1], to be interpreted as an HTER score; lower scores mean better translations.
Ranking: A ranking of sentence translations for a number of source sentences, produced by the same MT system, from best to worst. For this variant, it does not matter how the ranking is produced (from HTER predictions, likert predictions, or even without machine learning). The reference ranking will be defined based on the true HTER scores.
For the training of models, we provide the WMT12 dataset: 2,254 English-Spanish news sentences produced by a phrase-based SMT system (Moses) trained on Europarl and News Commentaries corpora as provided by WMT, along with their source sentences, reference translations, post-edited translations, and HTER scores. We used TERp (default settings: tokenised, case insensitive, etc., but capped to 1) to compute the HTER scores. Likert scores are also provided, if participants prefer to use them for the ranking variant.
NOTE: Participants are free to use as training data other post-edited material as well ("open" submission). However, for submitting to Task 1.1, we require at least one submission per participant using only the official 2,254 training set ("restricted" submission).
As test data, we provide a new set of translations produced by the same MT system as those used for training. Evaluation will be performed against the HTER and/or ranking of those translations using the same metrics as in WMT12: Mean-Average-Error (MAE), Root-Mean-Squared-Error (RMSE), Spearman's rank correlation, and DeltaAvg.
Task 1.2 System selection (new)
Participants will be required to rank up to five alternative translations for the same source sentence produced by multiple MT systems. We will use essentially the same data provided to participants of WMT's evaluation metrics task -- where MT evaluation metrics are assessed according to how well they correlate with human rankings. However, reference translations will not be allowed in this task. We provide:
Training data: A large set of up to five alternative machine translations produced by different MT systems for each source sentence and ranked for quality by humans. This is the outcome of the manual evaluation of the translation task from WMT09-WMT12. It includes two language pairs: German-English and English-Spanish, with 7,098 and 3,117 source sentences and up to five ranked translations, respectively.
Test data: A new set of up to 5 alternative machine translations per source sentence. Notice that there will be some overlap between the MT systems used in the training data and test data, but not all systems will be the same.
Evaluation for each language pair will be performed against human ranking of pairs of alternative translations, using as metric the overall Kendall's tau correlation (i.e. weighted average).
Task 1.3 Predicting Post-Editing Time (new)
Participating systems will be required to produce for each sentence:
Expected post-editing time: a real valued estimate of the time (in seconds) it takes a translator to post-edit the MT output.
For training we provide a new dataset: 800 English-Spanish news sentences produced by a phrase-based SMT system (Moses), along with their source sentences, post-edited translations and time (in seconds) that was spend on that segment. The data was collected using five translators (with few overlapping annotations). For each segment we provide an ID that specifies the translator who post-edited it (for those interested in training translator-specific models).
As test data, we provide additional source sentences and translations produced with the same SMT system, and IDs of the translators who will post-edit each of these translations (same post-editors as in the training data).
Submissions will evaluated in terms of Mean Average Error (MAE) against the time spent by the same translators post-editing these sentences.
For both Tasks 1.1-1.3, we also provide a system and resources to extract QE features (language model, Giza++ tables, etc.), when these are available. We also provide the machine learning algorithm that will be used as baseline: SVM regression with an RBF kernel, as well as the grid search algorithm for the optimisation of relevant parameters. The same 17 features used in WMT12 will be considered for the baseline systems.
Task 2: Word-level QE (new)
The data for this task is based on the same resources and data as in Task 1.3, but with word-level labels. Participating systems will be required to produce for each token a label in one of the following settings:
Binary classification: a good/bad label, where bad indicates the need for editing the token.
Multi-class classification: a label specifying the edit action needed for the token (keep as is, delete, or substitute).
As training data, we provide tokenized MT-output with 20,362 tokens, where each token is annotated with multiclass (good/delete/substitute) labels. The annotation is derived automatically by computing TER (with some tweaks) between the original machine translation and its post-edited version. For the binary variant, labels will be grouped in two: good (keep) versus all others (delete or substitute).
As test data, we provide a tokenized version of the test data used in Task 1.3.
Submissions will evaluated in terms of classification performance (precision, recall, F-1) against the original labels in the two variants (binary and multi-class).
Download
Data, resources and baseline systems
Submission Format
TBA
Submission Requirements
TBA We require that each participating team submits at most 2 submissions for each of the 6 variants of the task. These should be sent via email to Lucia Specia lspecia-AT-gmail.com. Please use the following pattern to name your files:
INSTITUTION-NAME_TASK-NAME_METHOD-NAME, where:
INSTITUTION-NAME is an acronym/short name for your institution, e.g. SHEF
TASK-NAME is one of the following 6: 1-1-scoring, 1-1-ranking, 1-2, 1-3, 2-binary, 2-multiclass.
METHOD-NAME is an identifier for your method in case you have multiple methods for the same task, e.g. 2-multiclass_J48, 2-multiclass_SVM
For instance, a submission from team SHEF for task 2-multiclass using method "SVM" could be named SHEF_2-multiclass_J48.
IMPORTANT DATES
Release of training sets + baseline systems March 6, 2013: here
Release of test sets May 15, 2013
Submission deadline for all QE subtasks May 31, 2013
Paper submission deadline June 7, 2013
ORGANIZERS
Christian Buck (University of Edinburgh)
Radu Soricut (Google)
Lucia Specia (University of Sheffield)
Other Requirements
You are invited to submit a short paper (4 to 6 pages) describing your QE method(s). You are not required to submit a paper if you do not want to. In that case, we ask you to give an appropriate reference describing your method(s) that we can cite in the WMT overview paper.
We encourage individuals who are submitting research papers to submit entries in the shared-task using the training resources provided by this workshop (in addition to potential entries that may use other training resources), so that their experiments can be repeated by others using these publicly available resources.
CONTACT
For questions, comments, etc. please send email to Lucia Specia lspecia-AT-gmail.com.

Last modified: 2013-04-27 08:02:32