The method is this. To start off, write down a large list of all the
types of errors you think the MT system might make. During the
evaluation, all the errors in the translated text are counted up.
Because you consider some errors more serious than others, each type
of error will be multiplied by some weighting factor which you
assign to it. The score then for each individual sentence or the whole
text will be the sum of all the weighted errors. So, if we take the
raw translation we were using in the scenario in
Chapter as an example, error analysis might work
as follows.
For the example three sorts of error are identified. These three sorts are errors involving selection of a vs one as the translation of German ein, errors in number agreement (e.g. * a computers), and errors in the selection of prepositions. Using some short codes for each error type, each error occurrence is marked up in the raw output. The resulting marked text is given below.
To calculate the seriousness of the errors, weights in the range 0 to 1 are assigned to the three error types. The weight for an error in preposition selection is higher than that for incorrect number because the person responsible considers that incorrect number is relatively less serious. This is summarized in the following table.
On the basis of this the total error score can be calculated. There
are two errors in NUMber agreement , two involving PREPositions, and
one involving A/ONE selection, so the score is:
Although this method gives more direct information on the usefulness of an MT system, there are immediate problems with using detailed error analysis. The first is practical: it will usually require considerable time and effort to train scorers to identify instances of particular errors --- and they will also need to spend more time analysing each output sentence. Second, is there any good basis for choosing a particular weighting scheme? Not obviously. The weighting is in some cases related to the consequences an error has for post-editing : how much time it will take to correct that particular mistake. In some other cases it merely reflects how badly an error affects the intelligibility of the sentence. Consequently, the result will either indicate the size of the post-editing task or the intelligibility of the text, with its relative usefulness. In both cases devising a weighting scheme will be a difficult task.
There is, however, a third problem and perhaps this is the most serious one: for some MT systems, many output sentences are so corrupted with respect to natural language correlates that detailed analysis of errors is not meaningful. Error types are not independent of each other: failure to supply any number inflection for a main verb will often mean that the subject and verb do not agree in number as required. It will be difficult to specify where one error starts and another ends and thus there is the risk of ending up with a general error scale of the form one, two, .... lots. The assignment of a weighting to such complex errors is thus a tricky business.