As with intelligibility , some sort of scoring scheme for accuracy must
be devised. Whilst it might initially seem tempting to just have
simple `Accurate' and `Inaccurate' labels, this could be somewhat
unfair to an MT system which routinely produces translations which are
only slightly deviant in meaning. Such a system would be deemed
just as inaccurate as an automated `Monty Python' phrasebook which
turns the innocent request Please line my pockets with chamois
into the target language
statement My hovercraft is full of eels. Obviously enough, if
the output sentence is complete gobbledegook (deserving of the lowest
score for intelligibility) then it is impossible to assign a meaning,
and so the question of whether the translation means the same as the
original cannot really be answered. (Hence accuracy testing follows
intelligibility rating).
The evaluation procedure is fairly similar to the one used for the scoring of intelligibility . However the scorers obviously have to be able to refer to the source language text (or a high quality translation of it in case they cannot speak the source language), so that they can compare the meaning of input and output sentences.
As it happens, in the sort of evaluation considered here, accuracy scores are much less interesting than intelligibility scores. This is because accuracy scores are often closely related to the intelligibility scores; high intelligibility normally means high accuracy. Most of the time most systems don't exhibit surreal or Monty Python properties. For some purposes it might be worth dispensing with accuracy scoring altogether and simply counting cases where the output looks silly (leading one to suppose something has gone wrong).
It should be apparent from the above that devising and assigning
quality scores for MT output --- what is sometimes called `Static' or
`Declarative Evaluation' --- is not
straightforward. Interpreting the resultant scores is also
problematic.
It is virtually impossible --- even for the evaluator --- to decide
what a set of intelligibility and accuracy
scores for a single MT
system might mean in terms of cost-effectiveness as a `gisting' device
or as a factor in producing high quality translation. To see this,
consider the sort of quality profile you might get as a result of
evaluation (Figure ), which indicates that most
sentences received a score of 3 or 4, hence of middling
intelligibility. Does that mean that you can use the system to
successfully gist agricultural reports? One cannot say.
Figure: Typical Quality Profile for an MT System
Turning to the high-quality translation case, it is clear that substantial post-editing will be required. But it is not clear --- without further information about the relationship between measured quality and post-editing times --- what effect on overall translator productivity the system will have. Whilst it is presumably true that increasingly unintelligible sentences will tend to be increasingly difficult to post-edit , the relationship may not be linear. For example, it may be that sorting out minor problems (which don't affect intelligibility very much) is just as much of an editing problem as correcting mistranslations of words (which affect intelligibility a great deal). We could for example imagine the following two sentences to be part of our sample text in Chapter 2. The first one is more intelligible than the second, yet more time will be needed to fix the errors in it:
It is true that a comparative evaluation of a number of
different MT systems might demonstrate that one system is in all
respects better than the others. The information however does not tell
us whether buying the better MT system will improve the total
translation process --- the system could still be unprofitable. And
even if two particular systems have different performance profiles, it
may not always be clear whether one profile is likely to be better
matched to the task in hand than the other. For example, look at the
intelligibility ratings for systems A and B in
Figure . For system A the majority of sentences are
neither very good nor bad (rating 3 or 4). System B, by comparison,
tends to do either quite well (scores of 7 are common) or quite badly
(scores 1, and 2 are frequent). Which system will be better in
practice? It is not possible to say.
Figure: Which Performance Curve is Better?