As with intelligibility , some sort of scoring scheme for accuracy must be devised. Whilst it might initially seem tempting to just have simple `Accurate' and `Inaccurate' labels, this could be somewhat unfair to an MT system which routinely produces translations which are only slightly deviant in meaning. Such a system would be deemed just as inaccurate as an automated `Monty Python' phrasebook which turns the innocent request Please line my pockets with chamois into the target language statement My hovercraft is full of eels. Obviously enough, if the output sentence is complete gobbledegook (deserving of the lowest score for intelligibility) then it is impossible to assign a meaning, and so the question of whether the translation means the same as the original cannot really be answered. (Hence accuracy testing follows intelligibility rating).
The evaluation procedure is fairly similar to the one used for the scoring of intelligibility . However the scorers obviously have to be able to refer to the source language text (or a high quality translation of it in case they cannot speak the source language), so that they can compare the meaning of input and output sentences.
As it happens, in the sort of evaluation considered here, accuracy scores are much less interesting than intelligibility scores. This is because accuracy scores are often closely related to the intelligibility scores; high intelligibility normally means high accuracy. Most of the time most systems don't exhibit surreal or Monty Python properties. For some purposes it might be worth dispensing with accuracy scoring altogether and simply counting cases where the output looks silly (leading one to suppose something has gone wrong).
It should be apparent from the above that devising and assigning quality scores for MT output --- what is sometimes called `Static' or `Declarative Evaluation' --- is not straightforward. Interpreting the resultant scores is also problematic.
It is virtually impossible --- even for the evaluator --- to decide what a set of intelligibility and accuracy scores for a single MT system might mean in terms of cost-effectiveness as a `gisting' device or as a factor in producing high quality translation. To see this, consider the sort of quality profile you might get as a result of evaluation (Figure ), which indicates that most sentences received a score of 3 or 4, hence of middling intelligibility. Does that mean that you can use the system to successfully gist agricultural reports? One cannot say.
Figure: Typical Quality Profile for an MT System
Turning to the high-quality translation case, it is clear that substantial post-editing will be required. But it is not clear --- without further information about the relationship between measured quality and post-editing times --- what effect on overall translator productivity the system will have. Whilst it is presumably true that increasingly unintelligible sentences will tend to be increasingly difficult to post-edit , the relationship may not be linear. For example, it may be that sorting out minor problems (which don't affect intelligibility very much) is just as much of an editing problem as correcting mistranslations of words (which affect intelligibility a great deal). We could for example imagine the following two sentences to be part of our sample text in Chapter 2. The first one is more intelligible than the second, yet more time will be needed to fix the errors in it:
It is true that a comparative evaluation of a number of different MT systems might demonstrate that one system is in all respects better than the others. The information however does not tell us whether buying the better MT system will improve the total translation process --- the system could still be unprofitable. And even if two particular systems have different performance profiles, it may not always be clear whether one profile is likely to be better matched to the task in hand than the other. For example, look at the intelligibility ratings for systems A and B in Figure . For system A the majority of sentences are neither very good nor bad (rating 3 or 4). System B, by comparison, tends to do either quite well (scores of 7 are common) or quite badly (scores 1, and 2 are frequent). Which system will be better in practice? It is not possible to say.
Figure: Which Performance Curve is Better?