In the previous sections we have discussed various types of quality assessment. One mayor disadvantage of quality assessment for MT evaluation purposes, however, is the fact the overall performance of an MT system has to be judged on more aspects than translation quality only. The most complete and direct way to determine whether MT performs well in a given set of circumstances is to carry out an operational evaluation on site comparing the combined MT and post-editing costs with those associated with pure human translation. The requirement here is that the vendor allows the potential buyer to test the MT system in her particular translation environment. Because of the enormous investment that buying a system often represents, vendors should allow a certain test period. During an operational evaluation a record is kept of all the user's costs, the translation times and other relevant aspects. This evaluation technique is ideal in the sense that it gives the user direct information on how MT would fit in and change the existing translation environment and whether it would be profitable.
Before starting up the MT evaluation the user should have a clear picture of the costs that are involved in the current set-up with human translation. When this information on the cost of the current translation service is available the MT experiment can begin.
In an operational evaluation of MT time plays an important role. Translators need to be paid and the more time they spend on post-editing MT output and updating the system's dictionaries, the less profitable MT will be. In order to get a realistic idea of the time needed for such translator tasks they need to receive proper training prior to the experiment. Also, the MT system needs to be tuned towards the texts it is supposed to deal with.
During an evaluation period lasting several months it should be possible to fully cost the use of MT, and at the end of the period, comparison with the costs of human translation should indicate whether, in the particular circumstances, MT would be profitable in financial terms or not.
One problem is that though one can compare cost in this way,
one does not necessarily hold quality constant. For example,
it is sometimes suspected that post-edited MT translations tend to be
of inferior quality to pure human translations because there is some
temptation to post-edit only up to that point where a correct (rather
than good) translation is realised. This would mean that cost benefits
of MT might have to be set against a fall in quality of translation.
There are several ways to deal with this. One could e.g. use the
quality measurement scales described above
(Section ). In this case we would need a
fine-grained scale like in the
ALPAC Report, since the differences
between post-edited MT and HT will be small. But what does this
quality measurement mean in practice? Do we have to worry about slight
differences in quality if after all an `acceptable' translation is
produced? Maybe a better
solution would be to ask an acceptability judgment from the customer.
If the customer notices a quality decrease which worries him, then
clearly post-editing quality needs to be improved. In most cases,
however, the experienced translator/post-editor is more critical
towards translation quality than the customer is.
In general it seems an operational evaluation conducted by a user will be extremely expensive, requiring 12 personmonths or more of translator time. An attractive approach is to integrate the evaluation process in the normal production process, the only difference being that records are kept on the number of input words, the turnaround time and the costs in terms of time spent in post-editing. The cost of such an integrated operational evaluation is obviously less. After all, if the system is really good the translation costs will have been reduced and will compensate for some of the costs of the evaluation method. (On the other hand, if the system is not an improvement for the company, the money spent on its evaluation will be lost of course.)