Useful discussion of evaluation methods can be found in [van Slype1982], and [Lehrberger and Bourbeau1987]. Practical discussion of many different aspects of MT evaluation can be found in [King and Falkedal1990], [Guida and MauriJuly 1986], and [Balkan et al. 1991].
A special issue of the Journal Machine Translation is dedicated to issues of evaluation of MT (and other NLP) systems. The introduction to the issue, [Arnold et al. in press b], gives an overview of the state of the issues involved, going into more detail about some issues glossed over here. Several of the articles which appear in this issue report practical experience of evaluation, and suggest techniques (for example, [Albisserin press,Flank et al. in press,Jordanin press,Neal et al. in press].)
The problems of focusing evaluation on the MT engine itself (i.e. apart from surrounding peripherals) are discussed in [Krauwerin press].
As things stand, evaluating an MT system (or other NLP system) involves a great deal of human activity, in checking output, for example. A method for automating part of the evaluation process is described in [Shiwenin press].
Some of the issues involved in construction of test suites are discussed in [Arnold et al. in press a], and [Nerbonne et al. in press].
In this chapter, we have generally taken the users' perspective. However, evaluation is also an essential for system developers (who have to be able to guage whether, and how much, their efforts are improving a system). How evaluation technique can be applied so as to aid developers discussed in [Minnisin press].
One of the best examples of MT evaluation in terms of rigour was that
which formed the basis of the ALPAC report [Pierce and Carroll1966],
which we
mentioned in Chapter (it is normal to be rude
about the conclusions of the ALPAC report, but this should not reflect
on the evaluation on which the report was based: the evaluation itself
was a model of care and rigour --- it is the interpretation of the
results for the potential of MT which was regrettable).
See [Nagao1986, page 59,] for more detailed scales and criteria for evaluating fidelity and ease of understanding.
As usual, Hutchins and Somers [Hutchins and Somers1992] contains a useful discussion of evaluation issues (Chapter 9).