Although the use of linguistic-knowledge based techniques tends to promote higher Intelligibility (and Accuracy) output, it is possible that the linguistic knowledge embedded in the system is defective or incomplete. Sometimes a certain grammar rule is too strict or too general to apply correctly in all circumstances; sometimes the rules that handle one phenomenon (e.g. modal verbs like may in The printer may fail) and the rules that handle another phenomenon (eg. negation) fail to work correctly together when the two phenomena co-occur or interact in a sentence. (For example, imagine the problems that will result if The printer can not be cleaned (i.e. can be left uncleaned), and The printer cannot be cleaned (i.e. must not be cleaned) are confused.)
Keeping track of these sorts of constructional errors and deficits has become rather a severe problem for developers of MT systems and other large NLP systems. For example, while running the system on a corpus of test texts will reveal many problems, many potential areas of difficulty are hidden because the statistics are such that even quite large corpora will lack even a single example of particular grammatical combinations of linguistic phenomena.
Rather than churning through increasingly large `natural' text corpora, developers have recently turned their attention to the use of suites of specially constructed test sentences. Each sentence in the suite contains either one linguistic construction of interest or a combination thereof. Thus part of an English test suite might look as follows.
This fragment just churns through all combinations of modal verbs like can, may together with optional not. In practice, one would expect test suites to run to very many thousands of sentences, because of the many different combinations of grammatical phenomena that can occur. Suites may include grammatically unacceptable sentences (e.g. * John not run) which the parser should recognize as incorrect. In systems which use the same linguistic knowledge for both analysing and synthesising text, the fact that an ill-formed sentence is rejected in analysis suggests that it is unlikely to be constructed in synthesis either.
Nobody knows for sure how test suites should be constructed and used in MT. A bi-directional system (a system that not only translates from German to English and from English to German) will certainly need test suites for both languages. Thus success in correctly translating all the sentences in a German test suite into English and all the sentences in an English test suite into German would definitely be encouraging. However, standard test suites are rather blunt instruments for probing translation performance in the sense that they tend to ignore typical differences between the languages involved in translation.
We can look at an example. In English the perfect tense is expressed with the auxiliary verb have, like in He has phoned. In German however there are two auxiliary verbs for perfect tense: haben and sein. Which verb is used depends on the main verb of the sentence: most require the first, some require the second. So an English and a German test suite designed to check the handling of perfect tense will look different.
The German test suite thus tests the perfect tense for verbs that take sein and verbs that take haben and therefore have to test twice the number of sentences to test the same phenomenon. However, if He has phoned is correctly translated into German Er hat angerufen, then we still can not be sure that all perfect tenses are translated correctly. For testing of the English grammar alone, there is no reason to include a sentence like He has gone into the English test suite, since the perfect tense has already been tested. For translation into German however it would be interesting to see whether the auxiliary verb sein is selected by the main verb gehen, giving the correct translation Er ist gegangen.
Given this sort of problem, it is clear that monolingual test suites should be supplemented with further sentences in each language designed to probe specific language pair differences. They could probably be constructed by studying data which has traditionally been presented in books on comparative grammar.
In a bi-directional system, we need test suites for both languages involved and test suites probing known translational problems between the two languages. Constructing test suites is a very complicated task, since they need to be complete with regard to the phenomena occurring in the present and future input texts of the MT user. Thus one should first check whether there are any existing test suites for the languages that need to be tested. (There are several monolingual test suites around). Such a suite can be modified by adding material and removing restrictions that are irrelevant in the texts for which the system is intended (eg. the texts to be translated might not contain any questions). As far as we know there are no readily available test suites for translational problems between two languages; to test for this, the evaluator will have to adapt existing monolingual ones.
Once the test suites have been devised they are run through the system and an inventory of errors is compiled. Clearly the test suite is an important tool in MT system development. How useful will it be for a user of MT systems?
It is of course possible for the user to run an MT system on a test suite of her own devising and, in some cases, this may be perfectly appropriate. It is especially useful to measure improvements in a system when the MT vendor provides a system update. However, the test suite approach does entail some drawbacks when used to assess system performance in comparison with competing systems. The problem is familiar by now: how are the evaluation results to be interpreted? Suppose System A and System B both produce acceptable translations for 40% of the test sentences and that they actually fail on different, or only partially overlapping, subsets of sentences. Which one is better? If System B (but not System A) fails on test sentences which embody phenomena with very low frequencies in the user's type of text material, then clearly System B is the better choice. But users typically do not have reliable information on the relative frequencies of various types of constructions in their material, and it is a complex task to retrieve such information by going through texts manually (automated tools to do the job are not yet widely available).
The same problem of interpretability holds when MT systems are evaluated by an independent agency using some sort of standard set of test suites. Published test suite information certainly gives a much better insight into expected performance than the vague promisory notes offered with current systems; but it doesn't immediately translate into information about likely performance in practice, or about cost effectiveness.
On top of this there is the problem of how to design a test suite, and
the cost of
actually constructing it. Research is ongoing to determine what sort
of sentences should go into a test suite: which grammatical phenomena
should be tested and to what extent should one include co-occurrence
of grammatical phenomena, should a test suite contain sentences to
test semantic phenomena and how does one test translation problems?
These and additional problems might be solved in the future, resulting
in proper guidelines for test suite construction.