ESL EFL Essay Tests
Scoring criteria & Rating scale formats
The criteria used to judge the essay examination operationally define which content features and test structure constitute a “good “or at least a “competent” response. To be credible, criteria should not reflect the preferences of only a few individuals, but should represent standards endorsed by a community of professionals knowledgeable about the subject matter.
Secondly the criteria should refer to these features of content and written expression, which are amendable to instructional intervention. We cannot test what we do not teach in the classroom. For example, dimensions of “depth”, “flavour”, and “creativity” may enhance the quality of the essay but a growing number of educators contend that it is neither logical nor fair to hold the learners accountable for subject matter or writing expertise that the schools cannot demonstrate they can teach.
The criteria used to evaluate learners’ content and written expression vary along a number of dimensions. The variation may be as follows:
From qualitative value judgements to quantitative counts of information and test features;
From global reactions to analytical judgements;
From comprehensive attention to a range of concepts and text features to isolated focus on particular information or text feature;
From vague guidelines to replicable precise definitions.
Generally, readers’ reactions to learners’ essays involve three levels of judgement.
1) Subjective, global impressions of overall quality
2) Analytic judgements about component test features
3) A holistic quality judgement combining subjective impressions with
judgements about the quality of the combinations of text elements.
i. Global judgment
In general impression scoring, a rater reads an essay once and assigns it a quality score. General impression ratings are global, heavily qualitative and are based upon vague guidelines that may not refer to component text features or their differential weighting or importance.
ii. Analytic judgement
The most quantitative, detailed and replicable scales are analytic rating scales where readers assign several scores for various features of the essay. Analytic scores vary considerably in the range of content, rhetorical, structural and syntactic elements referenced and in the relative weights of these elements. The analytic scores differ in the importance they give to different features of written assignment.
S Mohanraj (1981) discusses analytical rating scales of Caroll (1961), Alan & Campbell (1965), Cooper (1972), Davies (1977) and Pilliner. He has prepared a model of his own which includes twelve features of writing. He has further simplified it and has arrived at a model suited to our situation where teachers cannot spend much time in correcting compositions. This model is quite practicable and easy to use.
A similar model is suggested by suggested by Rita M. Deyoe (1980). Her model gives more importance to grammatical aspects whereas Mohanraj’s model attempts to concentrate on stylistic and discoursal features.
iii. Holistic judgement
Holistic scales, where readers assign a single score, often combine characteristics of both general impression and analytic approaches. Holistic schemes vary widely in the range of text elements contributing to each score point and the specificity with which score levels are defined (Ingenkamp 1977, Quellmalz 1980).
Since the focus, specificity and objectivity of criteria informing impressionistic, holistic or analytic approaches vary considerably, an examination programme should weigh carefully the nature of the criteria selected and their underlying rationale. Otherwise the programme may find that the criteria do not match well with the aims of the assessment and instructional programmes and do not provide a useful status report or diagnostic feedback. The need for explicit criteria is also apparent for scoring subject matter essay examinations. Learners commonly complain about the ambiguous subjective criteria used for subject matter essay examinations in the classroom assessments. When results of large scale achievement exams have serious consequences for learners’ explicit public and rational scoring keys are imperative.
d. Rating Procedures.
When a large number of papers must be scored by a pool of readers, an assessment programme must ensure that evaluation criteria are uniformly interpreted and applied. Such standardization involves both the formulation of explicit criteria and procedures for training raters. In the US rater training follows a fairly standard procedure. The following steps are employed to train raters.
$ There is a brief introduction to the rating scale.
$ Then the raters begin to practice applying criteria to a set of papers representing the test sample.
$ A trainer leads a discussion of the features of each paper that result in the classification of the paper to a particular grade.
Training time varies according to the number of separate scores recorded for each paper and according to the clarity of the criteria. The rigor of the procedures used to decide if acceptable rater agreements levels have been attained at the end of the training vary from a show of hands to pilot tests requiring independent scoring of essays.
In India through essay examinations are widely used, there is no programme to train raters. Failure to conduct any structured training or to check on prior agreement levels may increase the risk of unreliable scoring.
The reliability of an examination programme depends on the degree to which it eliminates measurement error. Four potential sources of error or score fluctuations identified for examinations of writing ability (but applying as well to tests of subjects matter skills) are as follows:
$ The writer – within – subject individual differences.
$ The assignment variations in item or task content.
$ Between - rater fluctuations
$ Within - rater instability
The writer within – subject errors can be avoided if the learners are asked to write a series of essays instead of one single essay. Thus the reliability of learners’ performance can be determined by gathering data on a pool of homogeneous items or assignments. Since essay writing requires at least twenty or thirty minutes it is often difficult to have them write many essays in examinations. But studies of the consistency of learners’ performances across a series of essay often report low reliabilities for a single essay. According to Spencer (1979) analysis of the stability of learners writing performance across several essays is also not reliable because of the variability brought in by the difference in topics.
Some ways of overcoming the problem of reliability are as follows:
$ Essay tasks should be based on specific skills of writing. This would reduce error variance due to the assignment.
$ Essays should be collected on at least two parallel assignments. This would reduce error associated with individual variability.
$ Scores on several essays should be combined to increase the readability of subject matter essay examinations.
Inter-rater agreement is the most prevalent issue concerning reliability in essay examinations. Statistical indices of agreement levels include co-efficient alpha, generalisability co-efficient, point biserial correlations and simple percentages of agreement. The most effective method of reducing inter-rater variability is to provide training on clearly specified criteria. To reduce error due to within – rater score fluctuations over time (rater drift) due to reader fatigue and / or carelessness, some form of interspersed check procedure seems helpful, according to Quellmalz (1980). Although some studies report that readers tend to get more lenient or harsher as rating progresses, few assessment programmes routinely monitor this problem.
Mike Hayhoe (1983) in his article, ‘A Historical Review of Essay Marketing’ discusses the problem of reliability in marking essays. According to him this problem has been persistent for a long time in the history of marking essays. If Rowntree was concerned about marker reliability in the 1880’s, Raleigh (1980) is equally worried about the same problem. Mike Hayhoe says that an error of twenty five percent is grading an essay may be conservative estimate and it has been suggested that the problem of unreliability in markings essays exists in internal assessment as well as external.
Reliability is inextricably linked with validity. The reliability of an essay examination depends on how valid the examination is and how valid the markers are in their assessment. A brief consideration of the problems faced by examiners in designing valid examinations is necessary if one wants to integrate testing and instruction.
The validity of an examination derives from evidence that the test accurately and dependably measure the specified skills. Evidence for the validity of an examination may take several forms.
i. One form focuses on the test content, that is, the test items or essay assignment, and gathers judgement of subject matter experts regarding a number of things like -
$ The objectives or skills defined to be important and representative of subject matter competencies, and
$ The way these skills are elicited in the item, problem or writing assignments.
ii. Other forms of validity focus on test performance to examine the following things:
$ Concurrent validity – whether the scores are comparable to scores on other tests of the same skills,
$ Predictive validity – if the score levels predict future success, and
$ Construct validity – if the performance pattern appears to measure the underlying trait.
The most common methods of attempting to establish the validity of essay examinations have been comparisons of scores to ‘related’ measures. In the case of tests of writing ability, the ‘other’ measures chosen as criterion variables are often reading tests, multiple choice or class grades.
The heart of the validity of a test is whether it measures the underlying skill construct, that is, whether it taps the hypothetical mental store of information and strategies. According to Raleigh (1980) the validity of an examination can be described in terms of the degree to which it ‘measures well’ what it is intended to measure. According to Mike Hayhoe (1983) there is a possibility to think of ‘Markers’ validity that is the degree to which he ‘measures well’ what the assessment systems sets out to measure.
g. Factors affecting marking
The marks awarded to an essay depend on a number of things. For example, Thorndike (1986) discusses the problem of ‘uniqueness’. Uniqueness raises the issue of divergence, the individuality of the work, and convergence, notions of correctness and orderliness. How far a marker is affected by divergence and convergence will decide the marks he gives to a particular assignment. Wiseman and Wrigley (1958) identified two schools of thought as far as assessors’ value base are concerned. One school values ‘imponderables’ of validity, freshness and fluency. The second school of thought sees the writer as ‘a craftsman able to show his skill whatever type of materials he works in’.
Britton (1963) found some evidence to suggest that teachers may well group towards valuing one end or the other of the following two poles:
Sophisticated, conventional written based work
Work based on familiar speech
Work based on imagination including fantasy / the unreal
Work based on observation of real life
A number of studies conducted in America suggest that teachers tend to clusters in favouring certain criteria ideas, form, flavour, mechanic, wording - and that the clusters of criteria adopted by the teacher can affect grading.
Deale (1975) feels that ‘adequacy’ of writing rather than ideas affects the marks awarded. Soloff (1973) argues that lack of consonance between the writers’ values and those of the assessor on a topic may affect the grade awarded. The London Association for the teaching of English shares his opinion. In its pamphlet, Assessing Compositions (1965) it expresses concern about how an assessor may react to experiences and attitudes in an essay which are unfamiliar to him and the potential for under or over assessing the work.
Marshall (!960) suggests that assessment in terms of the features of pieces of work which ‘float’ to the examiner – his intuitions about the texts – is the proper activity of an alert and sensitive marker.
Markers can be affected by visual features at the expense of such aspects as organisation, fluency, appropriateness in terms of task, audience and so on. According to Mike Haydoe (1983) this may be because the visual features are more immediately obvious, especially when they are flawed, and because there is a greater degree of consensus about them than there is about what ‘coherence’ or ‘clarity’ or other more global criteria may be.
Marshall (1967) and Scannel (1966) have found assessors particularly adversely affected by spelling errors, with errors of grammar and punctuation coming next. Handwriting also has a great impact on the assessors and many researchers like Chase (1968), Briggs (1970), and Soloff (1973) have demonstrated the power of this feature in affecting marking. In his more recent work Briggs (1980) goes further, suggesting that there may be borderline areas in grading in which this value aspect of a piece of writing may be the major factor in deciding what it is worth.
Yates and Pidgeon (1957) found that the setting of an essay affected the markers’ response. If an ‘average’ piece of work followed several fine pieces, it was likely to be marked hard; if it followed several poor ones it was likely to be upgraded.
The analysis of the present situation in Gujarat also reveals the fact that teachers are more concerned with spelling errors and punctuation. Next comes the grammatical error. Though all the teachers marked a number of features in the questionnaire (appropriacy, organisation, overall writing ability etc.) as very important, all of them assign one single grade on the basis of the overall impression of the composition.
i) Drawbacks of Essays Examinations
Essay examinations are said to test learners’ ability to engage in disciplined thought and the ability to express it in a coherent, supported discourse. But a number of points need to be taken into account if essay examinations are used to measure writing ability.
Some of the problems involved in using essay type tests are as follows:
It is difficult for an average teacher to structure such prompts for essay tests that clearly specify the aim, topic, audience, writer’s role and evaluation criteria. The problems of reliability, validity and the factors that affect marking discussed in this section prove that it is very difficult to measure the skills of writing ability through essay examinations.
Teachers cannot spend a lot of time in checking essays using analytic or holistic rating scales. The general impression score usually assigned by teachers is not a reliable method of scoring.
The method of training of raters is expensive and time-consuming and is not practicable as far as the schoolteachers are concerned.
Since it is not easy to structure, administer and score essay examinations, we need to consider other types of tests which are easy to construct, are easy to evaluate and which give a reliable and valid indication of learners’ proficiency to communicate through writing.