What Does Language Testing Have to Offer?
University of California, Los Angeles
Advances in language testing in the past decade have occurred in three areas: (a) the
development of a theoretical view that considers language ability to be multicomponential and
recognizes the influence of the test method and test taker characteristics on test performance,
(b) applications of more sophisticated measurement and statistical tools, and (c) the development
of “communicative” language tests that incorporate principles of “communicative” language
teaching. After reviewing these advances, this paper describes an interfactional model of
language test performance that includes two components, language ability and test method.
Language ability consists of language knowledge and metacognitive strategies, whereas test method
includes characteristics of the environment, rubric, input, expected response, and relationship
between input and expected response. Two aspects of authenticity are derived from this model. The
situational authenticity of a given test task depends on the relationship between its test method
characteristics and the features of a specific language use situation, while its interfactional
authenticity pertains to the degree to which it invokes the test taker’s language ability. The
application of this definition of authenticity to test development is discussed.
Since 1989, four papers reviewing the state of the art in the field of language testing have
appeared (Alderson, 1991; Bachman, 1990a; Skehan, 1988, 1989, 1991). All four have argued that
language testing has come of age as a discipline in its own right within applied linguistics and
have presented substantial evidence, I believe, in support of this assertion. A common theme in
all these articles is that the field of language testing has much to offer in terms of
theoretical, methodological, and practical accomplishments to its sister disciplines in applied
linguistics. Since these papers provide excellent critical surveys and discussions of the field
of language testing, I will simply summarize some of the common themes in these reviews in Part 1
of this paper in order to whet the appetite of readers who may be interested in knowing what are
the issues and problems of current interest to language testers. These articles are nontechnical
and accessible to those who are not themselves language testing specialists. Furthermore, Skehan
(1991) and Alderson (1991) appear in collections of papers from recent confer-ences that focus on
current issues in language testing. These collections include a wide variety of topics of current
interest within language testing, discussed from many perspectives, and thus constitute major
contributions to the literature on language testing.
The purpose of this paper is to address a question that is, I believe, implicit in all of the
review articles mentioned above, What does language testing have to offer to researchers and
practitioners in other areas of applied linguistics, particularly in language learning and
language teaching? These reviews discuss several specific areas in which valuable contributions
can be expected (e.g., program evaluation, second language acquisition, classroom learning,
research methodology). Part 2 of this paper focuses on two recent developments in language
testing, discussing their potential contributions to language learning and language teaching. I
argue first that a theoretical model of second language ability that has emerged on the basis of
research in language testing can be useful for both researchers and practitioners in language
learning and language teaching. Specifically, I believe it provides a basis for both
conceptualizing second language abilities whose acquisition is the object of considerable
research and instructional effort, and for designing language tests for use both in instructional
settings and for research in language learning and language teaching. Second, I will describe an
approach to characterize the authenticity of a language task which I believe can help us to
better understand the nature of the tasks we set, either for students in instructional programs
or for subjects in language learning research and which can thus aid in the design and
development of tasks that are more useful for these purposes.
PART 1: LANGUAGE TESTING IN THE 1990s
In echoing Alderson’s (1991) title, I acknowledge the commonal-ities among the review
articles mentioned above in the themes they discuss and the issues they raise. While each review
emphasizes specific areas, all approach the task with essentially the same rhetorical
organization: a review of the achievements in language testing, or lack thereof, over the past
decade; a discussion of areas of likely continued development; and suggestions of areas in need
of increased emphasis to assure developments in the future. Both Alderson and Skehan argue that
while language testing has made progress in some areas, on the whole “there has been relatively
little progress in language testing until recently” (Skehan, 1991, p. 3). Skehan discusses the
contextual factors—theory, practical consider-ations, and human considerations—that have
influenced language testing in terms of whether these factors act as “forces for conserva-tism”
or “forces for change” (p. 3). The former, he argues, “all have the consequence of retarding
change, reducing openness, and gen-erally justifying inaction in testing” (p. 3), while the
latter are “pres-sures which are likely to bring about more beneficial outcomes” (p. 7). All of
the reviews present essentially optimistic views of where language testing is going and what it
has to offer other areas of applied linguistics. I will group the common themes of these reviews
into the general areas of (a) theoretical issues and their im-plications for practical
application, (b) methodological advances, and (c) language test development.
THEORETICAL ISSUES
One of the major preoccupations of language testers in the past decade has been investigating
the nature of language proficiency. In 1980 the “unitary competence hypothesis” (Oller, 1979),
which claimed that language proficiency consists of a single, global ability was widely accepted.
By 1983 this view of language proficiency had been challenged by several empirical studies and
abandoned by its chief proponent (Oller, 1983). The unitary trait view has been replaced, through
both empirical research and theorizing, by the view that language proficiency is
multicomponential, consisting of a number of interrelated specific abilities as well as a general
ability or set of general strategies or procedures. Skehan and Alderson both suggest that the
model of language test performance proposed by Bachman (1990b) represents progress in this area,
since it includes both components of language ability and characteristics of test methods,
thereby making it possible “to make statements about actual performance as well as underlying
abilities” (Skehan, 1991, p. 9). At the same time, Skehan correctly points out that as research
progresses, this model will be modified and eventually superseded. Both Alderson and Skehan
indicate that an area where further progress is needed is in the application of theoretical
models of language proficiency to the design and development of language tests. Alderson, for
example, states that “we need to be concerned not only with . . . the nature of language
proficiency, but also with language learning and the design and researching of achievementtests;
not only with testers, and the problems of our professionalism,but also with testees, with
students, and their interests, perspectivesand insights” (Alderson, 1991, p. 5).
A second area of research and progress is in our understanding of the effects of the method
of testing on test performance, A number of empirical studies conducted in the 1980s clearly
demonstrated that the kind of test tasks used can affect test performance as much as the
abilities we want to measure (e.g., Bachman & Palmer, 1981, 1982, 1988; Clifford, 1981; Shohamy,
1983, 1984). Other studies demonstrated that the topical content of test tasks can affect
performance (e.g., Alderson & Urquhart, 1985; Erickson & Molloy, 1983). Results of these studies
have stimulated a renewed interest in the investigation of test content. And here the results
have been mixed. Alderson and colleagues (Alderson, 1986, 1990; Alderson & Lukmani, 1986;
Alderson, Henning, & Lukmani, 1987) have been investigating (a) the extent to which “experts”
agree in their judgments about what specific skills EFL reading test items measure, and at what
levels, and (b) whether these expert judgments about ability levels are related to the difficulty
of items. Their results indicate first, that these experts, who included test designers assessing
the content of their own tests, do not agree and, second, that there is virtually no relationship
between judgments of the levels of ability tested and empirical item difficulty. Bachman and
colleagues, on the other hand (Bachman, Davidson, Lynch, & Ryan, 1989; Bachman, Davidson, &
Milanovic, 1991; Bachman, Davidson, Ryan, & Choi, in press) have found that by using a content-
rating instrument based on a taxonomy of test method characteristics (Bachman, 1990b) and by
training raters, a high degree of agreement among raters can be obtained, and such content
ratings are related to item difficulty and item discrimina-tion. In my view, these results are
not inconsistent. The research of Alderson and colleagues presents, I believe, a sobering picture
of actual practice in the design and development of language tests: Test designers and experts in
the field disagree about what language tests measure, and neither the designers nor the experts
have a clear sense of the levels of ability measured by their tests. This research uncovers a
potentially serious problem in the way language testers practice their trade. Bachman’s
research, on the other hand, presents what can be accomplished in a highly controlled situation,
and provides one approach to solving this problem. Thus, an important area for future research in
the years to come will be in the refinement of approaches to the analysis of test method
character-istics, of which content is a substantial component, and the inves-tigation of how
specific characteristics of test method affect test performance. Progress will be realized in the
area of language test-ing practice when insights from this area of research inform the de-sign
and development of language tests. The research on test con-tent analysis that has been conducted
by the University of Cam-bridge Local Examinations Syndicate, and the incorporation of that
research into the design and development of EFL tests is illustrative of this kind of integrated
approach (Bachman et al., 1991), The 1980s saw a wealth of research into the characteristics of
test takers and how these are related to test performance, generally under the rubric of
investigations into potential sources of test bias; I can do little more than list these here. A
number of studies have shown differences in test performance across different cultural,
linguistic or ethnic groups (e.g., Alderman & Holland, 1981; Chen & Henning, 1985; Politzer &
McGroarty, 1985; Swinton & Powers, 1980; Zeidner, 1986), while others have found differential
performance between sexes (e.g., Farhady, 1982; Zeidner, 1987). Other studies have found
relationships between field dependence and test performance (e.g., Chapelle, 1988; Chapelle &
Roberts, 1986; Hansen, 1984; Hansen & Stansfield, 1981; Stansfield & Hansen, 1983). Such studies
demonstrate the effects of various test taker characteristics on test performance, and suggest
that such characteristics need to be considered in both the design of language tests and in the
interpretation of test scores. To date, however, no clear direction has emerged to suggest how
such considerations translate into testing practice. Two issues that need to be resolved in this
regard are .(a) whether and how we assess the specific characteristics of a given group of test
takers, and (b) whether and how we can incorporate such information into the way we design
language tests. Do we treat these characteristics as sources of test bias and seek ways to
somehow “correct” for this in the way we write and select test items, for example? Or, if many
of these characteristics are known to also influence language learning, do we reconsider our
definition of language ability? The investigation of test taker characteristics and their effects
on language test performance also has implications for research in second language acquisition
(SLA), and represents what Bachman (1989) has called an “interface” between SLA and language
testing research.
METHODOLOGICAL ADVANCES
Many of the developments mentioned way we view language ability, the effects taker
characteristics—have been facilitated that are available for test analysis. These above—changes
in the of test method and test by advances in the tools advances have been in three areas:
psychometrics, statistical analysis, and qualitative approaches to the description of test
performance. The 1980s saw the application of several modern psychometric tools to language
testing: item response theory (IRT), generalizability theory (G theory), criterion-referenced
(CR) measurement, and the Mantel-Haenszel procedure. As these tools are fairly technical, I will
simply refer readers to discussions of them: IRT (Henning, 1987), G theory (Bachman, 1990b;
Bolus, Hinofotis, & Bailey, 1982), CR measure-ment (Bachman, 1990b; Hudson & Lynch, 1984),
Mantel-Haenszel (Ryan & Bachman, in press). The application of IRT to language tests has brought
with it advances in computer-adaptive language testing, which promises to make language tests
more efficient and adaptable to individual test takers, and thus potentially more useful in the
types of information they provide (e.g., Tung, 1986), but which also presents a challenge not to
complacently continue using familiar testing techniques simply because they can be administered
easily via computer (Canale, 1986). Alderson (1988a) and the papers in Stansfield (1986) provide
extensive discussions of the applications of computers to language testing.
The major advance in the area of statistical analysis has been the application of structural
equation modeling to language testing research. (Relatively nontechnical discussions of
structural equation modeling can be found in Long, 1983a, 1983b.) The use of confirmatory factor
analysis was instrumental in demonstrating the untenability of the unitary trait hypothesis, and
this type of analysis, in conjunction with the multitrait/multimethod research design, continues
to be a productive approach to the process of construct validation. Structural equation modeling
has also facilitated the investigation of relationships between language test performance and
test taker characteristics (e.g., Fouly, 1985; Purcell, 1983) and different types of language
instruction (e.g., Sang, Schmitz, Vollmer, Baumert, & Roeder, 1986).
A third methodological advance has been in the use of introspec-tion to investigate the
processes or strategies that test takers employ in attempting to complete test tasks. Studies
using this approach have demonstrated that test takers use a variety of strategies in solving
language test tasks (e.g., Alderson, 1988c; Cohen, 1984) and that these strategies are related to
test performance (e.g., Anderson, Cohen, Perkins, & Bachman, 1991; Nevo, 1989).
Perhaps the single most important theoretical development in language testing in the 1980s
was the realization that a language test score represents a complexity of multiple influences. As
both Alderson and Skehan point out, this advance has been spurred on, to a considerable extent,
by the application of the methodological tools discussed above. But, as Alderson (1991) notes, “
the use of more sophisticated techniques reveals how complex responses to test items can be and
therefore how complex a test score can be” (p. 12). Thus, one legacy of the 1980s is that we now
know that a language test score cannot be interpreted simplistically as an indicator of the
particular language ability we want to measure; it is also affected to some extent by the
characteristics and content of the test tasks, the characteristics of the test taker, and the
strategies the test taker employs in attempting to complete the test task. What makes the
interpretation of test scores particularly difficult is that these factors undoubtedly interact
with each other. The particular strategy adopted by a given test taker, for example, is likely to
be a function of both the characteristics of the test task and the test taker’s personal
characteristics. This realization clearly indicates that we need to consider very carefully the
interpretations and uses we make of language test scores and thus should sound a note of caution
to language testing practitioners. At the same time, our expanded knowledge of the complexity of
language test perfor-mance, along with the methodological tools now at our disposal, provide a
basis for designing and developing language tests that are potentially more suitable for specific
groups of test takers and more useful for their intended purposes.
ADVANCES IN LANGUAGE TEST DEVELOPMENT