Student Teaching Evaluations:
Psychometric, Methodological, and Interpretational Issues

Anthony J. Onwuegbuzie
Professor
Department of Educational Measurement and Research
College of Education
Sam Houston State University
E-mail: tonyonwuegbuzie@aol.com

Larry G. Daniel
Dean of the College of Education and Human Services
University of North Florida

Kathleen M. T. Collins
Assistant Professor
Department of Curriculum and Instruction
University of Arkansas at Fayetteville
E-mail: kcollinsknob@cs.edu

Student teaching evaluation (STE) instruments, which typically are completed by the students at or toward the end of the course, represent the most common way of assessing faculty teaching performance at institutions of higher education in the United States (Dommeyer, Baum, Chapman, & Hanna, 2002). STEs routinely are utilized by administrators to make decisions on faculty about tenure, promotion, and merit pay increases. Although some STE forms contain one or more open-ended items that allow students to delineate their perceptions about their instructors’ teaching styles, these instruments almost exclusively or predominantly contain one or more rating scales containing Likert-format items. It is data from these scales that typically form the basis of decisions made by administrators and stakeholders. Unfortunately, because many administrators do not have adequate training in measurement and statistics (Onwuegbuzie & Daniel, 2003), they are not fully cognizant as to how easy it is for them to misuse and abuse a STE form, culminating in “unwarranted and unjust termination for large numbers of junior faculty and a source of humiliation for many of their senior colleagues” (Gray & Bergmann, 2003, p. 44). Such misuses include “treat[ing] relative position [of a rating] as if it were an absolute measure of merit” (Gray & Bergmann, 2003, p. 45), and “not recognizing that even in departments with mostly effective instructors, 50% of teachers would be rated below the department median” (Onwuegbuzie, Daniel, & Collins, in press, p. 4). Further, insufficient grounding in measurement and statistics on the part of faculty render them unable to defend their administrators’ invalid interpretations of their STE ratings.

Therefore, in order to reverse this trend of STE abuse and misuse, it is imperative that faculty and administrators are aware of the psychometric, methodological, and interpretational issues associated with STEs. In an attempt to address the former issue—(i.e., provide psychometric data), Onwuegbuzie et al. (in press) conducted an extensive review of the STE literature for the purpose of assessing the score reliability and score validity of STEs across these studies. The findings of this review are summarized in the following two sections.

Reliability. After examining studies that reported data on the score reliability of STEs, Onwuegbuzie et al. (in press) surmised that although the majority of studies have yielded STE scores with large reliability coefficients, some researchers have reported STE scores that have yielded low reliability coefficients. Thus, Onwuegbuzie et al. concluded that the reliability evidence has been mixed.

Validity. Building on Messick’s (1989, 1995) theory of validity, Onwuegbuzie et al. (in press) developed what they referred to as a meta-validity model that subdivides content-, criterion-, and construct-related validity into several areas of evidence. This meta-validity model is presented in Table 1 at the end of this document. According to Onwuegbuzie et al., although validity is treated as a unitary concept, all of these areas of evidence are needed when assessing the score validity of TEFs. Onwuegbuzie et al. utilized their meta-validity model to conduct a meta-validity analysis of STEs. In particular, they assessed the score-validity of STEs based on findings from the extant literature. Based on their analysis, they noted the following: (a) strong evidence has been provided with regard to areas of criterion-related validity; (b) however, in general, weak or inadequate evidence has been provided with regard to areas of both content-related and construct-related validity. Table 2 (at end of this document) provides a summary of their conclusions for each validity type. Onwuegbuzie et al. concluded that their analysis seriously calls into question both the score-validity and utility of STEs.

Purpose of Article

Onwuegbuzie et al. (in press) provided useful information about psychometric issues pertaining to STEs. Therefore, the purpose of this article is to build on their work by bringing to the fore some important methodological and interpretational issues.

First, we outline the strengths and limitations of STEs, as well as the perceptions of STE instruments of faculty and students. Second, we discuss how STE instruments can be abused and misused. We sub-divide this section into measurement errors and interpretational errors. Measurement errors include inserting items on the rating scale that are not directly related to teaching effectiveness, such as the extent to which students enjoyed reading the course textbook and the geographical dialect spoken by the instructor. Interpretational errors include not taking into account indices of statistical significance (e.g., confidence intervals) or practical significance (e.g., effect sizes) when comparing ratings among faculty or units and not statistically adjusting for factors known to affect student ratings such as perceived course difficulty, required status of the course, course format and delivery method, lecture content, and level of the course. We also discuss the effect of missing data on overall student ratings. Findings from the literature base are used to support our typology of errors. Third, we examine real data stemming from STE ratings utilized at one particular university to illustrate several of the articulated problems with these instruments.

Finally, we provide guidelines for proper use and interpretation of STE ratings. In so doing, we contend that the construction of STEs must begin by setting out the parameters, definitions, and assumptions of the object of the evaluation. Also, we propose that STE instruments should provide students the opportunity for open-ended responses. Moreover, we argue that STE instruments should never be used in isolation to evaluate instructional effectiveness. Rather, they should be combined with other measures of teaching competence.

Strengths and Limitations of Student Teaching Evaluations

The strength of STE forms, as part of an information-gathering process, is their role in accumulating data for both formative and summative evaluation purposes. Ory (2000) pointed out that teaching evaluations can provide a comprehensive measure of faculty performance characteristics that can lead to an overall improvement in course design and implementation. Faculty can use students’ recommendations to improve overall course design and implementation. Although the formative goal is to provide faculty with feedback and recommendations for improving teaching performance, the connection between evaluation data and faculty development is tenuous (Centra, 1979; Ory, 2000, Simpson, 1995; Theall & Franklin, 2001). This is unfortunate because the literature indicates that the combination of rating information and faculty development can promote improvements in teaching (Theall & Franklin, 2001).

Administrators can use students’ responses to global questions as one component of their overall summative assessments of faculty’s teaching effectiveness (Simpson, 1995). However, utilizing student ratings as a means of ranking faculty can diminish collegiality and promote an “unhealthy brand of competition” (Simpson, 1995, p. 5). Another limitation is that in response to the evaluative process, some instructors may attempt to improve their student evaluations by implementing behaviors, such as lenient grading that are not related to student learning.

Perceptions of STE Rating Forms

The current trend toward increased accountability in higher education has escalated the importance of student rating systems in personnel-related decisions (Johnson & Ryan, 2003). Administrators’ interpretations of STE data, in the context of formulating personnel decisions, are tempered by their knowledge of the student rating literature and their statistical training (Adams, 1997; McKeachie, 1997; Theall & Franklin, 2001). Theall and Franklin (2001) contend that faculty and administrators are not prepared to accurately interpret student rating forms. Inspection of several hundred survey reports revealed that faculty members’ and administrators’ limited knowledge base relative to the student ratings literature as well as their lack of statistical competence in basic statistics affected their abilities to accurately interpret student rating forms. On a more encouraging note, in an earlier study Cohen (1980) reported that the quality of STE data interpretation of administrators and faculty improved when they were provided with technical assistance.

The perceptions of faculty about the degree to which STE instruments measure teaching effectiveness are mixed. Kulik (2001) observed that some faculty believe numerical rating scales lend scientific precision to the evaluative process as well as opportunities for students to participate in their educational experience. However, other faculty members believe that STE instruments provide trivial data and instead may, at least in part, be a measure of teacher personality. In addition, although faculty generally believe that student feedback is important for improving teaching effectiveness, some faculty perceive that STE results are weighted more toward accountability and quality monitoring (Johnson & Ryan, 2003; Kulik, 2001; Penny, 2003).

Although research indicates that faculty and students agree on behaviors associated with effective teaching (Feldman, 1988), generally, students do not have sufficient and adequate background to evaluate a professor’s knowledge base within the subject area (Theall & Franklin, 2001). However, Theall and Franklin (2001) contended that it is reasonable to assume that students are qualified to assess other factors associated with teaching effectiveness because during the semester they have multiple opportunities to (a) observe teaching behaviors, such as presentational skill and teaching style; (b) engage in interactions with the professor in context of the class; and (c) evaluate the professor’s accessibility and helpfulness outside of class.

Overall, students believe that they are qualified to assess their professors’ teaching; however, students also express reservations about the extent to which their responses to STE forms are valued by administrators and faculty (Abbott, Wulff, Nyquist, Ropp, & Hess, 1990; Friedlander, 1978; Marlin, 1987; Miron & Segal, 1986). Spencer and Schmelkin (2002) noted that students perceived that the two most important reasons for completing the STE forms were to provide feedback to the professor and to provide information to their student peers. Although students believed that completing the forms provided information to their peers, they expressed little interest in consulting the STE forms when making their course selections. Interestingly, similar to the studies cited above, students’ responses also indicated that they were not convinced that administrators and faculty attend to student opinion.

Student perceptions about effective instruction may be a biasing factor in evaluating faculty effectiveness (Penny, 2003). For example, the current pedagogical shift from teacher-centered instruction based on direct and explicit teaching to student-centered instruction consistent with inquiry and problem-based teaching may not be familiar to students and, therefore, the latter may not be recognized as effective teaching (Penny, 2003). Consequently, teachers adhering to an inquiry-based teaching approach may receive an unfair evaluation.

Abuse and Misuse of Student Teaching Evaluations

Measurement Errors

There are well-constructed STE rating forms, such as the Instructional Development and Effectiveness, Student Instructional Report, and Students’ Evaluation of Educational Quality forms (Centra, 1993). Penny (2003) concluded that items on many rating surveys represent an “ad hoc lists of items” (p. 401) that do not consistently reflect dimensions defining teaching effectiveness and contain poorly constructed global questions.

According to Schmelkin, Spencer, and Gellman (1997), administrators aggregate students’ responses without taking into account the demographic profile of the students enrolled in the class. However, contemporary post-secondary student populations include nontraditional students, students representing various cultural and racial backgrounds, as well as international students who may have alternative perceptions of effective teaching that can differentially impact ratings of faculty effectiveness (Theall & Franklin, 2000). Often students’ responses to items on rating scales are quantified, and small variations may be given disproportionate importance (Centra, 1979). Evidence indicates that student ratings, generally, are not impacted by situational variables (Marsh, 1987; Theall & Franklin, 2001). However, Theall and Franklin (2001) noted that although teaching quality may be consistent across various contexts, effective student learning might be mitigated by certain conditions (i.e., large class size versus smaller class size).

As universities incorporate alternative instructional formats, such as online courses into the higher education curriculum, interpretation errors may occur when STE items that are used to assess faculty effectiveness in traditional or “face to face” classrooms are utilized to evaluate online teaching and learning (Theall & Franklin, 2000). Recently, Hoffman (2003) conducted a study to assess the ways colleges and universities were using the Internet to assess teaching effectiveness in online courses versus face-to-face courses. Results indicated that universities and colleges were more likely to use the Internet to distribute evaluation results to faculty than to collect student data. Findings also indicated that nearly two-thirds of the sample (n = 500) reported using the Internet to evaluate online courses. In face-to-face courses, only 17% were using the Internet to evaluate the courses. Other ways of generating measurement error include obtaining a response rate that is not an adequate representation of the student population. For example, Bothell and Henderson (2003) identified nonresponse bias as a systematic variation in response rate that occurs when individuals representing a specific group, in contrast to other groups within a population, are less likely to respond to an online survey. Indeed, research indicates that males are more likely to respond to an online survey than are females (Palmquist & Stueve, 1996). This is an important area for future research.

The most direct way of generating measurement error is to include items that students are unable to answer adequately such as the domain-specific knowledge of the professor. Measurement errors also can occur by including irrelevant items that do not provide useful formative data, such as asking students if they enjoyed or liked the textbook and asking students to evaluate the instructor’s spoken English. In the former example, textbooks used in courses that most students traditionally find difficult, such as statistics and research methodology, likely would be viewed negatively, thereby severely skewing the responses. The latter item, involving level of spoken English, likely would place at a disadvantage those teachers who were born and raised in geographical locations that are far removed from their institutions.

Interpretation Errors

Even if an STE scale can be developed that yields scores that are perfectly reliable and valid every time it is administered, it will still be very much open to misuse and abuse (Theall & Franklin, 2001). This is because neither optimal measurement nor optimal data analysis guarantees optimal interpretations and inferences. For many faculty, STEs represent a form of high stakes assessment because they often play an important role in decisions regarding tenure, promotion, and merit pay. Thus, the misinterpretation of STE scores can have dire consequences for the person(s) concerned. Unfortunately, it appears that many administrators, stakeholders, and even instructors incorrectly interpret STE data. This tendency to misinterpret such information likely stems from inadequate background in statistics and measurement among many faculty and administrators.

Examples of ways in which STE summary data are misused and/or abused include:

(a) not taking into account extraneous variables when comparing ratings across faculty or units;
(b) not taking into account indices of statistical significance (e.g., confidence intervals) or practical significance (e.g., effect sizes) when comparing ratings across faculty or units;
(c) ignoring the effect of missing data on overall student ratings;
(d) failing to report reliability indices for STE scores; and
(e) failing to take into account other indices of instructional effectiveness. Each of these examples is discussed below.

Not Taking into Account Extraneous Variables When Comparing Ratings

Perhaps the major way of interpreting STE data is to compare item or scale means of a target instructor to data pertaining to one or more other faculty members, academic programs, departments, the college, and/or the university. However, this practice is extremely dangerous and represents a misuse/abuse of the STE form because it does not take into account the array of extraneous variables that prevail. For example, is it valid or ethical to compare the mean rating of a faculty member teaching an elective course to a faculty member teaching a required course? What about the validity and ethicalness of comparing an instructor of an undergraduate course to an instructor of a graduate course? A statistics instructor to a history teacher? A first-year assistant professor to a tenured full professor? Such comparisons could lead to erroneous inferences.

If such comparisons were needed by the administrator, then statistical adjustments would have to be made. That is, administrators should statistically adjust for factors known to affect student ratings such as perceived course difficulty, required status of the course, course format and delivery method, lecture content, and level of the course. Unfortunately, such an adjustment would necessitate sophisticated statistical techniques (e.g., multiple regression), which, although more appropriate, would be even more difficult for many administrators to interpret. An easier way would be to limit the comparison group--for example, a statistics teacher would only be compared to other statistics instructors within the same department. Regardless, it is likely that the statistics instructors being compared would be different in at least one important way that would invalidate the comparison unless a statistical adjustment was made. Even if the target instructor did not differ in any important way from the comparison group, comparing mean ratings on a one-shot basis still could be misleading. In fact, possibly the most informative and ethical comparisons for an administrator to make would be a within-instructor comparison of teaching effectiveness, whereby an instructor’s mean rating for a particular class would be compared to the mean rating of the same instructor in another course taught at the same point in time, in the same course in a previous semester, or to any other course(s) taught previously by the instructor.

Not Taking into Account Indices When Comparing Ratings

Comparing STE ratings of faculty without taking into account indices of statistical significance (e.g., confidence intervals) or practical significance (e.g., effect sizes) when comparing ratings among faculty or units represents another misuse/abuse of STE summary data. Yet, many administrators find it difficult to resist making such uninformed comparisons. Those who lack the necessary statistical sophistication may end up declaring, for example, that a mean score of 3.8 on a 4-point scale represents significantly more effective instruction than a mean score of 3.7. Yet, such a difference would only be notable if the standard errors associated with these means were extremely small. Thus, interpretations should be made via confidence intervals. Because confidence intervals are sufficiently complex statistics, those responsible for generating the initial summary data also should provide confidence intervals as part of the summary sheet, so that administrators would not have to calculate these intervals. Clearly, faculty and administrators would have to receive a quick “crash course” in interpreting confidence intervals. Even when confidence intervals are used, the interpreter should refrain from comparing classes with extremely different sample sizes because sample sizes can greatly affect responses. For example, holding everything else equal, it is easier for a teacher of a class with a small number of students (say, n = 5) to attain a high, if not perfect rating, compared to a teacher of a large class. On the other hand, one dissenting student in a class containing 5 students likely would affect the mean rating to a much greater extent than would one dissenter in a class of 30 students.

Ignoring the Effect of Missing Data on Overall Student Ratings

Another interpretational error might occur if a relatively large proportion of students in a class, for whatever reason, did not complete the STE form. This would lead to missing data being generated. If the students who failed to complete the STE form were very different with respect to their views of their instructor’s effectiveness, then any ensuing interpretation would not be adequately representative of the instructor. However, this problem likely can be reduced when administrators make it clear to the students the importance of participating in the evaluation process. Another effective strategy might be for the university to have an official week in which STEs are administered so that students could be notified well in advance. In any case, it is imperative that administrators pay careful attention to sample sizes when comparing STE mean ratings.

Failure to Report Reliability Indices for STE Scores

As noted by Onwuegbuzie and Daniel (2004), low score reliability affects not only the statistical power of hypothesis tests, but also affects all kinds of descriptive statistics. Low score reliability tends to make all summary statistics untrustworthy. The importance of university administrators assessing score reliability of STEs is evidenced by court rulings (e.g., Scott vs. University of Delaware, 601 F. 2d 76, 1979; Wagner vs. Long Island University, 419 F. SUPP 618, 1976) concerning university practices. Thus, administrators should require that score reliability coefficients accompany all summary data.

Failure to Take into Account Other Indices of Instructional Effectiveness.

The sole reliance upon a single measure of instructional effectiveness (i.e., STE) has received much criticism in the literature (Centra, 1993; Ramsden & Dodds, 1989; Ramsden & Martin, 1992; Seldin, 1993), with authors contending that use of a single instrument provides an incomplete, and perhaps invalid, index of an instructor’s effectiveness. These authors have recommended that quantitative information be combined with qualitative information, with the latter taking the form of students’ responses to one or more open-ended questions about the instructor’s effectiveness or the like. Further, some authors (e.g., Williams & Desley, 1997) have recommended that instructional effectiveness data be collected from sources other than STEs, including feedback from colleagues or educational experts, analysis of student work, teacher portfolios, and reflective self-evaluations.  For example, at the University of North Florida (UNF), administrators are required to use multiple lines of evidence when evaluating the effectiveness of instruction. Summative evaluations of teaching must be based on at least three different information sources, including:

(1) course materials (e.g., syllabi, exams, etc.);
(2) self-evaluation report;
(3) documented peer assessment;
(4) student opinions (e.g., written narratives, university-sanctioned survey instruments) (UNF, 2000, p. 5/2).

Analysis of Student Teaching Evaluation Data

A simple data example is provided here to illustrate selected points made earlier in this paper. All data are taken from public domain files found on the UNF website (http://www.unf.edu/dept/inst-research/ISQ-Fall03.htm) for the fall 2003 semester. Data featured show score summaries for a 23-item instrument known as the “Instructional Satisfaction Questionnaire” (ISQ). Each ISQ item is rated on a 5-point Likert-type scale ranging from either (a) “(1) Strongly Disagree” to “(5) Strongly Agree” or (b) “(1) Poor” to “(5) Excellent”.  In all cases, higher scores connote more favorable impressions. Per academic affairs policy, all UNF faculty are required to collect and report data for all classes they teach for at least one semester during the academic year, and untenured faculty are required to submit data for all courses taught in both the fall and spring semesters (UNF, 2000). Mean scores on a subset of eight “core” items must be included in the formal summative evaluation process. Data from the remaining 15 items are used as the faculty member or administrator chooses.

Table 3 (at end of this document) presents data in summary form for each of the 23 items for various faculty groups/departments within the College of Education and Human Services. University data also are presented for purpose of comparative analysis. Although it is a somewhat rough summary of large amounts of data across many instructors and courses, the data demonstrate certain trends that provide useful information. First of all, the fact that all variables have a value of 4 or higher for all cohorts indicates either a halo effect or ceiling effect. It is obvious that relative few students are endorsing responses of less than 4 on any of the items. Consequently, the response format represents more of a dichotomy than a continuous scale. Secondly, as indicated by item standard deviations across the various cohorts, there are mostly small differences across faculty groups for any given item, indicating that students view teachers rather uniformly. Third, several items are marked consistently lower than others. For example, the first two core items (“description of course objectives and assignments” and “communication of ideas and information”) are among the lowest scored items across all eight cohorts. Conversely, there are other items marked consistently higher than others (e.g., supplemental items #2 and #3).

In addition to looking at trends in the data, one might also want to critique the ISQ items themselves. For example, as previously noted, a student would have a difficult time judging the instructor’s knowledge of the content; nevertheless, this is precisely what is being addressed in the third supplemental item. It is interesting that students tended to give this item higher ratings than other items on the survey, indicating that students may be judging the faculty members’ knowledge of content against their own knowledge rather than against the knowledge of other faculty. Follow-up focus group interviews with various cohorts of students who have completed this or a similar instrument would provide additional information about how students make decisions about instructors’ knowledge base or mastery of content.

It also is interesting to compare scores on items that are designed to measure similar constructs. For example, supplemental item #1 and core item #2 both deal with the instructor’s skill as a communicator; however, students consistently across all cohorts gave the former a higher rating than the latter (approximately .2 higher).  Similarly, supplemental item #14 and core item #4 are rather consistent; however, with only one exception, students gave slightly higher endorsements to the former (approximately .1 higher). These results have implications for score validity and reliability and suggest that additional psychometric integrity studies on ISQ data may be warranted.

Guidelines for Use and Interpretation of Student Teaching Evaluations

In this section we provide guidelines for developing STEs that administrators should consider implementing. First and foremost, we contend that the construction of STEs must begin by setting out the parameters, definitions, and assumptions of the object of the evaluation, as well as a sound rationale or theory behind the choice of STE form. By outlining these factors, developers of STEs would put themselves in a better position to develop appropriate instruments.

When STE forms are being developed, they should not include any items that students are not in a position to answer appropriately. For example, students should not be asked to rate the instructor’s level of knowledge (Seldin, 1993). Nor should students be asked to evaluate whether course materials are sufficiently current (Seldin, 1993). These judgments require extensive professional and academic experience. Conversely, students should be asked to assess what they explicitly learned in the course. Students also should be given the opportunity to rate the instructor’s ability to communicate at the students’ level, to have rapport with students, to stimulate interest and motivation, and to promote ethical and professional behavior. In addition, the STE scale should include some items that tap specific teacher behaviors (e.g., “All assignments were graded and returned in a timely manner”).

If the STE data are to be used for formative purposes, then the evaluation forms should be administered before the midpoint of the semester, so that there is sufficient time for adjustments to be made by the instructor if the data suggest this. If the STE results are to be used for summative purposes, then the form should be administered toward the end of the course; however, if possible, it should not be administered on the same day of the final examination. Also, as long as STEs scores are intended to be compared among instructors, the administration should be standardized considering that lack of standardization affects score reliability (Fernald, 1990). Further, a script should be given to the student volunteer to read explicit instructions to the class. The instructor should not be in the classroom at any point during this administration process because presence of the instructor has been found to bias the data (Feldman, 1979; Marsh, 1984). Further, all students should be informed how their ratings will be used.

Summary data that are provided to instructors should include not only the mean rating for each item and for the full scale, but also the corresponding standard deviations, and 95% confidence intervals. Reporting of confidence intervals is consistent with the recommendations of the American Psychological Association (APA) Task Force (Wilkinson & the Task Force on Statistical Inference, 1999). Such information would allow instructors, administrators, and other decision makers (e.g., tenure committees) to make more valid and meaningful comparisons among faculty. STE score reliability pertaining to the various academic units also should be reported so that all interpretations could be made after taking into account this information (Onwuegbuzie & Daniel, 2002, 2004). All faculty and administrators should then receive formal training in how to interpret the STE summary data.

Moreover, Theall and Franklin (2001) (pp. 52-54) outlined the following guidelines for STEs: 

1.  “Establish the purpose of the evaluation and the uses and users of ratings beforehand”
2.  “Include stakeholders in decisions about evaluation process by establishing policy process”
3.  “Publicly present clear information about the evaluation criteria, process and procedures”
4.  “Produce reports that can be understood easily and accurately”
5.  “Educate the users of ratings results to avoid misuse and misinterpretations”
6.  “Keep a balance between individual and institutional needs in mind”
7.  “Include resources for improvement and support of teaching and teachers”
8.  “Keep formative evaluations confidential and separate from summative decision making”
9.  “Adhere to rigorous psychometric and measurement principles and practices”
10.  “Regularly evaluate the evaluation system”
11.  “Establish a legally defensible process and a system for grievances”
12.  “Consider the appropriate combination of evaluation data with assessment and institutional  research information”

In the context of improving online teaching evaluations various researchers have advocated recommendations such as creating appropriate survey items, devising systems for monitoring student response rates and protecting the anonymity of respondents, and assessing administrator, faculty, and student readiness (e.g., computer skills) and accessibility to online resources (Bullock, 2003; Sorenson & Reiner, 2003).

Finally, we recommend that STE instruments should also provide opportunities for students to provide open-ended responses. Moreover, we argue that STE instruments should never be used in isolation to evaluate instructional effectiveness. Rather, they should be combined with other measures of teaching competence such as:

  • statements of the instructor’s teaching philosophies, responsibilities, and short- and long-term goals
  • copies of course syllabi and/or other teaching materials
  • a history of the instructor’s initiatives designed to improve teaching
  • a description of curricular revisions undertaken
  • research on teaching undertaken
  • evidence of advising and mentoring
  • self-evaluations undertaken by the instructor
  • evaluations by peers and administrators (both inside and outside the unit/institution)
  • unsolicited written comments made by students
  • evaluations from student advisees
  • evidence of participation in curriculum creation or teaching improvement efforts within the discipline
  • videotapes of a typical class
  • samples of students’ work; records of student achievement after leaving the course and/or institution
  • records of student achievement in more advanced courses
  • statements from alumni on the quality of the instruction
  • teaching portfolios (Arreola, 1995; Centra, 1994; Koon & Murray, 1995; Seldin, 1993).

Tables

Table 1:  Areas of Validity Evidence in Meta-Validation Model

 

Validity Type

 

Description

 

Criterion-Related:

 

Concurrent Validity

 

 

 

Predictive Validity

 

 

Content-Related:

 

Face Validity

 

Item Validity

 

Sampling Validity

 

Construct-Related:

 

Substantive Validity

 

 

Structural Validity

 

Convergent Validity

 

 

Discriminant    Validity

 

 

Divergent Validity

 

 

Outcome Validity
   

 

Generalizability

 

 

Assesses the extent to which scores on an instrument are related to scores on another, already-established instrument administered approximately simultaneously or to a measurement of some other criterion that is available at the same point in time as the scores on the instrument of interest

 

Assesses the extent to which scores on an instrument are related to scores on another, already-established instrument administered in the future or to a measurement of some other criterion that is available at a future point in time as the scores on the instrument of interest

 

Assesses the extent to which the items appear relevant, important, and interesting to the respondent

 

Assesses the extent to which the specific items represent measurement in the intended content area

 

Assesses the extent to which the full set of items sample the total content area

 

 

Assesses evidence regarding the theoretical and empirical analysis of the knowledge, skills, and processes hypothesized to underlie respondents’ scores

 

Assesses how well the scoring structure of the instrument corresponds to the construct domain

 

Assesses the extent to which scores yielded from the instrument of interest being highly correlated with scores from other instruments that measure the same construct

 

Assesses the extent to which scores generated from the instrument of interest being slightly but not significantly related to scores from instruments that measure concepts theoretically and empirically related to but not the same as the construct of interest

 

Assesses the extent to which scores yielded from the instrument of interest not being correlated with measures of constructs antithetical to the construct of interest

 

Assesses the meaning of scores and the intended and unintended consequences of using the instrument

 

Assesses the extent that meaning and use associated with a set of scores can be generalized to other populations

Table 2: Interpretation of Quality of Evidence Arising from Onwuegbuzie et al.’s (in press) Meta-Validity Analysis of Student Teaching Evaluations

VALIDITY TYPE                  EVIDENCE

Criterion-Related:

Concurrent Validity                   Strong

Predictive Validity                     Strong

Content-Related:
 
Face Validity                            Adequate

Item Validity                             Inadequate

Sampling Validity                      Inadequate

Construct-Related:

Substantive Validity                  Inadequate

Structural Validity                     Inadequate

Convergent Validity                  Adequate

Discriminant Validity                 Inadequate

Divergent Validity                     Inadequate

Outcome Validity                      Weak
   
Generalizability                         Tentative
_______________________

Table 3
Fall 2003 Instructional Satisfaction Data for the University of North Florida’s College of Education by Department

SUPPL. ITEMS

Univ

College

Curric./ Instruct.

Ed. Core

Counselor Ed.

Leadership (Masters)

Leadership (Doctoral)

Special
Ed.

StandDev.

1-
Comm.
effectively
with all
students

4.3

4.5

4.5

4.4

4.5

4.7

4.6

4.6

0.12

2-Enthusiasm for course material

4.5

4.7

4.8

4.6

4.7

4.8

4.9

4.8

0.13

3-
Mastery of the course content

4.6

4.7

4.7

4.6

4.9

4.8

4.9

4.9

0.14

4-
Relates course material to current examples

4.4

4.6

4.6

4.6

4.7

4.7

4.8

4.7

.12

5-
Clearly explains complex concepts and ideas

4.2

4.4

4.4

4.3

4.5

4.6

4.6

4.6

.16

6-
Lectures organized/ provide framework for learning

4.2

4.4

4.4

4.4

4.5

4.6

4.5

4.6

.14

7-
Course syllabus adequately described the course

4.4

4.5

4.5

4.5

4.6

4.7

4.7

4.6

.11

8-
Instruct. materials used effectively

4.3

4.5

4.5

4.4

4.5

4.6

4.6

4.7

.12

9-
Involves students in class activities

4.3

4.7

4.7

4.5

4.7

4.8

4.9

4.8

.19

10-
Uses class time well

4.3

4.5

4.4

4.4

4.3

4.6

4.5

4.7

.14

11-
Fosters envir’t for
critical thinking

4.3

4.5

4.5

4.4

4.7

4.7

4.8

4.7

.18

12-
Treats all students in  consistent manner

4.4

4.6

4.5

4.5

4.6

4.8

4.8

4.6

.14

13-
Exams reflect the
material covered

4.3

4.6

4.5

4.5

4.6

4.6

4.7

4.7

.13

14-Willingly assists students outside of class

4.3

4.3

4.6

4.5

4.6

4.7

4.7

4.6

.13

15-
I found this class to be challenging

4.2

4.3

4.3

4.2

4.6

4.4

4.5

4.5

.15

 

 

 

 

 

 

 

 

 

 

CORE ITEMS

 

 

 

 

 

 

 

 

 

1-Description of course objectives and assignmts

4.1

4.3

4.2

4.2

4.3

4.5

4.5

4.4

.15

2-
Comm of
ideas and information

4.0

4.3

4.3

4.2

4.4

4.5

4.4

4.5

.17

3-Expression of expect. for this class.

4.1

4.4

4.3

4.3

4.4

4.5

4.4

4.5

.13

4-Availability to assist students
in/out of class.

4.2

4.4

4.4

4.3

4.3

4.7

4.6

4.5

.17

5-
Respect and concern
for students

4.2

4.5

4.4

4.4

4.5

4.7

4.7

4.6

.17

6-Stimulation of interest in course

4.1

4.4

4.4

4.3

4.5

4.6

4.5

4.7

.18

7-Facilitation of learning

4.1

4.4

4.4

4.3

4.4

4.6

4.6

4.6

.18

8-
Overall rating of instructor

4.2

4.4

4.4

4.3

4.3

4.7

4.5

4.5

.14

MEANS

4.3

4.5

4.5

4.4

4.5

4.6

4.6

4.6

.15

 References

Abbott, R. D., Wulff, D. H., Nyquist, J. D., Ropp, V., & Hess, C. W. (1990).  Satisfaction with processes of collecting student opinions about instruction: The student perspective.  Journal of Educational Psychology, 82, 201-206.

Adams, J. V. (1997) Student evaluations: The rating game, Inquiry, 1, 10-16.

Arreola, R. A. (1995) Developing a Comprehensive Faculty Evaluation System. (Bolton, MA: Anker).

Bothell, T. W., & Henderson, T. (2003, Winter) Do online ratings of instruction make $ense?, New Directions for Teaching and Learning, 96, 69-79.

Bullock, C. D. (2003) Online collection of midterm student feedback, New Directions for Teaching and Learning, 96, 95-102.

Centra, J. A. (1979) Determining faculty effectiveness (San Francisco, CA: Jossey-Bass).

Centra, J. A. (1993) Reflective faculty evaluation. (San Francisco, CA: Jossey-Bass).

Centra, J. A. (1994) The use of the teaching portfolio and student evaluations for summative evaluation, Journal of Higher Education, 65, 555-570.

Cohen, P. A. (1980) Effectiveness of student-rating feedback for improving college instruction: A meta-analysis of findings, Research in Higher Education, 13, 321-341.

Dommeyer, C. J., Baum, P., Chapman, K. S., & Hanna, R. W. (2002) Attitudes of business faculty towards two methods of collecting teaching evaluations. Paper vs. online, Assessment & Evaluation in Higher Education, 27, 455-462.

Feldman, K. A. (1979) The significance of circumstances for college students’ ratings of their teachers and courses: A review and analysis, Research in Higher Education, 10, 149-172.

Feldman, K. A. (1988) Effective college teaching from the students’ and faculty’s view: Matched or mismatched priorities, Research in Higher Education, 28, 291-344.

Fernald, P. S. (1990) Students’ ratings of instruction: Standardized and customized, Teaching of Psychology, 17, 105-109.

Friedlander, J. (1978) Student perceptions on the effectiveness of midterm feedback to modify college instruction, Journal of Educational Research, 71, 140-143.

Gray, M. & Bergmann, B. R. (2003, September-October) Student teaching evaluations: Inaccurate, demeaning, misused, Academe, 89(5), 44-46.

Hoffman, K. M. (2003) Online course evaluation and reporting in higher education, New Directions for Teaching and Learning, 96, 25-29.

Johnson, T. D. & Ryan, K. E. (2003) A comprehensive approach to the evaluation of college teaching, New Directions in Higher Education, 83, 109-123.

Koon, J. & Murray, H. G. (1995) Using multiple outcomes to validate student ratings of overall teacher effectiveness, Journal of Higher Education, 66, 61-81.

Kulik, J. A. (2001) Student ratings: Validity, utility, and controversy, New Directions for Institutional Research, 109, 9-25.

Marlin, J. W. Jr. (1987) Student perception of end-of-course evaluations, Journal of Higher Education, 58, 704-716.

Marsh, H. W. (1984) Students’ evaluations of university teaching: Dimensionality, reliability, validity, potential biases, and utility, Journal of Educational Psychology, 76, 707-754.

Marsh, H. W. (1987) Students’ evaluations of university teaching: Research findings, methodological issues, and directions for future research, International Journal of Educational Research, 11, 253-388.

McKeachie, W. J. (1997) Student ratings: The validity of use, American Psychologist, 52, 1218-1225.

Messick, S. (1989) Validity, in: R. L. Linn (Ed) Educational measurement (3rd ed) (Old Tappan, N.J.: Macmillan), 13-103

Messick, S. (1995) Validity of psychological assessment: Validation of inferences from persons responses and performances as scientific inquiry into score meaning, American Psychologist, 50, 741-749.

Miron, M. & Segal, E. (1986) Student opinion on the value of student evaluations, Higher Education, 14, 321-333.

Onwuegbuzie, A. J. & Daniel, L. G. (2002) A framework for reporting and interpreting internal consistency reliability estimates, Measurement and Evaluation in Counseling and Development, 35, 89-103.

Onwuegbuzie, A. J. & Daniel, L. G. (2003, February 12) Typology of analytical and interpretational errors in quantitative and qualitative educational research, Current Issues in Education [On-line], 6(2).  Available: http://cie.ed.asu.edu/volume6/number2/

Onwuegbuzie, A .J. & Daniel, L. G. (2004) Reliability generalization: The importance of considering sample specificity, confidence intervals, and subgroup differences, Research in the Schools, 11(1), 61-72.

Onwuegbuzie, A.J., Daniel, L.G., & Collins, K.M.T. (in press) A meta-validation model for  assessing the score-validity of student teaching evaluations, Quality & Quantity: International Journal of Methodology.

Onwuegbuzie, A. J. & Teddlie, C. (2003) A framework for analyzing data in mixed methods research. In A. Tashakkori & C. Teddlie (Eds.) Handbook of mixed methods in social and behavioral research (pp. 351-383). (Thousand Oaks, CA: Sage).

Ory, J. C. (2000, Fall) Teaching evaluation: Past, present, and future, New Directions for Teaching and Learning, 83, 13-18.

Palmquist, J. & Stueve, A. (1996) Stay plugged into new opportunities, Marketing Research: A Magazine of Management and Applications, 8(1), 13-15.

Penny, A. R. (2003) Changing the agenda for research into students’ views about university teaching: Four shortcomings of SRT research, Teaching in Higher Education, 8(3), 399-411.

Ramsden, P. & Dodds, A. (1989) Improving teaching and courses: A guide to evaluation. Parkeville, Melbourne: University of Melbourne, Centre for the Study of Higher Education.

Ramsden, P. & Martin, E. (1992) The Student Evaluation of Teaching Service (TEVAL) Report of a review commissioned by the Tertiary Education Institute. Unpublished manuscript.

Schmelkin, L. P., Spencer, K. J., & Gellman, E. S. (1997) Faculty perspectives on course and teacher evaluations, Research in Higher Education, 38(5), 575-592.

Seldin, P. (1993, July) The use and abuse of student ratings of professors, The Chronicle of Higher Education, 21, A40.

Simpson, R. D. (1995) Uses and misuses of student evaluations of teaching effectiveness, Innovative Higher Education, 20(1), 3-5.

Sorenson, D. L. & Reiner, C. (2003, Winter) Charting the uncharted seas of online student ratings of instruction, New Directions for Teaching and Learning, 96, 1-24.

Spencer, K. J., & Schmelkin, L. P. (2002) Student perspectives on teaching and its evaluation, Assessment in Higher Education, 27, 397-409.

Theall, M. & Franklin, J. (2000, Fall) Creating responsive student rating systems to improve evaluation practice, New Directions for Teaching and Learning, 83, 95-107.

Theall, M. & Franklin, J. (2001, Spring) Looking for bias in all the wrong places: A search for truth or a witch hunt in student ratings of instruction?, New Directions for Institutional Research, 109, 45-56.

University of North Florida. (2000) Faculty handbook (Jacksonville, FL: Author).

Wilkinson, L., & the Task Force on Statistical Inference. (1999) Statistical methods in psychology journals: Guidelines and explanations, American Psychologist, 54, 594‑604. (Reprint available through the APA Home Page: http://www.apa.org/journals/amp/amp548594.html

Williams, W. M. & Desley, A. (1997) Rethinking student evaluations and the
improvement of teaching: Instruments for change at the University of Queensland, Studies in Higher Education, 22, 55-65.

You are invited to join AE Extra staff!
Send your ideas and/or writing sample to the Editor-in-chief:
Elizabeth Haller
Kent State University (e-mail: editoraee@hotmail.com)

Return to AE Home

Academic Exchange Extra invites reader response to any writings in this issue--especially articles advancing the scholarly debate of issues raised.


Copyright © Academic Exchange - EXTRA
Web Master: Zach Varner