blog posts and news stories

Can We Measure the Measures of Teaching Effectiveness?

Teacher evaluation has become the hot topic in education. State and local agencies are quickly implementing new programs spurred by federal initiatives and evidence that teacher effectiveness is a major contributor to student growth. The Chicago teachers’ strike brought out the deep divisions over the issue of evaluations. There, the focus was on the use of student achievement gains, or value-added. But the other side of evaluation—systematic classroom observations by administrators—is also raising interest. Teaching is a very complex skill, and the development of frameworks for describing and measuring its interlocking elements is an area of active and pressing research. The movement toward using observations as part of teacher evaluation is not without controversy. A recent OpEd in Education Week by Mike Schmoker criticizes the rapid implementation of what he considers overly complex evaluation templates “without any solid evidence that it promotes better teaching.”

There are researchers engaged in the careful study of evaluation systems, including the combination of value-added and observations. The Bill and Melinda Gates Foundation has funded a large team of researchers through its Measures of Effective Teaching (MET) project, which has already produced an array of reports for both academic and practitioner audiences (with more to come). But research can be ponderous, especially when the question is whether such systems can impact teacher effectiveness. A year ago, the Institute of Education Sciences (IES) awarded an $18 million contract to AIR to conduct a randomized experiment to measure the impact of a teacher and leader evaluation system on student achievement, classroom practices, and teacher and principal mobility. The experiment is scheduled to start this school year and results will likely start appearing by 2015. However, at the current rate of implementation by education agencies, most programs will be in full swing by then.

Empirical Education is currently involved in teacher evaluation through Observation Engine: our web-based tool that helps administrators make more reliable observations. See our story about our work with Tulsa Public Schools. This tool, along with our R&D on protocol validation, was initiated as part of the MET project. In our view, the complexity and time-consuming aspects of many of the observation systems that Schmoker criticizes arise from their intended use as supports for professional development. The initial motivation for developing observation frameworks was to provide better feedback and professional development for teachers. Their complexity is driven by the goal of providing detailed, specific feedback. Such systems can become cumbersome when applied to the goal of providing a single score for every teacher representing teaching quality that can be used administratively, for example, for personnel decisions. We suspect that a more streamlined and less labor-intensive evaluation approach could be used to identify the teachers in need of coaching and professional development. That subset of teachers would then receive the more resource-intensive evaluation and training services such as complex, detailed scales, interviews, and coaching sessions.

The other question Schmoker raises is: do these evaluation systems promote better teaching? While waiting for the IES study to be reported, some things can be done. First, look at correlations of the components of the observation rubrics with other measures of teaching such as value-added to student achievement (VAM) scores or student surveys. The idea is to see whether the behaviors valued and promoted by the rubrics are associated with improved achievement. The videos and data collected by the MET project are the basis for tools to do this (see earlier story on our Validation Engine.) But school systems can conduct the same analysis using their own student and teacher data. Second, use quasi-experimental methods to look at the changes in achievement related to the system’s local implementation of evaluation systems. In both cases, many school systems are already collecting very detailed data that can be used to test the validity and effectiveness of their locally adopted approaches.

2012-10-31

Oklahoma Implements Empirical’s Observation Engine for Certification of Classroom Observers

Tulsa Public Schools, the Cooperative Council for Oklahoma Administration, and Empirical Education Inc. just announced the launch of Observation Engine to implement the Teacher and Leader Effectiveness program in the state of Oklahoma. Tulsa Public Schools has purchased Empirical Education’s Observation Engine, an online certification and calibration tool for measuring the reliability of administrators assigned to conduct classroom observations. Tulsa Public Schools developed the Tulsa Model for Observation and Evaluation, a framework for ensuring teaching effectiveness performance, as well as best practices for creating an environment for successful learning and student achievement. Nearly 500 school districts in the state are piloting the Tulsa Model evaluation system this year.

In order to support the dissemination of the Tulsa Model, the Cooperative Council for Oklahoma Administration (CCOSA) is training and administering calibration tests throughout the state to assess and certify the individuals who evaluate the state’s teachers. The Tulsa Model is embedded in Observation Engine to deliver an efficient online system for state-wide use by Oklahoma certified classroom observers. Observation Engine is allowing CCOSA to test approximately 2,000 observers over a span of two weeks.

Observation Engine was developed as part of The Bill and Melinda Gates Foundation’s Measures of Effective Teaching project in which Empirical Education has participated as a research partner conducting R&D on validity and reliability of observational measures. The web-based software was built by Empirical Education, which hosts and supports it for school systems nationwide.

For more details on these events, see the press announcement and our case study.

2012-10-10

Empirical Releases Final Report on HMH Fuse™ iPad App

Today Empirical and Houghton Mifflin Harcourt made the following announcement. You can download the report and research summary from the study using the links below.
Fuse research report
Fuse research summary

Study Shows HMH Fuse™ iPad® App Can Dramatically Improve Student Achievement

Strong implementation in Riverside Unified School District associated with nine-point increase in percentile standing

BOSTON – April 10, 2012 – A study of HMH Fuse: Algebra 1 app released today by research firm Empirical Education Inc. identifies implementation as a key factor in the success of mobile technology. The 2010–2011 study was a pilot of a new educational app from global education leader Houghton Mifflin Harcourt (HMH) that re-imagines the conventional textbook to fully deploy interactive features of the mobile device. The HMH Fuse platform encourages the use of personalized lesson plans by combining direct instruction, ongoing support, assessment and intervention in one easy-to-use suite of tools.

Empirical found that the iPad-using students in the four participating districts: Long Beach, Fresno, San Francisco and Riverside Unified School District (Riverside Unified), performed on average as well as their peers using the traditional textbook. However, after examining its own results, Riverside Unified found an increase in test scores among students taught with HMH Fuse compared to their peers. Empirical corroborated these results, finding a statistically significant impact equivalent to a nine-point percentile increase. The Riverside Unified teachers also reported substantially greater usage of the HMH Fuse app both in teaching and by the students in class.

“Education technology does not operate in a vacuum, and the research findings reinforce that with a supportive school culture and strategic implementation, technology can have a significant impact on student achievement,” said Linda Zecher, President and CEO of HMH. “We’re encouraged by the results of the study and the potential of mobile learning to accelerate student achievement and deepen understanding in difficult to teach subjects like algebra.”

Across all districts, the study found a positive effect on student attitudes toward math, and those students with positive attitudes toward math achieved higher scores on the California Standards Test.

The research design was a “gold standard” randomized control trial that examined the performance of eighth-grade students during the 2010-2011 school year. Each teacher’s classes were randomly assigned to either the treatment group that used the HMH Fuse app or the control group that used the conventional print format of the same content.

“The rapid pace of mobile technology’s introduction into K-12 education leaves many educators with important questions about its efficacy especially given their own resources and experience,” said Denis Newman, CEO of Empirical Education. “The results from Riverside highlight the importance of future research on mobile technologies that account for differences in teacher experience and implementation.”

To access the full research report, go to www.empiricaleducation.com. A white paper detailing the implementation and impact of HMH Fuse in Riverside is available on the HMH website.

2012-04-10

The Value of Looking at Local Results

The report we released today has an interesting history that shows the value of looking beyond the initial results of an experiment. Later this week, we are presenting a paper at AERA entitled “In School Settings, Are All RCTs Exploratory?” The findings we report from our experiment with an iPad application were part of the inspiration for this. If Riverside Unified had not looked at its own data, we would not, in the normal course of data analysis, have broken the results out by individual districts, and our conclusion would have been that there was no discernible impact of the app. We can cite many other cases where looking at subgroups leads us to conclusions different from the conclusion based on the result averaged across the whole sample. Our report on AMSTI is another case we will cite in our AERA paper.

We agree with the Institute of Education Sciences (IES) in taking a disciplined approach in requiring that researchers “call their shots” by naming the small number of outcomes considered most important in any experiment. All other questions are fine to look at but fall into the category of exploratory work. What we want to guard against, however, is the implication that answers to primary questions, which often are concerned with average impacts for the study sample as a whole, must apply to various subgroups within the sample, and therefore can be broadly generalized by practitioners, developers, and policy makers.

If we find an average impact but in exploratory analysis discover plausible, policy-relevant, and statistically strong differential effects for subgroups, then some doubt about completeness may be cast on the value of the confirmatory finding. We may not be certain of a moderator effect—for example—but once it comes to light, the value of the average impact can also be considered incomplete or misleading for practical purposes. If it is necessary to conduct an additional experiment to verify a differential subgroup impact, the same experiment may verify that the average impact is not what practitioners, developers, and policy makers should be concerned with.

In our paper at AERA, we are proposing that any result from a school-based experiment should be treated as provisional by practitioners, developers, and policy makers. The results of RCTs can be very useful, but the challenges of generalizability of the results from even the most stringently designed experiment mean that the results should be considered the basis for a hypothesis that the intervention may work under similar conditions. For a developer considering how to improve an intervention, the specific conditions under which it appeared to work or not work is the critical information to have. For a school system decision maker, the most useful pieces of information are insight into subpopulations that appear to benefit and conditions that are favorable for implementation. For those concerned with educational policy, it is often the case that conditions and interventions change and develop more rapidly than research studies can be conducted. Using available evidence may mean digging through studies that have confirmatory results in contexts similar or different from their own and examining exploratory analyses that provide useful hints as to the most productive steps to take next. The practitioner in this case is in a similar position to the researcher considering the design of the next experiment. The practitioner also has to come to a hypothesis about how things work as the basis for action.

2012-04-01

Study of Alabama STEM Initiative Finds Positive Impacts

On February 21, 2012 the U.S. Department of Education released the final report of an experiment that Empirical Education has been working on for the last six years. The report, titled Evaluation of the Effectiveness of the Alabama Math, Science, and Technology Initiative (AMSTI) is now available on the Institute of Education Sciences website. The Alabama State Department of Education held a press conference to announce the findings, attended by Superintendent of Education Bice, staff of AMSTI, along with educators, students, and co-principal investigator of the study, Denis Newman, CEO of Empirical Education.

AMSTI was developed by the state of Alabama and introduced in 2002 with the goal of improving mathematics and science achievement in the state’s K-12 schools. Empirical Education was primarily responsible for conducting the study—including the design, data collection, analysis, and reporting—under its subcontract with the Regional Education Lab, Southeast (the study was initiated through a research grant to Empirical). Researchers from Academy of Education Development, Abt Associates, and ANALYTICA made important contributions to design, analysis and data collection.

The findings show that after one year, students in the 41 AMSTI schools experienced an impact on mathematics achievement equivalent to 28 days of additional student progress over students receiving conventional mathematics instruction. The study found, after one year, no difference for science achievement. It also found that AMSTI had an impact on teachers’ active learning classroom practices in math and science that, according to the theory of action posited by AMSTI, should have an impact on achievement. Further exploratory analysis found effects for student achievement in both mathematics and science after two years. The study also explored reading achievement, where it found significant differences between the AMSTI and control groups after one year. Exploration of differential effect for student demographic categories found consistent results for gender, socio-economic status, and pretest achievement level for math and science. For reading, however, the breakdown by student ethnicity suggests a differential benefit.

Just about everybody at Empirical worked on this project at one point or another. Besides the three of us (Newman, Jaciw and Zacamy) who are listed among the authors, we want to acknowledge past and current employees whose efforts made the project possible: Jessica Cabalo, Ruthie Chang, Zach Chin, Huan Cung, Dan Ho, Akiko Lipton, Boya Ma, Robin Means, Gloria Miller, Bob Smith, Laurel Sterling, Qingfeng Zhao, Xiaohui Zheng, and Margit Zsolnay.

With solid cooperation of the state’s Department of Education and the AMSTI team, approximately 780 teachers and 30,000 upper-elementary and middle school students in 82 schools from five regions in Alabama participated in the study. The schools were randomized into one of two categories: 1) Those who received AMSTI starting the first year, or 2) Those who received “business as usual” the first year and began participation in AMSTI the second year. With only a one-year delay before the control group entered treatment, the two-year impact was estimated using statistical techniques developed by, and with the assistance of our colleagues at Abt Associates. Academy for Education Development assisted with data collection and analysis of training and program implementation.

Findings of the AMSTI study will also be presented at the Society for Research on Educational Effectiveness (SREE) Spring Conference taking place in Washington D.C. from March 8-10, 2012. Join Denis Newman, Andrew Jaciw, and Boya Ma on Friday March 9, 2012 from 3:00pm-4:30pm, when they will present findings of their study titled, “Locating Differential Effectiveness of a STEM Initiative through Exploration of Moderators.” A symposium on the study, including the major study collaborators, will be presented at the annual conference of the American Educational Research Association (AERA) on April 15, 2012 from 2:15pm-3:45pm at the Marriott Pinnacle ⁄ Pinnacle III in Vancouver, Canada. This session will be chaired by Ludy van Broekhuizen (director of REL-SE) and will include presentations by Steve Ricks (director of AMSTI); Jean Scott (SERVE Center at UNCG); Denis Newman, Andrew Jaciw, Boya Ma, and Jenna Zacamy (Empirical Education); Steve Bell (Abt Associates); and Laura Gould (formerly of AED). Sean Reardon (Stanford) will serve as the discussant. A synopsis of the study will also be included in the Common Guidelines for Education Research and Development.

2012-02-21

Empirical is participating in recently awarded five-year REL contracts

The Institute of Education Sciences (IES) at the U.S. Department of Education recently announced the recipients of five-year contracts for each of the 10 Regional Education Laboratories (RELs). We are excited to be part of four strong teams of practitioners and researchers that received the awards.

The original request for proposals in May 2011 called for the new RELs to work closely with alliances of state and local education agencies and other practitioner organizations to build local capacity for research. Considering the close ties between this agenda and Empirical’s core mission we joined the proposal efforts and are now part of winning teams in the West (led by WestEd), Northwest (led by Education Northwest), Midwest (led by the American Institutes for Research (AIR)), and Southwest (led by SEDL) The REL Southwest is currently under a stop work order while ED addresses a dispute concerning its review process. Empirical Education’s history in conducting Randomized Control Trials (RCTs) and in providing technical assistance to education agencies provides a strong foundation for the next five years.

2012-02-16

Exploration in the World of Experimental Evaluation

Our 300+ page report makes a good start. But IES, faced with limited time and resources to complete the many experiments being conducted within the Regional Education Lab system, put strict limits on the number of exploratory analyses researchers could conduct. We usually think of exploratory work as questions to follow up on puzzling or unanticipated results. However, in the case of the REL experiments, IES asked researchers to focus on a narrow set of “confirmatory” results and anything else was considered “exploratory,” even if the question was included in the original research design.

The strict IES criteria were based on the principle that when a researcher is using tests of statistical significance, the probability of erroneously concluding that there is an impact when there isn’t one increases with the frequency of the tests. In our evaluation of AMSTI, we limited ourselves to only four such “confirmatory” (i.e., not exploratory) tests of statistical significance. These were used to assess whether there was an effect on student outcomes for math problem-solving and for science, and the amount of time teachers spent on “active learning” practices in math and in science. (Technically, IES considered this two sets of two, since two were the primary student outcomes and two were the intermediate teacher outcomes.) The threshold for significance was made more stringent to keep the probability of falsely concluding that there was a difference for any of the outcomes at 5% (often expressed as p < .05).

While the logic for limiting the number of confirmatory outcomes is based on technical arguments about adjustments for multiple comparisons, the limit on the amount of exploratory work was based more on resource constraints. Researchers are notorious (and we don’t exempt ourselves) for finding more questions in any study than were originally asked. Curiosity-based exploration can indeed go on forever. In the case of our evaluation of AMSTI, however, there were a number of fundamental policy questions that were not answered either by the confirmatory or by the exploratory questions in our report. More research is needed.

Take the confirmatory finding that the program resulted in the equivalent of 28 days of additional math instruction (or technically an impact of 5% of a standard deviation). This is a testament to the hard work and ingenuity of the AMSTI team and the commitment of the school systems. From a state policy perspective, it gives a green light to continuing the initiative’s organic growth. But since all the schools in the experiment applied to join AMSTI, we don’t know what would happen if AMSTI were adopted as the state curriculum requiring schools with less interest to implement it. Our results do not generalize to that situation. Likewise, if another state with different levels of achievement or resources were to consider adopting it, we would say that our study gives good reason to try it but, to quote Lee Cronbach, a methodologist whose ideas increasingly resonate as we translate research into practice: “…positive results obtained with a new procedure for early education in one community warrant another community trying it. But instead of trusting that those results generalize, the next community needs its own local evaluation” (Cronbach, 1975, p. 125).

The explorations we conducted as part of the AMSTI evaluation did not take the usual form of deeper examinations of interesting or unexpected findings uncovered during the planned evaluation. All the reported explorations were questions posed in the original study plan. They were defined as exploratory either because they were considered of secondary interest, such as the outcome for reading, or because they were not a direct causal result of the randomization, such as the results for subgroups of students defined by different demographic categories. Nevertheless, exploration of such differences is important for understanding how and for whom AMSTI works. The overall effect, averaging across subgroups, may mask differences that are of critical importance for policy

Readers interested in the issue of subgroup differences can refer to Table 6.11. Once differences are found in groups defined in terms of individual student characteristics, our real exploration is just beginning. For example, can the difference be accounted for by other characteristics or combinations of characteristics? Is there something that differentiates the classes or schools that different students attend? Such questions begin to probe additional factors that can potentially be addressed in the program or its implementation. In any case, the report just released is not the “final report.” There is still a lot of work necessary to understand how any program of this sort can continue to be improved.

2012-02-14
Archive