blog posts and news stories

Getting Different Results from the Same Program in Different Contexts

The spring 2014 conference of the Society for Research in Educational Effectiveness (SREE) gave us much food for thought concerning the role of replication of experimental results in social science research. If two research teams get the same result from experiments on the same program, that gives us confidence that the original result was not a fluke or somehow biased.

But in his keynote, John Ioannidis of Stanford showed that even in medical research, where the context can be more tightly controlled, replication very often fails—researchers get different results. The original finding may have been biased, for example, through the tendency to suppress null findings where no positive effect was found and over-report large, but potentially spurious results. Replication of a result over the long run helps us to get past the biases. Though not as glamorous as discovery, replication is fundamental to science, and educational science is no exception.

In the course of the conference, I was reminded that the challenge to conducting replication work is, in a sense, compounded in social science research. “Effect heterogeneity”—finding different results in different contexts—is common for many legitimate reasons. For instance, experimental controls seldom get placebos. They receive the program already in place, often referred to as “business as usual,” and this can vary across experiments of the same intervention and contribute to different results. Also, experiments of the same program carried out in different contexts are likely to be adapted given demands or affordances of the situation, and flexible implementation may lead to different results. The challenge is to disentangle differences in effects that give insight into how programs are adapted in response to conditions, from bias in results that John Ioannidis considered. In other fields (e.g., the “hard sciences”), less context dependency and more-robust effects may make it easier to diagnose when variation in findings is illegitimate. In education, this may be more challenging and reminds me why educational research is in many ways the ‘hardest science’ of all, as David Berliner has emphasized in the past.

Once separated from distortions of bias and properly differentiated from the usual kind of “noise” or random error, differences in effects can actually be leveraged to better understand how and for whom programs work. Building systematic differences in conditions into our research designs can be revealing. Such efforts should, however, be considered with the role of replication in mind—an approach to research that purposively builds in heterogeneity, in a sense, seeks to find where impacts don’t replicate, but for good reason. Non-reproducibility in this case is not haphazard, it is purposive.

What are some approaches to leveraging and understanding effect heterogeneity? We envision randomized trials where heterogeneity is built into the design by comparing different versions of a program or implementing in diverse settings across which program effects are hypothesized to vary. A planning phase of an RCT would allow discussions with experts and stakeholders about potential drivers of heterogeneity. Pertinent questions to address during this period include: what are the attributes of participants and settings across which we expect effects to vary and why? Under which conditions and how do we expect program implementation to change? Hypothesizing which factors will moderate effects before the experiment is conducted would add credibility to results if they corroborate the theory. A thoughtful approach of this sort can be contrasted with the usual approach whereby differential effects of program are explored as afterthoughts, with the results carrying little weight.

Building in conditions for understanding effect heterogeneity will have implications for experimental design. Increasing variation in outcomes affects statistical power and the sensitivity of designs to detect effects. We will need a better understanding of the parameters affecting precision of estimates. At Empirical, we have started using results from several of our experiments to explore parameters affecting sensitivity of tests for detecting differential impact. For example, we have been documenting the variation across schools in differences in performance depending on student characteristics such as individual SES, gender, and LEP status. This variation determines how precisely we are able to estimate the average difference between student subgroups in the impact of a program.

Some may feel that introducing heterogeneity to better understand conditions for observing program effects is going down a slippery slope. Their thinking is that it is better to focus on program impacts averaged across the study population and to replicate those effects across conditions; and that building sources of variation into the design may lead to loose interpretations and loss of rigor in design and analysis. We appreciate the cautionary element of this position. However, we believe that a systematic study of how a program interacts with conditions can be done in a disciplined way without giving up the usual strategies for ensuring the validity of results.

We are excited about the possibility that education research is entering a period of disciplined scientific inquiry to better understand how differences in students, contexts, and programs interact, with the hope that the resulting work will lead to greater opportunity and better fit of program solutions to individuals.

2014-05-21

The Value of Looking at Local Results

The report we released today has an interesting history that shows the value of looking beyond the initial results of an experiment. Later this week, we are presenting a paper at AERA entitled “In School Settings, Are All RCTs Exploratory?” The findings we report from our experiment with an iPad application were part of the inspiration for this. If Riverside Unified had not looked at its own data, we would not, in the normal course of data analysis, have broken the results out by individual districts, and our conclusion would have been that there was no discernible impact of the app. We can cite many other cases where looking at subgroups leads us to conclusions different from the conclusion based on the result averaged across the whole sample. Our report on AMSTI is another case we will cite in our AERA paper.

We agree with the Institute of Education Sciences (IES) in taking a disciplined approach in requiring that researchers “call their shots” by naming the small number of outcomes considered most important in any experiment. All other questions are fine to look at but fall into the category of exploratory work. What we want to guard against, however, is the implication that answers to primary questions, which often are concerned with average impacts for the study sample as a whole, must apply to various subgroups within the sample, and therefore can be broadly generalized by practitioners, developers, and policy makers.

If we find an average impact but in exploratory analysis discover plausible, policy-relevant, and statistically strong differential effects for subgroups, then some doubt about completeness may be cast on the value of the confirmatory finding. We may not be certain of a moderator effect—for example—but once it comes to light, the value of the average impact can also be considered incomplete or misleading for practical purposes. If it is necessary to conduct an additional experiment to verify a differential subgroup impact, the same experiment may verify that the average impact is not what practitioners, developers, and policy makers should be concerned with.

In our paper at AERA, we are proposing that any result from a school-based experiment should be treated as provisional by practitioners, developers, and policy makers. The results of RCTs can be very useful, but the challenges of generalizability of the results from even the most stringently designed experiment mean that the results should be considered the basis for a hypothesis that the intervention may work under similar conditions. For a developer considering how to improve an intervention, the specific conditions under which it appeared to work or not work is the critical information to have. For a school system decision maker, the most useful pieces of information are insight into subpopulations that appear to benefit and conditions that are favorable for implementation. For those concerned with educational policy, it is often the case that conditions and interventions change and develop more rapidly than research studies can be conducted. Using available evidence may mean digging through studies that have confirmatory results in contexts similar or different from their own and examining exploratory analyses that provide useful hints as to the most productive steps to take next. The practitioner in this case is in a similar position to the researcher considering the design of the next experiment. The practitioner also has to come to a hypothesis about how things work as the basis for action.

2012-04-01

Exploration in the World of Experimental Evaluation

Our 300+ page report makes a good start. But IES, faced with limited time and resources to complete the many experiments being conducted within the Regional Education Lab system, put strict limits on the number of exploratory analyses researchers could conduct. We usually think of exploratory work as questions to follow up on puzzling or unanticipated results. However, in the case of the REL experiments, IES asked researchers to focus on a narrow set of “confirmatory” results and anything else was considered “exploratory,” even if the question was included in the original research design.

The strict IES criteria were based on the principle that when a researcher is using tests of statistical significance, the probability of erroneously concluding that there is an impact when there isn’t one increases with the frequency of the tests. In our evaluation of AMSTI, we limited ourselves to only four such “confirmatory” (i.e., not exploratory) tests of statistical significance. These were used to assess whether there was an effect on student outcomes for math problem-solving and for science, and the amount of time teachers spent on “active learning” practices in math and in science. (Technically, IES considered this two sets of two, since two were the primary student outcomes and two were the intermediate teacher outcomes.) The threshold for significance was made more stringent to keep the probability of falsely concluding that there was a difference for any of the outcomes at 5% (often expressed as p < .05).

While the logic for limiting the number of confirmatory outcomes is based on technical arguments about adjustments for multiple comparisons, the limit on the amount of exploratory work was based more on resource constraints. Researchers are notorious (and we don’t exempt ourselves) for finding more questions in any study than were originally asked. Curiosity-based exploration can indeed go on forever. In the case of our evaluation of AMSTI, however, there were a number of fundamental policy questions that were not answered either by the confirmatory or by the exploratory questions in our report. More research is needed.

Take the confirmatory finding that the program resulted in the equivalent of 28 days of additional math instruction (or technically an impact of 5% of a standard deviation). This is a testament to the hard work and ingenuity of the AMSTI team and the commitment of the school systems. From a state policy perspective, it gives a green light to continuing the initiative’s organic growth. But since all the schools in the experiment applied to join AMSTI, we don’t know what would happen if AMSTI were adopted as the state curriculum requiring schools with less interest to implement it. Our results do not generalize to that situation. Likewise, if another state with different levels of achievement or resources were to consider adopting it, we would say that our study gives good reason to try it but, to quote Lee Cronbach, a methodologist whose ideas increasingly resonate as we translate research into practice: “…positive results obtained with a new procedure for early education in one community warrant another community trying it. But instead of trusting that those results generalize, the next community needs its own local evaluation” (Cronbach, 1975, p. 125).

The explorations we conducted as part of the AMSTI evaluation did not take the usual form of deeper examinations of interesting or unexpected findings uncovered during the planned evaluation. All the reported explorations were questions posed in the original study plan. They were defined as exploratory either because they were considered of secondary interest, such as the outcome for reading, or because they were not a direct causal result of the randomization, such as the results for subgroups of students defined by different demographic categories. Nevertheless, exploration of such differences is important for understanding how and for whom AMSTI works. The overall effect, averaging across subgroups, may mask differences that are of critical importance for policy

Readers interested in the issue of subgroup differences can refer to Table 6.11. Once differences are found in groups defined in terms of individual student characteristics, our real exploration is just beginning. For example, can the difference be accounted for by other characteristics or combinations of characteristics? Is there something that differentiates the classes or schools that different students attend? Such questions begin to probe additional factors that can potentially be addressed in the program or its implementation. In any case, the report just released is not the “final report.” There is still a lot of work necessary to understand how any program of this sort can continue to be improved.

2012-02-14

Empirical's Chief Scientist co-authored a recently released NCEE Reference Report

Together with researchers from Abt Associates, Andrew Jaciw, Chief Scientist of Empirical Education, co–authored a recently released report entitled, “Estimating the Impacts of Educational Interventions Using State Tests or Study-Administered Tests”. The full report released by the The National Center for Education Evaluation and Regional Assistance (NCEE) can be found on the Institute of Education Sciences (IES) website.The NCEE Reference Report examines and identifies factors that could affect the precision of program evaluations when they are based on state assessments instead of study-administered tests. The authors found that using the same test for both the pre- and post-test yielded more precise impact estimates; using two pre-test covariates, one from each type of test (state assessment and study- administered standardized test), yielded more precise impact estimates; using as the dependent variable the simple average of the post-test scores from the two types of tests yielded more precise impact estimates and smaller sample size requirements than using post-test scores from only one of the two types of tests.

2011-11-02

New Reports Show Positive Results for Elementary Reading Program

Two studies of the Treasures reading program from McGraw-Hill are now posted on our reports page. Treasures is a basal reading program for students in grades K–6. Although the first study was a multi-site study while the second was conducted in the Osceola school district, both found positive impacts on reading achievement in grades 3–5.

The primary data for the first study were scores supplied with district permission by Northwest Evaluation Association from their MAP reading test. The study uses a quasi-experimental comparison group design based on 35 Treasures and 48 comparison schools primarily in the midwest. The study found that Treasures had a positive impact on overall elementary student reading scores, the strongest effect being observed for grade 5.

The second study’s data were provided by the Osceola school district and consist of demographic information, FCAT test scores, and information on student transfers during the year (between schools within the districts and from other districts). The dataset for this time series design covered five consecutive school years from 2005–06 to 2009–10, including two years prior to introduction of the intervention and three years after the introduction. The study included exploration of moderators that demonstrated a stronger positive effect for students with disabilities and English learners than the rest of the student population. We also found a stronger positive impact on girls than on boys.

Check back for results from follow-up studies, which are currently underway in other states and districts.

2011-09-21

Looking Back 35 Years to Learn about Local Experiments

With the growing interest among federal agencies in building local capacity for research, we took another look at an article by Lee Cronbach published in 1975. We found it has a lot to say about conducting local experiments and implications for generalizability. Cronbach worked for much of his career at Empirical’s neighbor, Stanford University, and his work has had a direct and indirect influence on our thinking. Some may interpret Cronbach’s work as stating that randomized trials of educational interventions have no value because of the complexity of interactions between subjects, contexts, and the experimental treatment. In any particular context, these interactions are infinitely complex, forming a “hall of mirrors” (as he famously put it, p. 119), making experimental results—which at most can address a small number of lower-order interactions—irrelevant. We don’t read it that way. Rather, we see powerful insights as well as cautions for conducting the kinds of field experiments that are beginning to show promise for providing educators with useful evidence.

We presented these ideas at the Society for Research in Educational Effectiveness conference in March, building the presentation around a set of memorable quotes from the 1975 article. Here we highlight some of the main ideas.

Quote #1: “When we give proper weight to local conditions, any generalization is a working hypothesis, not a conclusion…positive results obtained with a new procedure for early education in one community warrant another community trying it. But instead of trusting that those results generalize, the next community needs its own local evaluation” (p. 125).

Practitioners are making decisions for their local jurisdiction. An experiment conducted elsewhere (including over many locales, where the results are averaged) provides a useful starting point, but not “proof” that it will or will not work in the same way locally. Experiments give us a working hypothesis concerning an effect, but it has to be tested against local conditions at the appropriate scale of implementation. This brings to mind California’s experience with class size reduction following the famous experiment in Tennessee, and how the working hypothesis corroborated through the experiment did not transfer to a different context. We also see applicability of Cronbach’s ideas in the Investing in Innovation (i3) program, where initial evidence is being taken as a warrant to scale-up intervention, but where the grants included funding for research under new conditions where implementation may head in unanticipated directions, leading to new effects.

Quote #2: “Instead of making generalization the ruling consideration in our research, I suggest that we reverse our priorities. An observer collecting data in one particular situation…will give attention to whatever variables were controlled, but he will give equally careful attention to uncontrolled conditions…. As results accumulate, a person who seeks understanding will do his best to trace how the uncontrolled factors could have caused local departures from the modal effect. That is, generalization comes late, and the exception is taken as seriously as the rule” (pp. 124-125).

Finding or even seeking out conditions that lead to variation in the treatment effect facilitates external validity, as we build an account of the variation. This should not be seen as a threat to generalizability because an estimate of average impact is not robust across conditions. We should spend some time looking at the ways that the intervention interacts differently with local characteristics, in order to determine which factors account for heterogeneity in the impact and which ones do not. Though this activity is exploratory and not necessarily anticipated in the design, it provides the basis for understanding how the treatment plays out, and why its effect may not be constant across settings. Over time, generalizations can emerge, as we compile an account of the different ways in which the treatment is realized and the conditions that suppress or accentuate its effects.

Quote #3: “Generalizations decay” (p. 122).

In the social policy arena, and especially with the rapid development of technologies, we can’t expect interventions to stay constant. And we certainly can’t expect the contexts of implementation to be the same over many years. The call for quicker turn-around in our studies is therefore necessary, not just because decision-makers need to act, but because any finding may have a short shelf life.

Cronbach, L. J. (1975). Beyond the two disciplines of scientifi­c psychology. American Psychologist, 116-127.

2011-03-21

Stimulating Innovation and Evidence

After a massive infusion of stimulus money into K-12 technology through the Title IID “Enhancing Education Through Technology” (EETT) grants, known also as “ed-tech” grants, the administration is planning to cut funding for the program in future budgets.

Well, they’re not exactly “cutting” funding for technology, but consolidating the dedicated technology funding stream into a larger enterprise, awkwardly named the “Effective Teaching and Learning for a Complete Education” program. For advocates of educational technology, here’s why this may not be so much a blow as a challenge and an opportunity.

Consider the approach stated at the White House “fact sheet”:

“The Department of Education funds dozens of programs that narrowly limit what states, districts, and schools can do with funds. Some of these programs have little evidence of success, while others are demonstrably failing to improve student achievement. The President’s Budget eliminates six discretionary programs and consolidates 38 K-12 programs into 11 new programs that emphasize using competition to allocate funds, giving communities more choices around activities, and using rigorous evidence to fund what works…Finally, the Budget dedicates funds for the rigorous evaluation of education programs so that we can scale up what works and eliminate what does not.”

From this, technology advocates might worry that policy is being guided by the findings of “no discernable impact” from a number of federally funded technology evaluations (including the evaluation mandated by the EETT legislation itself).

But this is not the case. The White House declares, “The President strongly believes that technology, when used creatively and effectively, can transform education and training in the same way that it has transformed the private sector.”

The administration is not moving away from the use of computers, electronic whiteboards, data systems, Internet connections, web resources, instructional software, and so on in education. Rather, the intention is that these tools are integrated, where appropriate and effective, into all of the other programs.

This does put technology funding on a very different footing. It is no longer in its own category. Where school administrators are considering funding from the “Effective Teaching and Learning for a Complete Education” program, they may place a technology option up against an approach to lower class size, a professional development program, or other innovations that may integrate technologies as a small piece of an overall intervention. Districts would no longer write proposals to EETT to obtain financial support to invest in technology solutions. Technology vendors will increasingly be competing for the attention of school district decision-makers on the basis of the comparative effectiveness of their solution—not just in comparison to other technologies but in comparison to other innovative solutions. The administration has clearly signaled that innovative and effective technologies will be looked upon favorably. It has also signaled that effectiveness is the key criterion.

As an Empirical Education team prepares for a visit to Washington DC for the conference of the Consortium for School Networking and the Software and Information Industry Association’s EdTech Government Forum, (we are active members in both organizations), we have to consider our message to the education technology vendors and school system technology advocates. (Coincidentally, we will also be presenting research at the annual conference of the Society for Research on Educational Effectiveness, also held in DC that week). As a research company we are constrained from taking an advocacy role—in principle we have to maintain that the effectiveness of any intervention is an empirical issue. But we do see the infusion of short term stimulus funding into educational technology through the EETT program as an opportunity for schools and publishers. Working jointly to gather the evidence from the technologies put in place this year and next will put schools and publishers in a strong position to advocate for continued investment in the technologies that prove effective.

While it may have seemed so in 1993 when the U.S. Department of Education’s Office of Educational Technology was first established, technology can no longer be considered inherently innovative. The proposed federal budget is asking educators and developers to innovate to find effective technology applications. The stimulus package is giving the short term impetus to get the evidence in place.

2010-02-14

Presentation at the Society for Research in Educational Effectiveness (SREE) Explores Methods for Studying Achievement Gaps

Frequently in Empirical Education’s experimental evaluations for school districts, the question of local concern is an achievement gap identified between two student groups. The analysis of these experiments also often finds significant differences between these subgroups in how effective the intervention was (that is, if it increased or decreased the gap) while not finding a significant overall difference. In his 2005 book, Howard Bloom suggested why there may be more statistical power to detect subgroup differences than to detect the average effect. The exploration presented at SREE, which was held in Washington March 1-3, examined the statistical characteristics of eight experiments conducted over the last three years to find out whether a critical assumption of Bloom’s approach held. His assumption is that the average performance gap does not vary across the units that are randomized. The work, led by Andrew P. Jaciw, Empirical Education’s Director of Experimental Design and Analysis, found that the assumption held. This finding is important because it suggests that local experiments focusing on achievement gaps may be less expensive than experiments addressing only the overall average effect of an intervention. (Click here for a copy of the poster and handout.)

Bloom, H. S., (2005). Randomizing groups to evaluate place-based programs. In H. S. Bloom (Ed). Learning More From Social Experiments. New York, NY: Sage.

2009-03-01
Archive