blog posts and news stories

Getting Different Results from the Same Program in Different Contexts

The spring 2014 conference of the Society for Research in Educational Effectiveness (SREE) gave us much food for thought concerning the role of replication of experimental results in social science research. If two research teams get the same result from experiments on the same program, that gives us confidence that the original result was not a fluke or somehow biased.

But in his keynote, John Ioannidis of Stanford showed that even in medical research, where the context can be more tightly controlled, replication very often fails—researchers get different results. The original finding may have been biased, for example, through the tendency to suppress null findings where no positive effect was found and over-report large, but potentially spurious results. Replication of a result over the long run helps us to get past the biases. Though not as glamorous as discovery, replication is fundamental to science, and educational science is no exception.

In the course of the conference, I was reminded that the challenge to conducting replication work is, in a sense, compounded in social science research. “Effect heterogeneity”—finding different results in different contexts—is common for many legitimate reasons. For instance, experimental controls seldom get placebos. They receive the program already in place, often referred to as “business as usual,” and this can vary across experiments of the same intervention and contribute to different results. Also, experiments of the same program carried out in different contexts are likely to be adapted given demands or affordances of the situation, and flexible implementation may lead to different results. The challenge is to disentangle differences in effects that give insight into how programs are adapted in response to conditions, from bias in results that John Ioannidis considered. In other fields (e.g., the “hard sciences”), less context dependency and more-robust effects may make it easier to diagnose when variation in findings is illegitimate. In education, this may be more challenging and reminds me why educational research is in many ways the ‘hardest science’ of all, as David Berliner has emphasized in the past.

Once separated from distortions of bias and properly differentiated from the usual kind of “noise” or random error, differences in effects can actually be leveraged to better understand how and for whom programs work. Building systematic differences in conditions into our research designs can be revealing. Such efforts should, however, be considered with the role of replication in mind—an approach to research that purposively builds in heterogeneity, in a sense, seeks to find where impacts don’t replicate, but for good reason. Non-reproducibility in this case is not haphazard, it is purposive.

What are some approaches to leveraging and understanding effect heterogeneity? We envision randomized trials where heterogeneity is built into the design by comparing different versions of a program or implementing in diverse settings across which program effects are hypothesized to vary. A planning phase of an RCT would allow discussions with experts and stakeholders about potential drivers of heterogeneity. Pertinent questions to address during this period include: what are the attributes of participants and settings across which we expect effects to vary and why? Under which conditions and how do we expect program implementation to change? Hypothesizing which factors will moderate effects before the experiment is conducted would add credibility to results if they corroborate the theory. A thoughtful approach of this sort can be contrasted with the usual approach whereby differential effects of program are explored as afterthoughts, with the results carrying little weight.

Building in conditions for understanding effect heterogeneity will have implications for experimental design. Increasing variation in outcomes affects statistical power and the sensitivity of designs to detect effects. We will need a better understanding of the parameters affecting precision of estimates. At Empirical, we have started using results from several of our experiments to explore parameters affecting sensitivity of tests for detecting differential impact. For example, we have been documenting the variation across schools in differences in performance depending on student characteristics such as individual SES, gender, and LEP status. This variation determines how precisely we are able to estimate the average difference between student subgroups in the impact of a program.

Some may feel that introducing heterogeneity to better understand conditions for observing program effects is going down a slippery slope. Their thinking is that it is better to focus on program impacts averaged across the study population and to replicate those effects across conditions; and that building sources of variation into the design may lead to loose interpretations and loss of rigor in design and analysis. We appreciate the cautionary element of this position. However, we believe that a systematic study of how a program interacts with conditions can be done in a disciplined way without giving up the usual strategies for ensuring the validity of results.

We are excited about the possibility that education research is entering a period of disciplined scientific inquiry to better understand how differences in students, contexts, and programs interact, with the hope that the resulting work will lead to greater opportunity and better fit of program solutions to individuals.

2014-05-21

Importance is Important for Rules of Evidence Proposed for ED Grant Programs

The U.S. Department of Education recently proposed new rules for including serious evaluations as part of its grant programs. The approach is modeled on how evaluations are used in the Investing in Innovation (i3) program where the proposal must show there’s some evidence that the proposed innovation has a chance of working and scaling and must include an evaluation that will add to a growing body of evidence about the innovation. We like this approach because it treats previous research as a hypothesis that the innovation may work in the new context. And each new grant is an opportunity to try the innovation in a new context, with improved approaches that warrant another check on effectiveness. But the proposed rules definitely had some weaknesses that were pointed out in the public comments available online. We hope ED heeds these suggestions.

Mark Schneiderman representing the Software and Information Industry Association (SIIA) recommends that outcomes used in effectiveness studies should not be limited to achievement scores.

SIIA notes that grant program resources could appropriately address a range of purposes from instructional to administrative, from assessment to professional development, and from data warehousing to systems productivity. The measures could therefore include such outcomes as student test scores, teacher retention rates, changes in classroom practice or efficiency, availability and use of data or other student/teacher/school outcomes, and cost effectiveness and efficiency that can be observed and measured. Many of these outcome measures can also be viewed as intermediate outcomes—changes in practice that, as demonstrated by other research, are likely to affect other final outcomes.

He also points out that quality of implementation and the nature of the comparison group can be the deciding factors in whether or not a program is found to be effective.

SIIA notes that in education there is seldom a pure control condition such as can be achieved in a medical trial with a placebo or sugar pill. Evaluations of education products and services resemble comparative effectiveness trials in which a new medication is tested against a currently approved one to determine whether it is significantly better. The same product may therefore prove effective in one district that currently has a weak program but relatively less effective in another where a strong program is in place. As a result, significant effects can often be difficult to discern.

This point gets to the heart of the contextual issues in any experimental evaluation. Without understanding the local conditions of the experiment the size of the impact for any other context cannot be anticipated. Some experimentalists would argue that a massive multi-site trial would allow averaging across many contextual variations. But such “on average” results won’t necessarily help the decision-maker working in specific local conditions. Thus, taking previous results as a rough indication that an innovation is worth trying is the first step before conducting the grant-funded evaluation of a new variation of the innovation under new conditions.

Jon Baron, writing for the Coalition for Evidence Based Policy expresses a fundamental concern about what counts as evidence. Jon, who is a former Chair of the National Board for Education Sciences and has been a prominent advocate for basing policy on rigorous research, suggests that

“the definition of ‘strong evidence of effectiveness’ in §77.1 incorporate the Investing in Innovation Fund’s (i3) requirement for effects that are ‘substantial and important’ and not just statistically significant.”

He cites examples where researchers have reported statistically significant results, which were based on trivial outcomes or had impacts so small as to have no practical value. Including “substantial and important” as additional criteria also captures the SIIA’s point that it is not sufficient to consider the internal validity of the study—policy makers must consider whether the measure used is an important one or whether the treatment-control contrast allows for detecting a substantial impact.

Addressing the substance and importance of the results gets us appropriately into questions of external validity, and leads us to questions about subgroup impact, where, for example, an innovation has a positive impact “on average” and works well for high scoring students but provides no value for low scoring students. We would argue that a positive average impact is not the most important part of the picture if the end result is an increase in a policy-relevant achievement gap. Should ED be providing grants for innovations where there has been a substantial indication that a gap is worsened? Probably yes, but only if the proposed development is aimed at fixing the malfunctioning innovation and if the program evaluation can address this differential impact.

2013-03-17

Join Empirical Education at ALAS, AEA, and NSDC

This year, the Association of Latino Administrators & Superintendents (ALAS) will be holding its 8th annual summit on Hispanic Education in San Francisco. Participants will have the opportunity to attend speaker sessions, roundtable discussions, and network with fellow attendees. Denis Newman, CEO of Empirical Education, together with John Sipe, Senior Vice President and National Sales Manager at Houghton Mifflin Harcourt and Jeannetta Mitchell, eight-grade teacher at Presidio Middle school and a participant in the pilot study, will take part in a 30-minute discussion reviewing the study design and experiences gathered around a one-year study of Algebra on the iPad. The session takes place on October 13th at the Salon 8 of the Marriott Marquis in San Francisco from 10:30am to 12:00pm.

Also this year, the American Evaluation Association (AEA) will be hosting its 25th annual conference from November 2–5 in Anaheim, CA. Approximately 2,500 evaluation practitioners, academics, and students from around the globe are expected to gather at the conference. This year’s theme revolves around the challenges of values and valuing in evaluation.

We are excited to be part of AEA again this year and would like to invite you to join us at two presentations. First, Denis Newman will be hosting the roundtable session on Returning to the Causal Explanatory Tradition: Lessons for Increasing the External Validity of Results from Randomized Trials. We examine how the causal explanatory tradition—originating in the writing of Lee Cronbach—can inform the planning, conduct and analysis of randomized trials to increase external validity of findings. Find us in the Balboa A/B room on Friday, November 4th from 10:45am to 11:30am.

Second, Valeriy Lazarev and Denis Newman will present a paper entitled, “From Program Effect to Cost Savings: Valuing the Benefits of Educational Innovation Using Vertically Scaled Test Scores And Instructional Expenditure Data.”

Be sure to stop by on Saturday, November 5th from 9:50am to 11:20am in room Avila A.

Furthermore, Jenna Zacamy, Senior Research Manager at Empirical Education, will be presenting on two topics at the National Staff Development Council (NSDC) annual conference taking place in Anaheim, CA from December 3rd to 7th. Join her on Monday, December 5th at 2:30pm to 4:30pm when she will talk about the impact on student achievement for grades 4 through 8 of the Alabama Math, Science, and Technology Initiative, together with Pamela Finney and Jean Scott from SERVE Center at UNCG.

On Tuesday, December 6th at 10:00am to 12:00pm Jenna will discuss prior and current research on the effectiveness of a large-scale high school literacy reform together with Cathleen Kral from WestEd and William Loyd from Washtenaw Intermediate School District.

2011-10-10

Conference Season has Arrived

Springtime marks the start of “conference season” and Empirical Education has been busy attending and preparing for the various meetings and events. We are participating in five conferences (CoSN, SIIA, SREE, NCES-MIS, and AERA) and we hope to see some familiar faces in our travels. If you will be attending any of the following meetings, please give us a call. We’d love to schedule a time to speak with you.

CoSN

The Empirical team headed to the 2010 Consortium of School Networking conference in Washington, DC at the Omni Shoreham Hotel from February 28—March 3, 2010. We were joined by Eric Lehew, Executive Director of Learning Support Services at Poway Unified School District, who co-presented with us a poster titled, “Turning Existing Data into Research” (Monday, March 1 from 1:00pm to 2:00pm). As exhibitors, Empirical Education also hosted a 15-minute vendor demonstration entitled Building Local Capacity: Using Your Own Data Systems to Easily Measure Program Effectiveness, to launch our MeasureResults tool.

SIIA

The Software & Information Industry Association held their 2010 Ed Tech Government Forum in Washington, DC on March 3–4. The focus this year was on Education Funding & Programs in a (Post) Stimulus World and included speakers, such as Secretary of Education, Arne Duncan and West Virginia Superintendent of Schools, Steven Paine.

SREE

Just as the SIIA Forum came to a close, the Society for Research on Educational Effectiveness held their annual conference—Research Into Practice—March 4-6 where our chief scientist, Andrew Jaciw, and research scientist, Xiaohui Zheng, presented their poster on estimating long-term program impacts when the control group joins treatment in the short-term. Dr. Jaciw was also named on a paper presentation with Rob Olsen of Abt Associates.

Thursday March 4, 2010
3:30pm–5:00pm: Session 2
2E. Research Methodology
Examining State Assessments
Forum
Chair: Jane Hannaway, The Urban Institute
Using State Or Study-Administered Achievement Tests in Impact Evaluations
Rob Olsen and Fatih Unlu, Abt Associates and Andrew Jaciw, Empirical Education
Friday March 5, 2010
5:00pm–7:00pm: Poster Session
Poster Session: Research Methodology
Estimating Long-Term Program Impacts When the Control Group Joins Treatment in the Short-Term: A Theoretical and Empirical Study of the Tradeoffs Between Extra- and Quasi-Experimental Estimates
Andrew Jaciw, Boya Ma, and Qingfeng Zhao, Empirical Education

NCES-MIS

The 23rd Annual Management Information Systems (MIS) Conference was held in Phoenix, Arizona March 3-5. Co-sponsored by the Arizona Department of Education and the U.S. Department of Education’s National Center for Education Statistics (NCES), the MIS Conference brings together the people who work with information collection, management, transmittal, and reporting in school districts and state education agencies. The majority of the sessions focused on data use, data standards, statewide data systems, and data quality. For more information, refer to the program highlights.

AERA

We will have a strong showing at the American Educational Research Association annual conference in Denver, Colorado from Friday, April 30 through Tuesday, May 4. Please come talk to us at our poster and paper sessions. View our AERA presentation schedule to find out which of our presentations you would like to attend. And we hope to see you at our customary stylish reception Sunday evening, May 2 from 6 to 8:30—mark your calendars!

IES

We will be presenting at the IES Research Conference in National Harbor, MD from June 28-30. View our poster here.

2010-03-12

Final Report on “Local Experiments” Project

Empirical Education released the final report of a project that has developed a unique perspective on how school systems can use scientific evidence. Representing more than three years of research and development effort, our report describes the startup of six randomized experiments and traces how local agencies decided to undertake the studies and how the resulting information was used. The project was funded by a grant from the Institute of Education Sciences under their program on Education Policy, Finance, and Systems. It started with a straightforward conjecture:

The combination of readily available student data and the greater pressure on school systems to improve productivity through the use of scientific evidence of program effectiveness could lead to a reduction in the cost of rigorous program evaluations and to a rapid increase in the number of such studies conducted internally by school districts.

The prevailing view of scientifically based research is that educators are consumers of research conducted by professionals. There is also a belief that rigorous research is extraordinarily expensive. The supposition behind our proposal was that the cost could be made low enough to allow experiments to be conducted routinely to support district decisions with local educators as the producers of evidence. The project contributed a number of methodological, analytic, and reporting approaches with potential to lower costs and make rigorous program evaluation more accessible to district researchers. An important result of the work was bringing to light the differences between conventional research design aimed at broadly generalized conclusions and design aimed at answering a local question, where sampling is restricted to the relevant “unit of decision making” such as a school district with jurisdiction over decisions about instructional or professional development programs. The final report concludes with an understanding of research use at the central office level, whether “data-driven” or “evidence-based” decision making, as a process of moving through stages in which looking for descriptive patterns in the data (i.e., data mining for questions of interest) will precede the use of statistical analysis of differences between and associations among variables of interest using appropriate methods such as HLM. And these will precede the adoption of an experimental research design to isolate causal, moderator, and mediator effects. It is proposed that most districts are not yet prepared to produce and use experimental evidence but would be able to start with useful descriptive exploration of data leading to needs assessment as a first step in a more proactive use of evaluation to inform their decisions.

For a copy of the report, please choose the Toward School Districts Conducting Their Own Rigorous Program Evaluation paper from our reports and papers webpage.

2008-10-01
Archive