Empirical Education Inc.

Revisiting The Relationship Between Internal and External Validity

The relationship between internal and external validity has been debated over the last few decades.

At the core of the debate is the question of whether causal validity comes before generalizability. To oversimplify this a bit, it is a question of whether knowing “what works” is logically prior to the question of what works “for whom and under what conditions.”

Some may consider the issue settled. I don’t count myself among them.

I think it is extremely important to revisit this question in the contemporary context, in which discussions are centering on issues of diversity of people and places, and the situatedness of programs and their effects.

In this blog I provide a new perspective on the issue, one that I hope rekindles the debate, and leads to productive new directions for research. (It builds on presentations at APPAM and SREE.)

I have organized the content into three degrees of depth. 1. For those interested in a perusal, I have addressed the main issues through a friendly dialogue presented below. 2. For those who want a deeper dive, I provide a video of a PowerPoint in which I take you through the steps of the argument. 3. The associated paper, Hold the Bets! Do Quasi-and True Experimental Evaluations Yield Equally Valid Impact Results When Effect Generalization is the Goal?, is currently posted as a preprint on SAGE Advance, and is under review by a journal.

Lastly, I would really value your comments to any of these works, to keep the conversation, and the progress in and beneficence from research going. Enjoy (and I hope to hear from you!),

Andrew Jaciw

The Great Place In-Between for Researchers and Evaluators

The impact evaluator is at an interesting crossroads between research and evaluation. There is an accompanying tension, but one that provides fodder for new ideas.

The perception of doing research, especially generalizable scientific research, is that it contributes information about the order of things, and about the relations among parts of systems in nature and society, that leads to cumulative and lasting knowledge.

Program evaluation is not quite the same. It addresses immediate needs, seldom has the luxury of time, and is meant to provide direction for critical stakeholders. It is governed by Program Evaluation Standards, of which Accuracy (including internal and statistical conclusions validity) is just one of many standards, with equal concern for Propriety and Stakeholder Representation.

The activities of the researcher and the evaluator may be seen as complementary, and the results of each can serve evaluative and scientific purposes.

The “impact evaluator” finds herself in a good place where the interests of the researcher-evaluator and evaluator-researcher overlap. This zone is a place where productive paradoxes emerge.

Here is an example from this zone. It takes the form of a friendly dialogue between an Evaluator-Researcher (ER) and a Researcher-Evaluator (RE).

ER: Being quizzical about the problem of external validity, I have proposed a novel method for answering the question of “what works”, or, more correctly of “what may work” in my context. It assumes a program has not yet been tried at my site of interest (the inference sample), and it involves comparing the performance across one or more sites where the program has been used, to performance at my site. The goal is to infer the impact for my site.

RE: Hold-on. So that’s kind of like a comparison group design but in reverse. You’re starting with an untreated group and comparing it to a treated group to draw an inference about potential impact for the untreated group. Right?

ER: Yes.

RE: But that does not make sense. That’s not the usual starting point. In research we start with the treated group and look for a valid control, not the other way around. I am confused.

ER: I understand, but when I was teaching, such comparisons were natural. For example, we compared the performance of a school just like ours, but that used Success For All (SFA), to performance at our school, which did not use SFA, to infer how we might have performed had we used the program. That is, to generalize the potential effect of the program for our site.

RE: You mean to predict impact for your site.

ER: Call it what you will. I prefer generalize because I am using information about performance under assignment to treatment from somewhere else.

RE: Hmmm. Odd, but OK (for now). However, why would you do that? Why not use an experimental result from somewhere else, maybe with some adjustment for differences in student composition and other things? You know, using reweighting methods, to produce a reasonable inference about potential impact for your site.

ER: I could, but that information would be coming from somewhere else where there are a lot of unknown variables about how that site operates, and I am not sure the local decision-makers would buy it. Coming from elsewhere it would be considered less-relevant.

RE: But your comparison also uses information from somewhere else. You’re using performance outcomes from somewhere else (where the treatment was implemented) to infer how your local site would have performed had the treatment been used there.

ER: Yes, but I am also preserving the true outcome in the absence of treatment (the ‘business as usual’ control outcome) for my site. I have half the true solution for my site. You’re asking me to get all my information from somewhere else.

RE: Yes, but I know the experimental result is unbiased from selection into conditions at the other “comparison” site, because of the randomized and uncompromised design. I‘ll take that over your “flipped” comparison group design any day!

ER: But your result may be biased from selection into sites, reflecting imbalance on known and possibly unknown moderators of impact. You’re talking about an experiment over there, and I have half the true solution over here, where I need it.

RE: I’ll take internal validity over there, first, and then worry about external validity to your site. Remember, internal validity is the “sine qua non”. Without it, you don’t have anything. Your approach seems deficient on two counts: first from lack of internal validity (you’re not using an experiment), and second from a lack of external validity (you’re drawing a comparison with somewhere else).

ER: OK, now you’re getting to the meat of things. Here is my bold conjecture: yes, internal and external validity bias both may be at play, but sometimes they may cancel each other out.

RE: What!? Like a chancy fluky kind of thing?

ER: No, systematically, and in principle.

RE: I don’t believe it. Two wrongs (biases) don’t make a right.

ER: But the product of two negatives makes a positive.

RE: I need something concrete to show what you mean.

ER: OK, here is an instance… The left vertical bar is the average impact for my site (site N). The right vertical bar is the average impact for the remote site (site M). The short horizontal bars show the values of Y (the outcome) for each site. (The black ones show values we can observe, the white-filled one shows an unobserved value [i.e., I don’t observe performance at my site (N) when treatment is provided, so the bar is empty.]) Bias1 is the difference between the other site and my site in the average impact (the difference in length of the vertical bars). Bias2 results from a comparison between sites in their average performance in the absence of treatment.

A figure showing the difference between performance in the presence of treatment at one location, and performance in the absence of treatment at the other location, which is the inference site.

The point that matters here is that using the impact from the other site M (the length of the vertical line at M) to infer impact for my site N, leads to a result that is biased by an amount equal to the difference between the length of the vertical bars (Bias 1). But if I use the main approach that I am talking about, and compare performance under treatment at the remote site “M” (black bar at the top of Site M site) to the performance at my site without treatment (black bar at the bottom of Site N) the total bias is (Bias1 – Bias2), and the magnitude of this “net bias” is less than Bias1 by itself.

RE: Well, you have not figured-in the sampling error.

ER: Correct. We can do that, but for now let’s consider that we’re working with true values.

RE: OK, let’s say for the moment I accept what you’re saying. What does it do to the order and logic that internal validity precedes external validity?

ER: That is the question. What does it do? It seems that when generalizability is a concern, internal and external validity should be considered concurrently. Internal validity is the sole concern only when external validity is not at issue. You might say internal validity wins the race, but only when it’s the only runner.

RE: You’re going down a philosophical wormhole. That can be dangerous.

ER: Alright, then let’s stop here (for now).

RE and ER walk happily down the conference hall to the bar where RE has a double Jack, neat, and ER tries the house red.

BTW, here is the full argument and mathematical demonstration of the idea. Please share on social and tag us (our social handles are in the footer below). We’d love to know your thoughts. A.J.

2023-09-20

Posted by: Andrew Jaciw

Tags: evaluation, evaluator, generalizability, research, researcher and validity

Getting Different Results from the Same Program in Different Contexts

The spring 2014 conference of the Society for Research in Educational Effectiveness (SREE) gave us much food for thought concerning the role of replication of experimental results in social science research. If two research teams get the same result from experiments on the same program, that gives us confidence that the original result was not a fluke or somehow biased.

But in his keynote, John Ioannidis of Stanford showed that even in medical research, where the context can be more tightly controlled, replication very often fails—researchers get different results. The original finding may have been biased, for example, through the tendency to suppress null findings where no positive effect was found and over-report large, but potentially spurious results. Replication of a result over the long run helps us to get past the biases. Though not as glamorous as discovery, replication is fundamental to science, and educational science is no exception.

In the course of the conference, I was reminded that the challenge to conducting replication work is, in a sense, compounded in social science research. “Effect heterogeneity”—finding different results in different contexts—is common for many legitimate reasons. For instance, experimental controls seldom get placebos. They receive the program already in place, often referred to as “business as usual,” and this can vary across experiments of the same intervention and contribute to different results. Also, experiments of the same program carried out in different contexts are likely to be adapted given demands or affordances of the situation, and flexible implementation may lead to different results. The challenge is to disentangle differences in effects that give insight into how programs are adapted in response to conditions, from bias in results that John Ioannidis considered. In other fields (e.g., the “hard sciences”), less context dependency and more-robust effects may make it easier to diagnose when variation in findings is illegitimate. In education, this may be more challenging and reminds me why educational research is in many ways the ‘hardest science’ of all, as David Berliner has emphasized in the past.

Once separated from distortions of bias and properly differentiated from the usual kind of “noise” or random error, differences in effects can actually be leveraged to better understand how and for whom programs work. Building systematic differences in conditions into our research designs can be revealing. Such efforts should, however, be considered with the role of replication in mind—an approach to research that purposively builds in heterogeneity, in a sense, seeks to find where impacts don’t replicate, but for good reason. Non-reproducibility in this case is not haphazard, it is purposive.

What are some approaches to leveraging and understanding effect heterogeneity? We envision randomized trials where heterogeneity is built into the design by comparing different versions of a program or implementing in diverse settings across which program effects are hypothesized to vary. A planning phase of an RCT would allow discussions with experts and stakeholders about potential drivers of heterogeneity. Pertinent questions to address during this period include: what are the attributes of participants and settings across which we expect effects to vary and why? Under which conditions and how do we expect program implementation to change? Hypothesizing which factors will moderate effects before the experiment is conducted would add credibility to results if they corroborate the theory. A thoughtful approach of this sort can be contrasted with the usual approach whereby differential effects of program are explored as afterthoughts, with the results carrying little weight.

Building in conditions for understanding effect heterogeneity will have implications for experimental design. Increasing variation in outcomes affects statistical power and the sensitivity of designs to detect effects. We will need a better understanding of the parameters affecting precision of estimates. At Empirical, we have started using results from several of our experiments to explore parameters affecting sensitivity of tests for detecting differential impact. For example, we have been documenting the variation across schools in differences in performance depending on student characteristics such as individual SES, gender, and LEP status. This variation determines how precisely we are able to estimate the average difference between student subgroups in the impact of a program.

Some may feel that introducing heterogeneity to better understand conditions for observing program effects is going down a slippery slope. Their thinking is that it is better to focus on program impacts averaged across the study population and to replicate those effects across conditions; and that building sources of variation into the design may lead to loose interpretations and loss of rigor in design and analysis. We appreciate the cautionary element of this position. However, we believe that a systematic study of how a program interacts with conditions can be done in a disciplined way without giving up the usual strategies for ensuring the validity of results.

We are excited about the possibility that education research is entering a period of disciplined scientific inquiry to better understand how differences in students, contexts, and programs interact, with the hope that the resulting work will lead to greater opportunity and better fit of program solutions to individuals.

2014-05-21

Posted by: Andrew Jaciw

Tags: conference, differential impact, experiment, heterogeneity, intervention, program effectiveness, RCT, replication, research, rigor, SREE and validity

Can We Measure the Measures of Teaching Effectiveness?

Teacher evaluation has become the hot topic in education. State and local agencies are quickly implementing new programs spurred by federal initiatives and evidence that teacher effectiveness is a major contributor to student growth. The Chicago teachers’ strike brought out the deep divisions over the issue of evaluations. There, the focus was on the use of student achievement gains, or value-added. But the other side of evaluation—systematic classroom observations by administrators—is also raising interest. Teaching is a very complex skill, and the development of frameworks for describing and measuring its interlocking elements is an area of active and pressing research. The movement toward using observations as part of teacher evaluation is not without controversy. A recent OpEd in Education Week by Mike Schmoker criticizes the rapid implementation of what he considers overly complex evaluation templates “without any solid evidence that it promotes better teaching.”

There are researchers engaged in the careful study of evaluation systems, including the combination of value-added and observations. The Bill and Melinda Gates Foundation has funded a large team of researchers through its Measures of Effective Teaching (MET) project, which has already produced an array of reports for both academic and practitioner audiences (with more to come). But research can be ponderous, especially when the question is whether such systems can impact teacher effectiveness. A year ago, the Institute of Education Sciences (IES) awarded an $18 million contract to AIR to conduct a randomized experiment to measure the impact of a teacher and leader evaluation system on student achievement, classroom practices, and teacher and principal mobility. The experiment is scheduled to start this school year and results will likely start appearing by 2015. However, at the current rate of implementation by education agencies, most programs will be in full swing by then.

Empirical Education is currently involved in teacher evaluation through Observation Engine: our web-based tool that helps administrators make more reliable observations. See our story about our work with Tulsa Public Schools. This tool, along with our R&D on protocol validation, was initiated as part of the MET project. In our view, the complexity and time-consuming aspects of many of the observation systems that Schmoker criticizes arise from their intended use as supports for professional development. The initial motivation for developing observation frameworks was to provide better feedback and professional development for teachers. Their complexity is driven by the goal of providing detailed, specific feedback. Such systems can become cumbersome when applied to the goal of providing a single score for every teacher representing teaching quality that can be used administratively, for example, for personnel decisions. We suspect that a more streamlined and less labor-intensive evaluation approach could be used to identify the teachers in need of coaching and professional development. That subset of teachers would then receive the more resource-intensive evaluation and training services such as complex, detailed scales, interviews, and coaching sessions.

The other question Schmoker raises is: do these evaluation systems promote better teaching? While waiting for the IES study to be reported, some things can be done. First, look at correlations of the components of the observation rubrics with other measures of teaching such as value-added to student achievement (VAM) scores or student surveys. The idea is to see whether the behaviors valued and promoted by the rubrics are associated with improved achievement. The videos and data collected by the MET project are the basis for tools to do this (see earlier story on our Validation Engine.) But school systems can conduct the same analysis using their own student and teacher data. Second, use quasi-experimental methods to look at the changes in achievement related to the system’s local implementation of evaluation systems. In both cases, many school systems are already collecting very detailed data that can be used to test the validity and effectiveness of their locally adopted approaches.

2012-10-31

Posted by: Denis Newman

Tags: achievement, education, evaluation systems, evidence, framework, LEA, leader, observation, observation engine, PD, randomization, research, SEA, student, teacher, teacher effectiveness, teacher evaluation, validity and value-added

Need for Product Evaluations Continues to Grow

There is a growing need for evidence of the effectiveness of products and services being sold to schools. A new release of SIIA’s product evaluation guidelines is now available at the Selling to Schools website (with continued free access to SIIA members), to help guide publishers in measuring the effectiveness of the tools they are selling to schools.

It’s been almost a decade since NCLB made its call for “scientifically-based research,” but the calls for research haven’t faded away. This is because resources available to schools have diminished over that time, heightening the importance of cost benefit trade-offs in spending.

NCLB has focused attention on test score achievement, and this metric is becoming more pervasive; e.g., through a tie to teacher evaluation and through linkages to dropout risk. While NCLB fostered a compliance mentality—product specs had to have a check mark next to SBR—the need to assure that funds are not wasted is now leading to a greater interest in research results. Decision-makers are now very interested in whether specific products will be effective, or how well they have been working, in their districts.

Fortunately, the data available for evaluations of all kinds is getting better and easier to access. The U.S. Department of Education has poured hundreds of millions of dollars into state data systems. These investments make data available to states and drive the cleaning and standardizing of data from districts. At the same time, districts continue to invest in data systems and warehouses. While still not a trivial task, the ability of school district researchers to get the data needed to determine if an investment paid off—in terms of increased student achievement or attendance—has become much easier over the last decade.

The reauthorization of ESEA (i.e., NCLB) is maintaining the pressure to evaluate education products. We are still a long way from the draft reauthorization introduced in Congress becoming a law, but the initial indications are quite favorable to the continued production of product effectiveness evidence. The language has changed somewhat. Look for the phrase “evidence based”. Along with the term “scientifically-valid”, this new language is actually more sophisticated and potentially more effective than the old SBR neologism. Bob Slavin, one of the reviewers of the SIIA guidelines, says in his Ed Week blog that “This is not the squishy ‘based on scientifically-based evidence’ of NCLB. This is the real McCoy.” It is notable that the definition of “evidence-based” goes beyond just setting rules for the design of research, such as the SBR focus on the single dimension of “internal validity” for which randomization gets the top rating. It now asks how generalizable the research is or its “external validity”; i.e., does it have any relevance for decision-makers?

One of the important goals of the SIIA guidelines for product effectiveness research is to improve the credibility of publisher-sponsored research. It is important that educators see it as more than just “market research” producing biased results. In this era of reduced budgets, schools need to have tangible evidence of the value of products they buy. By following the SIIA’s guidelines, publishers will find it easier to achieve that credibility.

2011-11-12

Posted by: Denis Newman

Tags: credibility, data systems, effectiveness, ESEA, evaluation, evidence, evidence based, guidelines, measuring, NCLB, research, research, researcher, scientifically-based, scientifically-valid, SIIA, teacher, teacher evaluation and validity

Join Empirical Education at ALAS, AEA, and NSDC

This year, the Association of Latino Administrators & Superintendents (ALAS) will be holding its 8th annual summit on Hispanic Education in San Francisco. Participants will have the opportunity to attend speaker sessions, roundtable discussions, and network with fellow attendees. Denis Newman, CEO of Empirical Education, together with John Sipe, Senior Vice President and National Sales Manager at Houghton Mifflin Harcourt and Jeannetta Mitchell, eight-grade teacher at Presidio Middle school and a participant in the pilot study, will take part in a 30-minute discussion reviewing the study design and experiences gathered around a one-year study of Algebra on the iPad. The session takes place on October 13th at the Salon 8 of the Marriott Marquis in San Francisco from 10:30am to 12:00pm.

Also this year, the American Evaluation Association (AEA) will be hosting its 25th annual conference from November 2–5 in Anaheim, CA. Approximately 2,500 evaluation practitioners, academics, and students from around the globe are expected to gather at the conference. This year’s theme revolves around the challenges of values and valuing in evaluation.

We are excited to be part of AEA again this year and would like to invite you to join us at two presentations. First, Denis Newman will be hosting the roundtable session on Returning to the Causal Explanatory Tradition: Lessons for Increasing the External Validity of Results from Randomized Trials. We examine how the causal explanatory tradition—originating in the writing of Lee Cronbach—can inform the planning, conduct and analysis of randomized trials to increase external validity of findings. Find us in the Balboa A/B room on Friday, November 4th from 10:45am to 11:30am.

Second, Valeriy Lazarev and Denis Newman will present a paper entitled, “From Program Effect to Cost Savings: Valuing the Benefits of Educational Innovation Using Vertically Scaled Test Scores And Instructional Expenditure Data.”

Be sure to stop by on Saturday, November 5th from 9:50am to 11:20am in room Avila A.

Furthermore, Jenna Zacamy, Senior Research Manager at Empirical Education, will be presenting on two topics at the National Staff Development Council (NSDC) annual conference taking place in Anaheim, CA from December 3rd to 7th. Join her on Monday, December 5th at 2:30pm to 4:30pm when she will talk about the impact on student achievement for grades 4 through 8 of the Alabama Math, Science, and Technology Initiative, together with Pamela Finney and Jean Scott from SERVE Center at UNCG.

On Tuesday, December 6th at 10:00am to 12:00pm Jenna will discuss prior and current research on the effectiveness of a large-scale high school literacy reform together with Cathleen Kral from WestEd and William Loyd from Washtenaw Intermediate School District.

2011-10-10

Posted by: Robin Means

Tags: AEA, ALAS, algebra, AMSTI, conference, cost-benefit, education, evaluation, iPad, NSDC, pilot, presentation, program effectiveness, randomization, RCT, research and validity

Looking Back 35 Years to Learn about Local Experiments

With the growing interest among federal agencies in building local capacity for research, we took another look at an article by Lee Cronbach published in 1975. We found it has a lot to say about conducting local experiments and implications for generalizability. Cronbach worked for much of his career at Empirical’s neighbor, Stanford University, and his work has had a direct and indirect influence on our thinking. Some may interpret Cronbach’s work as stating that randomized trials of educational interventions have no value because of the complexity of interactions between subjects, contexts, and the experimental treatment. In any particular context, these interactions are infinitely complex, forming a “hall of mirrors” (as he famously put it, p. 119), making experimental results—which at most can address a small number of lower-order interactions—irrelevant. We don’t read it that way. Rather, we see powerful insights as well as cautions for conducting the kinds of field experiments that are beginning to show promise for providing educators with useful evidence.

We presented these ideas at the Society for Research in Educational Effectiveness conference in March, building the presentation around a set of memorable quotes from the 1975 article. Here we highlight some of the main ideas.

Quote #1: “When we give proper weight to local conditions, any generalization is a working hypothesis, not a conclusion…positive results obtained with a new procedure for early education in one community warrant another community trying it. But instead of trusting that those results generalize, the next community needs its own local evaluation” (p. 125).

Practitioners are making decisions for their local jurisdiction. An experiment conducted elsewhere (including over many locales, where the results are averaged) provides a useful starting point, but not “proof” that it will or will not work in the same way locally. Experiments give us a working hypothesis concerning an effect, but it has to be tested against local conditions at the appropriate scale of implementation. This brings to mind California’s experience with class size reduction following the famous experiment in Tennessee, and how the working hypothesis corroborated through the experiment did not transfer to a different context. We also see applicability of Cronbach’s ideas in the Investing in Innovation (i3) program, where initial evidence is being taken as a warrant to scale-up intervention, but where the grants included funding for research under new conditions where implementation may head in unanticipated directions, leading to new effects.

Quote #2: “Instead of making generalization the ruling consideration in our research, I suggest that we reverse our priorities. An observer collecting data in one particular situation…will give attention to whatever variables were controlled, but he will give equally careful attention to uncontrolled conditions…. As results accumulate, a person who seeks understanding will do his best to trace how the uncontrolled factors could have caused local departures from the modal effect. That is, generalization comes late, and the exception is taken as seriously as the rule” (pp. 124-125).

Finding or even seeking out conditions that lead to variation in the treatment effect facilitates external validity, as we build an account of the variation. This should not be seen as a threat to generalizability because an estimate of average impact is not robust across conditions. We should spend some time looking at the ways that the intervention interacts differently with local characteristics, in order to determine which factors account for heterogeneity in the impact and which ones do not. Though this activity is exploratory and not necessarily anticipated in the design, it provides the basis for understanding how the treatment plays out, and why its effect may not be constant across settings. Over time, generalizations can emerge, as we compile an account of the different ways in which the treatment is realized and the conditions that suppress or accentuate its effects.

Quote #3: “Generalizations decay” (p. 122).

In the social policy arena, and especially with the rapid development of technologies, we can’t expect interventions to stay constant. And we certainly can’t expect the contexts of implementation to be the same over many years. The call for quicker turn-around in our studies is therefore necessary, not just because decision-makers need to act, but because any finding may have a short shelf life.

Cronbach, L. J. (1975). Beyond the two disciplines of scientific psychology. American Psychologist, 116-127.

2011-03-21

Posted by: Andrew Jaciw

Tags: conference, Cronbach, education, evidence, experiment, experiment, exploratory, generalizability, heterogeneity, implementation, intervention, modal effect, policy, RCT, SREE, Stanford and validity

blog posts and news stories

Revisiting The Relationship Between Internal and External Validity

The Great Place In-Between for Researchers and Evaluators

2023-09-20

Getting Different Results from the Same Program in Different Contexts

2014-05-21

Can We Measure the Measures of Teaching Effectiveness?

2012-10-31

Need for Product Evaluations Continues to Grow

2011-11-12

Join Empirical Education at ALAS, AEA, and NSDC

2011-10-10

Looking Back 35 Years to Learn about Local Experiments

2011-03-21

Archive