Empirical Education Inc.

Navigating the Tensions: How Could Equity-Relevant Research Also Be Agile, Open, and Scalable?

Our SEERNet partnership with Digital Promise is working to connect platform developers, researchers, and educators to find ways to conduct equity-relevant research using well-used digital learning platforms, and to simultaneously conduct research that is more agile, more open, and more directly applicable at scale. To do this researchers may have to rethink how they plan and undertake their research. We wrote a paper identifying five approaches that could better support this work.

Reframe research designs to form smaller, agile cycles that test small changes each time.
Researchers could shift from designing new educational resources to determining how well-used resources could be elaborated and refined to address equity issues.
Researchers could utilize variables that capture student experiences to investigate equity when they cannot obtain student demographic/identify variables.
Researchers could work in partnership with educators on equity problems that educators prioritize and want help in solving.
Researchers could acknowledge that achieving equity is not only a technological or resource-design problem, but requires working at the classroom and systems levels too.

We hope that this paper (Navigating the Tensions: How Could Equity-Relevant Research Also Be Agile, Open, and Scalable?) will provide insights and ideas for researchers in the SEERNet community.

Read the paper here.

2022-11-09

Posted by: Robin Means and Jenna Zacamy

Tags: Digital Promise, educational leadership, educational research methods, equity and online learning platforms

Evidentally is a finalist in the XPRIZE Digital Learning Challenge

The XPRIZE Digital Learning Challenge encourages applicants to develop innovative approaches to “modernize, accelerate, and improve effective learning tools, processes and outcomes” for all learners. The overarching goal of this type of research is to increase equity by identifying education products that work with different subgroups of students. Seeing the Institute of Education Sciences (IES) move in this direction provides hope for the future of education research and our students.

For those of you who have known us for the last 5-10 years, you may be aware that we’ve been working towards this future of low-cost, quick turnaround studies for quite some time. To be completely transparent, I had never even heard of XPRIZE before IES funded one of their competitions.

Given our excitement about this IES-funded competition, we knew we had to throw our Evidentally hat into the ring. Evidentally is the part of Empirical Education—formerly known as Evidence as a Service (EaaS)—that has been producing low-cost, quick turnaround research reports for edtech clients for the past 5 years.

Of the 33 teams who entered the XPRIZE competition, we are excited to announce that we are one of the 10 finalists. We look forward to seeing how this competition helps to pave the road to scalable education research.

2022-08-09

Posted by: Robin Means

Tags: equity, Evidentally and XPRIZE

McGraw Hill Education ALEKS Study Published

We worked with McGraw Hill in 2021 to evaluate the effect of ALEKS, an adaptive program for math and science, in California and Arizona. These School Impact reports, like all of our reports, were designed to meet The Every Student Succeeds Act (ESSA) evidence standards.

During this process of working with McGraw Hill, we found evidence that the implementation of ALEKS in Arizona school districts during the 2018-2019 school year had a positive effect on the AzMERIT End of Course Algebra I and Algebra II assessments, especially for students from historically disadvantaged populations. This School Impact report—meeting ESSA evidence tier 3: Promising Evidence—identifies the school-level effects of active ALEKS usage on achievement compared to similar AZ schools not using ALEKS.

Please visit our McGraw Hill webpage to read the ALEKS Arizona School Impact report.

What is ESSA?

For more information on ESSA in education and how Empirical Education incorporates the ESSA standards into our work, check out our ESSA page.

2022-05-11

Posted by: Robin Means

Tags: ALEKS, ESSA, every student succeeds act, McGraw Hill, student impact and what is ESSA

Presenting CREATE at AERA in April 2022

Attending AERA 2022

We’re finally returning to in-person conferences after the COVID-related hiatus, and we will be presenting at the annual meeting of the American Educational Research Association (AERA). This year, the AERA meeting will be held in San Diego, our CEO Denis Newman’s new home turf since relocating to Encinitas from Palo Alto, CA.

Sze-Shun Lau and Jenna Zacamy will be attending AERA in person, joined by Andrew Jaciw virtually, to present impacts of Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness (CREATE) on the continuous retention of teachers through their second year.

When: Thursday, April 21, from 2:30 to 4:00pm PDT
Where: San Diego Convention Center, Exhibit Hall B
AERA Roundtable session: Retaining Teachers for Diverse Contexts
AERA Presentation: Impacts of “Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness” on the Continuous Retention of Teachers Through Their Second Year

Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness (CREATE)

The work that is the basis of our AERA presentation examines the impact of CREATE—a teacher induction program—on graduation and subsequent retention of teachers through their first two years. The matched comparison group design involved 121 teachers across two cohorts. Positive impacts on retention rates were observed among Black educators only.

Retention rates after two years of teaching were 71% for non-Black educators in both CREATE and comparison groups. For Black educators the rates were 96% and 63% in CREATE and comparison, respectively. Positive impacts on mediators among Black educators, including stress-management and self-efficacy in teaching, provide a preliminary explanation of effects.

We have been exploring potential mechanisms for these impacts by posing open-ended survey questions to teachers about teacher retention. Based on their own conversations, experiences, and observations, early career teachers have cited rigid teaching standards, heavy and mentally taxing workloads, a lack of support from administration, and the low pay as common reasons teachers in their first three years of teaching leave the profession.

Factors that these teachers see as effective in retaining early-career teachers include recognition of the importance of representation in the classroom and motivation to work towards building less oppressive systems for their students. For early career teachers participating in CREATE, the access to professional learning around communication skills, changing one’s mindset, and addressing inequities are credited as potential drivers of higher retention rates.

We look forward to presenting these and other themes that have emerged from the responses these teachers provided.

We would be delighted to see you in San Diego if you’re planning to attend AERA. Let us know if we can schedule a time to meet up.

^{Photo by Lucas Davies}

2022-04-04

Posted by: Robin Means and Sze-Shun Lau

Tags: aera and CREATE

Towards Greater (Local) Relevance of Causal Generalizations

To cite the paper we discuss in this blog post, use the reference below.

Jaciw, A. P., Unlu, F., & Nguyen, T. (2021). A within-study approach to evaluating the role of moderators of impact in limiting generalizations from “large to small”. American Journal of Evaluation. https://journals.sagepub.com/doi/10.1177/10982140211030552

Generalizability of Causal Inferences

The field of education has made much progress over the past 20 years in the use of rigorous methods, such as randomized experiments, for evaluating causal impacts of programs. This includes a growing number of studies on the generalizability of causal inferences stemming from the recognition of the prevalence of impact heterogeneity and its sources (Steiner et al., 2019). Most recent work on generalizability of causal inferences has focused on inferences from “small to large”. Studies typically include 30–70 schools while generalizations are made to inference populations at least ten times larger (Tipton et al., 2017). Such studies are typically used in informing decision makers concerned with impacts on broad scales, for example at the state level. However, as we are periodically reminded by the likes of Cronbach (1975, 1982) and Shadish et al. (2002), generalizations are of many types and support decisions on different levels. Causal inferences may be generalized not only to populations outside the study sample or to larger populations, but also to subgroups within the study sample and to smaller groups – even down to the individual! In practice, district and school officials who need local interpretations of the evidence might ask: “If a school reform effort demonstrates positive impact on some large scale, should I, as a principal, expect that the reform will have positive impact on the students in my school?” Our work introduces a new approach (or a new application of an old approach) to address questions of this type. We empirically evaluate how well causal inferences that are drawn on the large scale generalize to smaller scales.

The Research Method

We adapt a method from studies traditionally used (first in economics and then in education) to empirically measure the accuracy of program impact estimates from non-experiments. A central question is whether specific strategies result in better alignment between non-experimental impact findings and experimental benchmarks. Those studies—sometimes referred to as “Within-Study Comparison” studies (pioneered by Lalonde, 1986, and Fraker et al., 1987)—typically start with an estimate of a program’s impact from an uncompromised experiment. This result serves as the benchmark experimental impact finding. Then, to generate a non-experimental result, outcomes from the experimental control are replaced with those from a different comparison group. The difference in impact that results from this substitution measures the bias (inaccuracy) in the result that employs the non-experimental comparison. Researchers typically summarize this bias, and then try to remediate using various design and analysis-based strategies. (The Within-Study Comparison literature is vast and includes many studies that we cite in the article.)

Our Approach Follows a Within-Study Comparison Rationale and Method, but with a Focus on Generalizability.

We use data from the multisite Tennessee Student-Teacher Achievement Ratio (STAR) class size reduction experiment (described in Finn et al., 1990; Mosteller, 1995; Nye et al., 2000) to illustrate the application of our method. (We used 73 of the original 79 sites.) In the original study, students and teachers were randomized to small or regular-sized classes in grades K-3. Results showed a positive average impact of small classes. In our study, we ask whether a decisionmaker at a given site should accept this finding of an overall average positive impact as generalizable to his/her individual site.

We use the Within-Study Comparison Method as a Foundation.

First, we adopt the idea of using experimental benchmark impacts as the starting point. In the case of the STAR trial, each of the 73 individual sites yields its own benchmark value for impact. Second, consistent with Within-Study Comparisons, we select an alternative to compare against the benchmark. Specifically, we choose the average of impacts (the grand mean) across all sites as the generalized value. Third, we establish how closely this generalized value approximates impacts at individual sites (i.e., how well it generalizes “to the small”.) With STAR, we can do this 73 times, once for each site. Fourth, we summarize the discrepancies. Standard Within-Study Comparison methods typically average over the absolute values of individual biases. We adapt this, but instead use the average of 73 squared differences between the generalized impact and site-benchmark impacts. This allows us to capture the average discrepancy as a variance, specifically as the variation in impact across sites. We estimated this variation several ways, using alternative hierarchical linear models. Finally, we examine whether adjusting for imbalance between sites in site-level characteristics that potentially interact with treatment leads to closer alignment between the grand mean (generalized) and site-specific impacts. (Sometimes people wonder why, with Within-Study Comparison studies, if site-specific benchmark impacts are available, one would use less-optimal comparison group-based alternatives. With Within-Study Comparisons, the whole point is to see how closely we can replicate the benchmark quantity, in order to inform how well methods of causal inference (of generalization, in this case) potentially perform, in situations where we do not have an experimental benchmark.)

Our application is intentionally based on Within-Study Comparison methods. This is set out clearly in Jaciw (2010, 2016). Early applications with a similar approach can be found in Hotz, et al. (2005) and Hotz, et al. (2006). A new contribution of ours is that we summarize the discrepancy not as an average of absolute value of bias (a common metric in Within-Study Comparison studies), but as noted above, as a variance. This may sound like a nuanced technical detail, but we think it leads to an important interpretation: variation in impact is not just related to the problem of generalizability; rather, it directly indexes the accuracy (quantifies the degree of validity) of generalizations from “large to small”. We acknowledge Bloom et al. (2005) for the impetus for this idea, specifically, their insight that bias in Within-Study Comparison studies can be thought of as a type of “mismatch error”. Finally, we think it is important to acknowledge the ideas in G Theory from education (Cronbach et al., 1963; Shavelson et al., 2009). In that tradition, parsing variability in outcomes, accounting for its sources, and assessing the role of interactions among study factors, are central to the problem of generalizability.

Research Findings

First main result

The grand mean impact, on average, does not generalize reliably to the 73 sites. Before covariate adjustments, the average of the differences between the grand mean and the impacts at individual sites ranges between 0.41 and 0.25 standard deviations (SDs) of the outcome distribution, depending on the model used. After covariate adjustments, the average of the differences ranges between 0.41 and 0.17 SDs. (The average impact was about 0.25 SD.)

Second main result

Modeling effects of site-level covariates, and their interactions with treatment, only minimally reduced the between-site differences in impact.

The third main result

Whether impact heterogeneity achieves statistical significance depends on sampling error and correctly accounting for its sources. If we are going to provide accurate policy advice, we must make sure that we are not confusing random sampling error within sites (differences we would expect in results even if the program was not used) for variation in impact across sites. One source of random sampling error that is important but could be overlooked comes from classes. Given that teachers provide different value-added to students’ learning, we can expect differences in outcomes across classes. In STAR, with only a handful of teachers per school, the between-class differences easily add noise to the between-school outcomes and impacts. After adjusting for class random effects, the discrepancies in impact described above decreased by approximately 40%.

Research Conclusions

For the STAR experiment, the grand mean impact failed to generalize to individual sites. Adjusting for effects of moderators did not help much. Adjusting for class-level sampling error significantly reduced the level of measured heterogeneity. Even though the discrepancies decreased significantly after the class effects were included, the size of the discrepancies remained large enough to be substantively important, and therefore, we cannot conclude that the average impact generalized to individual sites.

In sum, based on this study, a policymaker at the site (school) level should apply caution in assessing whether the average result applies to his or her unique context.

The results remind us of an observation from Lee Cronbach (1982) about how a school board might best draw inferences about their local context serving a large Hispanic student body when program effects vary:

The school board might therefore do better to look at…small cities, cities with a large Hispanic minority, cities with well-trained teachers, and so on. Several interpretations-by-analogy can then be made….If these several conclusions are not too discordant, the board can have some confidence in the decision that it makes about its small city with well-trained teachers and a Hispanic clientele. When results in the various slices of data are dissimilar, it is better to try to understand the variation than to take the well-determined – but only remotely relevant – national average as the best available information. The school board cannot regard that average as superior information unless it believes that district characteristics do not matter (p. 167).

Some Possible Extensions of The Work

We’re looking forward to doing more work to continue to understand how to produce useful generalizations that support decision-making on smaller scales. Traditional Within-Study Comparison studies give us much food for thought, including about other designs and analysis strategies for inferring impacts to individual sites, and how to best communicate the discrepancies we observe and whether they are substantively large enough to matter for informing policy decisions and outcomes. One area of main interest concerns the quality of the moderators themselves; that is, how well they account for or explain impact heterogeneity. Here our approach diverges from traditional Within-Study Comparison studies. When applied to problems of internal validity, confounders can be seen as nuisances that make our impact estimates inaccurate. With regard to external validity, factors that interact with the treatment, and thereby produce variation in impact that affects generalizability, are not a nuisance; rather, they are an important source of information that may help us to understand the mechanisms through which the variation in impact occurs. Therefore, understanding the mechanisms relating the person, the program, context, and the outcome is key.

Lee Cronbach described the bounty of and interrelations among interactions in the social sciences as a “hall of mirrors”. We’re looking forward to continuing the careful journey along that hall to incrementally make sense of a complex world!

References

Bloom, H. S., Michalopoulos, C., & Hill, C. J. (2005). Using experiments to assess nonexperimental comparison -group methods for measuring program effect. In H. S. Bloom (Ed.), Learning more from social experiments (pp. 173 –235). Russell Sage Foundation.

Cronbach, L. J. (1975). Beyond the two disciplines of scientific psychology. American Psychologist, 30(2), 116–127.

Cronbach, L.J., Rajaratnam, N., & Gleser, G.C. (1963). Theory of generalizability: A liberation of reliability theory. The British Journal of Statistical Psychology, 16, 137-163.

Cronbach, L. J. (1982). Designing Evaluations of Educational and Social Programs. Jossey-Bass.

Finn, J. D., & Achilles, C. M., (1990). Answers and questions about class size: A statewide experiment. American Educational Research Journal, 27, 557-577.

Fraker, T., & Maynard, R. (1987). The adequacy of comparison group designs for evaluations of employment-related programs. The Journal of Human Resources, 22, 194–227.

Jaciw, A. P. (2010). Challenges to drawing generalized causal inferences in educational research: Methodological and philosophical considerations. [Doctoral dissertation, Stanford University.]

Jaciw, A. P. (2016). Assessing the accuracy of generalized inferences from comparison group studies using a within-study comparison approach: The methodology. Evaluation Review, 40, 199-240. https://journals.sagepub.com/doi/abs/10.1177/0193841x16664456

Hotz, V. J., Imbens, G. W., & Klerman, J. A. (2006). Evaluating the differential effects of alternative welfare-to-work training components: A reanalysis of the California GAIN Program. Journal of Labor Economics, 24, 521–566.

Hotz, V. J., Imbens, G. W. & Mortimer, J. H (2005). Predicting the efficacy of future training programs using past experiences at other locations. Journal of Econometrics, 125, 241–270.

Lalonde, R. (1986). Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review, 76, 604–620.

Mosteller, F., (1995). The Tennessee study of class size in the early school grades. The Future of Children, 5, 113-127.

Nye, B., Hedges, L. V., & Konstantopoulos, (2000). The effects of small classes on academic achievement: The results of the Tennessee class size experiment. American Educational Research Journal, 37, 123-151.

Fraker, T., & Maynard, R. (1987). The adequacy of comparison group designs for evaluations of employment-related programs. The Journal of Human Resources, 22, 194–227.

Shadish, W. R., Cook, T. D., & Campbell, D. T., (2002). Experimental and Quasi-experimental Designs for Generalized Causal Inference. Houghton Mifflin.

Shavelson, R. J., & Webb, N. M. (2009). Generalizability theory and its contributions to the discussion of the generalizability of research findings. In K. Ercikan & W. M. Roth (Eds.), Generalizing from educational research (pp. 13–32). Routledge.

Steiner, P. M., Wong, V. C. & Anglin, K. (2019). A causal replication framework for designing and assessing replication efforts. Zeitschrift fur Psychologie, 227, 280–292.

Tipton, E., Hallberg, K., Hedges, L. V., & Chan, W. (2017). Implications of small samples for generalization: Adjustments and rules of thumb. Evaluation Review, 41(5), 472–505.

^{Photo by drmakete lab}

2022-03-15

Posted by: Andrew P. Jaciw and Thanh Nguyen

Tags: generalizability and within-study comparison

Introducing SEERNet with the Goal of Replication Research

In 2021, we partnered with Digital Promise on a research proposal for the IES research network: Digital Learning Platforms to Enable Efficient Education Research Network. The project, SEER Research Network for Digital Learning Platforms (SEERNet) was funded through an IES education research grant in fall 2021, and we took off running. Digital Promise launched this SEERNet website to keep the community up to date on our progress. We’ve been meeting with five platform hosts, selected by IES, to develop ideas for replication research, generalizability in research, and rapid research.

The goal of SEERNet is to integrate rigorous education research into existing digital learning platforms (DLPs) in an effort to modernize research. The digital learning platforms have the potential to support education researchers as they study new ideas and seek to replicate those ideas quickly, across many sites, with a wide range of student populations and with a variety of education research topics. Each of the five platforms (listed below) will eventually have over 100,000 users, allowing us to explore ways to increase the efficiency of a replication study.

Kinetic by OpenStax
UpGrade/MATHia by Carnegie Learning
Learning at Scale by Arizona State University
E-Trials by ASSISTments
Terracotta by Canvas

As the network leads, Empirical Education and Digital Promise will work to share best practices among the DLPs and build a community of researchers and practitioners interested in the opportunities afforded by these innovative platforms for impactful research. Stay tuned for more updates on how you can get involved!

^{This project is supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305N210034 to Digital Promise. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.}

2022-01-20

Posted by: Robin Means

Tags: education research, education research grant, education research topics, generalizability, generalizability in research, rapid research, replication, what is a limitation that affects the generalizability of research results, what is a limitation that affects the generalizability of research results? and what is replication of a study

Happy New Year from Empirical Education

We’ve had an interesting year, to say the least. We are grateful for your colleagueship and friendship through it all. To ring in the new year, we want to share this one-hour playlist with you. It comprises songs from each person on our team. We hope you like it. Cheers to a healthy and prosperous 2022!

Happy New Year playlist

2021-12-16

Posted by: Robin Means

Tags: education research, empirical education, ESSA and evaluation

Introducing Our Newest Researchers

The Empirical Research Team is pleased to announce the addition of 2 new team members. We welcome Zahava Heydel and Chelsey Nardi on board as our newest researchers!

Zahava Heydel, Research Assistant

Zahava has taken on assisting Sze-Shun Lau with the CREATE project, a teacher residency program in Atlanta Public Schools invested in expanding equity in education by developing critically conscious, compassionate, and skilled educators. Zahava’s experience as a research assistant at the University of Colorado Anschutz Medical Campus Department of Psychiatry, Colorado Center for Women’s Behavioral Health and Wellness is an asset to the Empirical Education team as we move toward evaluating SEL programs and individual student needs.

Chelsey Nardi, Research Manager

Chelsey is taking on the role of co-project manager for our evaluation of the CREATE project, working with Sze-Shun and Zahava. Chelsey is currently working toward her PhD exploring the application of antiracist theories in science education, which may support the evaluation of CREATE’s mission to develop critically conscious educators. Additionally, her research experience at McREL International and REL Pacific as a Research and Evaluation Associate has prepared her for managing some of our REL Southwest applied research projects. These experiences, coupled with her experience in project management, makes her an ideal fit for our team.

2021-10-04

Posted by: Robin Means

Tags: Chelsey Nardi, create, education research, rel-sw, research and Zahava Heydel

Empirical Education Wraps Up Two Major i3 Research Studies

Empirical Education is excited to share that we recently completed two Investing In Innovation (i3) (now EIR) evaluations for the Making Sense of SCIENCE program and the Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness (CREATE) programs. We thank the staff on both programs for their fantastic partnership. We also acknowledge Anne Wolf, our i3 technical assistance liaison from Abt Associates, as well as our Technical Working Group members on the Making Sense of SCIENCE project (Anne Chamberlain, Angela DeBarger, Heather Hill, Ellen Kisker, James Pellegrino, Rich Shavelson, Guillermo Solano-Flores, Steve Schneider, Jessaca Spybrook, and Fatih Unlu) for their invaluable contributions. Conducting these two large-scale, complex, multi-year evaluations over the last five years has not only given us the opportunity to learn much about both programs, but has also challenged our thinking—allowing us to grow as evaluators and researchers. We now reflect on some of the key lessons we learned, lessons that we hope will contribute to the field’s efforts in moving large-scale evaluations forward.

Background on Both Programs and Study Summaries

Making Sense of SCIENCE (developed by WestEd) is a teacher professional learning model aimed at increasing student achievement through improving instruction and supporting districts, schools, and teachers in their implementation of the Next Generation Science Standards (NGSS). The key components of the model include building leadership capacity and providing teacher professional learning. The program’s theory of action is based on the premise that professional learning that is situated in an environment of collaborative inquiry and supported by school and district leadership produces a cascade of effects on teachers’ content and pedagogical content knowledge, teachers’ attitudes and beliefs, the school climate, and students’ opportunities to learn. These effects, in turn, yield improvements in student achievement and other non-academic outcomes (e.g., enjoyment of science, self-efficacy, and agency in science learning). NGSS had just been introduced two years prior to the study, a study which ran from 2015 through 2018. The infancy of NGSS and the resulting shifting landscape of science education posed a significant challenge to our study, which we discuss below.

Our impact study of Making Sense of SCIENCE was a cluster-randomized, two-year evaluation involving more than 300 teachers and 8,000 students. Confirmatory impact analyses found a positive and statistically significant impact on teacher content knowledge. While impact results on student achievement were mostly all positive, none reached statistical significance. Exploratory analyses found positive impacts on teacher self-reports of time spent on science instruction, shifts in instructional practices, and amount of peer collaboration. Read our final report here.

CREATE is a three-year teacher residency program for students of Georgia State University College of Education and Human Development (GSU CEHD) that begins in their last year at GSU and continues through their first two years of teaching. The program seeks to raise student achievement by increasing teacher effectiveness and retention of both new and veteran educators by developing critically-conscious, compassionate, and skilled educators who are committed to teaching practices that prioritize racial justice and interrupt inequities.

Our impact study of CREATE used a quasi-experimental design to evaluate program effects for two staggered cohorts of study participants (CREATE and comparison early career teachers) from their final year at GSU CEHD through their second year of teaching, starting with the first cohort in 2015–16. Confirmatory impact analyses found no impact on teacher performance or on student achievement. However, exploratory analyses revealed a positive and statistically significant impact on continuous retention over a three-year time period (spanning graduation from GSU CEHD, entering teaching, and retention into the second year of teaching) for the CREATE group, compared to the comparison group. We also observed that higher continuous retention among Black educators in CREATE, relative to those in the comparison group, is the main driver of the favorable impact. The fact that the differential impacts on Black educators were positive and statistically significant for measures of executive functioning (resilience) and self-efficacy—and marginally statistically significant for stress management related to teaching—hints at potential mediators of impact on retention and guides future research.

After the i3 program funded this research, Empirical Education, GSU CEHD, and CREATE received two additional grants from the U.S. Department of Education’s Supporting Educator Effectiveness Development (SEED) program for further study of CREATE. We are currently studying our sixth cohort of CREATE residents and will have studied eight cohorts of CREATE residents, five cohorts of experienced educators, and two cohorts of cooperating teachers by the end of the second SEED grant. We are excited to continue our work with the GSU and CREATE teams and to explore the impact of CREATE, especially for retention of Black educators. Read our final report for the i3 evaluation of CREATE here.

Lessons Learned

While there were many lessons learned over the past five years, we’ll highlight two that were particularly challenging and possibly most pertinent to other evaluators.

The first key challenge that both studies faced was the availability of valid and reliable instruments to measure impact. For Making Sense of SCIENCE, a measure of student science achievement that was aligned with NGSS was difficult to identify because of the relative newness of the standards, which emphasized three-dimensional learning (disciplinary core ideas, science and engineering practices, and cross-cutting concepts). This multi-dimensional learning stood in stark contrast to the existing view of science education at the time, which primarily focused on science content. In 2014, one year prior to the start of our study, the National Research Council pointed out that “the assessments that are now in wide use were not designed to meet this vision of science proficiency and cannot readily be retrofitted to do so” (NRC, 2014, page 12). While state science assessments that existed at the time were valid and reliable, they focused on science content and did not measure the type of three-dimensional learning targeted by NGSS. The NRC also noted that developing new assessments would “present[s] complex conceptual, technical, and practical challenges, including cost and efficiency, obtaining reliable results from new assessment types, and developing complex tasks that are equitable for students across a wide range of demographic characteristics” (NRC, 2014, p.16).

Given this context, despite the research team’s extensive search for assessments from a variety of sources—including reaching out to state departments of education, university-affiliated assessment centers, and test developers—we could not find an appropriate instrument. Using state assessments was not an option. The states in our study were still in the process of either piloting or field testing assessments that were aligned to NGSS or to state standards based on NGSS. This void of assessments left the evaluation team with no choice but to develop one, independently of the program developer, using established items from multiple sources to address general specifications of NGSS, and relying on the deep content expertise of some members of the research team. Of course there were some risks associated with this, especially given the lack of opportunity to comprehensively pilot or field test the items in the context of the study. When used operationally, the researcher-developed assessment turned out to be difficult and was not highly discriminating of ability at the low end of the achievement scale, which may have influenced the small effect size we observed. The circumstances around the assessment and the need to improvise a measure leads us to interpret findings related to science achievement of the Making Sense of SCIENCE program with caution.

The CREATE evaluation also faced a measurement challenge. One of the two confirmatory outcomes in the study was teacher performance, as measured by ratings of teachers by school administrators on two of the state’s Teacher Assessment on Performance Standards (TAPS), which is a component of the state’s evaluation system (Georgia Department of Education, 2021). We could not detect impact on this measure because the variance observed in the ordinal ratings was remarkably low, with ratings overwhelmingly centered on the median value. This was not a complete surprise. The literature documents this lack of variability in teaching performance ratings. A seminal report, The Widget Effect by The New Teacher Project (Weisberg et al., 2009), called attention to this “national crisis”—the inability of schools to effectively differentiate among low- and high-performing teachers. The report showed that in districts that use binary evaluation ratings, as well as those that use a broader range of rating options, less than 1% of teachers received a rating of unsatisfactory. In the CREATE study, the median value was chosen overwhelmingly. In a study examining teacher performance ratings by Kraft and Gilmour (2017), principals in that study explained that they were more reluctant to give new teachers a rating below proficient because they acknowledge that new teachers were still working to improve their teaching, and that “giving a low rating to a potentially good teacher could be counterproductive to a teacher’s development.” These reasons are particularly relevant to the CREATE study given that the teachers in our study are very early in their teaching career (first year teachers), and given the high turnover rate of all teachers in Georgia.

We bring up this point about instruments as a way to share with the evaluation community what we see as a not uncommon challenge. In 2018 (the final year of outcomes data collection for Making Sense of SCIENCE), when we presented about the difficulties of finding a valid and reliable NGSS-aligned instrument at AERA, a handful of researchers approached us to commiserate; they too were experiencing similar challenges with finding an established NGSS-aligned instrument. As we write this, perhaps states and testing centers are further along in their development of NGSS-aligned assessments. However, the challenge of finding valid and reliable instruments, generally speaking, will persist as long as educational standards continue to evolve. (And they will.) Our response to this challenge was to be as transparent as possible about the instruments and the conclusions we can draw from using them. In reporting on Making Sense of SCIENCE, we provided detailed descriptions of our process for developing the instruments and reported item- and form-level statistics, as well as contextual information and rationale for critical decisions. In reporting on CREATE, we provided the distribution of ratings on the relevant dimensions of teacher performance for both the baseline and outcome measures. In being transparent, we allow the readers to draw their own conclusions from the data available, facilitate the review of the quality of the evidence against various sets of research standards, support replication of the study, and provide further context for future study.

A second challenge was maintaining a consistent sample over the course of the implementation, particularly in multi-year studies. For Making Sense of SCIENCE, which was conducted over two years, there was substantial teacher mobility into and out of the study. Given the reality of schools, even with study incentives, nearly half of teachers moved out of study schools or study-eligible grades within schools over the two year period of the study. This obviously presented a challenge to program implementation. WestEd delivered professional learning as intended, and leadership professional learning activities all met fidelity thresholds for attendance, with strong uptake of Making Sense of SCIENCE within each year (over 90% of teachers met fidelity thresholds). Yet, only slightly more than half of study teachers met the fidelity threshold for both years. The percentage of teachers leaving the school was congruous with what we observed at the national level: only 84% of teachers stay as a teacher at the same school year-over-year (McFarland et al., 2019). For assessing impacts, the effects of teacher mobility can be addressed to some extent at the analysis stage; however, the more important goal is to figure out ways to achieve fidelity of implementation and exposure for the full program duration. One option is to increase incentivization and try to get more buy-in, including among administration, to allow more teachers to reach the two-year participation targets by retaining teachers in subjects and grades to preserve their eligibility status in the study. This solution may go part way because teacher mobility is a reality. Another option is to adapt the program to make it shorter and more intensive. However, this option may work against the core model of the program’s implementation, which may require time for teachers to assimilate their learning. Yet another option is to make the program more adaptable; for example, by letting teachers who leave eligible grades and school to continue to participate remotely, allowing impacts to be assessed over more of the initially randomized sample.

For CREATE, sample size was also a challenge, but for slightly different reasons. During study design and recruitment, we had anticipated and factored the estimated level of attrition into the power analysis, and we successfully recruited the targeted number of teachers. However, several unexpected limitations arose during the study that ultimately resulted in small analytic samples. These limitations included challenges in obtaining research permission from districts and schools (which would have allowed participants to remain active in the study), as well as a loss of study participants due to life changes (e.g., obtaining teaching positions in other states, leaving the teaching profession completely, or feeling like they no longer had the time to complete data collection activities). Also, while Georgia administers the Milestones state assessment in grades 4–8, many participating teachers in both conditions taught lower elementary school grades or non-tested subjects. For the analysis phase, many factors resulted in small student samples: reduced teacher samples, the technical requirement of matching students across conditions within each cohort in order to meet WWC evidence standards, and the need to match students within grades, given the lack of vertically scaled scores. While we did achieve baseline equivalence between the CREATE and comparison groups for the analytic samples, the small number of cases greatly reduced the scope and external validity of the conclusions related to student achievement. The most robust samples were for retention outcomes. We have the most confidence in those results.

As a last point of reflection, we greatly enjoyed and benefited from the close collaboration with our partners on these projects. The research and program teams worked together in lockstep at many stages of the study. We also want to acknowledge the role that the i3 grant played in promoting the collaboration. For example, the grant’s requirements around the development and refinement of the logic model was a major driver of many collaborative efforts. Evaluators reminded the team periodically about the “accountability” requirements, such as ensuring consistency in the definition and use of the program components and mediators in the logic model. The program team, on the other hand, contributed contextual knowledge gained through decades of being intimately involved in the program. In the spirit of participatory evaluation, the two teams benefited from the type of organization learning that “occurs when cognitive systems and memories are developed and shared by members of the organizations” (Cousins & Earl, 1992). This type of organic and fluid relationship encouraged the researchers and program teams to embrace uncertainty during the study. While we “pre-registered” confirmatory research questions for both studies by submitting the study plans to NEi3 prior to the start of the studies, we allowed exploratory questions to be guided by conversations with the program developers. In doing so, we were able to address questions that were most useful to the program developers and the districts and schools implementing the programs.

We are thankful that we had the opportunity to conduct these two rigorous evaluations alongside such humble, thoughtful, and intentional (among other things!) program teams over the last five years, and we look forward to future collaborations. These two evaluations have both broadened and deepened our experience with large-scale evaluations, and we hope that our reflections here not only serve as lessons for us, but that they may also be useful to the education evaluation community at large, as we continue our work in the complex and dynamic education landscape.

References

Cousins, J. B., & Earl, L. M. (1992). The case for participatory evaluation. Educational Evaluation and Policy Analysis, 14(4), 397-418.

Georgia Department of Education (2021). Teacher Keys Effectiveness System. https://www.gadoe.org/School-Improvement/Teacher-and-Leader-Effectiveness/Pages/Teacher-Keys-Effectiveness-System.aspx

Kraft, M. A., & Gilmour, A. F. (2017). Revisiting the widget effect: Teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5), 234-249.

McFarland, J., Hussar, B., Zhang, J., Wang, X., Wang, K., Hein, S., Diliberti, M., Forrest Cataldi, E., Bullock Mann, F., and Barmer, A. (2019). The Condition of Education 2019 (NCES 2019-144). U.S. Department of Education. National Center for Education Statistics. https://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2019144

National Research Council (NRC). (2014). Developing Assessments for the Next Generation Science Standards. Committee on Developing Assessments of Science Proficiency in K-12. Board on Testing and Assessment and Board on Science Education, J.W. Pellegrino, M.R. Wilson, J.A. Koenig, and A.S. Beatty, Editors. Division of Behavioral and Social Sciences and Education. The National Academies Press.

Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The Widget Effect: Our National Failure to Acknowledge and Act on Differences in Teacher Effectiveness. The New Teacher Project. https://tntp.org/wp-content/uploads/2023/02/TheWidgetEffect_2nd_ed.pdf

2021-06-23

Posted by: Thanh Nguyen, Jenna Zacamy, & Andrew Jaciw

Tags: CREATE, education research, empirical education, evaluation, Georgia State University, i3, Making Sense of SCIENCE, QE, RCT, research, teacher and WestEd

Instructional Coaching: Positive Impacts on Edtech Use and Student Learning

In 2019, Digital Promise contracted with Empirical Education to evaluate the impact of the Dynamic Learning Project (DLP) on teacher and student edtech usage and on student achievement. DLP provided school-based instructional technology coaches with mentoring and professional developing, with the goal to increase educational equity and impactful use of technology. You may have seen the blog post we published in summer 2020 announcing the release of our design memo for the study. The importance of this project was magnified during the pandemic-induced shift to an increased use of online tools.

The results of the study are summarized in this research brief published last month. We found evidence of positive impacts on edtech use and student learning across three districts involved in DLP.

These findings make a contribution to the evidence base for how to drive meaningful technology use in schools. This should continue to be an area of investigation for future studies; districts focused on equity and inclusion must ensure that edtech is adopted broadly across teacher and student populations.

2021-04-28

Posted by: Robin Means

Tags: digital promise, edtech, empirical education, evaluation, evidence, impact and research

blog posts and news stories

Navigating the Tensions: How Could Equity-Relevant Research Also Be Agile, Open, and Scalable?

2022-11-09

Evidentally is a finalist in the XPRIZE Digital Learning Challenge

2022-08-09

McGraw Hill Education ALEKS Study Published

What is ESSA?

2022-05-11

Presenting CREATE at AERA in April 2022

Attending AERA 2022

Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness (CREATE)

2022-04-04

Towards Greater (Local) Relevance of Causal Generalizations

Generalizability of Causal Inferences

The Research Method

Our Approach Follows a Within-Study Comparison Rationale and Method, but with a Focus on Generalizability.

We use the Within-Study Comparison Method as a Foundation.

Research Findings

First main result

Second main result

The third main result

Research Conclusions

Some Possible Extensions of The Work

2022-03-15

Introducing SEERNet with the Goal of Replication Research

2022-01-20

Happy New Year from Empirical Education

Happy New Year playlist

2021-12-16

Introducing Our Newest Researchers

2021-10-04

Empirical Education Wraps Up Two Major i3 Research Studies

Background on Both Programs and Study Summaries

Lessons Learned

2021-06-23

Instructional Coaching: Positive Impacts on Edtech Use and Student Learning

2021-04-28

Archive