blog posts and news stories

Empirical Education Wraps Up Two Major i3 Research Studies

Empirical Education is excited to share that we recently completed two Investing In Innovation (i3) (now EIR) evaluations for the Making Sense of SCIENCE program and the Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness (CREATE) programs. We thank the staff on both programs for their fantastic partnership. We also acknowledge Anne Wolf, our i3 technical assistance liaison from Abt Associates, as well as our Technical Working Group members on the Making Sense of SCIENCE project (Anne Chamberlain, Angela DeBarger, Heather Hill, Ellen Kisker, James Pellegrino, Rich Shavelson, Guillermo Solano-Flores, Steve Schneider, Jessaca Spybrook, and Fatih Unlu) for their invaluable contributions. Conducting these two large-scale, complex, multi-year evaluations over the last five years has not only given us the opportunity to learn much about both programs, but has also challenged our thinking—allowing us to grow as evaluators and researchers. We now reflect on some of the key lessons we learned, lessons that we hope will contribute to the field’s efforts in moving large-scale evaluations forward.

Background on Both Programs and Study Summaries

Making Sense of SCIENCE (developed by WestEd) is a teacher professional learning model aimed at increasing student achievement through improving instruction and supporting districts, schools, and teachers in their implementation of the Next Generation Science Standards (NGSS). The key components of the model include building leadership capacity and providing teacher professional learning. The program’s theory of action is based on the premise that professional learning that is situated in an environment of collaborative inquiry and supported by school and district leadership produces a cascade of effects on teachers’ content and pedagogical content knowledge, teachers’ attitudes and beliefs, the school climate, and students’ opportunities to learn. These effects, in turn, yield improvements in student achievement and other non-academic outcomes (e.g., enjoyment of science, self-efficacy, and agency in science learning). NGSS had just been introduced two years prior to the study, a study which ran from 2015 through 2018. The infancy of NGSS and the resulting shifting landscape of science education posed a significant challenge to our study, which we discuss below.

Our impact study of Making Sense of SCIENCE was a cluster-randomized, two-year evaluation involving more than 300 teachers and 8,000 students. Confirmatory impact analyses found a positive and statistically significant impact on teacher content knowledge. While impact results on student achievement were mostly all positive, none reached statistical significance. Exploratory analyses found positive impacts on teacher self-reports of time spent on science instruction, shifts in instructional practices, and amount of peer collaboration. Read our final report here.

CREATE is a three-year teacher residency program for students of Georgia State University College of Education and Human Development (GSU CEHD) that begins in their last year at GSU and continues through their first two years of teaching. The program seeks to raise student achievement by increasing teacher effectiveness and retention of both new and veteran educators by developing critically-conscious, compassionate, and skilled educators who are committed to teaching practices that prioritize racial justice and interrupt inequities.

Our impact study of CREATE used a quasi-experimental design to evaluate program effects for two staggered cohorts of study participants (CREATE and comparison early career teachers) from their final year at GSU CEHD through their second year of teaching, starting with the first cohort in 2015–16. Confirmatory impact analyses found no impact on teacher performance or on student achievement. However, exploratory analyses revealed a positive and statistically significant impact on continuous retention over a three-year time period (spanning graduation from GSU CEHD, entering teaching, and retention into the second year of teaching) for the CREATE group, compared to the comparison group. We also observed that higher continuous retention among Black educators in CREATE, relative to those in the comparison group, is the main driver of the favorable impact. The fact that the differential impacts on Black educators were positive and statistically significant for measures of executive functioning (resilience) and self-efficacy—and marginally statistically significant for stress management related to teaching—hints at potential mediators of impact on retention and guides future research.

After the i3 program funded this research, Empirical Education, GSU CEHD, and CREATE received two additional grants from the U.S. Department of Education’s Supporting Educator Effectiveness Development (SEED) program for further study of CREATE. We are currently studying our sixth cohort of CREATE residents and will have studied eight cohorts of CREATE residents, five cohorts of experienced educators, and two cohorts of cooperating teachers by the end of the second SEED grant. We are excited to continue our work with the GSU and CREATE teams and to explore the impact of CREATE, especially for retention of Black educators. Read our final report for the i3 evaluation of CREATE here.

Lessons Learned

While there were many lessons learned over the past five years, we’ll highlight two that were particularly challenging and possibly most pertinent to other evaluators.

The first key challenge that both studies faced was the availability of valid and reliable instruments to measure impact. For Making Sense of SCIENCE, a measure of student science achievement that was aligned with NGSS was difficult to identify because of the relative newness of the standards, which emphasized three-dimensional learning (disciplinary core ideas, science and engineering practices, and cross-cutting concepts). This multi-dimensional learning stood in stark contrast to the existing view of science education at the time, which primarily focused on science content. In 2014, one year prior to the start of our study, the National Research Council pointed out that “the assessments that are now in wide use were not designed to meet this vision of science proficiency and cannot readily be retrofitted to do so” (NRC, 2014, page 12). While state science assessments that existed at the time were valid and reliable, they focused on science content and did not measure the type of three-dimensional learning targeted by NGSS. The NRC also noted that developing new assessments would “present[s] complex conceptual, technical, and practical challenges, including cost and efficiency, obtaining reliable results from new assessment types, and developing complex tasks that are equitable for students across a wide range of demographic characteristics” (NRC, 2014, p.16).

Given this context, despite the research team’s extensive search for assessments from a variety of sources—including reaching out to state departments of education, university-affiliated assessment centers, and test developers—we could not find an appropriate instrument. Using state assessments was not an option. The states in our study were still in the process of either piloting or field testing assessments that were aligned to NGSS or to state standards based on NGSS. This void of assessments left the evaluation team with no choice but to develop one, independently of the program developer, using established items from multiple sources to address general specifications of NGSS, and relying on the deep content expertise of some members of the research team. Of course there were some risks associated with this, especially given the lack of opportunity to comprehensively pilot or field test the items in the context of the study. When used operationally, the researcher-developed assessment turned out to be difficult and was not highly discriminating of ability at the low end of the achievement scale, which may have influenced the small effect size we observed. The circumstances around the assessment and the need to improvise a measure leads us to interpret findings related to science achievement of the Making Sense of SCIENCE program with caution.

The CREATE evaluation also faced a measurement challenge. One of the two confirmatory outcomes in the study was teacher performance, as measured by ratings of teachers by school administrators on two of the state’s Teacher Assessment on Performance Standards (TAPS), which is a component of the state’s evaluation system (Georgia Department of Education, 2021). We could not detect impact on this measure because the variance observed in the ordinal ratings was remarkably low, with ratings overwhelmingly centered on the median value. This was not a complete surprise. The literature documents this lack of variability in teaching performance ratings. A seminal report, The Widget Effect by The New Teacher Project (Weisberg et al., 2009), called attention to this “national crisis”—the inability of schools to effectively differentiate among low- and high-performing teachers. The report showed that in districts that use binary evaluation ratings, as well as those that use a broader range of rating options, less than 1% of teachers received a rating of unsatisfactory. In the CREATE study, the median value was chosen overwhelmingly. In a study examining teacher performance ratings by Kraft and Gilmour (2017), principals in that study explained that they were more reluctant to give new teachers a rating below proficient because they acknowledge that new teachers were still working to improve their teaching, and that “giving a low rating to a potentially good teacher could be counterproductive to a teacher’s development.” These reasons are particularly relevant to the CREATE study given that the teachers in our study are very early in their teaching career (first year teachers), and given the high turnover rate of all teachers in Georgia.

We bring up this point about instruments as a way to share with the evaluation community what we see as a not uncommon challenge. In 2018 (the final year of outcomes data collection for Making Sense of SCIENCE), when we presented about the difficulties of finding a valid and reliable NGSS-aligned instrument at AERA, a handful of researchers approached us to commiserate; they too were experiencing similar challenges with finding an established NGSS-aligned instrument. As we write this, perhaps states and testing centers are further along in their development of NGSS-aligned assessments. However, the challenge of finding valid and reliable instruments, generally speaking, will persist as long as educational standards continue to evolve. (And they will.) Our response to this challenge was to be as transparent as possible about the instruments and the conclusions we can draw from using them. In reporting on Making Sense of SCIENCE, we provided detailed descriptions of our process for developing the instruments and reported item- and form-level statistics, as well as contextual information and rationale for critical decisions. In reporting on CREATE, we provided the distribution of ratings on the relevant dimensions of teacher performance for both the baseline and outcome measures. In being transparent, we allow the readers to draw their own conclusions from the data available, facilitate the review of the quality of the evidence against various sets of research standards, support replication of the study, and provide further context for future study.

A second challenge was maintaining a consistent sample over the course of the implementation, particularly in multi-year studies. For Making Sense of SCIENCE, which was conducted over two years, there was substantial teacher mobility into and out of the study. Given the reality of schools, even with study incentives, nearly half of teachers moved out of study schools or study-eligible grades within schools over the two year period of the study. This obviously presented a challenge to program implementation. WestEd delivered professional learning as intended, and leadership professional learning activities all met fidelity thresholds for attendance, with strong uptake of Making Sense of SCIENCE within each year (over 90% of teachers met fidelity thresholds). Yet, only slightly more than half of study teachers met the fidelity threshold for both years. The percentage of teachers leaving the school was congruous with what we observed at the national level: only 84% of teachers stay as a teacher at the same school year-over-year (McFarland et al., 2019). For assessing impacts, the effects of teacher mobility can be addressed to some extent at the analysis stage; however, the more important goal is to figure out ways to achieve fidelity of implementation and exposure for the full program duration. One option is to increase incentivization and try to get more buy-in, including among administration, to allow more teachers to reach the two-year participation targets by retaining teachers in subjects and grades to preserve their eligibility status in the study. This solution may go part way because teacher mobility is a reality. Another option is to adapt the program to make it shorter and more intensive. However, this option may work against the core model of the program’s implementation, which may require time for teachers to assimilate their learning. Yet another option is to make the program more adaptable; for example, by letting teachers who leave eligible grades and school to continue to participate remotely, allowing impacts to be assessed over more of the initially randomized sample.

For CREATE, sample size was also a challenge, but for slightly different reasons. During study design and recruitment, we had anticipated and factored the estimated level of attrition into the power analysis, and we successfully recruited the targeted number of teachers. However, several unexpected limitations arose during the study that ultimately resulted in small analytic samples. These limitations included challenges in obtaining research permission from districts and schools (which would have allowed participants to remain active in the study), as well as a loss of study participants due to life changes (e.g., obtaining teaching positions in other states, leaving the teaching profession completely, or feeling like they no longer had the time to complete data collection activities). Also, while Georgia administers the Milestones state assessment in grades 4–8, many participating teachers in both conditions taught lower elementary school grades or non-tested subjects. For the analysis phase, many factors resulted in small student samples: reduced teacher samples, the technical requirement of matching students across conditions within each cohort in order to meet WWC evidence standards, and the need to match students within grades, given the lack of vertically scaled scores. While we did achieve baseline equivalence between the CREATE and comparison groups for the analytic samples, the small number of cases greatly reduced the scope and external validity of the conclusions related to student achievement. The most robust samples were for retention outcomes. We have the most confidence in those results.

As a last point of reflection, we greatly enjoyed and benefited from the close collaboration with our partners on these projects. The research and program teams worked together in lockstep at many stages of the study. We also want to acknowledge the role that the i3 grant played in promoting the collaboration. For example, the grant’s requirements around the development and refinement of the logic model was a major driver of many collaborative efforts. Evaluators reminded the team periodically about the “accountability” requirements, such as ensuring consistency in the definition and use of the program components and mediators in the logic model. The program team, on the other hand, contributed contextual knowledge gained through decades of being intimately involved in the program. In the spirit of participatory evaluation, the two teams benefited from the type of organization learning that “occurs when cognitive systems and memories are developed and shared by members of the organizations” (Cousins & Earl, 1992). This type of organic and fluid relationship encouraged the researchers and program teams to embrace uncertainty during the study. While we “pre-registered” confirmatory research questions for both studies by submitting the study plans to NEi3 prior to the start of the studies, we allowed exploratory questions to be guided by conversations with the program developers. In doing so, we were able to address questions that were most useful to the program developers and the districts and schools implementing the programs.

We are thankful that we had the opportunity to conduct these two rigorous evaluations alongside such humble, thoughtful, and intentional (among other things!) program teams over the last five years, and we look forward to future collaborations. These two evaluations have both broadened and deepened our experience with large-scale evaluations, and we hope that our reflections here not only serve as lessons for us, but that they may also be useful to the education evaluation community at large, as we continue our work in the complex and dynamic education landscape.

References

Cousins, J. B., & Earl, L. M. (1992). The case for participatory evaluation. Educational Evaluation and Policy Analysis, 14(4), 397-418.

Georgia Department of Education (2021). Teacher Keys Effectiveness System. https://www.gadoe.org/School-Improvement/Teacher-and-Leader-Effectiveness/Pages/Teacher-Keys-Effectiveness-System.aspx

Kraft, M. A., & Gilmour, A. F. (2017). Revisiting the widget effect: Teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5), 234-249.

McFarland, J., Hussar, B., Zhang, J., Wang, X., Wang, K., Hein, S., Diliberti, M., Forrest Cataldi, E., Bullock Mann, F., and Barmer, A. (2019). The Condition of Education 2019 (NCES 2019-144). U.S. Department of Education. National Center for Education Statistics. https://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2019144

National Research Council (NRC). (2014). Developing Assessments for the Next Generation Science Standards. Committee on Developing Assessments of Science Proficiency in K-12. Board on Testing and Assessment and Board on Science Education, J.W. Pellegrino, M.R. Wilson, J.A. Koenig, and A.S. Beatty, Editors. Division of Behavioral and Social Sciences and Education. The National Academies Press.

Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The Widget Effect: Our National Failure to Acknowledge and Act on Differences in Teacher Effectiveness. The New Teacher Project. https://tntp.org/wp-content/uploads/2023/02/TheWidgetEffect_2nd_ed.pdf

2021-06-23

The Evaluation of CREATE Continues

Empirical Education began conducting the evaluation of Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness (CREATE) in 2015 under a subcontract with Atlanta Neighborhood Charter Schools (ANCS) as part of their Investing in Innovation (i3) Development grant. Since our last CREATE update, we’ve extended this work through the Supporting Effective Educator Development (SEED) Grant Program. The SEED grant provides continued funding for three more cohorts of participants and expands the research to include experienced educators (those not in the CREATE residency program) in CREATE schools. The grant was awarded to Georgia State University and includes partnerships with ANCS, Empirical Education (as the external evaluator), and local schools and districts.

Similar to the i3 work, we’re following a treatment and comparison group over the course of the three-year CREATE residency program and looking at impacts on teacher effectiveness, teacher retention, and student achievement. With the SEED project, we will also be able to follow Cohort 3 and 4 for an additional 1-2 years following residency. Surveys will measure perceived levels of social capital, school climate and community, collaboration, resilience, and mindfulness, in addition to other topics. Recruitment for Cohort 4 began this past spring and continued through the summer, resulting in approximately 70 new participants.

One of the goals of the expanded CREATE programming is to support the effectiveness and social capital of experienced educators in CREATE schools. Any experienced educator in a CREATE school who attends CREATE professional learning activities will be invited to participate in the research study. Surveys will measure similar topics to those measured in the quasi-experiment and we conduct individual interviews with a sample of participants to gain an in-depth understanding of the participant experience.

We have completed our first year of experienced educator research and continue to recruit participants, on an ongoing basis, into the second year of the study. We currently have 88 participants and counting.

2018-10-03

IES Published Our REL Southwest Study on Trends in Teacher Mobility

The U.S. Department of Education’s Institute of Education Sciences published a report of a study we conducted for REL Southwest! We are thankful for the support and engagement we received from the Educator Effectiveness Research Alliance throughout the study.

The study was published in December 2017 and provides updated information regarding teacher mobility for Texas public schools during the 2011-12 through 2015-16 school years. Teacher mobility is defined as teachers changing schools or leaving the public school system.

In the report, descriptive information on mobility rates is presented at the regional and state levels for each school year. Mobility rates are disaggregated further into destination proportions to describe the proportion of teacher mobility due to within-district movement, between-district movement, and leaving Texas public schools. This study leverages data collected by the Texas Education Agency during the pilot of the Texas Teacher Evaluation and Support System (T-TESS) in 57 school districts in 2014-15. Analyses examine how components of the T-TESS observation rubric are related to school-level teacher mobility rates.

During the 2011-12 school year, 18.7% of Texas teachers moved schools within a district, moved between districts, or left the Texas Public School system. By 2015-16, this mobility rate had increased to 22%. Moving between districts was the primary driver of the increase in mobility rates. Results indicate significant links between mobility and teacher, student, and school demographic characteristics. Teachers with special education certifications left Texas public schools at nearly twice the rate of teachers with other teaching certifications. School-level mobility rates showed significant positive correlations with the proportion of special education, economically disadvantaged, low-performing, and minority students. School-level mobility rates were negatively correlated with the proportion of English learner students. Schools with higher overall observation ratings on the T-TESS rubric tended to have lower mobility rates.

Findings from this study will provide state and district policymakers in Texas with updated information about trends and correlates of mobility in the teaching workforce, and offer a systematic baseline for monitoring and planning for future changes. Informed by these findings, policymakers can formulate a more strategic and targeted approach for recruiting and retaining teachers. For instance, instead of using generic approaches to enhance the overall supply of teachers or improve recruitment, more targeted efforts to attract and retain teachers in specific subject areas (for example, special education), in certain stages of their career (for example, novice teachers), and in certain geographic areas are likely to be more productive. Moreover, this analysis may enrich the existing knowledge base about schools’ teacher retention and mobility in relation to the quality of their teaching force, or may inform policy discussions about the importance of a stable teaching force for teaching effectiveness.

2018-02-01

IES Publishes our Recent REL Southwest Teacher Studies

The U.S. Department of Education’s Institute of Education Sciences published two reports of studies we conducted for REL Southwest! We are thankful for the support and engagement we received from the Educator Effectiveness Research Alliance and the Oklahoma Rural Schools Research Alliance throughout the studies. The collaboration with the research alliances and educators aligns well with what we set out to do in our core mission: to support K-12 systems and empower educators in making evidence-based decisions.

The first study was published earlier this month and identified factors associated with successful recruitment and retention of teachers in Oklahoma rural school districts, in order to highlight potential strategies to address Oklahoma’s teaching shortage. This correlational study covered a 10-year period (the 2005-06 to 2014-15 school years) and used data from the Oklahoma State Department of Education, the Oklahoma Office of Educational Quality and Accountability, federal non-education sources, and publicly available geographic information systems from Google Maps. The study found that teachers who are male, those who have higher postsecondary degrees, and those who have more teaching experience are harder than others to recruit and retain in Oklahoma schools. In addition, for teachers in rural districts, higher total compensation and increased responsibilities in job assignment are positively associated with successful recruitment and retention. In order to provide context, the study also examined patterns of teacher job mobility between rural and non-rural school districts. The rate of teachers in Oklahoma rural schools reaching tenure is slightly lower than the rates for teachers in non-rural areas. Also, rural school districts in Oklahoma had consistently lower rates of success in recruiting teachers than non-rural school districts from 2006-07 to 2011-12.

This most recent study, published last week, examined data from the 2014-15 pilot implementation of the Texas Teacher Evaluation and Support System (T-TESS). In 2014-15 the Texas Education Agency piloted the T-TESS in 57 school districts. During the pilot year teacher overall ratings were based solely on rubric ratings on 16 dimensions across four domains.

The study examined the statistical properties of the T-TESS rubric to explore the extent to which it differentiates teachers on teaching quality and to investigate its internal consistency and efficiency. It also explored whether certain types of schools have teachers with higher or lower ratings. Using data from the pilot for more than 8,000 teachers, the study found that the rubric differentiates teacher effectiveness at the overall, domain, and dimension levels; domain and dimension ratings on the observation rubric are internally consistent; and the observation rubric is efficient, with each dimension making a unique contribution to a teacher’s overall rating. In addition, findings indicated that T-TESS rubric ratings varied slightly in relation to some school characteristics that were examined, such as socioeconomic status and percentage of English Language Learners. However, there is little indication that these characteristics introduced bias in the evaluators’ ratings.

2017-10-30

Sure, the edtech product is proven to work, but will it work in my district?

It’s a scenario not uncommon in your district administrators’ office. They’ve received sales pitches and demos of a slew of new education technology (edtech) products, each one accompanied with “evidence” of its general benefits for teachers and students. But underlying the administrator’s decision is a question often left unanswered: Will this work in our district?

In the conventional approach to research advocated, for example, by the U.S. Department of Education and the Every Student Succeeds Act (ESSA), the finding that is reported and used in the review of products is the overall average impact for any and all subgroups of students, teachers, or schools in the study sample. In our own research, we have repeatedly seen that who it works for and under what conditions can be more important than the average impact. There are products that are effective on average but don’t work for an important subgroup of students, or vice versa, work for some students but not all. Some examples:

  • A math product, while found to be effective overall, was effective for white students but ineffective for minority students. This effect would be relevant to any district wanting to close (rather than further widen) an achievement gap.
  • A product that did well on average performed very well in elementary grades but poorly in middle school. This has obvious relevance for a district, as well as for the provider who may modify its marketing target.
  • A teacher PD product greatly benefitted uncertified teachers but didn’t help the veteran teachers do any better than their peers using the conventional textbook. This product may be useful for new teachers but a poor choice for others.

As a research organization, we have been looking at ways to efficiently answer these kinds of questions for products. Especially now, with the evidence requirements built into ESSA, school leaders can ask the edtech salesperson: “Does your product have evidence that ESSA calls for?” They may well hear an affirmative answer supported by an executive summary of a recent study. But, there’s a fundamental problem with what ESSA is asking for. ESSA doesn’t ask for evidence that the product is likely to work in your specific district. This is not the fault of ESSA’s drafters. The problem is built into the conventional design of research on “what works”. The U.S. Department of Education’s What Works Clearinghouse (WWC) bases its evidence rating only on an average; if there are different results for different subgroups of students, that difference is not part of the rating. Since ESSA adopts the WWC approach, that’s the law of the land. Hence, your district’s most pressing question is left unanswered: will this work for a district like mine?

Recently, the Software & Information Industry Association, the primary trade association of the software industry, released a set of guidelines for research explaining to its member companies the importance of working with districts to conduct research that will meet the ESSA standards. As the lead author of this report, I can say it was our goal to foster an improved dialog between the schools and the providers about the evidence that should be available to support buying these products. As an addendum to the guidelines aimed at arming educators with ways to look at the evidence and questions to ask the edtech salesperson, here are three suggestions:

  1. It is better to have some information than no information. The fact that there’s research that found the product worked somewhere gives you a working hypothesis that it could be a better than average bet to try out in your district. In this respect, you can consider the WWC and newer sites such as Evidence for ESSA rating of the study as a screening tool—they will point you to valid studies about the product you’re interested in. But you should treat previous research as a working hypothesis rather than proof.
  2. Look at where the research evidence was collected. You’ll want to know whether the research sites and populations in the study resemble your local conditions. WWC has gone to considerable effort to code the research by the population in the study and provides a search tool so you can find studies conducted in districts like yours. And if you download and read the original report, it may tell you whether it will help reduce or increase an achievement gap of concern.
  3. Make a deal with the salesperson. In exchange for your help in organizing a pilot and allowing them to analyze your data, you get the product for a year at a steep discount and a good ongoing price if you decide to implement the product on a full scale. While you’re unlikely to get results from a pilot (e.g., based on spring testing) in time to support a decision, you can at least lower your cost for the materials, and you’ll help provide a neighboring district (with similar populations and conditions) with useful evidence to support a strong working hypothesis as to whether it is likely to work for them as well.
2017-10-15

Determining the Impact of CREATE on Math and ELA Achievement

Empirical Education is conducting the evaluation of Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness (CREATE) under an Investing in Innovation (i3) development grant awarded in 2014. The CREATE evaluation takes place in schools throughout the state of Georgia.

Approximately 40 residents from the Georgia State University (GSU) College of Education (COE) are participating in the CREATE teacher residency program. Using a quasi-experimental design, outcomes for these teachers and their students will be compared to those from a matched comparison group of close to 100 teachers who simultaneously enrolled in GSU COE but did not participate in CREATE. Implementation for cohort 1 started in 2015, and cohort 2 started in 2016. Confirmatory outcomes will be assessed in years 2 and 3 of both cohorts (2017 - 2019).

Confirmatory research questions we will be answering include:

What is the impact of one-year of exposure of students to a novice teacher in their second year of teacher residency in the CREATE program, compared to the Business as Usual GSU teacher credential program, on mathematics and ELA achievement of students in grades 4-8, as measured by the Georgia Milestones Assessment System?

What is the impact of CREATE on the quality of instructional strategies used by teachers, as measured by the Teacher Assessment of Performance Standards (TAPS) scores, at the end of the third year of residency, relative to the business as usual condition?

What is the impact of CREATE on the quality of the learning environment created by teachers, as measured by Teacher Assessment of Performance Standards (TAPS) scores, at the end of the third year of residency, relative to the business as usual condition?

Exploratory research questions will address additional teacher-level outcomes including retention, effectiveness, satisfaction, collaboration, and levels of stress in relationships with students and colleagues.

We plan to publish the results of this study in fall of 2019. Please visit the CREATE webpage to read the research report.

2017-06-06

IES Releases New Empirical Education Report on Educator Effectiveness

Just released by IES, our report examines the statistical properties of Arizona’s new multiple-measure teacher evaluation model. The study used data from the pilot in 2012-13 to explore the relationships among the system’s component measures (teacher observations, student academic progress, and stakeholder surveys). It also investigated how well the model differentiated between higher and lower performing teachers. Findings suggest that the model’s observation measure may be improved through further evaluator training and calibration, and that a single aggregated composite score may not adequately represent independent aspects of teacher performance.

The study was carried out in partnership with the Arizona Department of Education as part of our work with the Regional Education Laboratory (REL) West’s Educator Effectiveness Alliance, which includes Arizona, Utah, and Nevada Department of Education officials, as well as teacher union representatives, district administrators, and policymakers. While the analysis is specific to Arizona’s model, the study findings and methodology may be of interest to other state education agencies that are developing of implementing new multiple-measure evaluation systems. We have continued this work with additional analyses for alliance members and plan to provide additional capacity building during 2015.

2014-12-16

U.S. Department of Education Could Expand its Concept of Student Growth

The continuing debate about the use of student test scores as a part of teacher evaluation misses an essential point. A teacher’s influence on a student’s achievement does not end in spring when the student takes the state test (or is evaluated using any of the Student Learning Objectives methods). An inspiring teacher, or one that makes a student feel recognized, or one that digs a bit deeper into the subject matter, may be part of the reason that the student later graduates high school, gets into college, or pursues a STEM career. These are “student achievements,” but they are ones that show up years after a teacher had the student in her class. As a teacher is getting students to grapple with a new concept, the students may not demonstrate improvements on standardized tests that year. But the “value-added” by the teacher may show up in later years.

States and districts implementing educator evaluations as part of their NCLB waivers are very aware of the requirement that they must “use multiple valid measures in determining performance levels, including as a significant factor data on student growth …” Student growth is defined as change between points in time in achievement on assessments. Student growth defined in this way obscures a teacher’s contribution to a student’s later school career.

As a practical matter, it may seem obvious that for this year’s evaluation, we can’t use something that happens next year. But recent analyses of longitudinal data, reviewed in an excellent piece by Raudenbush show that it is possible to identify predictors of later student achievement associated with individual teacher practices and effectiveness. The widespread implementation of multiple-measure teacher evaluations is starting to accumulate just the longitudinal datasets needed to do these predictive analyses. On the basis of these analyses we may be able to validate many of the facets of teaching that we have found, in analyses of the MET data, to be unrelated to student growth as defined in the waiver requirements.

Insofar as we can identify, through classroom observations and surveys, practices and dispositions that are predictive of later student achievement such as college going, then we have validated those practices. Ultimately, we may be able to substitute classroom observations and surveys of students, peers, and parents for value-added modeling based on state tests and other ad hoc measures of student growth. We are not yet at that point, but the first step will be to recognize that a teacher’s influence on a student’s growth extends beyond the year she has the student in the class.

2014-08-30

Does 1 teacher = 1 number? Some Questions About the Research on Composite Measures of Teacher Effectiveness

We are all familiar with approaches to combining student growth metrics and other measures to generate a single measure that can be used to rate teachers for the purpose of personnel decisions. For example, as an alternative to using seniority as the basis for reducing the workforce, a school system may want to base such decisions—at least in part—on a ranking based on a number of measures of teacher effectiveness. One of the reports released January 8 by the Measures of Effective Teaching (MET) addressed approaches to creating a composite (i.e., a single number that averages various aspects of teacher performance) from multiple measures such as value-added modeling (VAM) scores, student surveys, and classroom observations. Working with the thousands of data points in the MET longitudinal database, the researchers were able to try out multiple statistical approaches to combining measures. The important recommendation from this research for practitioners is that, while there is no single best way to weight the various measures that are combined in the composite, balancing the weights more evenly tends to increase reliability.

While acknowledging the value of these analyses, we want to take a step back in this commentary. Here we ask whether agencies may sometimes be jumping to the conclusion that a composite is necessary when the individual measures (and even the components of these measures) may have greater utility than the composite for many purposes.

The basic premise behind creating a composite measure is the idea that there is an underlying characteristic that the composite can more or less accurately reflect. The criterion for a good composite is the extent to which the result accurately identifies a stable characteristic of the teacher’s effectiveness.

A problem with this basic premise is that in focusing on the common factor, the aspects of each measure that are unrelated to the common factor get left out—treated as noise in the statistical equation. But, what if observations and student surveys measure things that are unrelated to what the teacher’s students are able to achieve in a single year under her tutelage (the basis for a VAM score)? What if there are distinct domains of teacher expertise that have little relation to VAM scores? By definition, the multifaceted nature of teaching gets reduced to a single value in the composite.

This single value does have a use in decisions that require an unequivocal ranking of teachers, such as some personnel decisions. For most purposes, however, a multifaceted set of measures would be more useful. The single measure has little value for directing professional development, whereas the detailed output of the observation protocols are designed for just that. Consider a principal deciding which teachers to assign as mentors, or a district administrator deciding which teachers to move toward a principalship. Might it be useful, in such cases, to have several characteristics to represent different dimensions of abilities relevant to success in the particular roles?

Instead of collapsing the multitude of data points from achievement, surveys, and observations, consider an approach that makes maximum use of the data points to identify several distinct characteristics. In the usual method for constructing a composite (and in the MET research), the results for each measure (e.g., the survey or observation protocol) are first collapsed into a single number, and then these values are combined into the composite. This approach already obscures a large amount of information. The Tripod student survey provides scores on the seven Cs; an observation framework may have a dozen characteristics; and even VAM scores, usually thought of as a summary number, can be broken down (with some statistical limitations) into success with low-scoring vs. with high-scoring students (or any other demographic category of interest). Analyzing dozens of these data points for each teacher can potentially identify several distinct facets of a teacher’s overall ability. Not all facets will be strongly correlated with VAM scores but may be related to the teacher’s ability to inspire students in subsequent years to take more challenging courses, stay in school, and engage parents in ways that show up years later.

Creating a single composite measure of teaching has value for a range of administrative decisions. However, the mass of teacher data now being collected are only beginning to be tapped for improving teaching and developing schools as learning organizations.

2013-02-14

Can We Measure the Measures of Teaching Effectiveness?

Teacher evaluation has become the hot topic in education. State and local agencies are quickly implementing new programs spurred by federal initiatives and evidence that teacher effectiveness is a major contributor to student growth. The Chicago teachers’ strike brought out the deep divisions over the issue of evaluations. There, the focus was on the use of student achievement gains, or value-added. But the other side of evaluation—systematic classroom observations by administrators—is also raising interest. Teaching is a very complex skill, and the development of frameworks for describing and measuring its interlocking elements is an area of active and pressing research. The movement toward using observations as part of teacher evaluation is not without controversy. A recent OpEd in Education Week by Mike Schmoker criticizes the rapid implementation of what he considers overly complex evaluation templates “without any solid evidence that it promotes better teaching.”

There are researchers engaged in the careful study of evaluation systems, including the combination of value-added and observations. The Bill and Melinda Gates Foundation has funded a large team of researchers through its Measures of Effective Teaching (MET) project, which has already produced an array of reports for both academic and practitioner audiences (with more to come). But research can be ponderous, especially when the question is whether such systems can impact teacher effectiveness. A year ago, the Institute of Education Sciences (IES) awarded an $18 million contract to AIR to conduct a randomized experiment to measure the impact of a teacher and leader evaluation system on student achievement, classroom practices, and teacher and principal mobility. The experiment is scheduled to start this school year and results will likely start appearing by 2015. However, at the current rate of implementation by education agencies, most programs will be in full swing by then.

Empirical Education is currently involved in teacher evaluation through Observation Engine: our web-based tool that helps administrators make more reliable observations. See our story about our work with Tulsa Public Schools. This tool, along with our R&D on protocol validation, was initiated as part of the MET project. In our view, the complexity and time-consuming aspects of many of the observation systems that Schmoker criticizes arise from their intended use as supports for professional development. The initial motivation for developing observation frameworks was to provide better feedback and professional development for teachers. Their complexity is driven by the goal of providing detailed, specific feedback. Such systems can become cumbersome when applied to the goal of providing a single score for every teacher representing teaching quality that can be used administratively, for example, for personnel decisions. We suspect that a more streamlined and less labor-intensive evaluation approach could be used to identify the teachers in need of coaching and professional development. That subset of teachers would then receive the more resource-intensive evaluation and training services such as complex, detailed scales, interviews, and coaching sessions.

The other question Schmoker raises is: do these evaluation systems promote better teaching? While waiting for the IES study to be reported, some things can be done. First, look at correlations of the components of the observation rubrics with other measures of teaching such as value-added to student achievement (VAM) scores or student surveys. The idea is to see whether the behaviors valued and promoted by the rubrics are associated with improved achievement. The videos and data collected by the MET project are the basis for tools to do this (see earlier story on our Validation Engine.) But school systems can conduct the same analysis using their own student and teacher data. Second, use quasi-experimental methods to look at the changes in achievement related to the system’s local implementation of evaluation systems. In both cases, many school systems are already collecting very detailed data that can be used to test the validity and effectiveness of their locally adopted approaches.

2012-10-31
Archive