Empirical Education Inc.

Report of the Evaluation of iRAISE Released

Empirical Education Inc. has completed its evaluation (read the report here) of an online professional development program for Reading Apprenticeship. WestEd’s Strategic Literacy Initiative (SLI) was awarded a development grant under the Investing in Innovation (i3) program in 2012. iRAISE (internet-based Reading Apprenticeship Improving Science Education) is an online professional development program for high school science teachers. iRAISE trained more than 100 teachers in Michigan and Pennsylvania over the three years of the grant. Empirical’s randomized control trial measured the impact of the program on students with special attention to differences in their incoming reading achievement levels.

The goal of iRAISE was to improve student achievement by training teachers in the use of Reading Apprenticeship, an instructional framework that describes the classroom in four interacting dimensions of learning: social, personal, cognitive, and knowledge-building. The inquiry-based professional development (PD) model included a week-long Foundations training in the summer; monthly synchronous group sessions and smaller personal learning communities; and asynchronous discussion groups designed to change teachers’ understanding of their role in adolescent literacy development and to build capacity for literacy instruction in the academic disciplines. iRAISE adapted an earlier face-to-face version of Reading Apprenticeship professional development, which was studied under an earlier i3 grant, Reading Apprenticeship Improving Secondary Education (RAISE), into a completely online course, creating a flexible, accessible platform.

To evaluate iRAISE, Empirical Education conducted an experiment in which 82 teachers across 27 schools were randomly assigned to either receive the iRAISE Professional Development during the 2014-15 school year or continue with business as usual and receive the program one year later. Data collection included monthly teacher surveys that measured their use of several classroom instructional practices and a spring administration of an online literacy assessment, developed by Educational Testing Service, to measure student achievement in literacy. We found significant positive impacts of iRAISE on several of the classroom practice outcomes, including teachers providing explicit instruction on comprehension strategies, their use of metacognitive inquiry strategies, and their levels of confidence in literacy instruction. These results were consistent with the prior RAISE research study and are an important replication of the previous findings, as they substantiate the success of SLI’s development of a more accessible online version of their teacher PD. After a one-year implementation with iRAISE, we do not find an overall effect of the program on student literacy achievement. However, we did find that levels of incoming reading achievement moderate the impact of iRAISE on general reading literacy such that lower scoring students benefit more. The success of iRAISE in adapting immersive, high-quality professional development to an online platform is promising for the field.

You can access the report and research summary from the study using the links below.
iRAISE research report
iRAISE research summary

2016-07-01

Posted by: Adam Schellinger

Tags: effectiveness, empirical education, evaluation, evidence, i3, iRAISE, Reading Apprenticeship, report, research and science

Empirical’s Impact as a Service Providing Insight to EdTech Companies

Education innovators and entrepreneurs have been receiving a boost of support from private equity investors. Currently, ASU GSV is holding their 2016 Summit to support new businesses whose goals are to make a difference in education. Reach Newschools Capital (Reach) is one such organization providing early stage funding, as well as business acumen to entrepreneurs who are trying to solve the most challenging issues…and often with the most challenged populations, in K-12 education. Through Empirical Education, Reach is providing research services by examining the demographic impact of the constituents these education innovators hope to serve. By examining company data from 20 of Reach’s portfolio companies, Empirical provides reports and easy-to-read graphs comparing customer demographic information to national average estimates.

The reports have been well received in gleaning the kind of information companies need to stay on mission…economically, through goods and services, and as social impact.

“The Edtech industry is trying to change the perception that the latest and greatest technologies are only reaching the wealthiest students with the most resources. These reports are disproving this claim, showing that there are a large number of low-income, minority students utilizing these products.” said Aly Sharp, Product Manager for Empirical Education.

2016-04-19

Posted by: Marilyn Quinsaat

Tags: Demographic Information, Education, Empirical Education, Impact as a Service, Insight, K-12, Reach Newschools Capital, Research and Social Impact

Math in Focus Paper Published in JREE

Chief Scientist Andrew Jaciw’s paper entitled Assessing Impacts of Math in Focus, a “Singapore Math” Program was accepted by the Journal of Research on Educational Effectiveness. The paper reports the results of an RCT conducted in Clark County (Las Vegas, NV) by a team that included Whitney Hegseth, Li Lin, Megan Toby, Denis Newman, Boya Ma, and Jenna Zacamy. From the abstract (available online here):

Twenty-two grade-level teams across twelve schools were randomized to the program or business as usual. Measures included indicators of fidelity to treatment, and student mathematics learning. Impacts on mathematics achievement ranged between .11 and .15 standard deviation units, with no differential impact based on level of pretest [or] minority status.

2016-03-30

Posted by: Robin Means

Tags: Andrew Jaciw, Clark County, journal, Journal of Research on Educational Effectiveness, JREE, Math in Focus, MIF and Singapore Math

Five-year evaluation of Reading Apprenticeship i3 implementation reported at SREE

Empirical Education has released two research reports on the scale-up and impact of Reading Apprenticeship, as implemented under one of the first cohorts of Investing in Innovation (i3) grants. The Reading Apprenticeship Improving Secondary Education (RAISE) project reached approximately 2,800 teachers in five states with a program providing teacher professional development in content literacy in three disciplines: science, history, and English language arts. RAISE supported Empirical Education and our partner, IMPAQ International, in evaluating the innovation through both a randomized control trial encompassing 42 schools and a systematic study of the scale-up of 239 schools. The RCT found significant impact on student achievement in science classes consistent with prior studies. Mean impact across subjects, while positive, did not reach the .05 level of significance. The scale-up study found evidence that the strategy of building cross-disciplinary teacher teams within the school is associated with growth and sustainability of the program. Both sides of the evaluation were presented at the annual conference of the Society for Research on Educational Effectiveness, March 6-8, 2016 in Washington DC. Cheri Fancsali (formerly of IMPAQ, now at Research Alliance for NYC Schools) presented results of the RCT. Denis Newman (Empirical) presented a comparison of RAISE as instantiated in the RCT and scale-up contexts.

You can access the reports and research summaries from the studies using the links below.
RAISE RCT research report
RAISE RCT research summary
RAISE Scale-up research report
RAISE Scale-up research summary

2016-03-09

Posted by: Denis Newman

Tags: effectiveness, empirical education, evaluation, evidence, i3, RAISE, randomized control trial, RCT, Reading Apprenticeship, report, research, Scale-up, science and SREE

Evaluation Concludes Aspire’s PD Tools Show Promise to Impact Classroom Practice

Empirical Education Inc. has completed an independent evaluation (read the report here) of a set of tools and professional development opportunities developed and implemented by Aspire Public Schools under an Investing in Innovation (i3) grant. Aspire was awarded the development grant in the 2011 funding cycle and put the system, Transforming Teacher Talent (t3), into operation in 2013 in their 35 California schools. The goal of t3 was to improve teacher practice as measured by the Aspire Instructional Rubric (AIR) and thereby improve student outcomes on the California Standards Test (CST), the state assessment. Some of the t3 components connected the AIR scores from classroom observations to individualized professional development materials building on tools from BloomBoard, Inc.

To evaluate t3, Empirical principal investigator, Andrew Jaciw and his team designed the strongest feasible evaluation. Since it was not possible to split the schools into two groups by having two versions of Aspire’s technology infrastructure supporting t3, a randomized experiment or other comparison group design was not feasible. Working with the National Evaluation of i3 (NEi3) team, Empirical developed a correlational design comparing two years of teacher AIR scores and student CST scores; that is, from the 2012-13 school year to the scores in the first year of implementation, 2013-14. Because the state was in a transition to new Common Core tests, the evaluation was unable to collect student outcomes systematically. The AIR scores, however, provided evidence of substantial overall improvement with an effect size of 0.581 standard deviations (p <.001). The evidence meets the standards for “evidence-based” as defined in the recently enacted Every Student Succeeds Act (ESSA), which requires, at the least, that the test of the intervention “demonstrates a statistically significant effect on improving…relevant outcomes based on…promising evidence from at least 1 well designed and well-implemented correlational study with statistical controls for selection bias.” A demonstration of promise can assist in obtaining federal and other funding.

2016-03-07

Posted by: Robin Means

Tags: andrew jaciw, Aspire, bloomboard, correlational design, empirical education, evaluation, i3, pd, report, t3 and Transforming Teacher Talent

SREE Spring 2016 Conference Presentations

We are excited to be presenting two topics at the annual Spring Conference of The Society for Research on Educational Effectiveness (SREE) next week. Our first presentation addresses the problem of using multiple pieces of evidence to support decisions. Our second presentation compares the context of an RCT with schools implementing the same program without those constraints. If you’re at SREE, we hope to run into you, either at one of these presentations (details below) or at one of yours.

Friday, March 4, 2016 from 3:30 - 5PM
Roosevelt (“TR”) - Ritz-Carlton Hotel, Ballroom Level

6E. Evaluating Educational Policies and Programs
Evidence-Based Decision-Making and Continuous Improvement

Chair: Robin Wisniewski, RTI International

Does “What Works”, Work for Me?: Translating Causal Impact Findings from Multiple RCTs of a Program to Support Decision-Making
Andrew P. Jaciw, Denis Newman, Val Lazarev, & Boya Ma, Empirical Education

Saturday, March 5, 2016 from 10AM - 12PM
Culpeper - Fairmont Hotel, Ballroom Level

Session 8F: Evaluating Educational Policies and Programs & International Perspectives on Educational Effectiveness
The Challenge of Scale: Evidence from Charters, Vouchers, and i3

Chair: Ash Vasudeva, Bill & Melinda Gates Foundation

Comparing a Program Implemented under the Constraints of an RCT and in the Wild
Denis Newman, Valeriy Lazarev, & Jenna Zacamy, Empirical Education

2016-02-26

Posted by: Robin Means

Tags: andrew jaciw, conference, denis newman, education policy, empirical education, evidence, evidence based, i3, RCT, research, scale-up and sree

Learning Forward Presentation Highlights Fort Wayne Partnership

This past December, Teacher Evaluation Specialist K.C. MacQueen presented at the annual Learning Forward conference. MacQueen presented alongside Fort Wayne Community Schools’ (FWCS) Todd Cummings and Laura Cain, and Learning Forward’s Kay Psencik. The presentation titled, “Implementing Inter-Rater Reliability in a Learning System,” highlighted how FWCS has used Calibration & Certification Engine (CCE), School Improvement Network’s branded version of Observation Engine™, to ensure equitable evaluation of teacher effectiveness. FWCS detailed the process they used to engage instructional leaders in developing a common rubric vocabulary around their existing teacher observation rubric. While an uncommon step and one that definitely added to the implementation timeline, FWCS prioritized this collaboration and found that it increased both inter-rater reliability and buy-in to the process with the ultimate goal of assisting teachers in improving classroom instruction to result in greater student growth.

2016-01-19

Posted by: K.C. MacQueen

Tags: Calibration & Certification Engine, CCE, conference, Empirical Education, Fort Wayne Community Schools, FWCS, Inter-Rater Reliability, Learning Forward and Observation Engine

Feds Moving Toward a More Rational and Flexible Approach to Teacher Support and Evaluation

Congress is finally making progress on a bill to replace NCLB. Here’s an excerpt from a summary of the draft law.

TITLE II–
Helps states support teachers– The bill provides resources to states and school districts to implement various activities to support teachers, principals, and other educators, including allowable uses of funds for high quality induction programs for new teachers, ongoing professional development opportunities for teachers, and programs to recruit new educators to the profession. Ends federal mandates on evaluations, allows states to innovate- The bill allows, but does not require, states to develop and implement teacher evaluation systems. This bill eliminates the definition of a highly qualified teacher—which has proven onerous to states and school districts—and provides states with the opportunity to define this term.

This is very positive. It makes teacher evaluation no longer an Obama-imposed requirement but allows states, that want to do it (and there are quite a few of those), to use federal funds to support it. It removes the irrational requirement that “student growth” be a major component of these systems. This will lower the reflexive resistance from unions because the purpose of evaluation can be more clearly associated with teacher support (for more on that argument, see the Real Clear Education piece). It will also encourage the use of observation and feedback from administrators and mentors. Removing the outmoded definition of “highly qualified teacher” opens up the possibility of wider use of research-based analyses of what is important to measure in effective teaching.

A summary is also provided by EdWeek. On a separate note, it says: “That new research and innovation program that some folks were describing as sort of a next generation ‘Investing in Innovation’ program made it into the bill. (Sens. Orrin Hatch, R-Utah, and Michael Bennet, D-Colo., are big fans, as is the administration.)”

2015-11-24

Posted by: Denis Newman

Tags: effective teaching, Investing in Innovation, NCLB, No Child Left Behind and teacher evaluation

Upcoming REL-SW Workshop Event

On November 19th, Erica Plut and Jenna Zacamy will join REL Southwest Alliance Liaison Haidee Williams in facilitating a workshop on Identifying Practices to Engage Native American Indian Families in Students’ Academic and Career Aspirations. The workshop is being offered to the Oklahoma Rural School Research Alliance members and their colleagues and will take place in Norman, Oklahoma. The goals of the workshop are:

To increase alliance members’ knowledge and understanding of the research literature addressing promising practices to engage Native American Indian families in students’ academic and career aspirations
To provide an opportunity to use the research literature to inform the refinement or development of family and community engagement programs or initiatives that are focused on students’ academic and career aspirations

You can find more information about this event on the IES website.

2015-11-11

Posted by: Robin Means

Tags: community engagement, Oklahoma, Oklahoma Rural Schools Research Alliance, REL Southwest, research literature and workshop

Unintended Consequences of Using Student Test Scores to Evaluate Teachers

There has been a powerful misconception driving policy in education. It’s a case where theory was inappropriately applied to practice. The misconception has had unintended consequences. It is helping to lead large numbers of parents to opt out of testing and could very well weaken the case in Congress for accountability as ESEA is reauthorized.

The idea that we can use student test scores as one of the measures in evaluating teachers came into vogue with Race to the Top. As a result of that and related federal policies, 38 states now include measures of student growth in teacher evaluations.

This was a conceptual advance over the NCLB definition of teacher quality in terms of preparation and experience. The focus on test scores was also a brilliant political move. The simple qualification for funding from Race to the Top—a linkage between teacher and student data—moved state legislatures to adopt policies calling for more rigorous teacher evaluations even without funding states to implement the policies. The simplicity of pointing to student achievement as the benchmark for evaluating teachers seemed incontrovertible.

It also had a scientific pedigree. Solid work had been accomplished by economists developing value-added modeling (VAM) to estimate a teacher’s contribution to student achievement. Hanushek et al.’s analysis is often cited as the basis for the now widely accepted view that teachers make the single largest contribution to student growth. The Bill and Melinda Gates Foundation invested heavily in its Measures of Effective Teaching (MET) project, which put the econometric calculation of teachers’ contribution to student achievement at the center of multiple measures.

The academic debates around VAM remain intense concerning the most productive statistical specification and evidence for causal inferences. Perhaps the most exciting area of research is in analyses of longitudinal datasets showing that students who have teachers with high VAM scores continue to benefit even into adulthood and career—not so much in their test scores as in their higher earnings, lower likelihood of having children as teenagers, and other results. With so much solid scientific work going on, what is the problem with applying theory to practice? While work on VAMs has provided important findings and productive research techniques, there are four important problems in applying these scientifically-based techniques to teacher evaluation.

First, and this is the thing that should have been obvious from the start, most teachers teach in grades or subjects where no standardized tests are given. If you’re conducting research, there is a wealth of data for math and reading in grades three through eight. However, if you’re a middle-school principal and there are standardized tests for only 20% of your teachers, you will have a problem using test scores for evaluation.

Nevertheless, federal policy required states—in order to receive a waiver from some of the requirements of NCLB—to institute teacher evaluation systems that use student growth as a major factor. To fill the gap in test scores, a few districts purchased or developed tests for every subject taught. A more wide-spread practice is the use of Student Learning Objectives (SLOs). Unfortunately, while they may provide an excellent process for reflection and goal setting between the principal and teacher, they lack the psychometric properties of VAMs, which allow administrators to objectively rank a teacher in relation to other teachers in the district. As the Mathematica team observed, “SLOs are designed to vary not only by grade and subject but also across teachers within a grade and subject.” By contrast, academic research on VAM gave educators and policy makers the impression that a single measure of student growth could be used for teacher evaluation across grades and subjects. It was a misconception unfortunately promoted by many VAM researchers who may have been unaware that the technique could only be applied to a small portion of teachers.

There are several additional reasons that test scores are not useful for teacher evaluation.

The second reason is that VAMs or other measures of student growth don’t provide any indication as to how a teacher can improve. If the purpose of teacher evaluation is to inform personnel decisions such as terminations, salary increases, or bonuses, then, at least for reading and math teachers, VAM scores would be useful. But we are seeing a widespread orientation toward using evaluations to inform professional development. Other kinds of measures, most obviously classroom observations conducted by a mentor or administrator—combined with feedback and guidance—provide a more direct mapping to where the teacher needs to improve. The observer-teacher interactions within an established framework also provide an appropriate managerial discretion in translating the evaluation into personnel decisions. Observation frameworks not only break the observation into specific aspects of practice but provide a rubric for scoring in four or five defined levels. A teacher can view the training materials used to calibrate evaluators to see what the next level looks like. VAM scores are opaque in contrast.

Third, test scores are associated with a narrow range of classroom practice. My colleague, Val Lazarev, and I found an interesting result from a factor analysis of the data collected in the MET project. MET collected classroom videos from thousands of teachers, which were then coded using a number of frameworks. The students were tested in reading and/or math using an assessment that was more focused on problem-solving and constructive items than is found in the usual state test. Our analysis showed that a teacher’s VAM score is more closely associated with the framework elements related to classroom and behavior management (i.e., keeping order in the classroom) than the more refined aspects of dialog with students. Keeping the classroom under control is a fundamental ability associated with good teaching but does not completely encompass what evaluators are looking for. Test scores, as the benchmark measure for effective teaching, may not be capturing many important elements.

Fourth, achievement test scores (and associated VAMs) are calculated based on what teachers can accomplish with respect to improving test scores from the time students appear in their classes in the fall to when they take the standardized test in the spring. If you ask people about their most influential teacher, they talk about being inspired to take up a particular career or about keeping them in school. These are results that are revealed in following years or even decades. A teacher who gets a student to start seeing math in a new way may not get immediate results on the spring test but may get the student to enroll in a more challenging course the next year. A teacher who makes a student feel at home in class may be an important part of the student not dropping out two years later. Whether or not teachers can cause these results is speculative. But the characteristics of warm, engaging, and inspiring teaching can be observed. We now have analytic tools and longitudinal datasets that can begin to reveal the association between being in a teacher’s class and the probability of a student graduating, getting into college, and pursuing a productive career. With records of systematic classroom observations, we may be able, in the future, to associate teaching practices with benchmarks that are more meaningful than the spring test score.

The policy-makers’ dream of an algorithm for translating test scores into teacher salary levels is a fallacy. Even the weaker provisions such as the vague requirement that student growth must be an important element among multiple measures in teacher evaluations has led to a profusion of methods of questionable utility for setting individual goals for teachers. But the insistence on using annual student achievement as the benchmark has led to more serious, perhaps unintended, consequences.

Teacher unions have had good reason to object to using test scores for evaluations. Teacher opposition to this misuse of test scores has reinforced a negative perception of tests as something that teachers oppose in general. The introduction of the new Common Core tests might have been welcomed by the teaching profession as a stronger alignment of the test with the widely shared belief about what is important for students to learn. But the change was opposed by the profession largely because it would be unfair to evaluate teachers on the basis of a test they had no experience preparing students for. Reducing the teaching profession’s opposition to testing may help reduce the clamor of the opt-out movement and keep the schools on the path of continuous improvement of student assessment.

We can return to recognizing that testing has value for teachers as formative assessment. And for the larger community it has value as assurance that schools and districts are maintaining standards, and most importantly, in considering the reauthorization of NCLB, not failing to educate subgroups of students who have the most need.

A final note. For purposes of program and policy evaluation, for understanding the elements of effective teaching, and for longitudinal tracking of the effect on students of school experiences, standardized testing is essential. Research on value-added modeling must continue and expand beyond tests to measure the effect of teachers on preparing students for “college and career”. Removing individual teacher evaluation from the equation will be a positive step toward having the data needed for evidence-based decisions.

An abbreviated version of this blog post can be found on Real Clear Education.

2015-09-10

Posted by: Denis Newman

Tags: analytic tools, classroom observations, Common Core, education policy, effective teaching, ESEA, evidence-based, longitudinal datasets, MET project, NCLB, observation framework, policy evaluation, program evaluation, Student Learning Objectives, teacher evaluation, value-added modeling and VAM

blog posts and news stories

Report of the Evaluation of iRAISE Released

2016-07-01

Empirical’s Impact as a Service Providing Insight to EdTech Companies

2016-04-19

Math in Focus Paper Published in JREE

2016-03-30

Five-year evaluation of Reading Apprenticeship i3 implementation reported at SREE

2016-03-09

Evaluation Concludes Aspire’s PD Tools Show Promise to Impact Classroom Practice

2016-03-07

SREE Spring 2016 Conference Presentations

2016-02-26

Learning Forward Presentation Highlights Fort Wayne Partnership

2016-01-19

Feds Moving Toward a More Rational and Flexible Approach to Teacher Support and Evaluation

2015-11-24

Upcoming REL-SW Workshop Event

2015-11-11

Unintended Consequences of Using Student Test Scores to Evaluate Teachers

2015-09-10

Archive