Empirical Education Inc.

Looking Back 35 Years to Learn about Local Experiments

With the growing interest among federal agencies in building local capacity for research, we took another look at an article by Lee Cronbach published in 1975. We found it has a lot to say about conducting local experiments and implications for generalizability. Cronbach worked for much of his career at Empirical’s neighbor, Stanford University, and his work has had a direct and indirect influence on our thinking. Some may interpret Cronbach’s work as stating that randomized trials of educational interventions have no value because of the complexity of interactions between subjects, contexts, and the experimental treatment. In any particular context, these interactions are infinitely complex, forming a “hall of mirrors” (as he famously put it, p. 119), making experimental results—which at most can address a small number of lower-order interactions—irrelevant. We don’t read it that way. Rather, we see powerful insights as well as cautions for conducting the kinds of field experiments that are beginning to show promise for providing educators with useful evidence.

We presented these ideas at the Society for Research in Educational Effectiveness conference in March, building the presentation around a set of memorable quotes from the 1975 article. Here we highlight some of the main ideas.

Quote #1: “When we give proper weight to local conditions, any generalization is a working hypothesis, not a conclusion…positive results obtained with a new procedure for early education in one community warrant another community trying it. But instead of trusting that those results generalize, the next community needs its own local evaluation” (p. 125).

Practitioners are making decisions for their local jurisdiction. An experiment conducted elsewhere (including over many locales, where the results are averaged) provides a useful starting point, but not “proof” that it will or will not work in the same way locally. Experiments give us a working hypothesis concerning an effect, but it has to be tested against local conditions at the appropriate scale of implementation. This brings to mind California’s experience with class size reduction following the famous experiment in Tennessee, and how the working hypothesis corroborated through the experiment did not transfer to a different context. We also see applicability of Cronbach’s ideas in the Investing in Innovation (i3) program, where initial evidence is being taken as a warrant to scale-up intervention, but where the grants included funding for research under new conditions where implementation may head in unanticipated directions, leading to new effects.

Quote #2: “Instead of making generalization the ruling consideration in our research, I suggest that we reverse our priorities. An observer collecting data in one particular situation…will give attention to whatever variables were controlled, but he will give equally careful attention to uncontrolled conditions…. As results accumulate, a person who seeks understanding will do his best to trace how the uncontrolled factors could have caused local departures from the modal effect. That is, generalization comes late, and the exception is taken as seriously as the rule” (pp. 124-125).

Finding or even seeking out conditions that lead to variation in the treatment effect facilitates external validity, as we build an account of the variation. This should not be seen as a threat to generalizability because an estimate of average impact is not robust across conditions. We should spend some time looking at the ways that the intervention interacts differently with local characteristics, in order to determine which factors account for heterogeneity in the impact and which ones do not. Though this activity is exploratory and not necessarily anticipated in the design, it provides the basis for understanding how the treatment plays out, and why its effect may not be constant across settings. Over time, generalizations can emerge, as we compile an account of the different ways in which the treatment is realized and the conditions that suppress or accentuate its effects.

Quote #3: “Generalizations decay” (p. 122).

In the social policy arena, and especially with the rapid development of technologies, we can’t expect interventions to stay constant. And we certainly can’t expect the contexts of implementation to be the same over many years. The call for quicker turn-around in our studies is therefore necessary, not just because decision-makers need to act, but because any finding may have a short shelf life.

Cronbach, L. J. (1975). Beyond the two disciplines of scientific psychology. American Psychologist, 116-127.

2011-03-21

Posted by: Andrew Jaciw

Tags: conference, Cronbach, education, evidence, experiment, experiment, exploratory, generalizability, heterogeneity, implementation, intervention, modal effect, policy, RCT, SREE, Stanford and validity

Conference Season 2011

Empirical researchers will again be on the road this conference season, and we’ve included a few new conference stops. Come meet our researchers as we discuss our work at the following events. If you will be present at any of these, please get in touch so we can schedule a time to speak with you, or come by to see us at our presentations.

NCES-MIS

This year, the NCES-MIS “Deep in the Heart of Data” Conference will offer more than 80 presentations, demonstrations, and workshops conducted by information system practitioners from federal, state, and local K-12 agencies.

Come by and say hello to one of our research managers, Joseph Townsend, who will be running Empirical Education’s table display at the Hilton Hotel in Austin, Texas from February 23-25th. Joe will be presenting interactive demonstrations of MeasureResults, which allows school district staff to conduct complete program evaluations online.

SREE

Attendees of this spring’s Society for Research on Educational Effectiveness (SREE) Conference, held in Washington, DC March 3-5, will have the opportunity to discuss questions of generalizability with Empirical Education’s Chief Scientist, Andrew Jaciw and President, Denis Newman at two poster sessions. The first poster, entitled External Validity in the Context of RCTs: Lessons from the Causal Explanatory Tradition applies insights from Lee Cronbach to current RCT practices. In the second poster, The Use of Moderator Effects for Drawing Generalized Causal Inferences, Jaciw addresses issues in multi-site experiments. They look forward to discussing these posters both online at the conference website and in person.

AEFP

We are pleased to announce that we will have our first showing this year at the Association for Education Finance and Policy (AEFP) Annual Conference. Join us in the afternoon on Friday, March 25th at the Grand Hyatt in Seattle, WA as Empirical’s research scientist, Valeriy Lazarev, presents a poster on Cost-benefit analysis of educational innovation using growth measures of student achievement.

AERA

We will again have a strong showing at the 2011 American Educational Research Association (AERA) Conference. Join us in festive New Orleans, April 8-12 for the final results on the efficacy of the PCI Reading Program, our qualitative findings from the first year of formative research on our MeasureResults online program evaluation tool, and more.

View our AERA presentation schedule for more details and a complete list of our participants.

SIIA

This year’s SIIA Ed Tech Industry Summit will take place in gorgeous San Francisco, just 45 minutes north of Empirical Education’s headquarters in the Silicon Valley. We invite you to schedule a meeting with us at the Palace Hotel from May 22-24.

2011-02-18

Posted by: Robin Means

Tags: AEFP, AERA, conference, cost-benefit analysis, efficacy, generalizability, MeasureResults, NCES, presentation, qualitative, SIIA and SREE

Empirical Education Partners with Carnegie Learning on New Student Performance Guarantee

Schools looking to improve student Algebra and Geometry achievement have signed up for a guarantee from Carnegie Learning® that states that students using the company’s Cognitive Tutor programs will pass their math courses. Empirical Education is tasked to monitor student performance in participating schools. Starting this school year, Carnegie Learning guarantees that students who take three complete and consecutive years of Carnegie Learning’s math courses will pass their math class in the third year. The guarantee applies to middle and high school students taking the Carnegie Learning Bridge to Algebra, Algebra, Algebra II, and Geometry courses.

In the coming weeks/months, Empirical will collect roster data, course grades, and assessment scores from schools as well as usage data from Carnegie Learning’s math teaching software. These data will be combined to generate biannual reports that will provide schools with evidence they can use to effectively improve implementation of the courses and raise student achievement.

Carnegie Learning’s guarantee is part of their School Improvement Grant support efforts. “Partnering with Empirical Education will allow us to get mid- and end-of-year research reports into the hands of our school partners,” says Steve Ritter, Co-Founder and Chief Scientist at Carnegie Learning. “It’s part of our continuous improvement cycle; we’re excited to see the progress districts committed to the turnaround and transformation process can make with these new, powerful tools.”

2010-11-30

Posted by: Robin Means

Tags: Bridge to Algebra, Cognitive Tutor, evidence, high school and middle school

Recognizing Success

When the Obama-Duncan administration approaches teacher evaluation, the emphasis is on recognizing success. We heard that clearly in Arne Duncan’s comments on the release of teacher value-added modeling (VAM) data for LA Unified by the LA Times. He’s quoted as saying, “What’s there to hide? In education, we’ve been scared to talk about success.” Since VAM is often thought of as a method for weeding out low performing teachers, Duncan’s statement referencing success casts the use of VAM in a more positive light. Therefore we want to raise the issue here: how do you know when you’ve found success? The general belief is that you’ll recognize it when you see it. But sorting through a multitude of variables is not a straightforward process, and that’s where research methods and statistical techniques can be useful. Below we illustrate how this plays out in teacher and in program evaluation.

As we report in our news story, Empirical is participating in the Gates Foundation project called Measures of Effective Teaching (MET). This project is known for its focus on value-added modeling (VAM) of teacher effectiveness. It is also known for having collected over 10,000 videos from over 2,500 teachers’ classrooms—an astounding accomplishment. Research partners from many top institutions hope to be able to identify the observable correlates for teachers whose students perform at high levels as well as for teachers whose students do not. (The MET project tested all the students with an “alternative assessment” in addition to using the conventional state achievement tests.) With this massive sample that includes both data about the students and videos of teachers, researchers can identify classroom practices that are consistently associated with student success. Empirical’s role in MET is to build a web-based tool that enables school system decision-makers to make use of the data to improve their own teacher evaluation processes. Thus they will be able to build on what’s been learned when conducting their own mini-studies aimed at improving their local observational evaluation methods.

When the MET project recently had its “leads” meeting in Washington DC, the assembled group of researchers, developers, school administrators, and union leaders were treated to an after-dinner speech and Q&A by Joanne Weiss. Joanne is now Arne Duncan’s chief of staff, after having directed the Race to the Top program (and before that was involved in many Silicon Valley educational innovations). The approach of the current administration to teacher evaluation—emphasizing that it is about recognizing success—carries over into program evaluation. This attitude was clear in Joanne’s presentation, in which she declared an intention to “shine a light on what is working.” The approach is part of their thinking about the reauthorization of ESEA, where more flexibility is given to local decision- makers to develop solutions, while the federal legislation is more about establishing achievement goals such as being the leader in college graduation.

Hand in hand with providing flexibility to find solutions, Joanne also spoke of the need to build “local capacity to identify and scale up effective programs.” We welcome the idea that school districts will be free to try out good ideas and identify those that work. This kind of cycle of continuous improvement is very different from the idea, incorporated in NCLB, that researchers will determine what works and disseminate these facts to the practitioners. Joanne spoke about continuous improvement, in the context of teachers and principals, where on a small scale it may be possible to recognize successful teachers and programs without research methodologies. While a teacher’s perception of student progress in the classroom may be aided by regular assessments, the determination of success seldom calls for research design. We advocate for a broader scope, and maintain that a cycle of continuous improvement is just as much needed at the district and state levels. At those levels, we are talking about identifying successful schools or successful programs where research and statistical techniques are needed to direct the light onto what is working. Building research capacity at the district and state level will be a necessary accompaniment to any plan to highlight successes. And, of course, research can’t be motivated purely by the desire to document the success of a program. We have to be equally willing to recognize failure. The administration will have to take seriously the local capacity building to achieve the hoped-for identification and scaling up of successful programs.

2010-11-18

Posted by: Denis Newman

Tags: ESEA, MET project, methodology, observation, program evaluation, research, research design, researcher, teacher effectiveness, teacher evaluation, value-added and VAM

Empirical Education Develops Web-Based Tool to Improve Teacher Evaluation

For school districts looking for ways to improve teacher observation methods, Empirical Education has begun development of a web-delivered tool that will provide a convenient way to validate their observational protocols and rubrics against measures of the teacher’s contribution to student academic growth.

Empirical Education is charged with developing a “validation engine” as part of the Measures of Teacher Effectiveness (MET) project, funded by the Bill and Melinda Gates Foundation. As described on the project’s website, the tool will allow users to “view classroom observation videos, rate those videos and then receive a report that evaluates the predictive validity and rater consistency for the protocol.” The MET project has collected thousands of hours of video of classrooms as well as records of the characteristics and academic performance associated with the students in the class.

By watching and coding videos of a range of teachers, users will be able to verify whether or not their current teacher rating systems are identifying teaching behavior associated with higher achievement. The tool will allow users to review their own rating systems against a variety of MET project measures, and will give real-time feedback through an automated report generator.

Development of the validation engine builds on two years of MET Project research, which included data from six school districts across the country and over 3,000 teachers. Researchers will now use the data to identify leading indicators of teacher practice on student achievement. The engine is expected to undergo beta testing over the next few months, beginning with the National Math and Science Initiative.

Announcement of the new tool comes as interest in alternative ways to measure the effectiveness of teachers is becoming a major issue in education and as federal, state and local officials and teacher organizations look for researched-based ways to identify effective teachers and improve student outcomes.

“At a time when schools are experiencing budget cuts, it is vital that school districts have ready access to research tools, so that they can make the most informed decisions,” says Denis Newman, President of Empirical Education. The validation engine will be part of a suite of web-based technology tools developed by the company, including [MeasureResults, an online tool that allows districts to evaluate the effectiveness of the products and programs they use.

2010-11-17

Posted by: Robin Means

Tags: effectiveness, Gates, MeasureResults, MET project, observation, rating, teacher, validation engine and videos

Empirical Education at AERA 2011

Empirical is excited to announce that we will again have a strong showing at the 2011 American Educational Research Association (AERA) Conference. Join us in festive New Orleans, LA, April 8-12 for the final results on the efficacy of the PCI Reading Program, our findings from the first year of formative research on our MeasureResults program evaluation tool, and more. Visit our website in the coming months to view our AERA presentation schedule and details about our annual reception—we hope to see you there!

2010-11-15

Posted by: Robin Means

Tags: AERA, conference, PCI, presentation, program evaluation and reception

2010-2011: The Year of the VAM

If you haven’t heard about Value-Added Modeling (VAM) in relation to the controversial teacher ratings in Los Angeles and subsequent brouhaha in the world of education, chances are that you’ll hear about it in the coming year.

VAM is a family of statistical techniques for estimating the contribution of a teacher or of a school to the academic growth of students. Recently, the LA Times obtained the longitudinal test score records for all the elementary school teachers and students in LA Unified and had a RAND economist (working as an independent consultant) run the calculations. The result was a “score” for all LAUSD elementary school teachers.

Reactions to the idea that a teacher could be evaluated using a set of test scores—in this case from the California Standards Test—were swift and divisive. The concept was denounced by the teachers’ union, with the local leader calling for a boycott. Meanwhile, the US Secretary of Education, Arne Duncan, made headlines by commenting favorably on the idea. The LA Times quotes him as saying “What’s there to hide? In education, we’ve been scared to talk about success.”

There is a tangle of issues here, along with exaggerations, misunderstandings, and confusion between research techniques and policy decisions. This column will address some of the issues over the coming year. We also plan to announce some of our own contributions to the VAM field in the form of project news.

The major hot-button issues include appropriate usage (e.g., for part or all of the input to merit pay decisions) and technical failings (e.g., biases in the calculations). Of course, these two issues are often linked; for example, many argue that biases may make VAM unfair for individual merit pay. The recent Brief from the Economic Policy Institute, authored by an impressive team of researchers (several our friends/mentors from neighboring Stanford), makes a well reasoned case for not using VAM as the only input to high-stakes decisions. While their arguments are persuasive with respect to VAM as the lone criterion for awarding merit pay or firing individual teachers, we still see a broad range of uses for the technique, along with the considerable challenges.

For today, let’s look at one issue that we find particularly interesting: How to handle teacher collaboration in a VAM framework. In a recent Education Week commentary, Kim Marshall argues that any use of test scores for merit pay is a losing proposition. One of the many reasons he cites is its potentially negative impact on collaboration.

A problem with an exercise like that conducted by the LA Times is that there are organizational arrangements that do not come into the calculations. For example, we find that team teaching within a grade at a school is very common. A teacher with an aptitude for teaching math may take another teacher’s students for a math period, while sending her own kids to the other teacher for reading. These informal arrangements are not part of the official school district roster. They can be recorded (with some effort) during the current year but are lost for prior years. Mentoring is a similar situation, wherein the value provided to the kids is distributed among members of their team of teachers. We don’t know how much difference collaborative or mentoring arrangements make to individual VAM scores, but one fear in using VAM in setting teacher salaries is that it will militate against productive collaborations and reduce overall achievement.

Some argue that, because VAM calculations do not properly measure or include important elements, VAM should be disqualified from playing any role in evaluation. We would argue that, although they are imperfect, VAM calculations can still be used as a component of an evaluation process. Moreover, continued improvements can be made in testing, in professional development, and in the VAM calculations themselves. In the case of collaboration, what is needed are ways that a principal can record and evaluate the collaborations and mentoring so that the information can be worked into the overall evaluation and even into the VAM calculation. In such an instance, it would be the principal at the school, not an administrator at the district central office, who can make the most productive use of the VAM calculations. With knowledge of the local conditions and potential for bias, the building leader may be in the best position to make personnel decisions.

VAM can also be an important research tool—using consistently high and/or low scores as a guide for observing classroom practices that are likely to be worth promoting through professional development or program implementations. We’ve seen VAM used this way, for example, by the research team at Wake County Public Schools in North Carolina in identifying strong and weak practices in several content areas. This is clearly a rich area for continued research.

The LA Times has helped to catapult the issue of VAM onto the national radar. It has also sparked a discussion of how school data can be used to support local decisions, which can’t be a bad thing.

2010-09-18

Posted by: Denis Newman

Tags: evaluation, evaluation, framework, longitudinal, observation, PD, policy, program, program implementation, research, researcher, Stanford, teacher, value-added and VAM

New Education Pilot Brings Apple’s iPad Into the Classroom

Above: Empirical Education President Denis Newman converses with Secretary Bonnie Reiss and author, Dr. Edward Burger

They’re not contest winners, but today, dozens of lucky 8th grade Algebra 1 students enthusiastically received new iPad devices, as part of a pilot of the new technology.

California Secretary of Education Bonnie Reiss joined local officials, publishers, and researchers at Washington Middle School in Long Beach for the kick-off. Built around this pilot is a scientific study designed to test the effectiveness of a new iPad-delivered Algebra textbook. Over the course of the new school year, Empirical Education researchers will compare the effect of the interactive iPad-delivered textbook to that of its conventional paper counterpart.

The new Algebra I iPad Application is published by Houghton Mifflin Harcourt and features interactive lessons, videos, quizzes, problem solving, and more. While students have to flip pages in a traditional textbook to reveal answers and explanations, students using the iPad version will be able to view interactive explanations and study guides instantly by tapping on the screen. Researchers will be able to study data collected from usage logs to enhance their understanding of usage patterns.

Empirical Education is charged with conducting the study, which will incorporate the performance of over twelve hundred students from four school districts throughout California, including Long Beach, San Francisco, Riverside, and Fresno. Researchers will combine measures of math achievement and program implementation to estimate the new program’s advantage while accounting for the effects of teacher differences and other influences on implementation and student achievement. Each participating teacher has one randomly selected class using the iPads while the other classes continue with the text version of the same material.

Though the researchers haven’t come up with a way of dealing with jealousy from students who will not receive an iPad, they did come up with a fair way to choose the groups who would use the new high tech program. Classes who received iPads were determined by a random number generator.

2010-09-08

Posted by: Robin Means

Tags: achievement, algebra, California, education, iPad, pilot, press, program implementation, randomization, research, study and technology

Empirical Education is Part of Winning i3 Team

Of the almost 1700 grant applications submitted to the federal Investing in Innovation (i3) fund, the U.S. Department of Education chose only 49 proposals for this round of funding. A proposal submitted by our colleagues at WestEd was the third highest rated. Empirical Education assisted in developing the evaluation plan for the project. The project (officially named “Scaling Up Content-Area Academic Literacy in High School English Language Arts, Science and History Classes for High Needs Students”) is based on the Reading Apprenticeship model of academic literacy instruction. The grant will span five years and total $22.6 million, including 20 percent in matching funds from the private sector. This collaborative effort is expected to include 2,800 teachers and more than 400,000 students in 300 schools across four states. The evaluation component, on which we will collaborate with researchers from Academy for Educational Development, will combine a large scale randomized control trial with extensive formative research for continuous improvement of the innovation as it scales up.

2010-08-16

Posted by: Robin Means

Tags: AED, evaluation, grant, high school, i3, randomization and Reading Apprenticeship

REL West Releases Report of RCT on Problem-Based Economics Conducted with Empirical Ed Help

Three years ago, Empirical Education began assisting the Regional Educational Laboratory West (REL West) housed at WestEd in conducting a large-scale randomized experiment on the effectiveness of the Problem-Based Economics (PBE) curriculum.

Today, the Institute of Education Sciences released the final report indicating a significant impact of the program for students in 12th grade as measured by the Test of Economic Literacy. In addition to the primary focus on student achievement outcomes, the study examined changes in teachers’ content knowledge in economics, their pedagogical practices, and satisfaction with the curriculum. The report, Effects of Problem Based Economics on High School Economics Instruction is found on the IES website.

Eighty Arizona and California school districts participated in the study, which encompassed 84 teachers and over 8,000 students. Empirical Education was responsible for major aspects of research operations, which involved collecting, tracking, scoring, and warehousing all data including rosters and student records from the districts, as well as the distribution of the PBE curricular materials, assessments, and student and teacher surveys. To handle the high volume and multiple administrations of surveys and assessments, we created a detail-oriented operation including schedules for following up with survey responses where we achieved response rates of over 95% for both teacher and student surveys. The experienced team of research managers, RAs and data warehouse engineers maintained a rigorous 3-day turnaround for gathering end-of-unit exams and sending score reports to each teacher. The complete, documented dataset was delivered to the researchers at WestEd as our contribution to this REL West achievement.

2010-07-30

Posted by: Robin Means

Tags: achievement, Arizona, California, economics, high school, IES, randomization, RCT, REL, report, research, surveys and WestEd

blog posts and news stories

Looking Back 35 Years to Learn about Local Experiments

2011-03-21

Conference Season 2011

NCES-MIS

SREE

AEFP

AERA

SIIA

2011-02-18

Empirical Education Partners with Carnegie Learning on New Student Performance Guarantee

2010-11-30

Recognizing Success

2010-11-18

Empirical Education Develops Web-Based Tool to Improve Teacher Evaluation

2010-11-17

Empirical Education at AERA 2011

2010-11-15

2010-2011: The Year of the VAM

2010-09-18

New Education Pilot Brings Apple’s iPad Into the Classroom

2010-09-08

Empirical Education is Part of Winning i3 Team

2010-08-16

REL West Releases Report of RCT on Problem-Based Economics Conducted with Empirical Ed Help

2010-07-30

Archive