blog posts and news stories

The Evaluation of CREATE Continues

Empirical Education began conducting the evaluation of Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness (CREATE) in 2015 under a subcontract with Atlanta Neighborhood Charter Schools (ANCS) as part of their Investing in Innovation (i3) Development grant. Since our last CREATE update, we’ve extended this work through the Supporting Effective Educator Development (SEED) Grant Program. The SEED grant provides continued funding for three more cohorts of participants and expands the research to include experienced educators (those not in the CREATE residency program) in CREATE schools. The grant was awarded to Georgia State University and includes partnerships with ANCS, Empirical Education (as the external evaluator), and local schools and districts.

Similar to the i3 work, we’re following a treatment and comparison group over the course of the three-year CREATE residency program and looking at impacts on teacher effectiveness, teacher retention, and student achievement. With the SEED project, we will also be able to follow Cohort 3 and 4 for an additional 1-2 years following residency. Surveys will measure perceived levels of social capital, school climate and community, collaboration, resilience, and mindfulness, in addition to other topics. Recruitment for Cohort 4 began this past spring and continued through the summer, resulting in approximately 70 new participants.

One of the goals of the expanded CREATE programming is to support the effectiveness and social capital of experienced educators in CREATE schools. Any experienced educator in a CREATE school who attends CREATE professional learning activities will be invited to participate in the research study. Surveys will measure similar topics to those measured in the quasi-experiment and we conduct individual interviews with a sample of participants to gain an in-depth understanding of the participant experience.

We have completed our first year of experienced educator research and continue to recruit participants, on an ongoing basis, into the second year of the study. We currently have 88 participants and counting.

2018-10-03

The Rebel Alliance is Growing

The rebellion against the old NCLB way of doing efficacy research is gaining force. A growing community among edtech developers, funders, researchers, and school users has been meeting in an attempt to reach a consensus on an alternative built on ESSA.

This is being assisted by openness in the directions currently being pursued by IES. In fact, we are moving into a new phase marked by two-way communication with the regime. While the rebellion hasn’t yet handed over its lightsabers, it is encouraged by the level of interest from prominent researchers.

From these ongoing discussions, there have been some radical suggestions inching toward consensus. A basic idea now being questioned is this:

The difference between the average of the treatment group and the average of the control group is a valid measure of effectiveness.

There are two problems with this:

  1. In schools, there’s no “placebo” or something that looks like a useful program but is known to have zero effectiveness. Whatever is going on in the schools, or classes, or with teachers and students in the control condition has some usefulness or effectiveness. The usefulness of the activities in the control classes or schools may be greater than the activities being evaluated in the study, or may be not as useful. The study may find that the “effectiveness” of the activities being studied is positive, negative, or too small to be discerned statistically by the study. In any case, the size (negative or positive) of the effect is determined as much by what’s being done in the control group as the treatment group.
  2. Few educational activities have the same level of usefulness for all teachers and students. Looking at only the average will obscure the differences. For example, we ran a very large study for the U.S. Department of Education of a STEM program where we found, on average, the program was effective. What the department didn’t report was that it only worked for the white kids, not the black kids. The program increased instead of reducing the existing achievement gap. If you are considering adopting this STEM program, the impact on the different subgroups is relevant–a high minority school district may want to avoid it. Also, to make the program better, the developers need to know where it works and where it doesn’t. Again, the average impact is not just meaningless but also can be misleading.

A solution to the overuse of the average difference from studies is to conduct a lot more studies. The price the ED paid for our large study could have paid for 30 studies of the kind we are now conducting in the same state of the same program; in 10% of the time of the original study. If we had 10 different studies for each program, where studies are conducted in different school districts with different populations and levels of resources, the “average” across these studies start to make sense. Importantly, the average across these 10 studies for each of the subgroups will give a valid picture of where, how, and with which students and teachers the program tends to work best. This kind of averaging used in research is called meta-analysis and allows many small differences found across studies to build on the power of each study to generate reliable findings.

If developers or publishers of the products being used in schools took advantage of their hundreds of implementations to gather data, and if schools would be prepared to share student data for this research, we could have researcher findings that both help schools decide what will likely work for them and help developers improve their products.

2018-09-21

Which Came First: The Journal or the Conference?

You may have heard of APPAM, but do you really know what they do? They organize an annual conference? They publish a journal? Yes, they do all that and more!

APPAM stands for the Association for Public Policy Analysis and Management. APPAM is dedicated to improving public policy and management by fostering excellence in research, analysis, and education. The first APPAM Fall Research Conference occurred in 1979 in Chicago. The first issue of the Journal of Policy Analysis and Management appeared in 1981.

Why are we talking about APPAM now? While we’ve attended the APPAM conference multiple years in the past, the upcoming conference poses a unique opportunity for us. This year, our chief scientist, Andrew Jaciw, is acting as guest editor of a special issue of Evaluation Review on multi-armed randomized experiments. As part of this effort, to encourage discussion of the topic, he proposed three panels that were accepted at APPAM.

Andrew will chair the first panel titled Information Benefits and Statistical Challenges of Complex Multi-Armed Trials: Innovative Designs for Nuanced Questions.

In the second panel, Andrew will be presenting a paper that he co-wrote with Senior Research Manager Thanh Nguyen titled Using Multi-Armed Experiments to Test “Improvement Versions” of Programs: When Beneficence Matters. This presentation will take place on Friday, November 9, 2018 at 9:30am (in Marriott Wardman Park, Marriott Balcony B - Mezz Level).

In the third panel he submitted, Larry Orr, Joe Newhouse, and Judith Gueron (with Becca Maynard as discussant) should provide an important retrospective. As pioneers of social science experiments, the contributors will share experiences and important lessons learned.

Some of these panelists will also be submitting their papers to the special edition of the Evaluation Review. We will update this blog with a link to that journal issue once it has been published.

2018-08-21

New Multi-State RCT with Imagine Learning

Empirical Education is excited to announce a new study on the effectiveness of Imagine Math, an online supplemental math program that helps students build conceptual understanding, problem-solving skills, and a resilient attitude toward math. The program provides adaptive instruction so that students can work at their own pace and offers live support from certified math teachers as students work through the content. Imagine Math also includes diagnostic benchmarks that allows educators to track progress at the student, class, school, and district level.

The research questions to be answered by this study are:

  1. What is the impact of Imagine Math on student achievement in mathematics in grades 6–8?
  2. Is the impact of Imagine Math different for students with diverse characteristics, such as those starting with weak or strong content-area skills?
  3. Are differences in the extent of use of Imagine Math, such as the number of lessons completed, associated with differences in student outcomes?

The new study will use a randomized control trial (RCT) or randomized experiment in which two equivalent groups of students are formed through random assignment. The experiment will specifically use a within-teacher RCT design, with randomization taking place at the classroom level for eligible math classes in grades 6–8.

Eligible classes will be randomly assigned to either use or not use Imagine Math during the school year, with academic achievement compared at the end of the year, in order to determine the impact of the program on grade 6-8 mathematics achievement. In addition, Empirical Education will make use of Imagine Math’s usage data for potential analysis of the program’s impact on different subgroups of users.

This is Empirical Education’s first project with Imagine Learning, highlighting our extensive experience conducting large-scale, rigorous, experimental impact studies. The study is commissioned by Imagine Learning and will take place in multiple school districts and states across the country, including Hawaii, Alabama, Alaska, and Delaware.

2018-08-03

For Quasi-experiments on the Efficacy of Edtech Products, it is a Good Idea to Use Usage Data to Identify Who the Users Are

With edtech products, the usage data allows for precise measures of exposure and whether critical elements of the product were implemented. Providers often specify an amount of exposure or the kind of usage that is required to make a difference. Furthermore, educators often want to know whether the program has an effect when implemented as intended. Researchers can readily use data generated by the product (usage metrics) to identify compliant users, or to measure the kind and amount of implementation.

Since researchers generally track product implementation and statistical methods allow for adjustments for implementation differences, it is possible to estimate the impact on successful implementers, or technically, on a subset of study participants who were compliant with treatment. It is, however, very important that the criteria researchers use in setting a threshold be grounded in a model of how the program works. This will, for example, point to critical components that can be referred to in specifying compliance. Without a clear rationale for the threshold set in advance, the researcher may appear to be “fishing” for the amount of usage that produces an effect.

Some researchers reject comparison studies in which identification of the treatment group occurs after the product implementation has begun. This is based in part on the concern that the subset of users who comply with the suggested amount of usage will get more exposure to the program. More exposure will result in a larger effect. This assumes of course, that the product is effective, otherwise the students and teachers will have been wasting their time and will likely perform worse than the comparison group.

There is also the concern that the “compliers” may differ from the non-compliers (and non-users) in some characteristic that isn’t measured. And that even after controlling for measurable variables (prior achievement, ethnicity, English proficiency, etc.), there could be a personal characteristic that results in an otherwise ineffective program becoming effective for them. We reject this concern and take the position that a product’s effectiveness can be strengthened or weakened by many factors. A researcher conducting any matched comparison study can never be certain that there isn’t an unmeasured variable that is biasing it. (That’s why the What Works Clearinghouse only accepts Quasi-Experiments “with reservations.”) However, we believe that as long as the QE controls for the major factors that are known to affect outcomes, the study can meet the Every Student Succeeds Act requirement that the researcher “controls for selection bias.”

With those caveats, we believe that a QE, which identifies users by their compliance to a pre-specified level of usage, is a good design. Studies that look at the measurable variables that modify the effectiveness of a product can not only be useful for school in answering their question, “is the product likely to work in my school?” but points the developer and product marketer to ways the product can be improved.

2018-07-27

New Project with ALSDE to Study AMSTI

Empirical Education is excited to announce a new study of the Alabama Math, Science, and Technology Initiative (AMSTI). The Alabama legislature commissioned the study. AMSTI is the Alabama State Department of Education’s initiative to improve math and science teaching statewide. The program, which started over 20 years ago, operates in over 900 schools across the state. Many external evaluators have validated AMSTI.

Researchers here at Empirical Education, directed by Chief Scientist Andrew Jaciw, published a study in 2012. The cluster-randomized trial (CRCT) involved 82 schools and ~700 teachers. It assessed the efficacy of AMSTI over a three year period and showed an overall positive effect (Newman et al., 2012).

The new study that we are embarking on will use a quasi-experimental matched comparison group design. We will take advantage of existing data available from the Alabama State Department of Education and the AMSTI program. By comparing compare schools using AMSTI to matched schools not using AMSTI, we can determine the impact of the program on math and science achievement for students in grades 3 through 8. Our report will also include differential impacts of the program on important student subgroups. Using Improvement Science principles, we will examine school climates for a greater or reduced program impact.

At the conclusion of the study, we will distribute the report to select committees of the Alabama state legislature, the Governor and the Alabama State Board of Education, and the Alabama State Department of Education. Empirical Education researchers will travel to Montgomery, AL to present the study findings and recommendations for improvement to the Alabama legislature.

2018-07-13

How Are Edtech Companies Thinking About Data and Research?

Forces of the rebellion were actively at work at SIIA’s Annual Conference last week in San Francisco. Snippets of conversation revealed a common theme of harnessing and leveraging data in order to better understand and serve the needs of schools and districts.

This theme was explored in depth during one panel session, “Efficacy and Research: Why It Matters So Much in the Education Market”, where edtech executives discussed the phases and roles of research as it relates to product improvement and marketing. Moderated by Pearson’s Gary Mainor, session panelists included Andrew Coulson of the MIND Research Institute, Kelli Hill of Kahn Academy, and Shawn Mahoney of McGraw Hill Education.

Coulson, who was one of the contributing reviewers of our Research Guidelines, stated that all signs are pointing to an “exponential increase” of school district customers asking for usage data. He advised fellow edtech entrepreneurs to start paying attention to fine-grained usage data, as it is becoming necessary to provide this for customers. Panelist Kelli Hill agreed with the importance of making data visible, adding that Kahn Academy proactively provides users with monthly usage reports.

In addition to providing helpful advice for edtech sales and marketing teams, the session also addressed a pervasive misconception that that all it takes is “one good study” to validate and prove the effectiveness of a program. A company could commission one rigorous randomized trial reporting positive results and obtaining endorsement from the What Works Clearinghouse, but that study might be outdated, and more importantly, not relevant to what schools and districts are looking for. Panelist Shawn Mahoney, Chief Academic Officer of McGraw-Hill Education, affirmed that school districts are interested in “super contextualized research” and look for recent and multiple studies when evaluating a product. Q&A discussions with the panelists revealed that school decision makers are quick to claim “what works for someone else might not work for us”, supporting the notion that the conduct of multiple research studies, reporting effects for various subgroups and populations of students, is much more useful and reflective of district needs.

SIIA’s gathering proved to be a fruitful event, allowing us to reconnect with old colleagues and meet new ones, and leaving us with a number of useful insights and optimistic possibilities for new directions in research.

2018-06-22

A Rebellion Against the Current Research Regime

Finally! There is a movement to make education research more relevant to educators and edtech providers alike.

At various conferences, we’ve been hearing about a rebellion against the “business as usual” of research, which fails to answer the question of, “Will this product work in this particular school or community?” For educators, the motive is to find edtech products that best serve their students’ unique needs. For edtech vendors, it’s an issue of whether research can be cost-effective, while still identifying a product’s impact, as well as helping to maximize product/market fit.

The “business as usual” approach against which folks are rebelling is that of the U.S. Education Department (ED). We’ll call it the regime. As established by the Education Sciences Reform Act of 2002 and the Institute of Education Sciences (IES), the regime anointed the randomized control trial (or RCT) as the gold standard for demonstrating that a product, program, or policy caused an outcome.

Let us illustrate two ways in which the regime fails edtech stakeholders.

First, the regime is concerned with the purity of the research design, but not whether a product is a good fit for a school given its population, resources, etc. For example, in an 80-school RCT that the Empirical team conducted under an IES contract on a statewide STEM program, we were required to report the average effect, which showed a small but significant improvement in math scores (Newman et al., 2012). The table on page 104 of the report shows that while the program improved math scores on average across all students, it didn’t improve math scores for minority students. The graph that we provide here illustrates the numbers from the table and was presented later at a research conference.

bar graph representing math, science, and reading scores for minority vs non-minority students

IES had reasons couched in experimental design for downplaying anything but the primary, average finding, however this ignores the needs of educators with large minority student populations, as well as of edtech vendors that wish to better serve minority communities.

Our RCT was also expensive and took many years, which illustrates the second failing of the regime: conventional research is too slow for the fast-moving innovative edtech development cycles, as well as too expensive to conduct enough research to address the thousands of products out there.

These issues of irrelevance and impracticality were highlighted last year in an “academic symposium” of 275 researchers, edtech innovators, funders, and others convened by the organization now called Jefferson Education Exchange (JEX). A popular rallying cry coming out of the symposium is to eschew the regime’s brand of research and begin collecting product reviews from front-line educators. This would become a Consumer Reports for edtech. Factors associated with differences in implementation are cited as a major target for data collection. Bart Epstein, JEX’s CEO, points out: “Variability among and between school cultures, priorities, preferences, professional development, and technical factors tend to affect the outcomes associated with education technology. A district leader once put it to me this way: ‘a bad intervention implemented well can produce far better outcomes than a good intervention implemented poorly’.”

Here’s why the Consumer Reports idea won’t work. Good implementation of a program can translate into gains on outcomes of interest, such as improved achievement, reduction in discipline referrals, and retention of staff, but only if the program is effective. Evidence that the product caused a gain on the outcome of interest is needed or else all you measure is the ease of implementation and student engagement. You wouldn’t know if the teachers and students were wasting their time with a product that doesn’t work.

We at Empirical Education are joining the rebellion. The guidelines for research on edtech products we recently prepared for the industry and made available here is a step toward showing an alternative to the regime while adopting important advances in the Every Student Succeeds Act (ESSA).

We share the basic concern that established ways of conducting research do not answer the basic question that educators and edtech providers have: “Is this product likely to work in this school?” But we have a different way of understanding the problem. From years of working on federal contracts (often as a small business subcontractor), we understand that ED cannot afford to oversee a large number of small contracts. When there is a policy or program to evaluate, they find it necessary to put out multi-million-dollar, multi-year contracts. These large contracts suit university researchers, who are not in a rush, and large research companies that have adjusted their overhead rates and staffing to perform on these contracts. As a consequence, the regime becomes focused on the perfection in the design, conduct, and reporting of the single study that is intended to give the product, program, or policy a thumbs-up or thumbs-down.

photo of students in a classroom on computers

There’s still a need for a causal research design that can link conditions such as resources, demographics, or teacher effectiveness with educational outcomes of interest. In research terminology, these conditions are called “moderators,” and in most causal study designs, their impact can be measured.

The rebellion should be driving an increase the number of studies by lowering their cost and turn-around time. Given our recent experience with studies of edtech products, this reduction can reach a factor of 100. Instead of one study that costs $3 million and takes 5 years, think in terms of a hundred studies that cost $30,000 each and are completed in less than a month. If for each product, there are 5 to 10 studies that are combined, they would provide enough variation and numbers of students and schools to detect differences in kinds of schools, kinds of students, and patterns of implementation so as to find where it works best. As each new study is added, our understanding of how it works and with whom improves.

It won’t be enough to have reviews of product implementation. We need an independent measure of whether—when implemented well—the intervention is capable of a positive outcome. We need to know that it can make (i.e., cause) a difference AND under what conditions. We don’t want to throw out research designs that can detect and measure effect sizes, but we should stop paying for studies that are slow and expensive.

Our guidelines for edtech research detail multiple ways that edtech providers can adapt research to better work for them, especially in the era of ESSA. Many of the key recommendations are consistent with the goals of the rebellion:

  • The usage data collected by edtech products from students and teachers gives researchers very precise information on how well the program was implemented in each school and class. It identifies the schools and classes where implementation met the threshold for which the product was designed. This is a key to lowering cost and turn-around time.
  • ESSA offers four levels of evidence which form a developmental sequence, where the base level is based on existing learning science and provides a rationale for why a school should try it. The next level looks for a correlation between an important element in the rationale (measured through usage of that part of the product) and a relevant outcome. This is accepted by ESSA as evidence of promise, informs the developers how the product works, and helps product marketing teams get the right fit to the market. a pyramid representing the 4 levels of ESSA
  • The ESSA level that provides moderate evidence that the product caused the observed impact requires a comparison group matched to the students or schools that were identified as the users. The regime requires researchers to report only the difference between the user and comparison groups on average. Our guidelines insist that researchers must also estimate the extent to which an intervention is differentially effective for different demographic categories or implementation conditions.

From the point of view of the regime, nothing in these guidelines actually breaks the rules and regulations of ESSA’s evidence standards. Educators, developers, and researchers should feel empowered to collect data on implementation, calculate subgroup impacts, and use their own data to generate evidence sufficient for their own decisions.

A version of this article was published in the Edmarket Essentials magazine.

2018-05-09

AERA 2018 Recap: The Possibilities and Necessity of a Rigorous Education Research Community

This year’s AERA annual meeting on “The Dreams, Possibilities, and Necessity of Public Education,” was fittingly held in the city with the largest number of public school students in the country—New York. Against this radically diverse backdrop, presenters were encouraged to diversify both the format and topics of presentations in order to inspire thinking and “confront the struggles for public education.”

AERA’s sheer size may risk overwhelming its attendees, but in other ways, it came as a relief. At a time when educators and education remain under-resourced, it was heartening to be reminded that a large, vibrant community of dedicated and intelligent people exists to improve educational opportunities for all students.

One theme that particularly stood out is that researchers are finding increasingly creative ways to use existing usage data from education technology products to measure impact and implementation. This is a good thing when it comes to reducing the cost of research and making it more accessible to smaller businesses and nonprofits. For example, in a presentation on a software-based knowledge competition for nursing students, researchers used usage data to identify components of player styles and determine whether these styles had a significant effect on student performance. In our Edtech Research Guidelines, Empirical similarly recommends that edtech companies take advantage of their existing usage data to run impact and implementation analyses, without using more expensive data collection methods. This can help significantly reduce the cost of research studies—rather than one study that costs $3 million, companies can consider multiple lower-cost studies that leverage usage data and give the company a picture of how the product performs in a greater diversity of contexts.

Empirical staff themselves presented on a variety of topics, including quasi-experiments on edtech products; teacher recruitment, evaluation, and retention; and long-term impact evaluations. In all cases, Empirical reinforced its commitment to innovative, low-cost, and rigorous research. You can read more about the research projects we presented in our previous AERA post.

photo of Denis Newman presenting at AERA 2018

Finally, Empirical was delighted to co-host the Division H AERA Reception at the Supernova bar at Novotel Hotel. If you ever wondered if Empirical knows how to throw a party, wonder no more! A few pictures from the event are below. View all of the pictures from our event on facebook!


We had a great time and look forward to seeing everyone at the next AERA annual meeting!

2018-05-03

Updated Research Guidelines Will Improve Education Technology Products and Provide More Value to Schools

Recommendations include 16 best practices for the design, implementation, and reporting of Usable Evidence for Educators

Palo Alto, CA (April 25, 2018) – Empirical Education Inc. and the Education Technology Industry Network (ETIN) of SIIA released an important update to the “Guidelines for Conducting and Reporting Edtech Impact Research in U.S. K-12 Schools” today.

Authored by Empirical Education researchers, Drs. Denis Newman, Andrew Jaciw, and Valeriy Lazarev, the Guidelines detail 16 best practices for the design, implementation, and reporting of efficacy research of education technology. Recommendations range from completing the product’s logic model before fielding it to disseminating a study’s results in accessible and non-technical language.

The Guidelines were first introduced in July 2017 at ETIN’s Edtech Impact Symposium to address the changing demand for research. They served to address new challenges driven by the accelerated pace of edtech development and product releases, the movement of new software to the cloud, and the passage of the Every Student Succeeds Act (ESSA). The authors committed to making regular updates to keep pace with technical advances in edtech and research methods.

“Our collaboration with ETIN brought the right mix of practical expertise to this important document,” said Denis Newman, CEO of Empirical Education and lead author of the Guidelines. “ETIN provided valuable expertise in edtech marketing, policy, and development. With over a decade of experience evaluating policies, programs, and products for the U.S. Department of Education, major research organizations, and publishers, Empirical Education brought a deep understanding of how studies are traditionally performed and how they can be improved in the future. Our experience with our Evidence as a Service™ offering to investors and developers of edtech products also informed the guidelines.”

The current edition advocates for analysis of usage patterns in the data collected routinely by edtech applications. These patterns help to identify classrooms and schools with adequate implementation and lead to lower-cost faster turn-around research. So rather than investing hundreds of thousands of dollars in a single large-scale study, developers should consider multiple small-scale studies. The authors point to the advantages of looking at subgroup analysis to better understand how and for whom the product works best, thus more directly answering common educator questions. Issues with quality of implementation are addressed in greater depth, and the visual design of the Guidelines has been refined for improved readability.

“These guidelines may spark a rebellion against the research business as usual, which doesn’t help educators know whether an edtech product will work for their specific populations. They also provide a basis for schools and developers to partner to make products better,” said Mitch Weisburgh, Managing Partner of Academic Business Advisors, LLC and President of ETIN, who has moderated panels and webinars on edtech research.

Empirical Education, in partnership with a variety of organizations, is conducting webinars to help explain the updates to the Guidelines, as well as to discuss the importance of these best practices in the age of ESSA. The updated Guidelines are available here: https://www.empiricaleducation.com/research-guidelines/.

2018-04-25
Archive