**Main Link:**

The TRU Scoring Rubric

The TRU Scoring Rubric was developed as a research tool for the explicit purpose of testing, refining, and validating the TRU Framework. Once validated, the rubric could serve two additional functions: assessing research interventions and supporting professional development.

**Rubric Development**

The five dimensions of the TRU Framework emerged from the TRU research team’s distillation of the research literature. The team mined the literature for papers that identified factors that influenced student learning, listing hundreds of such factors. It then clustered them into families of related factors, which we called dimensions. The questions then were:

- Is this clustering coherent?
- Can it be operationalized – that is, can each dimension be characterized so that scorers can identify instances of it and score them reliably? That is, could we construct a scoring rubric that met the standard tests of reliability?
- Will scores assigned for each dimension (and thus average scores across dimensions) correlate with student learning, as measured by robust tests of thinking and problem solving?

Over a period of years we refined the scoring rubric and its capsule descriptions until the three questions above could be answered in the affirmative. When that was achieved, we could state with confidence that the TRU Framework captured what mattered, and that increased scores in each dimension of the rubric corresponded to increased student learning.

The printed rubric assigns scores of 1, 2, and 3 (best) for each dimension, in a range of circumstances. There are sub-rubrics for Whole Class Activities (Launch, Teacher Exposition, and Whole Class Discussion), Small Group Work, Student Presentations, and Individual Work, as well as a summary rubric. The original work on TRU took place in the context of the Algebra Teaching Study, so the rubric elaborated on “what counts” in high school algebra, specifically with regard to meaningful “contextual algebraic tasks” – tasks that support the application of algebra in meaningful contexts, such as modeling or graphing “real world” phenomena.

**Rubric Use for Research**

Scoring, part 1: a 5-point scale. In practice, the TRU research team discovered that scores of 1, 2, and 3 were not enough to differentiate between instances of teaching: 1 seemed too harsh a score to assign, and 3 seemed “too good,” so the vast majority of scores assigned were 2’s. (There were numerous comments like “It didn’t really seem like a 2, but a 1 was too harsh, so I gave it a 2,” or “It was better than a 2 but it didn’t reach the level of a 3, so I scored it 2.”) As a result, we introduced scores of 1.5 and 2.5, with the obvious meanings.

Scoring, part 2: assigning scores to “episodes” of 10 minutes or less. An 50-minute period may contain something like 5 minutes of introduction and review, 15 minutes of boardwork on homework, 20 minutes of whole class discussion, and 10 minutes of seatwork. Non- mathematical activities are not scored.

For purposes of detailed scoring, we parse each lesson into episodes of coherent activities, with each episode being up to 10 minutes in length. The 50-minute period described above would contain the following episodes:

- 5 minutes of whole class intro/review
- The first 10 minutes of student presentations
- The remaining 5 minutes of student presentations
- The first 10 minutes of whole class discussion
- The remaining 10 minutes of whole class discussion
- 10 minutes of seatwork.

Each would be scored with the relevant rubric, and a weighted average (weighted by time) would produce final scores.

Scoring, part 3: Achieving reliability. We used the following procedure for achieving reliability within the group. The team chose a new tape, and a team member broke the tape into episodes as suggested above. Individual team members scored the tape independently, and then the group convened to discuss scores. With a number of new members (say 4 out of 10 when a new semester began) there was significant variation in scoring, with many scores differing by 1 (e.g., some scores of 1.5 and some of 2.5) and some even differing by 1.5. The group discussed the reasons for assigning their scores, with senior members explaining the nuances that determined their scoring. Group members then independently re-scored the lesson. This time all scores were typically within .5 of each other.

This process was repeated with two new tapes. On the first of these new tapes there were still non-trivial differences – there were still differences of 1, but not as many as before. On the second take, scores were quite close on the first pass. At that point we could have individuals score lessons (except in the case of close analyses for published papers, in which case we always had at least two independent scorers).

Details of rubric development and scoring can be found in many of the team’s papers:

**Other groups’ use of the TRU Rubric for research.**

The TRU rubric has been used extensively by research groups around the globe. Two points are worth noting.

Reliability of scoring is easy to achieve. The procedures described above – independent scoring and re-scoring of sample tapes until the group converges – consistently results in high levels of “local” reliability. That is, the scores assigned by an external group might not be a precise match for the TRU team’s scores, but they will be reliable and either consistently lower or higher.

Not surprisingly, additional detail requires additional measures. Many intervention studies are interested not only in general improvement, but in the impact of focused interventions – e.g., the impact on students’ language use of a study that focused on the role of language in instruction, or student perceptions of feeling safe in the classroom, or… In these cases, TRU scoring provides critical background information about classroom environment and student learning as a whole. More detailed information about the impact of targeted interventions is entirely consistent with TRU, but must be obtained by additional measures. One exemplary study indicating such synergy is

Prediger, S., & Neugebauer, P. Capturing teaching practices in language‑responsive mathematics classrooms Extending the TRU framework “teaching for robust understanding” to L‑TRU. ZDM (2021) 53:289–304. For such uhttps://doi.org/10.1007/s11858-020-01187-1.

**Rubric Use for Professional Development**

Like any yardstick, a scoring rubric can be used to rap people on the knuckles or to measure growth. Our intention is to support growth! Any other use, e.g., for individual teacher evaluation, is contrary to the spirit of our work. Moreover, scoring one or two isolated instances of instruction runs the dangerously of being seriously unfair – every class day is different, with different opportunities in each dimension (including the randomness of students’ moods on any given day!). When we assessed classrooms for research purposes we scored each classroom a minimum of five times over the course of a year.

The rubric is an appropriate diagnostic tool for assessing collective professional growth. For such use it would used as though it were a research tool, sampling participants’ instruction before and after the professional development. The safest use is by independent evaluators, where teachers are identified by code numbers and not by name.

The best use of the rubric for coaching with individual teachers is to use it as a yardstick, in concert with tools that directly support improvement – the TRU Conversation Guide and the TRU Conversation Guide.

The TRU Scoring Rubric can be found here.

Next Page: Tools to Come