If an octave error occurs we can be sure of two things:

- the computed pitch is wrong
- the reference pitch is half or double the computed pitch

Bagshaw assumes that we do not know if the first computed F0 value is an octave error or not. Consequently he uses sets in the algorithm and does not rely on the initial F0 value. He puts all F0 values in sets. If an octave errors occurs he changes to another set and proceeds. After all sets for all F0 values have been computed, the biggest set is identified. He assumes that the majority cannot be wrong and decides that the F0 values in this set represent the true fundamental pitch. All F0 values in other sets represent octave errors. They need correction. The correction factor can simply be derived from the index of the set. Here is an outline of the algorithm:

set_index = 0 true_pitch_set_index = 0 octave_error_threshold = 0.75 /* 75 % */ Put the first F0 value into Set(set_index) FOR all other F0 values of a voiced region /* change set if octave error */ IF (current_F0_value > (preceding_F0_value * (1 + octave_error_threshold)) THEN /* current_value is too high, change set */ set_index = set_index + 1 ENDIF IF (current_F0_value < (preceding_F0_value * (octave_error_threshold)) THEN /* current_value is too low, change set */ set_index = set_index - 1 ENDIF put current_F0_value into Set(set_index) ENDFOR Compute the set with the most items and assume that these values represent the true F0 values true_pitch_set_index = the index of the set with most items FOR all Sets set_index = index of this set correct the F0 values in each set by multiplying with 2^(true_pitch_set_index - set_index) ENDFORYou can see that the de-step filter allows changes in the F0 contour, but only so far as the changes are lower than the octave error threshold. The algorithm prohibits octave jumps.

- Median filters have a fixed size, whereas the de-step filter examines a whole voiced region. So a median filter is working locally whereas the de-step filter works globally.
- Median filters always smooth; the de-step filter does nothing, if no octave errors occur. If octave errors do occur, the de-step filter only corrects the octave errors and no neighbouring values.
- The de-step filter does not permit jumps greater than the octave error threshold, whereas a median filter allows them, if the region is large enough. This can be a drawback for the median filter or a drawback for the de-step filter, but this depends on your application. In (spontaneously) spoken language, an octave jump in a single voiced region is unlikely, so a pitch tracking algorithm benefits from this de-step filter property.

Matthias Nutt