It's a bit of an unexpected result for me. I assumed that the next instruction sequence would take less than 32 CPU cycles.
a: mov r3,(r0)+
sob r1,a
However, it may take about 40. So I made some changes to FILLV. I have also attached the diff. It is interesting will this variant show...