The inline assembly implementation was badly in need of improvement.
- irq_disable() took 2 CPU cycles more than needed
- The current interrupt state was stored in a temporary register and
afterwards copied to the target register, rather than storing it in the
target register right away
- The lower bits of the state were cleared (as they have no meaning for the
interrupt status), but the API purposely never required such things from
implementations.
- irq_restore() took 5 CPU cycles. This was reduced to 3 CPU cycles (or 2 CPU
cycles in the best case)