When converting a string to lower case, the compiler is able to
autovectorize nicely, so a nice simple implementation is also very
fast, comparable to memcpy().
Comparisons are more difficult for the compiler, so we convert eight
bytes at a time using "SIMD within a register" tricks. Experiments
indicate it's best to stick to simple loops for shorter strings and
the remainder of long strings.