When converting a string to lower case, the compiler is able to autovectorize nicely, so a nice simple implementation is also very fast, comparable to memcpy(). Comparisons are more difficult for the compiler, so we convert eight bytes at a time using "SIMD within a register" tricks. Experiments indicate it's best to stick to simple loops for shorter strings and the remainder of long strings.