The combining characters (that are zero-width) start at U+0300.
After that it's pretty much chaos, width-wise.
The mbwidth() function is not called for control characters (whose
representation takes up two columns), as they are handled separately.
The calls of mbwidth() that *can* happen with a control character as
argument are only to determine whether the character is zero-width,
and then it doesn't matter whether the exact width is 1 or 2.
The first byte of a multi-byte UTF-8 sequence must be in the range
0xC2...0xFF. Any other byte cannot be a starter byte and can thus
immediately be treated as a single byte.
This makes the cursor move smoothly left and right -- instead of
"stuttering" when passing over a zero-width character.
Pressing <Delete> on a normal (spacing) character also deletes
any zero-width characters after it. But pressing <Backspace>
while the cursor is placed after a zero-width character, just
deletes that zero-width character. The latter behavior allows
deleting and retyping just the combining diacritic of a character
instead of the whole character.
This addresses https://savannah.gnu.org/bugs/?50773.
Requested-by: Mike Frysinger <vapier@gentoo.org>
Make case-insensitive searching in a UTF-8 locale eight times faster
when the actual characters involved are plain ASCII.
This makes us faster than 'less', and as fast as Vim and Emacs.
The disadvantage of this change is that searching for a string that
begins with a multibyte character is nearly ten times slower than
searching for one that begins with an ASCII character. This may be
unsettling when searching a huge file first for a simple ASCII string
and later for a UTF-8 one. Doing this second search, the user might
get impatient: "Why is it taking so long?"
(This patch fell through the cracks four years ago, when I worked on
the searching code. It sat in a branch on top of other changes that
I never applied because I made different improvements. The speedup
at the time, on that machine, was only around sixty percent, though.
But measuring it now again on the same machine, it clocks in at an
82 percent reduction with -O0 and an 87 percent reduction with -O2.)
Those casts are redundant, and sometimes ugly. And as the types of
variables are extremely unlikely to change any more at this point,
the protection they offer against miscompilations is moot.
Signed-off-by: Hussam al-Homsi <sawuare@gmail.com>
It should give the same result as 'wc -w' as long as the content
of 'wordchars' does not affect the counting.
This fixes https://savannah.gnu.org/bugs/?58123.
Bug existed since version 2.6.2, since the --wordchars option was
introduced in commit 6f12992c.
Now all remaining calls of measured_copy() have a "+ 1" in their
second argument, and can thus be simplified. And each of those
calls is followed by terminating the string with a NUL byte, so
thát can be pulled into the function.
Using straightforward comparisons is clearer and faster and shorter.
Again, note that this does not filter out 0x7F (DEL). But that is
okay, as that code will never be returned from get_kbinput().
When dealing with a plain, seven-bit ASCII character, don't bother
calling is_cntrl_mbchar() but determine directly whether it is a
control character. Also reshuffle things so that we don't compare
charlen == 1 when we already know it is 1.
After the previous change, all remaining calls of parse_mbchar() have
NULL as their third parameter. So, drop that parameter and remove the
chunk of code that handles it. Also rename the function, as there are
already too many functions that start with "parse".
In a non-UTF8 build, mbwidth() returns always 1, so it is pointless
to call that function and compare its result to zero then.
Also, don't bother special-casing the function for a non-UTF8 locale.
In addition, the function was used just once, had a weird return value,
and now some more code can be excluded from a non-UTF8 build.
Make use of the fact that any single-byte character always occupies
just one column, and call the costly mbtowc() and wcwidth() only for
characters that actually are multibyte.
Instead of calling in twenty places parse_mbchar(pointer, NULL, NULL),
use a simpler and faster char_length(pointer). This saves pushing two
unneeded parameters onto the stack, avoids two needless ifs, and elides
an intermediate variable.
Its main purpose will follow in a later commit: to speed up searching.