This makes the handling of plain ASCII a tiny bit slower, but it
affects only the users of --constantshow without --minibar, so...
All other uses of mbstrlen() and collect_char() are not in speed-
critical code paths.
Since the previous commit, mbwidth() is used only to determine whether
a character is either double width or zero width. There is no need to
return the actual width of the character; a simple yes or no is enough.
Transforming mbwidth() into is_doublewidth() also allows streamlining
it and is_zerowidth() a bit, so that they become slightly faster.
The number of bytes in the character were determined twice: first in
mbwidth() and then in char_length(). Do it just once, in mbtowide().
Also, avoid calling is_cntrl_char(), because it does unneeded checks
when we already know that the high bit is set.
This duplicates some code, but advance_over() is called a lot, so it
is important that it is as fast as possible.
This shouldn't slow down plain ASCII, as the extra checks (use_utf8
and *string < 0xA0) are done only for non-ASCII (apart from DEL).
Calling wctomb() with NULL as the first parameter returns zero in a
UTF-8 locale, meaning that there is no state, so there is no point
in resetting it either.
This saves a function call, and the passing and checking of the
MAXCHARLEN parameter, and the checking whether wc is maybe NULL
(which for nano is never the case), and who knows what other
overheads mbtowc() has, and our workaround for glibc.
Code was written after looking at gnulib/lib/mbrtowc-impl-utf8.h.
Most implementations of mblen() do a call to mbtowc(), which is
a waste of time when all we want to know is the number of bytes
(and when we already know that we're using UTF-8 and that the
first byte is at least 0xC2).
(This also avoids burdening correct implementations with the
workaround that was needed only for glibc.)
Code was written after looking at gnulib/lib/mbrtowc-impl-utf8.h.
The mblen() and mbtowc() functions will happily return 4 or 5 or 6
for byte sequences that start with 0xF4 0x90 or higher. But those
sequences encode for U+110000 or higher, which are not valid Unicode
code points. The libc of FreeBSD and OpenBSD and Alpine correctly
return -1 for such sequences. Make nano behave correctly also when
linked against glibc, so that invalid sequences are always presented
as a series of invalid bytes and never as a single invalid code.
This fixes https://savannah.gnu.org/bugs/?60262.
Bug existed since before version 2.0.0.
The call of this function in make_mbchar() does not add anything,
because wctomb() already returns -1 for codes U+D800 to U+DFFF,
and parse_verbatim_kbinput() already rejects anything that starts
with U+11.... or higher, so make_mbchar() is never called for codes
beyond U+10FFFF.
And the call in display_string() just needs to check for wc <= 0x10FFFF
because mbtowc() already returns -1 for codes U+D800 to U+DFFF.
That is, accept U+FDD0 to U+FDEF, and accept U+xxFFFE and U+xxFFFF
for xx from 00 to 10 hex, being the 66 reserved "non-characters".
It may not be wise of the user to input these "things" (by typing
their code after M-V), but the codes are valid Unicode code points
and should not be rejected.
See https://www.unicode.org/faq/private_use.html#nonchar8 et al.
This fixes https://savannah.gnu.org/bugs/?60263.
Bug existed since before version 2.0.0.
The combining characters (that are zero-width) start at U+0300.
After that it's pretty much chaos, width-wise.
The mbwidth() function is not called for control characters (whose
representation takes up two columns), as they are handled separately.
The calls of mbwidth() that *can* happen with a control character as
argument are only to determine whether the character is zero-width,
and then it doesn't matter whether the exact width is 1 or 2.
The first byte of a multi-byte UTF-8 sequence must be in the range
0xC2...0xFF. Any other byte cannot be a starter byte and can thus
immediately be treated as a single byte.
This makes the cursor move smoothly left and right -- instead of
"stuttering" when passing over a zero-width character.
Pressing <Delete> on a normal (spacing) character also deletes
any zero-width characters after it. But pressing <Backspace>
while the cursor is placed after a zero-width character, just
deletes that zero-width character. The latter behavior allows
deleting and retyping just the combining diacritic of a character
instead of the whole character.
This addresses https://savannah.gnu.org/bugs/?50773.
Requested-by: Mike Frysinger <vapier@gentoo.org>
Make case-insensitive searching in a UTF-8 locale eight times faster
when the actual characters involved are plain ASCII.
This makes us faster than 'less', and as fast as Vim and Emacs.
The disadvantage of this change is that searching for a string that
begins with a multibyte character is nearly ten times slower than
searching for one that begins with an ASCII character. This may be
unsettling when searching a huge file first for a simple ASCII string
and later for a UTF-8 one. Doing this second search, the user might
get impatient: "Why is it taking so long?"
(This patch fell through the cracks four years ago, when I worked on
the searching code. It sat in a branch on top of other changes that
I never applied because I made different improvements. The speedup
at the time, on that machine, was only around sixty percent, though.
But measuring it now again on the same machine, it clocks in at an
82 percent reduction with -O0 and an 87 percent reduction with -O2.)
Those casts are redundant, and sometimes ugly. And as the types of
variables are extremely unlikely to change any more at this point,
the protection they offer against miscompilations is moot.
Signed-off-by: Hussam al-Homsi <sawuare@gmail.com>
It should give the same result as 'wc -w' as long as the content
of 'wordchars' does not affect the counting.
This fixes https://savannah.gnu.org/bugs/?58123.
Bug existed since version 2.6.2, since the --wordchars option was
introduced in commit 6f12992c.
Now all remaining calls of measured_copy() have a "+ 1" in their
second argument, and can thus be simplified. And each of those
calls is followed by terminating the string with a NUL byte, so
thát can be pulled into the function.
Using straightforward comparisons is clearer and faster and shorter.
Again, note that this does not filter out 0x7F (DEL). But that is
okay, as that code will never be returned from get_kbinput().
When dealing with a plain, seven-bit ASCII character, don't bother
calling is_cntrl_mbchar() but determine directly whether it is a
control character. Also reshuffle things so that we don't compare
charlen == 1 when we already know it is 1.
After the previous change, all remaining calls of parse_mbchar() have
NULL as their third parameter. So, drop that parameter and remove the
chunk of code that handles it. Also rename the function, as there are
already too many functions that start with "parse".