Commit Graph

315 Commits (5c8de3e39f77cca37e75eea3b09550e3bad687bb)

Author SHA1 Message Date
Benno Schulenberg 5512c63bdd copyright: update to the current year for significantly changed files 2021-09-24 11:01:41 +02:00
Benno Schulenberg 3c35538e8b tweaks: add Schiermonnikoog to the list of friendly islands
(The commit message is a joke, of course.  Instead, this commit just
removes some unneeded comments and corrects one bit of whitespace.)
2021-06-16 11:19:23 +02:00
Benno Schulenberg 30bafc70cc tweaks: prevent two more size_t subtractions from going negative
This fully fixes https://savannah.gnu.org/bugs/?60658.

Found by compiling with -fsanitize=undefined.
2021-05-24 11:00:29 +02:00
Benno Schulenberg ceaae49b2d tweaks: avoid the subtraction of two size_t variables becoming negative
This fixes https://savannah.gnu.org/bugs/?60658.

Found by compiling with -fsanitize=undefined.
2021-05-23 11:46:37 +02:00
Benno Schulenberg bb81932422 chars: work around the wrong private-use-character widths on OpenBSD
This fixes https://savannah.gnu.org/bugs/?60393.
2021-04-20 11:13:08 +02:00
Benno Schulenberg 48fa14acc0 tweaks: simplify two fragments of code
This makes the handling of plain ASCII a tiny bit slower, but it
affects only the users of --constantshow without --minibar, so...

All other uses of mbstrlen() and collect_char() are not in speed-
critical code paths.
2021-04-13 11:19:32 +02:00
Benno Schulenberg b4a5aedc6c tweaks: remove a misplaced (and nested) #ifdef
It was accidentally introduced two weeks ago by commit 1c010d8e.
2021-04-09 16:55:07 +02:00
Benno Schulenberg d6ed174d09 tweaks: morph a function into what it is actually used for
Since the previous commit, mbwidth() is used only to determine whether
a character is either double width or zero width.  There is no need to
return the actual width of the character; a simple yes or no is enough.

Transforming mbwidth() into is_doublewidth() also allows streamlining
it and is_zerowidth() a bit, so that they become slightly faster.
2021-04-09 16:38:23 +02:00
Benno Schulenberg 78f92e044a tweaks: avoid parsing a multibyte character twice
The number of bytes in the character were determined twice: first in
mbwidth() and then in char_length().  Do it just once, in mbtowide().

Also, avoid calling is_cntrl_char(), because it does unneeded checks
when we already know that the high bit is set.

This duplicates some code, but advance_over() is called a lot, so it
is important that it is as fast as possible.

This shouldn't slow down plain ASCII, as the extra checks (use_utf8
and *string < 0xA0) are done only for non-ASCII (apart from DEL).
2021-04-09 11:32:15 +02:00
Benno Schulenberg c75a3839da tweaks: elide a small function that is used just once 2021-04-07 17:08:05 +02:00
Benno Schulenberg b6a32fbd5f tweaks: elide an unneeded resetting NULL call to wctomb()
Calling wctomb() with NULL as the first parameter returns zero in a
UTF-8 locale, meaning that there is no state, so there is no point
in resetting it either.
2021-04-07 16:11:40 +02:00
Benno Schulenberg 0dcac9188f tweaks: simplify two fragments of code, eliding useless character copying 2021-03-29 20:06:05 +02:00
Benno Schulenberg 1c010d8ec9 chars: implement mbtowc() ourselves, for more efficiency
This saves a function call, and the passing and checking of the
MAXCHARLEN parameter, and the checking whether wc is maybe NULL
(which for nano is never the case), and who knows what other
overheads mbtowc() has, and our workaround for glibc.

Code was written after looking at gnulib/lib/mbrtowc-impl-utf8.h.
2021-03-29 12:36:10 +02:00
Benno Schulenberg b020937475 chars: implement mblen() ourselves, for efficiency
Most implementations of mblen() do a call to mbtowc(), which is
a waste of time when all we want to know is the number of bytes
(and when we already know that we're using UTF-8 and that the
first byte is at least 0xC2).

(This also avoids burdening correct implementations with the
workaround that was needed only for glibc.)

Code was written after looking at gnulib/lib/mbrtowc-impl-utf8.h.
2021-03-27 14:38:28 +01:00
Benno Schulenberg df7fe1280d tweaks: drop unneeded braces and adjust indentation after previous change 2021-03-26 12:17:44 +01:00
Benno Schulenberg 929770191e chars: work around a UTF-8 bug in glibc, to display invalid codes right
The mblen() and mbtowc() functions will happily return 4 or 5 or 6
for byte sequences that start with 0xF4 0x90 or higher.  But those
sequences encode for U+110000 or higher, which are not valid Unicode
code points.  The libc of FreeBSD and OpenBSD and Alpine correctly
return -1 for such sequences.  Make nano behave correctly also when
linked against glibc, so that invalid sequences are always presented
as a series of invalid bytes and never as a single invalid code.

This fixes https://savannah.gnu.org/bugs/?60262.

Bug existed since before version 2.0.0.
2021-03-26 11:07:05 +01:00
Benno Schulenberg 66d9d6c6d2 tweaks: elide the pointless is_valid_unicode() function
The call of this function in make_mbchar() does not add anything,
because wctomb() already returns -1 for codes U+D800 to U+DFFF,
and parse_verbatim_kbinput() already rejects anything that starts
with U+11.... or higher, so make_mbchar() is never called for codes
beyond U+10FFFF.

And the call in display_string() just needs to check for wc <= 0x10FFFF
because mbtowc() already returns -1 for codes U+D800 to U+DFFF.
2021-03-25 11:24:41 +01:00
Benno Schulenberg de816840cb input: accept Unicode codes for non-characters as valid, since they are
That is, accept U+FDD0 to U+FDEF, and accept U+xxFFFE and U+xxFFFF
for xx from 00 to 10 hex, being the 66 reserved "non-characters".

It may not be wise of the user to input these "things" (by typing
their code after M-V), but the codes are valid Unicode code points
and should not be rejected.

See https://www.unicode.org/faq/private_use.html#nonchar8 et al.

This fixes https://savannah.gnu.org/bugs/?60263.

Bug existed since before version 2.0.0.
2021-03-24 17:11:05 +01:00
Benno Schulenberg 6360e4170a copyright: update the years for the FSF 2021-01-11 14:22:51 +01:00
Benno Schulenberg 24e5f956d0 build: fix compilation when configured with --disable-utf8
This fixes https://savannah.gnu.org/bugs/?59842.
Reported-by: Ruben van Wyk <admin@knwip.com>

Bug existed since commit 5129e718 from two days ago.
2021-01-08 12:05:55 +01:00
Benno Schulenberg 10b99d8ac0 chars: short-circuit determining the width of characters under U+0300
The combining characters (that are zero-width) start at U+0300.
After that it's pretty much chaos, width-wise.

The mbwidth() function is not called for control characters (whose
representation takes up two columns), as they are handled separately.

The calls of mbwidth() that *can* happen with a control character as
argument are only to determine whether the character is zero-width,
and then it doesn't matter whether the exact width is 1 or 2.
2021-01-06 20:15:14 +01:00
Benno Schulenberg 5129e718d7 chars: speed up the handling of invalid UTF-8 starter bytes
The first byte of a multi-byte UTF-8 sequence must be in the range
0xC2...0xFF.  Any other byte cannot be a starter byte and can thus
immediately be treated as a single byte.
2021-01-06 12:41:49 +01:00
Benno Schulenberg a4675acdba copyright: update to the current year for significantly changed files 2020-11-30 12:01:47 +01:00
Benno Schulenberg 687efd210c moving: skip combining characters and other zero-width characters
This makes the cursor move smoothly left and right -- instead of
"stuttering" when passing over a zero-width character.

Pressing <Delete> on a normal (spacing) character also deletes
any zero-width characters after it.  But pressing <Backspace>
while the cursor is placed after a zero-width character, just
deletes that zero-width character.  The latter behavior allows
deleting and retyping just the combining diacritic of a character
instead of the whole character.

This addresses https://savannah.gnu.org/bugs/?50773.
Requested-by: Mike Frysinger <vapier@gentoo.org>
2020-11-17 10:21:50 +01:00
Benno Schulenberg 5a635db262 chars: reduce searching time with roughly 85 percent for plain ASCII
Make case-insensitive searching in a UTF-8 locale eight times faster
when the actual characters involved are plain ASCII.

This makes us faster than 'less', and as fast as Vim and Emacs.

The disadvantage of this change is that searching for a string that
begins with a multibyte character is nearly ten times slower than
searching for one that begins with an ASCII character.  This may be
unsettling when searching a huge file first for a simple ASCII string
and later for a UTF-8 one.  Doing this second search, the user might
get impatient: "Why is it taking so long?"

(This patch fell through the cracks four years ago, when I worked on
the searching code.  It sat in a branch on top of other changes that
I never applied because I made different improvements.  The speedup
at the time, on that machine, was only around sixty percent, though.
But measuring it now again on the same machine, it clocks in at an
82 percent reduction with -O0 and an 87 percent reduction with -O2.)
2020-09-01 19:35:34 +02:00
Hussam al-Homsi c87bc1d55f tweaks: stop casting the return of malloc() and friends
Those casts are redundant, and sometimes ugly.  And as the types of
variables are extremely unlikely to change any more at this point,
the protection they offer against miscompilations is moot.

Signed-off-by: Hussam al-Homsi <sawuare@gmail.com>
2020-08-31 12:17:27 +02:00
Benno Schulenberg 8249f3560f tweaks: normalize the indentation after the previous change 2020-07-20 19:46:27 +02:00
Benno Schulenberg dd1b16cd54 tweaks: trim an ASCII case, as the function is called only for UTF-8 2020-07-20 19:37:40 +02:00
Benno Schulenberg 90f6342fd1 tweaks: rename two header files, to be distinct and not an abbreviation 2020-06-20 12:09:31 +02:00
Benno Schulenberg 547de4a7bb counting: count words correctly also when --wordchars is used
It should give the same result as 'wc -w' as long as the content
of 'wordchars' does not affect the counting.

This fixes https://savannah.gnu.org/bugs/?58123.

Bug existed since version 2.6.2, since the --wordchars option was
introduced in commit 6f12992c.
2020-04-06 11:17:43 +02:00
Benno Schulenberg f528ced22b tweaks: use a symbol instead of a number, and drop two unneeded casts 2020-03-22 14:29:10 +01:00
Benno Schulenberg 4ce2e146ea tweaks: elide three unneeded #defines
Backspace and Tab and Carriage Return have standard backslash escapes.
2020-03-19 14:40:51 +01:00
Benno Schulenberg 9917a05f04 tweaks: exclude a function when compiled without spell-checking support 2020-03-13 11:59:08 +01:00
Benno Schulenberg fcda76f684 build: restore non-UTF8 fallbacks, to allow compiling with --disable-utf8
Commits b2c63c3d and 004af03e from yesterday mistakenly removed those
calls.
2020-03-13 11:43:31 +01:00
Benno Schulenberg 21ed79938e tweaks: normalize the indentation after the previous two changes 2020-03-12 15:54:19 +01:00
Benno Schulenberg 004af03ea5 tweaks: remove non-UTF-8 code from three more functions 2020-03-12 15:54:19 +01:00
Benno Schulenberg b2c63c3d3c chars: optimize a function for the most common blanks: space and tab
Also, do not bother to provide separate code for the non-UTF-8 case.
Instead, optimize for plain ASCII characters.
2020-03-12 15:54:19 +01:00
Benno Schulenberg ae139021eb tweaks: rename four more functions, to get rid of an abbreviation
Also, improve their comments.
2020-03-12 15:54:19 +01:00
Benno Schulenberg f6dedf3598 tweaks: rename another function, to remove the obscuring abbreviation 2020-03-12 15:54:19 +01:00
Benno Schulenberg 8003842e5c tweaks: rename a function, to remove an obscuring abbreviation
The "mb" made the name harder to read.  Also, the function is
not only for multibyte characters but for any character.
2020-03-12 15:53:49 +01:00
Benno Schulenberg 1d4411a474 tweaks: elide a function call, by copying a byte directly
Now all remaining calls of measured_copy() have a "+ 1" in their
second argument, and can thus be simplified.  And each of those
calls is followed by terminating the string with a NUL byte, so
thát can be pulled into the function.
2020-02-20 16:38:14 +01:00
Benno Schulenberg a9f7277b1b tweaks: remove a now-unused helper function 2020-02-16 12:33:29 +01:00
Benno Schulenberg 0a31a9aa38 tweaks: make two conditions more direct, and thus elide two functions
Using straightforward comparisons is clearer and faster and shorter.

Again, note that this does not filter out 0x7F (DEL).  But that is
okay, as that code will never be returned from get_kbinput().
2020-02-12 11:38:33 +01:00
Benno Schulenberg 2148e857e5 copyright: update the years for significantly changed files 2020-01-15 12:11:56 +01:00
Benno Schulenberg afa4c6b9fc copyright: update the years for the FSF 2020-01-15 11:42:38 +01:00
Benno Schulenberg 3c695664ec tweaks: elide a function call for the plain ASCII case
When dealing with a plain, seven-bit ASCII character, don't bother
calling is_cntrl_mbchar() but determine directly whether it is a
control character.  Also reshuffle things so that we don't compare
charlen == 1 when we already know it is 1.
2019-10-21 18:52:44 +02:00
Benno Schulenberg 8a7634f070 tweaks: rename two parameters plus a variable, to match others
Also improve a comment and normalize an indentation.
2019-10-21 13:02:17 +02:00
Benno Schulenberg fa88fcc8f2 tweaks: rename a function, and elide a parameter that is always NULL
After the previous change, all remaining calls of parse_mbchar() have
NULL as their third parameter.  So, drop that parameter and remove the
chunk of code that handles it.  Also rename the function, as there are
already too many functions that start with "parse".
2019-10-21 12:35:14 +02:00
Benno Schulenberg c2d8641f01 chars: add a faster version of the character-parsing function
It elides a parameter that is always NULL, and elides two ifs
that always take the same path.
2019-10-21 12:24:23 +02:00
Benno Schulenberg 17c16a4bf5 tweaks: rename a function and elide its first parameter 2019-10-20 09:45:58 +02:00