In a non-UTF8 build, mbwidth() returns always 1, so it is pointless
to call that function and compare its result to zero then.
Also, don't bother special-casing the function for a non-UTF8 locale.
In addition, the function was used just once, had a weird return value,
and now some more code can be excluded from a non-UTF8 build.
Make use of the fact that any single-byte character always occupies
just one column, and call the costly mbtowc() and wcwidth() only for
characters that actually are multibyte.
Instead of calling in twenty places parse_mbchar(pointer, NULL, NULL),
use a simpler and faster char_length(pointer). This saves pushing two
unneeded parameters onto the stack, avoids two needless ifs, and elides
an intermediate variable.
Its main purpose will follow in a later commit: to speed up searching.
This function is used in get_totsize(), so speed is important.
There is no reason why the length of the string must limited to a
certain size -- that is just a leftover from the function merge in
commit ba2e6f43 from a year ago.
Also, improve a comment and shorten another, change a 'for' to a 'while'
(as the end point is not known), and rename a parameter from a single
letter to a word.
And in the bargain get rid of some duplicate code.
This makes a binary without UTF-8 support slightly slower, but that's
not important -- it is more than fast enough anyway. Important is that
the most used and longest code path, the UTF-8 case, becomes faster.
Note that 'is_cntrl_mbchar()' will fall back to 'is_cntrl_char()' for
a non-UTF-8 build, so the deleted piece of code really was equivalent
with the remaining piece for that case.
Again, if the most significant bit of a UTF-8 byte is zero, it means
the character is a single byte and we can skip the call of mblen(),
*and* if the character is one byte it also occupies just one column,
because all ASCII characters are single-column characters -- apart
from control codes.
This partially addresses https://savannah.gnu.org/bugs/?51491.
For UTF-8, if the most significant bit of a byte is zero, it means the
character is just a single byte and we can skip the call of mblen().
For files consisting of pure ASCII bytes (between 0x00 and 0x7F), this
change reduces the counting time of mbstrlen() by ninety six percent.
This partially addresses https://savannah.gnu.org/bugs/?50406.
When mbtowc() is never called with anything less than MAXCHARLEN as
the length parameter, it will apparently not get confused and will
not need to be reset.
Each leading tab is converted to two tabs, and any leading four spaces
is converted to one tab. The intended tab size (for keeping most lines
within 80 columns) is now four.
Instead of always stepping back four bytes and then tentatively
moving forward again (which is wasteful when most codes are just
one or two bytes long), inspect the preceding bytes one by one
and begin the move forward at the first valid starter byte.
This reduces the backwards searching time by close to 40 percent.
If the length of the haystack is smaller than the length of the needle,
this means that also the length of the tail will be smaller -- because
pointer will be bigger than or equal to haystack -- so the pointer gets
readjusted to be a needle length before the end of the haystack, which
means that it ends up /before/ the haystack: thus the while loop will
never run.
On average, this saves some 200 nanoseconds per line.
The interval 2013-2017 for the Free Software Foundation is valid
because in those years there were releases with changes by either
Chris or David, and the GNU maintainers guide advises to mention
a new year in all files of a package, not just in the ones that
actually changed, and be done with it for the rest of the year.
The platform's default char type might be signed, which could cause
problems in 8-bit locales.
This addresses https://savannah.gnu.org/bugs/?50289.
Reported-by: Hans-Bernhard Broeker <HBBroeker@T-Online.de>
In path names and file names, 0x0A means an embedded newline and
should be shown as ^J, but in anything related to the file's data,
0x0A is an encoded NUL and should be displayed as ^@.
So... switch mode at the two main entry points into the "file system"
(reading in a file, and writing out a file), and also when drawing the
titlebar. Switch back to the default mode in the main loop.
This fixes https://savannah.gnu.org/bugs/?49893.
The byte 0x0A means 0x00 *only* when it is found in nano's internal
representation of a file's data, not when it occurs in a file name.
This fixes the second part of https://savannah.gnu.org/bugs/?49867.
That is: elide a second test from the most travelled path: a valid
character. This adds a second call of mblen() when parse_mbchar()
is called on a terminating zero, but that should never happen.
It is quicker to do a handful of superfluous compares at the end of
each line than it is to compute and keep track of and compare the
remaining line length the whole time.
The typical line is some sixty characters long, the typical search
string ten characters -- with a shorter search string the speedup is
even higher: some fifteen percent. Only when the string is longer
than half the average line length does searching become slower with
this new method.
All this for a UTF-8 locale. For a C locale it makes no difference.
Now that mbstrncasecmp() does the right thing, there is no need any
more to verify that only a valid multibyte sequence was matched.
(See https://savannah.gnu.org/bugs/?45579 for a test case.)
Also, this will make it possible to search for invalid sequences.
(Currently it isn't possible to enter a search string with invalid
characters, but... a user might edit the search history file. And
if pasting at the prompt is implemented, it will be trivial to enter
invalid sequences if you have a file that contains them.)
Persisting might lead to count 'n' reaching zero, which would mean that
the needle has matched, which is wrong when one of the strings contains
an invalid or incomplete multibyte sequence.
That is: don't run towlower() on the two differing bytes when having
reached the end of one of the strings.
This fixes https://savannah.gnu.org/bugs/?48700.
In the bargain, don't do the conversion to lowercase twice.
Furthermore, persist when encountering invalid byte sequences --
until finding bytes that differ.
The needle is never part of the hay -- it is always a separate string.
(And even if needle and haystack were identical, the routine works fine,
the case does not need special treatment.)
This allows the user to specify which other characters, besides the
default alphanumeric ones, should be considered as part of a word, so
that word operations like Ctrl+Left and Ctrl+Right will pass them by.
Using this option overrides the option --wordbounds.
This fulfills https://savannah.gnu.org/bugs/?47283.
Invalid multibyte sequences get depicted with the Replacement Character,
and unassigned codepoints are shown as if they were a space. Both have
a width of one.