Minor country and help fixes #3186

FeralChild64 · 2023-12-07T14:52:59Z

Description

Fix two comments in country data definitions (data is OK; it is the comments which are wrong) - noticed by @rderooy
Fix problem I have discovered while updating Polish translation - to determine UTF-8 string length in characters we need something much more sophisticated than strlen or C++ counterparts (they give us length in bytes, and a single Unicode character might need many bytes to be encoded); it is much easier to rely on the translator to put a proper number of minuses (as a title underscore).

Manual testing

Execute dosbox --list-countries and dosbox --list-glshaders commands

Checklist

I have:

johnnovak · 2023-12-07T21:57:44Z

to determine UTF-8 string length in characters we need something much more sophisticated than strlen or C++ counterparts

Oh man... don't tell me this is still an unsolved problem in the C++ std lib... Like every lowly scripting language had this for 20 years now...😞

FeralChild64 · 2023-12-07T22:28:24Z

Oh man... don't tell me this is still an unsolved problem in the C++ std lib...

Unfortunately, it is.

There is std::wstring, but it is not portable, on Windows it is 16-bit only. AFAIK not much standard library functions can deal with it.
We have std::u32string, which can be used to deal with UTF-32 encoding, not really useful for us. Moreover, size() will tell us the number of code points, while we need the number of graphemes here (and a grapheme can be made, for example, from one code point telling about a base Latin letter + two other telling about combining marks; IIRC Baltic languages are particularly nasty here).
C++20 has std::u8string to deal with UTF-8. But the standard library offers nothing to handle it. Might be useful to us once we migrate to C++20, but only for type checking.

I could implement UTF-8 compliant strlen, by asking my Unicode engine to convert UTF-8 to ASCII and then run a regular string size check - but this is too risky, I might improve my Unicode engine in the future to cooperate with emulated DOS more closely, and at this point we don't have DOSBox fully initialized - I'm not sure if I'll remember to test this particular use case like 2 or 3 years in the future.

IMHO better ask the translator to do the job.

johnnovak · 2023-12-08T00:38:01Z

Moreover, size() will tell us the number of code points, while we need the number of graphemes here (and a grapheme can be made, for example, from one code point telling about a base Latin letter + two other telling about combining marks; IIRC Baltic languages are particularly nasty here).

@FeralChild64 Actually, I take my statement back that "every other stdlib" can do this 😅 I'm pretty sure most other languages offer the same basic support only, i.e., they only count "code points", or "runes" (they're called runes in Nim, for instance). But not that wacky situation you described.

Also some libs just shit themselves when dealing with 4-byte UTF-8 emojis which are super popular among Asian audiences (talking from real world experience here; 4-byte UTF-8 emojis caused so many headaches for me in the past...)

kcgen

Tha is for the thorough discussion and explanation, @FeralChild64 !

Makes sense.

kcgen · 2023-12-08T07:53:34Z

Reading about it, ICU has a break iterator, but people mention how its API is very old.

Boost's locale.hpp has a boundary iterator that can be used to count graphemes.

Glib has a UTF-8 ustring, however it seems to only count code points without deducting those that are used for joining.

So Glib::ustring s = L"नमस्ते";, size() returns six, but two of the code points are for joining and we're interested in the four remaining graphemes.

ICU and Boost would be heavy dependencies, so having your engine handle this in the future would be very nice, @FeralChild64 !

johnnovak · 2023-12-08T08:19:23Z

Reading about it, the ICU has a break iterator, but people mention how it's API is very old.

Boost's locale.hpp has a boundary iterator that can be used to count graphemes.

Glib has a UTF-8 ustring, however it seems to only count code points without deducting those that are used for joining.

So Glib::ustring s = L"नमस्ते";, size() returns six, but two of the code points are for joining and we're interested in the four remaining graphemes.

ICU and Boost would be heavy dependencies, so having your engine handle this in the future would be very nice, @FeralChild64 !

Yeah, I don't even want to go near Boost. Like ever, unless I'm extremely well compensated 😄

FeralChild64 mentioned this pull request Dec 7, 2023

Add countries and historic locales from OS/2 Warp 4.52 #3182

Merged

10 tasks

FeralChild64 requested review from kcgen, johnnovak and rderooy December 7, 2023 19:48

FeralChild64 added the bug Something isn't working label Dec 7, 2023

FeralChild64 marked this pull request as ready for review December 7, 2023 20:05

FeralChild64 self-assigned this Dec 7, 2023

FeralChild64 added 2 commits December 7, 2023 23:34

Fix mistakes in comments

5e7ab16

Fix wrong underscore length in some translated help messages

ca429d4

FeralChild64 force-pushed the fc/help-fix-1 branch from 3718244 to ca429d4 Compare December 7, 2023 22:36

kcgen approved these changes Dec 8, 2023

View reviewed changes

FeralChild64 merged commit a8cd386 into main Dec 8, 2023

johnnovak added the localisation Issues related to localisation and internationalisation label Dec 11, 2023

FeralChild64 deleted the fc/help-fix-1 branch December 11, 2023 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Minor country and help fixes #3186

Minor country and help fixes #3186

Uh oh!

FeralChild64 commented Dec 7, 2023 •

edited

Loading

Uh oh!

johnnovak commented Dec 7, 2023

Uh oh!

FeralChild64 commented Dec 7, 2023

Uh oh!

johnnovak commented Dec 8, 2023 •

edited

Loading

Uh oh!

kcgen left a comment

Uh oh!

kcgen commented Dec 8, 2023 •

edited

Loading

Uh oh!

johnnovak commented Dec 8, 2023

Uh oh!

Uh oh!

Uh oh!

Minor country and help fixes #3186

Minor country and help fixes #3186

Uh oh!

Conversation

FeralChild64 commented Dec 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Manual testing

Checklist

Uh oh!

johnnovak commented Dec 7, 2023

Uh oh!

FeralChild64 commented Dec 7, 2023

Uh oh!

johnnovak commented Dec 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kcgen left a comment

Choose a reason for hiding this comment

Uh oh!

kcgen commented Dec 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnnovak commented Dec 8, 2023

Uh oh!

Uh oh!

FeralChild64 commented Dec 7, 2023 •

edited

Loading

johnnovak commented Dec 8, 2023 •

edited

Loading

kcgen commented Dec 8, 2023 •

edited

Loading