Skip to content

Minor country and help fixes #3186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 8, 2023
Merged

Minor country and help fixes #3186

merged 2 commits into from
Dec 8, 2023

Conversation

FeralChild64
Copy link
Collaborator

@FeralChild64 FeralChild64 commented Dec 7, 2023

Description

  1. Fix two comments in country data definitions (data is OK; it is the comments which are wrong) - noticed by @rderooy
  2. Fix problem I have discovered while updating Polish translation - to determine UTF-8 string length in characters we need something much more sophisticated than strlen or C++ counterparts (they give us length in bytes, and a single Unicode character might need many bytes to be encoded); it is much easier to rely on the translator to put a proper number of minuses (as a title underscore).

Screenshot-Bug

Manual testing

Execute dosbox --list-countries and dosbox --list-glshaders commands

Checklist

I have:

  • followed the project's contributing guidelines and code of conduct.
  • performed a self-review of my code.
  • commented on the particularly hard-to-understand areas of my code.
  • split my work into well-defined, bisectable commits, and I named my commits well.
  • applied the appropriate labels (bug, enhancement, refactoring, documentation, etc.)
  • checked that all my commits can be built.
  • confirmed that my code does not cause performance regressions (e.g., by running the Quake benchmark).
  • added unit tests where applicable to prove the correctness of my code and to avoid future regressions.
  • made corresponding changes to the documentation or the website according to the documentation guidelines.
  • locally verified my website or documentation changes.

@FeralChild64 FeralChild64 added the bug Something isn't working label Dec 7, 2023
@FeralChild64 FeralChild64 marked this pull request as ready for review December 7, 2023 20:05
@FeralChild64 FeralChild64 self-assigned this Dec 7, 2023
@johnnovak
Copy link
Member

to determine UTF-8 string length in characters we need something much more sophisticated than strlen or C++ counterparts

Oh man... don't tell me this is still an unsolved problem in the C++ std lib... Like every lowly scripting language had this for 20 years now...😞

@FeralChild64
Copy link
Collaborator Author

Oh man... don't tell me this is still an unsolved problem in the C++ std lib...

Unfortunately, it is.

  • There is std::wstring, but it is not portable, on Windows it is 16-bit only. AFAIK not much standard library functions can deal with it.
  • We have std::u32string, which can be used to deal with UTF-32 encoding, not really useful for us. Moreover, size() will tell us the number of code points, while we need the number of graphemes here (and a grapheme can be made, for example, from one code point telling about a base Latin letter + two other telling about combining marks; IIRC Baltic languages are particularly nasty here).
  • C++20 has std::u8string to deal with UTF-8. But the standard library offers nothing to handle it. Might be useful to us once we migrate to C++20, but only for type checking.

I could implement UTF-8 compliant strlen, by asking my Unicode engine to convert UTF-8 to ASCII and then run a regular string size check - but this is too risky, I might improve my Unicode engine in the future to cooperate with emulated DOS more closely, and at this point we don't have DOSBox fully initialized - I'm not sure if I'll remember to test this particular use case like 2 or 3 years in the future.

IMHO better ask the translator to do the job.

@johnnovak
Copy link
Member

johnnovak commented Dec 8, 2023

Moreover, size() will tell us the number of code points, while we need the number of graphemes here (and a grapheme can be made, for example, from one code point telling about a base Latin letter + two other telling about combining marks; IIRC Baltic languages are particularly nasty here).

@FeralChild64 Actually, I take my statement back that "every other stdlib" can do this 😅 I'm pretty sure most other languages offer the same basic support only, i.e., they only count "code points", or "runes" (they're called runes in Nim, for instance). But not that wacky situation you described.

Also some libs just shit themselves when dealing with 4-byte UTF-8 emojis which are super popular among Asian audiences (talking from real world experience here; 4-byte UTF-8 emojis caused so many headaches for me in the past...)

Copy link
Member

@kcgen kcgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tha is for the thorough discussion and explanation, @FeralChild64 !

Makes sense.

@kcgen
Copy link
Member

kcgen commented Dec 8, 2023

Reading about it, ICU has a break iterator, but people mention how its API is very old.

Boost's locale.hpp has a boundary iterator that can be used to count graphemes.

Glib has a UTF-8 ustring, however it seems to only count code points without deducting those that are used for joining.

So Glib::ustring s = L"नमस्ते";, size() returns six, but two of the code points are for joining and we're interested in the four remaining graphemes.

ICU and Boost would be heavy dependencies, so having your engine handle this in the future would be very nice, @FeralChild64 !

@FeralChild64 FeralChild64 merged commit a8cd386 into main Dec 8, 2023
@johnnovak
Copy link
Member

Reading about it, the ICU has a break iterator, but people mention how it's API is very old.

Boost's locale.hpp has a boundary iterator that can be used to count graphemes.

Glib has a UTF-8 ustring, however it seems to only count code points without deducting those that are used for joining.

So Glib::ustring s = L"नमस्ते";, size() returns six, but two of the code points are for joining and we're interested in the four remaining graphemes.

ICU and Boost would be heavy dependencies, so having your engine handle this in the future would be very nice, @FeralChild64 !

Yeah, I don't even want to go near Boost. Like ever, unless I'm extremely well compensated 😄

@johnnovak johnnovak added the localisation Issues related to localisation and internationalisation label Dec 11, 2023
@FeralChild64 FeralChild64 deleted the fc/help-fix-1 branch December 11, 2023 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working localisation Issues related to localisation and internationalisation
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants