Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Strings as byte arrays #938

Open
Rangi42 opened this issue Nov 12, 2021 · 13 comments
Open

[Feature request] Strings as byte arrays #938

Rangi42 opened this issue Nov 12, 2021 · 13 comments
Labels
enhancement Typically new features; lesser priority than bugs rgbasm This affects RGBASM

Comments

@Rangi42
Copy link
Contributor

Rangi42 commented Nov 12, 2021

In most contexts, strings are already just byte sequences. String literals can contain any bytes (except for \0, which currently terminates the string, but using C++ std::string could avoid this). String functions like STRCAT and STRRPL operate on the bytes and do not care about encoding. Even print and println just send the bytes to stdout; things print as UTF-8 iff that is set as the console's locale.

The only functions that warn about strings which aren't valid UTF-8 are STRLEN and STRSUB. I think this is actually a mistake, and we should have STRLENUTF8 and STRSUBUTF8 if that behavior is desired.

You would expect db "{s}" to declare STRLEN("{s}") many bytes, but actually STRLEN undercounts since there are multi-byte UTF-8 characters. STRLEN("héllo") == 5, but db "héllo" declares 6 bytes, 68 c3 a9 6c 6c 6f.

If strings acted as byte arrays, and #885 allowed \0 bytes in strings, then #933 could implement a single READFILE function for both text and binary files. We would not need to implement numeric arrays (#67) just for that one use case (and given all the open questions about how arrays should behave, and the lack of string arrays anyway, I'd rather not have them.)

Changing the behavior of STRLEN and STRSUB would be a potentially breaking change, but I think it would be better than adding "STRBYTELEN" and "STRBYTESUB" functions, since UTF-8 encoding is the unusual special case. Note that rgbds-struct's uses of STRLEN and STRSUB would all be valid even if the definitions were changed; and hypothetical cases that would break should probably be using CHARLEN and CHARSUB anyway.)

(One other useful function would be STRBYTE(str, idx), to get the raw byte value at an index, without going through the charmap. That is, STRSUB("ABCD", 2, 1) and CHARSUB("ABCD", 2) return the string "B" which coerces to the number $42 if you haven't charmapped it; but STRBYTE("ABCD", 2) would return $42 directly.)

(Another nice addition along with this would be to allow \0 as a way to put $00 bytes in strings. It can be inconvenient to have literal null bytes in a file, but all the others are fine.)

We would probably also want to get rid of the "Input string is not valid UTF-8!" warning in charmap.c, which I think is the only other place where UTF-8 encoding matters.

@Rangi42 Rangi42 added enhancement Typically new features; lesser priority than bugs rgbasm This affects RGBASM labels Nov 12, 2021
@Rangi42 Rangi42 added this to the v0.5.3 milestone Nov 12, 2021
@ISSOtm
Copy link
Member

ISSOtm commented Nov 12, 2021

You would expect db "{s}" to declare STRLEN("{s}") many bytes, but actually STRLEN undercounts since there are multi-byte UTF-8 characters. STRLEN("héllo") == 5, but db "héllo" declares 6 bytes, 68 c3 a9 6c 6c 6f.

No, you wouldn't, because charmaps.

@Rangi42
Copy link
Contributor Author

Rangi42 commented Nov 12, 2021

No, you wouldn't, because charmaps.

That's assuming there are no charmaps involved besides the default one, so STRLEN("{s}") == CHARLEN("{s}").

@Rangi42
Copy link
Contributor Author

Rangi42 commented Nov 13, 2021

Basically these are what I see as the four ways forward:

  1. The status quo: we have strings for which STRLEN and STRSUB expect UTF-8 encoding. We're going to switch from leaky char *s to ref-counted struct Strings or RAII-with-smart-pointers std::strings to allow unlimited string lengths. We could also add a READFILE function to read the contents of a UTF-8 text file as a string, but can't handle binary files. One of the motivating use cases for even adding file-reading functions was to add an offset to the bytes of a tilemap, which would have to be binary, so I think we should try to allow that.
  2. Add arrays/lists and a READBIN function to return an array for binary files, plus READFILE for strings for UTF-8 text files. Given the uncertainties and tradeoffs we ran into when considering how arrays would be implemented, and how major a feature it would be mostly just for the sake of enabling READBIN, I'd rather not do that.
  3. Let READFILE return a string for binary files too. Define new functions STRBYTELEN, STRBYTESUB, and STRBYTE to get the length, substrings, and individual bytes from a string, without expecting any particular text encoding. This would still require us to allow $00 bytes in strings, but that's not a problem; it's feasible as long as we don't rely on string.h functions for algorithms (which neither the struct Strings nor std::strings need to do).
  4. Let READFILE return a string for binary files too. Change STRLEN and STRSUB to not expect any particular text encoding, and add STRBYTE to get individual bytes from a string. Optionally add STRLENUTF8 and STRSUBUTF8 to allow the current behavior (which I do think is worthwhile, even though in most cases where you care, you should probably be using CHARLEN and CHARSUB).

I could certainly be missing an even better fifth way of allowing users to access binary file contents, so here or #933 is fine for discussing that (or #67 if arrays are the preferred solution).

@aaaaaa123456789
Copy link
Member

I'd say #3 is the best option by far.

@Rangi42
Copy link
Contributor Author

Rangi42 commented Nov 14, 2021

Hm, I would somewhat prefer 4 since I expect (a) non-UTF-8-specific would be the more common use case, and (b) hopefully few/no users are depending on UTF-8 STRLEN and STRSUB so far; but either would be fine with me.

@aaaaaa123456789
Copy link
Member

STRLEN and STRSUB have to expect some encoding; there's no meaningful concept of "string length" without one. The encoding where every byte encodes itself is an encoding (ISO-8859-1).

@Rangi42
Copy link
Contributor Author

Rangi42 commented Nov 14, 2021

Option 3 would make STRLEN behave like C's strlen (except without needing the $00 terminator after we finish PR #885, i.e. STRLEN would return the struct String's size value). STRSUB would likewise act like taking a segment of a char[] array. Neither of those cares about the encoding; the string is just an array of bytes.

True, ISO-8859-1 is an encoding that has single-byte characters, but it's not the only one. And the rgbasm language would not be taking a position on which Unicode code points go with which byte values in strings. So I don't think of option 3 as "switch from UTF-8 to ISO-8859-1", but "switch from UTF-8 to arbitrary unsigned byte values". Even charmaps don't really care about Unicode; the character set only becomes relevant when you print things, and that's up to your console. (Also ISO-8859-1 does not define characters for 00-1F or 7F-9F.)

@ISSOtm
Copy link
Member

ISSOtm commented Mar 28, 2022

Given https://hsivonen.fi/string-length, I'm for option 4 as well.

@Rangi42 Rangi42 removed this from the v0.9.0 milestone Aug 6, 2024
@Rangi42
Copy link
Contributor Author

Rangi42 commented Jan 17, 2025

https://discord.com/channels/303217943234215948/661193788802203688/852865070539079701 :

RGBDS strings should be encoding-agnostic like the rest of the language
I.e. only assume ASCII compatibility and nothing else
(We should change STR{UPR,LWR} to respect that, and only operate on ASCII values, if that's not already the case)

Keeping this in mind for a potential rethink of this issue.

@ISSOtm
Copy link
Member

ISSOtm commented Jan 17, 2025

FWIW, the Rust port will enforce the entire input document to be UTF-8. (So that point will be moot.) But since non-ASCII capitalisation is locale-dependent anyway, I think we should stick to ASCII.

@Rangi42
Copy link
Contributor Author

Rangi42 commented Jan 17, 2025

I expect the Rust port will retain the STRLEN and STRSUB behavior of expecting+parsing UTF-8 codepoints, so for example STRLEN("Pokémon") is 7 and STRSUB("Pokémon", 4, 1) is "é".

If that's the case, then I think this would be a forward-compatible solution:

  1. Add STRBYTELEN(), STRBYTESUB(), and STRBYTE() functions which treat strings as raw bytes. (So STRBYTELEN("Pokémon") is 8, STRBYTESUB("Pokémon", 4, 2) is "é", STRBYTE("Pokémon", 4) is $C3, and STRBYTE("Pokémon", 5) is $A9.)
  2. Add READFILE() function to get the contents of a file as a string.

Then if you want to deal with a UTF-8 text file you can process the returned string with STRLEN/STRSUB, and if you want to deal with it as binary data you can use STRBYTELEN/STRBYTE.

(I don't expect us to add a whole new kind of entity, "arrays/lists", just for numeric data; although if we do, there's #67 for it.)

@ISSOtm
Copy link
Member

ISSOtm commented Jan 17, 2025

I think I'd like it called bstr, for "binary string". Or "bytes string".

Or, yes, more realistically, I want #67.

@Rangi42
Copy link
Contributor Author

Rangi42 commented Jan 17, 2025

BSTRLEN, BSTRSUB, STRBYTE, then. Those will go well with #854: strings can thus be treated as sequences of bytes, of UTF-8 characters, of charmap characters, or of charmapped values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Typically new features; lesser priority than bugs rgbasm This affects RGBASM
Projects
None yet
Development

No branches or pull requests

3 participants