-
-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Strings as byte arrays #938
Comments
No, you wouldn't, because charmaps. |
That's assuming there are no charmaps involved besides the default one, so |
Basically these are what I see as the four ways forward:
I could certainly be missing an even better fifth way of allowing users to access binary file contents, so here or #933 is fine for discussing that (or #67 if arrays are the preferred solution). |
I'd say #3 is the best option by far. |
Hm, I would somewhat prefer 4 since I expect (a) non-UTF-8-specific would be the more common use case, and (b) hopefully few/no users are depending on UTF-8 |
|
Option 3 would make True, ISO-8859-1 is an encoding that has single-byte characters, but it's not the only one. And the rgbasm language would not be taking a position on which Unicode code points go with which byte values in strings. So I don't think of option 3 as "switch from UTF-8 to ISO-8859-1", but "switch from UTF-8 to arbitrary unsigned byte values". Even charmaps don't really care about Unicode; the character set only becomes relevant when you print things, and that's up to your console. (Also ISO-8859-1 does not define characters for 00-1F or 7F-9F.) |
Given https://hsivonen.fi/string-length, I'm for option 4 as well. |
https://discord.com/channels/303217943234215948/661193788802203688/852865070539079701 :
Keeping this in mind for a potential rethink of this issue. |
FWIW, the Rust port will enforce the entire input document to be UTF-8. (So that point will be moot.) But since non-ASCII capitalisation is locale-dependent anyway, I think we should stick to ASCII. |
I expect the Rust port will retain the If that's the case, then I think this would be a forward-compatible solution:
Then if you want to deal with a UTF-8 text file you can process the returned string with (I don't expect us to add a whole new kind of entity, "arrays/lists", just for numeric data; although if we do, there's #67 for it.) |
I think I'd like it called Or, yes, more realistically, I want #67. |
|
In most contexts, strings are already just byte sequences. String literals can contain any bytes (except for
\0
, which currently terminates the string, but using C++std::string
could avoid this). String functions likeSTRCAT
andSTRRPL
operate on the bytes and do not care about encoding. Evenprint
andprintln
just send the bytes to stdout; things print as UTF-8 iff that is set as the console's locale.The only functions that warn about strings which aren't valid UTF-8 are
STRLEN
andSTRSUB
. I think this is actually a mistake, and we should haveSTRLENUTF8
andSTRSUBUTF8
if that behavior is desired.You would expect
db "{s}"
to declareSTRLEN("{s}")
many bytes, but actuallySTRLEN
undercounts since there are multi-byte UTF-8 characters.STRLEN("héllo") == 5
, butdb "héllo"
declares 6 bytes,68 c3 a9 6c 6c 6f
.If strings acted as byte arrays, and #885 allowed
\0
bytes in strings, then #933 could implement a singleREADFILE
function for both text and binary files. We would not need to implement numeric arrays (#67) just for that one use case (and given all the open questions about how arrays should behave, and the lack of string arrays anyway, I'd rather not have them.)Changing the behavior of
STRLEN
andSTRSUB
would be a potentially breaking change, but I think it would be better than adding "STRBYTELEN
" and "STRBYTESUB
" functions, since UTF-8 encoding is the unusual special case. Note that rgbds-struct's uses ofSTRLEN
andSTRSUB
would all be valid even if the definitions were changed; and hypothetical cases that would break should probably be usingCHARLEN
andCHARSUB
anyway.)(One other useful function would be
STRBYTE(str, idx)
, to get the raw byte value at an index, without going through the charmap. That is,STRSUB("ABCD", 2, 1)
andCHARSUB("ABCD", 2)
return the string"B"
which coerces to the number $42 if you haven't charmapped it; butSTRBYTE("ABCD", 2)
would return $42 directly.)(Another nice addition along with this would be to allow
\0
as a way to put $00 bytes in strings. It can be inconvenient to have literal null bytes in a file, but all the others are fine.)We would probably also want to get rid of the "Input string is not valid UTF-8!" warning in charmap.c, which I think is the only other place where UTF-8 encoding matters.
The text was updated successfully, but these errors were encountered: