Skip to content
This repository has been archived by the owner on Aug 10, 2024. It is now read-only.

Parse feeds encoded with GB2312 #43

Open
brentsimmons opened this issue Jan 10, 2020 · 6 comments
Open

Parse feeds encoded with GB2312 #43

brentsimmons opened this issue Jan 10, 2020 · 6 comments
Assignees

Comments

@brentsimmons
Copy link
Collaborator

See the tests — there’s a commented-out test.

Related: Ranchero-Software/NetNewsWire#1477

@Wevah
Copy link
Member

Wevah commented Jan 10, 2020

Looks like there’s a way to register custom encoding<->UTF-8 handlers with libxml2, which it will use when it finds a registered encoding name in the XML preamble (or HTML meta tag): http://xmlsoft.org/encoding.html

@brentsimmons
Copy link
Collaborator Author

brentsimmons commented Jan 10, 2020

Another possible option is just to convert the text to UTF8 before passing it to the parser. This would mean, first, detecting the encoding and realizing that it needs conversion.

@danielpunkass
Copy link
Contributor

Seems like would have to detect the encoding up front for either of those scenarios. You might want to consider running it once through the xml parser hoping for the best, and then do the analysis/reparsing only if it failed.

@Wevah
Copy link
Member

Wevah commented Jan 11, 2020

I like the libxml version just because it takes care of the detection from the declaration in the document. Detecting encodings can be tough; a lot of byte strings are valid in a few different east-Asian encodings, for example, at least in my limited testing. The detection method on NSString seems to do a slightly better job than my attempt at a Swift version, though, so they might be using some other tricks.

It might be simple enough to register a handler that just uses iconv.

@brentsimmons
Copy link
Collaborator Author

We could also detect the encoding based in the declaration in the document.

@brentsimmons
Copy link
Collaborator Author

Maybe.

@brentsimmons brentsimmons self-assigned this Jan 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants