-
Notifications
You must be signed in to change notification settings - Fork 39
Parse feeds encoded with GB2312 #43
Comments
Looks like there’s a way to register custom encoding<->UTF-8 handlers with libxml2, which it will use when it finds a registered encoding name in the XML preamble (or HTML meta tag): http://xmlsoft.org/encoding.html |
Another possible option is just to convert the text to UTF8 before passing it to the parser. This would mean, first, detecting the encoding and realizing that it needs conversion. |
Seems like would have to detect the encoding up front for either of those scenarios. You might want to consider running it once through the xml parser hoping for the best, and then do the analysis/reparsing only if it failed. |
I like the libxml version just because it takes care of the detection from the declaration in the document. Detecting encodings can be tough; a lot of byte strings are valid in a few different east-Asian encodings, for example, at least in my limited testing. The detection method on It might be simple enough to register a handler that just uses iconv. |
We could also detect the encoding based in the declaration in the document. |
Maybe. |
See the tests — there’s a commented-out test.
Related: Ranchero-Software/NetNewsWire#1477
The text was updated successfully, but these errors were encountered: