-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Patch margin index splits Unicode surrogate pairs unexpectedly #149
Comments
this is a good catch, @kkshinkai - I think I've seen this reported elsewhere too but I can't find it now. there should be a relatively easy operation here, which is to backup and include a previous code unit if we're on a split surrogate.
this oversimplifies it because we could start a document on a low surrogate, and then we'd need to cut it out, but I think all that has to happen is include up to one additional character. it's not changing the contract of making a margin either because that additional code unit is still part of a character we've already expanded into. happy to review any patch you want. I haven't looked at padding += this.get_margin_before(path2.start, this.Patch_Margin);
pattern = text.substring(patch.start2 - padding,
patch.start2 + patch.length1 + padding); in other words, a |
Does it mean that the library is not yet safe to use with Emoji? |
I think I've just run into this issue. Not sure yet, but seems to make sense. Will report back if I learn anything on closer inspection. |
Yes and no. It works for many inputs, but we need to merge in the fixes to eliminate splitting surrogate pairs when generating these outputs.
It's not only Emoji that are affected, of course, but Emoji are the most notable code points. Everything outside the Basic Multilingual Plane will experience this problem. There's another sneaky problem for deleted spans of text when using versions of In #80 I've fixed the delta format we use in Simplenote, but that's not merged into the main code. I also keep saying I'm going to fork this repo, but without any official announcement in the README that this library is unmaintained it seems silly to maintain a second copy that nobody will find 🤷♂️. You are free to use that PR's branch and if you propose a new PR to fix the |
Thanks for your work dmsnell, you went deep on this! I'm not sure I've wrapped my head around all that's being discussed, but I used your version here and it seems to fix my problem. So thank you. I'll keep an eye out for further updates. |
When the function
diff_match_patch.prototype.patch_addContext_
adds context to a patch, it increments/decreases the index by a constant,Patch_Margin = 4
. However, since JavaScript'ssubstring
function operates with UTF-16 code unit indexing, there's a chance thatPatch_Margin
may split a Unicode surrogate pair.Consider the following example:
The output is
"\uddee **"
(🧮 corresponds to"\ud83e\uddee"
).If you attempt to use
diff_match_patch.patch_obj.prototype.toString
on this patch, it leads to a crash.encodeURI
will throw aURIError
if URI contains a lone surrogate.A straightforward solution might involve adding a verification step after applying
Patch_Margin
to ensure the indices remain valid. I can start a PR, but I've noticed thatPatch_Margin
is used in many places, and I'm unsure about the best way to make changes.The text was updated successfully, but these errors were encountered: