Spec and WPT inconsistencies #239

anonrig · 2024-12-18T20:46:57Z

What is the issue with the URL Pattern Standard?

There is a web-platform test that is even implemented by Chrome that is not covered with the URLPattern spec.

new URLPattern({ "protocol": "http", "port": "80 " })

This fails on Chromium due to this function validation: https://chromium.googlesource.com/chromium/src/+/main/extensions/common/url_pattern.cc#101

But other than canonicalizePort there is no place that actually validates the validity of the port, and canonicalizePort calls url parser setter which removes leading and trailing spaces which makes "80 " input valid.

Relevant WPT: https://github.com/web-platform-tests/wpt/blob/0c1d19546fd4873bb9f4147f0bbf868e7b4f91b7/urlpattern/resources/urlpatterntestdata.json#L1146

The text was updated successfully, but these errors were encountered:

anonrig · 2024-12-19T15:45:36Z

Another issue @annevk : new URLPattern({ hostname: 'bad#hostname' }); should not throw but there is a WPT that validates that it throws.

{
    "pattern": [{ "hostname": "bad#hostname" }],
    "expected_obj": "error"
  },

This is wrongly implemented on Deno's URLPattern, Cloudflare's workerd and Chromium.

annevk · 2024-12-19T15:59:49Z

Hmm, but # is a https://url.spec.whatwg.org/#forbidden-host-code-point so that does seem weird?

anonrig · 2024-12-19T16:01:24Z

If I understand it correctly: This doesn't fail, hence it shouldn't fail on URLPattern as well:

const url = new URL('fake://dummy.test')
u.hostname = 'bad#hostname';

anonrig · 2024-12-19T16:03:21Z

Host parser (specifically domain to ASCII with domain and false) strip all trailing values whenever it sees # https://url.spec.whatwg.org/#concept-host-parser

anonrig · 2024-12-24T00:48:02Z

Another test case is invalid: ada-url/ada@d17f000

If you run the following on Google Chrome, you'll get the following error:

> new URLPattern("https://{sub.}?example{.com/}foo")

VM970:1 Uncaught TypeError: Failed to construct 'URLPattern': Invalid hostname pattern '{sub.}?example{.com/}foo'. Invalid hostname pattern 'example.com/foo'.
    at <anonymous>:1:1

But, canonicalize_hostname() method spec doesn't mention anything about certain edge cases: https://urlpattern.spec.whatwg.org/#canonicalize-a-hostname

Particularly, the following should work and works according to URL spec:

Welcome to Node.js v23.4.0.
Type ".help" for more information.
> let u = new URL('fake://dummy.test')
undefined
> u
URL {
  href: 'fake://dummy.test',
  origin: 'null',
  protocol: 'fake:',
  username: '',
  password: '',
  host: 'dummy.test',
  hostname: 'dummy.test',
  port: '',
  pathname: '',
  search: '',
  searchParams: URLSearchParams {},
  hash: ''
}
> u.hostname = 'example.com/foo'
'example.com/foo'
> u
URL {
  href: 'fake://example.com',
  origin: 'null',
  protocol: 'fake:',
  username: '',
  password: '',
  host: 'example.com',
  hostname: 'example.com',
  port: '',
  pathname: '',
  search: '',
  searchParams: URLSearchParams {},
  hash: ''
}

Therefore, this test case shouldn't fail.

jeremyroman · 2024-12-27T17:35:35Z

Going to see how many of these I can get through today.

For this example:

new URLPattern({ "protocol": "http", "port": "80 " })

that's not the relevant Chromium code, but instead that we use url::CanonicalizePort here and here, but that function does not strip whitespace, which the URL port setter does as part of the basic URL parser, when it does this:

Remove all ASCII tab or newline from input.

I think this change is probably minor enough (especially since it only makes previously invalid patterns valid) that changing the implementation(s) to match the spec is okay.

anonrig · 2024-12-27T18:32:06Z

@jeremyroman I think there are more invalid cases like this. I've removed and updated the following test cases on urlpatterntestdata.json. Ref: https://github.com/ada-url/ada/pull/785/files#diff-b8289182a638cd2b53ad33fee3a3aeaeb6fcd4d73e2cd9ebb0974a0fc8c2236f

Appreciate if you could take a look. They are mostly around hostname canonicalize method due to Chromium not using the URL parser.

jeremyroman · 2024-12-27T20:54:19Z

Yeah, I'm in the process of looking into what you've said.

For the port example, on further inspection the change I mention does address some whitespace (newlines and tabs) but not spaces. In the Chromium implementation, it also winds through ParsePortFromStringPosition which simply ignores any leading zeroes and any junk after the ASCII digit sequence, whose spec counterpart is here.

jeremyroman · 2024-12-27T23:52:05Z

For the hostname ones, the cases of /?# seem quite parallel to the port ones.

The case of \ is quite weird -- during pattern parsing we can't tell for sure if the URL is special so treat it as not, but for interpreting the init dictionary to test/exec etc we really could do so. The spec currently doesn't really try to do anything about that, though. Thoughts?

For the other bad characters, we have comments in Chromium linking to https://issues.chromium.org/u/0/issues/40124263. If it's just a matter of a (long-standing?) Chromium-specific bug I suppose we should probably test the standard behavior, though that kinda suggests it might be tough for us to fix which might motivate the spec changing. Not familiar enough with that bug yet to comment off the top of my head.

jeremyroman · 2024-12-30T22:03:10Z

I'm beginning to think that port and possibly also hostname canonicalization should be revisited in the spec, rather than in the implementations and tests. I've looked more at port, but there are other quirks of the current specified behavior. For instance, the pattern 8{000}? looks like it should match 8 and 8000 but in fact matches 8 and 80 because 000 is canonicalized to 0, and for hostnames {4.}?google.com becomes {0.0.0.4}?google.com (which matches 0.0.0.4google.com which isn't even the same domain).

A version of that for ports might be:

The algorithm used when compiling a component for port should simply strip TAB/LF/CR (because URL does that here and elsewhere, though I would also be okay with not doing that at all) and error if any non-ASCII-digit character is present after that.
The algorithm used in process port for init (in the non-pattern case) should behave more like the URL port setter, but not silently no-op like it does (it leaves it set to whatever it was before; here there it doesn't seem reasonable to interpret that as ""). That implies that leading junk characters (excl TAB/LF/CR + digits) throw an exception, but everything starting at a trailing one is ignored.

Optionally, we could also be stricter than do the same as the previous plus the range and default port checks (which are a little stranger on a fixed part). I think that's more predictable ("80invalid" being "80" or "" depending on protocol, but "invalid80" failing is consistent with the URL port setter but weird).

Curious for your opinions as well as those of @sisidovski. But basically I think the spec might be the thing that ought to give, possibly with changes to the implementations and tests if what it changes to doesn't match them.

ljharb · 2024-12-30T22:36:14Z

{000} normalized to {0} seems very weird to me indeed.

anonrig · 2024-12-31T01:26:24Z

@jeremyroman All canonicalize methods are basically encountering the same exact issues like hostname because Chromium implementation does not call URL. Since URL specification is a living specification, I don't see how URLPattern can be spec compliant while using URL. There are more test cases that are failing for pathname as well.

I recommend having an "example" implementation to ensure that the behavior is same in all implementations. With the current state, it's not possible to complete implementation and can be spec compliant where the only implementation does not follow the spec...

On top of that, the existing web-platform tests are not explanatory and indeed very cryptic. For example, I can't seem to understand how URLPattern spec uses a hacky solution in a spec in canonicalize_pathname. And I haven't found an answer to #240

I think we should have a call to understand what our options are, and proceed accordingly. I'm more than happy to help!

FYI: I'm working on adding URLPattern to Ada (which is used by Node.js and CF Workers for URL) which will power Node.js and Cloudflare workers URLPattern implementations.

anonrig · 2024-12-31T16:09:39Z

Another inconsistency. There is a test in WPT that takes a pathname that starts with . and pathname property returns /foo/bar

let u = new URLPattern({ pathname: './foo/bar', baseURL: 'https://example.com' })
console.log(u.pathname) // returns /foo/bar

But it should return ./foo/bar.

The inconsistency comes from Chromium not following the spec. It should check for has_opaque_host, but it checks for opaque post with a flag...

if (url::IsUsingStandardCompliantNonSpecialSchemeURLParsing()) {
    return IsStandard() || (IsValid() && !HasOpaquePath());
  }

Ref: https://source.chromium.org/chromium/chromium/src/+/main:third_party/blink/renderer/core/url_pattern/url_pattern.cc;l=225;drc=296a0fe8d8bfa678632bf9b6b4a8af38c998a5e6;bpv=1;bpt=1

This was referenced Dec 19, 2024

fix invalid urlpattern test(s) web-platform-tests/wpt#49782

Open

canonicalizePort does not follow spec. denoland/rust-urlpattern#53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec and WPT inconsistencies #239

Spec and WPT inconsistencies #239

anonrig commented Dec 18, 2024

anonrig commented Dec 19, 2024

annevk commented Dec 19, 2024

anonrig commented Dec 19, 2024

anonrig commented Dec 19, 2024

anonrig commented Dec 24, 2024

jeremyroman commented Dec 27, 2024

anonrig commented Dec 27, 2024

jeremyroman commented Dec 27, 2024

jeremyroman commented Dec 27, 2024

jeremyroman commented Dec 30, 2024

ljharb commented Dec 30, 2024

anonrig commented Dec 31, 2024

anonrig commented Dec 31, 2024

Spec and WPT inconsistencies #239

Spec and WPT inconsistencies #239

Comments

anonrig commented Dec 18, 2024

What is the issue with the URL Pattern Standard?

anonrig commented Dec 19, 2024

annevk commented Dec 19, 2024

anonrig commented Dec 19, 2024

anonrig commented Dec 19, 2024

anonrig commented Dec 24, 2024

jeremyroman commented Dec 27, 2024

anonrig commented Dec 27, 2024

jeremyroman commented Dec 27, 2024

jeremyroman commented Dec 27, 2024

jeremyroman commented Dec 30, 2024

ljharb commented Dec 30, 2024

anonrig commented Dec 31, 2024

anonrig commented Dec 31, 2024