Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infer incorrectly identifies mkv as webm #96

Open
a99984b1799 opened this issue Jul 7, 2024 · 2 comments
Open

infer incorrectly identifies mkv as webm #96

a99984b1799 opened this issue Jul 7, 2024 · 2 comments

Comments

@a99984b1799
Copy link

Demo:

$ ffprobe example.mkv
...
Input #0, matroska,webm, from example.mkv:
  Metadata:
    ENCODER         : Lavf60.16.100
  Duration: 00:02:46.10, start: 0.000000, bitrate: 42034 kb/s
  Stream #0:0: Video: hevc (Main), yuv420p(tv, bt709), 1920x1080 [SAR 1:1 DAR 16:9], 50 fps, 50 tbr, 1k tbn
...

$ cargo run -- example.mkv
Inferred: Ok(Some(Type { matcher_type: Video, mime_type: "video/webm", extension: "webm" }))
Code
fn main() {
    let inf = std::env::args().nth(1).unwrap();
    println!("Inferred: {:?}", infer::get_from_path(&inf));
}

The current code detects two byte patterns. This file doesn't contain the first one:

$ xxd -d example.mkv | head -n1
00000000: 1a45 dfa3 a342 8681 0142 f781 0142 f281  .E...B...B...B..
$ #                 ^ diverges here

And does contain the second one, but at a different offset (24-31 instead of 31-38):

$ xxd -d example.mkv | grep "6d61 7472 6f73 6b61"
00000016: 0442 f381 0842 8288 6d61 7472 6f73 6b61  .B...B..matroska
@a99984b1799
Copy link
Author

Here's an example: example.zip. It's H.264 + Opus and exhibits similar behaviour. I recorded it with OBS.

@a99984b1799
Copy link
Author

It appears file works by matching 1a45 dfa3 at the beginning, then searching for the pattern \x42\x82.matroska (in regex syntax) anywhere in the first 4K.

4K seems very large. My understanding of the spec is that Matroska files must start with an EBML document, which must start with a header, which must contain their docType (matroska). The header can only be so big, so I think searching within the first ~256 bytes is fair.

I'll make a PR for this soon 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant