Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle CoNLL-U comments #27

Open
DavidNemeskey opened this issue Aug 19, 2021 · 3 comments
Open

Handle CoNLL-U comments #27

DavidNemeskey opened this issue Aug 19, 2021 · 3 comments
Assignees

Comments

@DavidNemeskey
Copy link
Collaborator

emtsv does not handle CoNLL-U comments very well. If the input is a tsv file, two things happen:

  1. If the file only has the form column, comments (lines starting with "# ") are treated as a token and are analyzed as a single "word" token
  2. If the file has other columns (e.g. form anas lemma xpostag to which I want to add upostag feats), only the new header is returned.

Expected behavior: comments should be kept in the text and returned as-is, and they should not prevent emtsv to analyze the text (as in the second case).

@dlazesz
Copy link
Collaborator

dlazesz commented Aug 19, 2021

CoNLL-U comments need to be explicitly enabled with conllu-comments parameter.
We may flip the default behaviour to enabled in some future release.

I agree that the documentation is very coarse on this.

@DavidNemeskey
Copy link
Collaborator Author

Yes, I think it would make sense if that was the default. Should I do it in a PR (+ add a sentence about it to the docs)?

@dlazesz
Copy link
Collaborator

dlazesz commented Aug 23, 2021

Specifiing this in the docs is ok, but changing the default in xtsv requires new major version at least in xtsv. These breaking changes should be commited in batches to minimise disruption. (We have others in mind.)

@mittelholcz What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants