-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Synonym Sync: Duplicate rows #720
Open
joeflack4
wants to merge
8
commits into
develop
Choose a base branch
from
bugfix-syn-sync-dupe-rows-684
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Bug fix: Fixed case in which sources can have multiple synonyms which only vary by capitalization, causing duplicate rows to appear in the results. We don't consider source capitalization as authoritative, so these variations are only useful for analysis and should not show up as multiple rows to be processed. Thus, we now aggregate capitalization variations into the single column synonym_case_diff_source.
- Minor codestyle update: Removed accidentally added \ on a line from last commit.
- Bug fix: Address issues related to mondo capitalization variation. - Update: Ensure the following columns are now in the output and are together: synonym_case_mondo, synonym_case_diff_mondo, synonym_case_mondo_is_many, synonym_case_source, synonym_case_diff_source, synonym_case_source_is_many. Note that previously, we had removed synonym_case_mondo & synonym_case_source, opting instead for the 'diff' columns, because it was previously only valuable to show the original capitalizations if there was a difference between the two. But now that we can have multiple variations in capitalization on the same syonym, it is useful to see the original case by itself, as wel as all the variations. - Update: For synonym_case_source_is_many, ensure that all variations show up in synonym_case_source and synonym_case_diff_source columns. Note that when there are multiple capitalization variations at the source, we only need 1 row. - Update: For all synonym_case_mondo_is_many, ensure that all variations show up in the synonym_case_diff_mondo column. But leave synonym_case_mondo as it is. We need to preserve the original case for that row, since unlike the source, we will retain multiple rows in the case that Mondo has multiple capitalization variations for a single synonym.
joeflack4
force-pushed
the
bugfix-syn-sync-dupe-rows-684
branch
from
December 16, 2024 04:39
0bb9418
to
382b83d
Compare
This was referenced Dec 16, 2024
- Delete: Some temporary, analytical code.
joeflack4
commented
Dec 16, 2024
- Bug fix: Sometimes the multi-value source synonym field would be erroneously set as 'synonym'.
twhetzel
requested changes
Dec 18, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sync-synonyms
goal does not run successfully from this feature branch.
This was referenced Dec 18, 2024
- Bug fix on last bug fix: Fixed a KeyError that occurs in the -added template.
This was referenced Dec 24, 2024
- Update: minor: remove redundant line break
Bug fix: Synonym Sync: Recapitalization of acronyms - Fixed bug where this operation was being done for -updated and -confirmed, but it should only be done for -added. - Bug fix: I don't think we should have ever had this in the first place. I don't see that it actually fixes anything; only should cause bugs.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #684
Overview
Fixed case in which sources can have multiple synonyms which only vary by capitalization, causing duplicate rows to appear in the results.
We don't consider source capitalization as authoritative, so these variations are only useful for analysis and should not show up as multiple rows to be processed. Thus, we now aggregate capitalization variations into the single column
synonym_case_diff_source
.Pre-merge checklist
Documentation
Was the documentation added/updated under
docs/
?QC
Was the full pipeline run before submitting this PR using
sh run.sh make build-mondo-ingest
on this branch (afterdocker pull obolibrary/odkfull:dev
), and no errors occurred?Builds:
New Packages
Were any new Python packages added?
Were any other non-Python packages added?
PR Review and Conversations Resolved
Has the PR been sufficiently reviewed by at least 1 team member of the Mondo Technical team and all threads resolved?
Results: Google Sheet