Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QC check for: Set of Mondo terms in mondo.owl & mondo.sssom.tsv differ #384

Open
joeflack4 opened this issue Dec 1, 2023 · 9 comments · May be fixed by #385
Open

QC check for: Set of Mondo terms in mondo.owl & mondo.sssom.tsv differ #384

joeflack4 opened this issue Dec 1, 2023 · 9 comments · May be fixed by #385
Assignees

Comments

@joeflack4
Copy link
Contributor

joeflack4 commented Dec 1, 2023

Overview

I was working on #363 (comment) today. I had just gotten my new computer, so I was finally able to build mondo.db, so I ran the goal for that as well as mondo.sssom.tsv to make sure that both were up-to-date. I expected then that the list of terms would be the same in each (I also accounted for obsolete terms). But that is not the case.

I found 971 terms that were in mondo.db (from mondo.owl) that weren't in mondo.sssom.tsv.
I found 2,024 terms that were in mondo.sssom.tsv but not in mondo.db.

out_of_sync_mondo_ids.tsv.zip (Google Sheet)

Update:
A. Non-problematic: in ontology only
B. Problematic: in SSSOM only

@joeflack4 joeflack4 added the bug Something isn't working label Dec 1, 2023
@matentzn
Copy link
Member

matentzn commented Dec 1, 2023

See my comment here: #363 (comment)

@joeflack4
Copy link
Contributor Author

joeflack4 commented Dec 1, 2023

Copy/pasting our discussion from aforementioned comment.

Joe wrote:

As written in the issue:

I found 971 terms that were in mondo.db (from mondo.owl) that weren't in mondo.sssom.tsv.
I found 2,024 terms that were in mondo.sssom.tsv but not in mondo.db.

The implementation for this changed. Right now I'm just doing a static check of all of the Mondo IDs in mondo.db and mondo.sssom.tsv.

The implementation I had before gave more information based on which terms were missing by source, but I didn't think that information was important, so I changed it.

Before, I was printing a warning for sources where the number of out of sync terms were greater than 200. Now I'm saving a report and running the same check every time (source agnostic), but if it's greater than 200 I'm still warning. But now I'm also always saving this report: out_of_sync_mondo_ids.tsv.zip

Questions

  1. @matentzn If this is a static QC check that does not depend on any source nor on this synchronization work, should I make a separate PR for this? Should I add this to some kind of QC command or a test (maybe eventually QC checks will be run as Python tests?)?
  2. @matentzn Do you want me to change this back to how I had it before? Some kind of breakdown by source, rather than simply checking mondo.db and mondo.sssom.tsv against each other? Or should I do both--both checking those against each other, and some kind of breakdown by source?

Nico wrote:

For reports that require discussion, its better to have them on Google sheets so I can click immediately;

I found 971 terms that were in mondo.db (from mondo.owl) that weren't in mondo.sssom.tsv.

For now, its best to look at once example and then decide if it is an issue. I would have expected around 1000K terms in Mondo that do not have any associated mappings, and therefore do not show up in the mondo.sssom.tsv. So this one seems ok to me.

I found 2,024 terms that were in mondo.sssom.tsv but not in mondo.db.

Again it would be good to look at some examples. Out-of-syncness is only 1 hypothesis; another one is that the sssom file contains 2000 obsolete classes that are not present in mondo.db (which is interesting to know, as I would not agree with the oak design in this case).

I fully agree with a simple QC check that ensures a minimum of X classes are being processed, ideally on a per resource basis; but for now I suspect the cases you found have good explanations.

@joeflack4 joeflack4 changed the title mondo.owl and mondo.sssom.tsv out of sync Set of Mondo terms in mondo.owl & mondo.sssom.tsv differ Dec 1, 2023
@joeflack4
Copy link
Contributor Author

@matentzn If you want we can discuss at meeting instead of here.

Decided to move conversation here, as (@matentzn correct me if I'm wrong) it sounds like your answers to questions (1) and (2) are 'yes' and 'yes'. Basically, there I can add a QC check and do some breakdown by source, and that should be done in a separate PR.

Some more follow-up questions:
3. QC: Is there a standard way you have in mind of doing it in mondo-ingest? Maybe this should be a separate issue?
4. I was thinking of adding this to #319, but I don't know of any failure conditions for this if I turned it into a test. So I suppose right now it's better as a report; part of some separate QC command. So I make an independent PR for this?
5. Do you want me to keep the warning in the synchronization pipeline? Where the source has more than 200 direct SCRs in Mondo (change it to terms? I had been warning about edges) where at least 1 of the terms in the SCRs is not in mondo.sssom.tsv? It feels redundant to me to keep the warning in the synchronization pipeline if there's to be a standalone QC check.


its better to have them on Google sheets so I can click immediately;

I attached out_of_sync_mondo_ids.tsv.zip yesterday. IDK if you didn't see it, or are just saying you prefer a Google Sheets, or it is possible you noticed it was erroneous! I had it correct yesterday, then tweaked the logic and got bad results. I just fixed that now and re-uploaded. Here's also the Google Sheets version for you.

I see you are saying this difference between the files may not necessarily be problematic. So I just updated the title of the issue. FYI, OAK by default, when using the .entities() method to get list of terms, filters out obsoletes. But for this check, I included the obsoletes. I did see that there are some obsoletes in the mondo.owl/mondo.db and mondo.sssom.tsv, but I didn't inspect beyond that.

@matentzn
Copy link
Member

matentzn commented Dec 4, 2023

I attached out_of_sync_mondo_ids.tsv.zip yesterday. IDK if you didn't see it, or are just saying you prefer a Google Sheets, or it is possible you noticed it was erroneous!

No I do a lot of my work on the phone and would like to be able to see things immediately rather than having to unzip etc to safe time.

I see you are saying this difference between the files may not necessarily be problematic. So I just updated the title of the issue. FYI, OAK by default, when using the .entities() method to get list of terms, filters out obsoleted.

There seems to be a real issue: I checked for example "MONDO:0029465" and it absolutely should be in both! No idea what could cause it to not be in one of the other. Is the mondo.owl that is at the bottom of mondo.db the latest release version?

  1. QC: Is there a standard way you have in mind of doing it in mondo-ingest? Maybe this should be a separate issue?

Nothing in mind yet!

  1. I was thinking of adding this to Slurp tests & testing setup #319, but I don't know of any failure conditions for this if I turned it into a test. So I suppose right now it's better as a report; part of some separate QC command. So I make an independent PR for this?

I would first make sure we have an issue that clearly defines the QC check. Then a separate PR.

  1. Do you want me to keep the warning in the synchronization pipeline? Where the source has more than 200 direct SCRs in Mondo (change it to terms? I had been warning about edges) where at least 1 of the terms in the SCRs is not in mondo.sssom.tsv? It feels redundant to me to keep the warning in the synchronization pipeline if there's to be a standalone QC check.

TMD. I would like to ensure that the pipeline breaks if major issues occur, but I neither know what I mean by "major issues" exactly, nor how to recognise these; at least not without thinking. Lets develop some ideas and discuss at a call.

@joeflack4
Copy link
Contributor Author

joeflack4 commented Dec 4, 2023

Understood on Google Sheets now!

Ok, we'll discuss how to implement these checks at the meeting.

As for MONDO:0029465 I just discovered that my mondo.owl did not get updated as I had suspected! Sorry about that. I'm re-running and will update the out_of_sync_mondo_ids Google Sheet with a new report shortly.

Edit: Good news! MONDO:0029465 no longer appears in the report. Also great news--now there are 0 entries that are "in SSSOM but not in mondo.owl". There are now just 3,499 entires that are "in mondo.owl but not in SSSOM".

@matentzn
Copy link
Member

matentzn commented Dec 5, 2023

Awesome, I randomly checked 10, can you remove "obsolete classes" from this list as well? all 10 were obsolete.

@joeflack4
Copy link
Contributor Author

No problem! I intentionally included the obsoletes to be most thorough. But especially now since the problem is no longer bidirectional, I suppose there is no longer any need.

I just updated: out_of_sync_mondo_ids Google Sheet

I also added a labels column.

If you want the obsoletes back, now that I have added the labels, let me know. I could also add an is_obsolete column and add sorting for that.

@matentzn
Copy link
Member

matentzn commented Dec 6, 2023

Fantastic. In this case, I checked 10 randomly, and they all have no mappings associated with them, so everything seems correct! THANKS for drilling in!

@joeflack4 joeflack4 changed the title Set of Mondo terms in mondo.owl & mondo.sssom.tsv differ QC check for: Set of Mondo terms in mondo.owl & mondo.sssom.tsv differ Dec 6, 2023
@joeflack4 joeflack4 added qc / test and removed bug Something isn't working labels Dec 6, 2023
@joeflack4 joeflack4 self-assigned this Dec 6, 2023
@joeflack4
Copy link
Contributor Author

Very good! I updated the issue / project tracker so that this issue is for a QC check and not a bug report.

@joeflack4 joeflack4 linked a pull request Dec 9, 2023 that will close this issue
@joeflack4 joeflack4 linked a pull request Dec 9, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants