Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexmatch revamp #397

Merged
merged 7 commits into from
Jan 29, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 19 additions & 4 deletions src/ontology/config/prefixes.csv
Original file line number Diff line number Diff line change
Expand Up @@ -227,12 +227,27 @@ ICD10CM,http://purl.bioontology.org/ontology/ICD10CM/
ICD10CM2,https://icd.codes/icd10cm/
ICD10WHO,https://icd.who.int/browse10/2019/en#/
ICD10WHO2010,http://apps.who.int/classifications/icd10/browse/2010/en#/
OMIMPS,https://www.omim.org/phenotypicSeries/PS
OMIMPS,https://omim.org/phenotypicSeries/PS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefixes.csv updates

That will be supplanted eventually by dynamic creation of prefixes.csv but these changes seem necessary for now.

Harshad and I should have thought to check the status of this sooner.

OMIM,https://omim.org/entry/
Orphanet,http://www.orpha.net/ORDO/Orphanet_
GARD,http://purl.obolibrary.org/obo/GARD_
MEDRA,http://identifiers.org/meddra/
MESH,http://identifiers.org/mesh/
SCTID,http://identifiers.org/snomedct/
SCTID,http://snomed.info/id/
UMLS,http://purl.obolibrary.org/obo/UMLS_
UMLS2,http://linkedlifedata.com/resource/umls/id/
UMLS,http://linkedlifedata.com/resource/umls/id/
HGNC,http://identifiers.org/hgnc/
MEDDRA,http://identifiers.org/meddra/
MEDGEN,http://identifiers.org/medgen/
SCTID,http://identifiers.org/snomedct/
UMLS,http://linkedlifedata.com/resource/umls/id/
orcid,https://orcid.org/
swrl,http://www.w3.org/2003/11/swrl#
semapv,https://w3id.org/semapv/vocab/
HGNC_SYMBOL,https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/
HGNC,https://identifiers.org/hgnc/
ncbi.gene,https://www.ncbi.nlm.nih.gov/gene/
OMIMPS,https://www.omim.org/phenotypicSeries/PS
STY,http://purl.bioontology.org/ontology/STY/
sssom,https://w3id.org/sssom/
biolink,https://w3id.org/biolink/vocab/
doap,http://usefulinc.com/ns/doap#
2 changes: 1 addition & 1 deletion src/scripts/lexmatch-sssom-compare.py
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,7 @@ def extract_unmapped_matches(input: str, matches: TextIO, output_dir: str, summa
combined_df = pd.concat(ont_df_list)

combined_msdf = MappingSetDataFrame(
df=combined_df, prefix_map=msdf_lex.prefix_map, metadata=msdf_lex.metadata
df=combined_df, converter=msdf_lex.converter, metadata=msdf_lex.metadata
matentzn marked this conversation as resolved.
Show resolved Hide resolved
)
df_dict = split_dataframe(combined_msdf)
summary.write("## mondo_XXXXmatch_ontology")
Expand Down
3 changes: 2 additions & 1 deletion src/scripts/match-mondo-sources-all-lexical.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,8 @@ def main(verbose: int, quiet: bool):
def run(input: str, config: str, rules: str, rejects: str, output: str):
# Implemented `meta` param in `lexical_index_to_sssom`

meta = get_metadata_and_prefix_map(config)
#meta = get_metadata_and_prefix_map(config)
meta = None
Copy link
Contributor

@joeflack4 joeflack4 Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lexical_index_to_sssom(meta=None)

Nico wrote:

skips meta in src/scripts/match-mondo-sources-all-lexical.py which is not urgently needed and was causing all the problems @joeflack4 is fixing

You are more familiar with all of the lexmatch outputs and you say that running it locally was fine, so it seems there are no significant negative side effects from this.

You say it is not urgently needed, so I take it that this means that there is still a desire to eventually do what I did in #394 (pass a Converter) or similar, perhaps after addressing INCATools/ontology-access-kit#698. So perhaps I should make like a medium priority ticket for this, or keep my current PR open and rebase from / conform to what you have here.

Not important but to clarify, this wasn't causing all the problems I am fixing. Rather it is a solution for the last, only, conditional, remaining unfixed problem in #394.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trish wrote:

@joeflack4 have you also run the entire pipeline locally? Did you find that Nico's solution to "ditch meta" worked as intended?

I haven't run the whole pipeline locally. I'll give it a go but I'm terrified of it; I've heard it takes over 24 hours and sometimes this can interfere w/ other projects I'm working on. My computer wasn't actually capable of doing this until last month (without some skipping of things like NCIT). When I've asked Nico if he wants me to do this before, he's said I could hold off. But I'll at least give it a go.

Regarding checking myself the results of meta=None, I'll give that a go. I was trusting Nico and it sound like he got the results he expected, but good to have second eyes especially if you think so. Another difference I just noticed as well is that I've also been passing no meta (same as passing None), but I've been passing prefix_map instead. I'm not sure it matters in this case but I'd better remove that as well and run in a more closely to what Nico is running. I'd check out his PR as well but I think better to run on my PR because Nico made some other changes to prefixes.csv, but I want to more closely compare the outputs I got before/after with simply the meta=None/prefix_map=None change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#394 redundancy

The main difference is that #397 does passes meta=None to lexical_index_to_sssom(), whereas #394 passes a Converter. Supposedly #397 approach is fine.

Everything else in that PR is now cosmetic, so i can merge just those changes afterwards or ignore and close.

My instinct is to feel a little irked that I spent a lot of work and found a way to get the pipeline working without throwing the oio err (although #394 currently has oio removed and loses outputs), and that my work had I think the same overall effect but was supplanted by Nico and Charlie's PRs. But maybe I shouldn't feel irked; I laid the groundwork and this is just the nature of collaboration, and Nico/Charlie are more senior here and have a better idea of how things should exactly be.

Also, I see that this PR does add more than #394, such as missing prefixes.

with open(config, "r") as f:
yml = yaml.safe_load(f)

Expand Down
2 changes: 1 addition & 1 deletion src/sparql/fix_xref_prefixes.ru
Original file line number Diff line number Diff line change
Expand Up @@ -30,5 +30,5 @@ WHERE
?p ?v .
}
FILTER (isIRI(?entity))
BIND(REPLACE(REPLACE(REPLACE(?value, "ICD-10:", "ICD10:", "i"), "MeSH:", "MESH:", "i"), "OMIM:PS", "OMIMPS:", "i") AS ?value_fixed)
BIND(REPLACE(REPLACE(REPLACE(REPLACE(?value, "ICD-11:", "ICD11:", "i"), "ICD-10:", "ICD10:", "i"), "MeSH:", "MESH:", "i"), "OMIM:PS", "OMIMPS:", "i") AS ?value_fixed)
matentzn marked this conversation as resolved.
Show resolved Hide resolved
}
2 changes: 1 addition & 1 deletion src/sparql/rm_xref_by_prefix.ru
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,5 @@ WHERE {
oboInOwl:hasDbXref ?xref ;
?p1 ?o2 .
}
FILTER( STRSTARTS(str(?xref), "UMLS_ICD9CM_2005_AUI:") || STRSTARTS(str(?xref), "ICD11:") || STRSTARTS(str(?xref), "SNOMEDCT_US_") ||STRSTARTS(str(?xref), "IMDRF:"))
FILTER( STRSTARTS(str(?xref), "UMLS_ICD9CM_2005_AUI:") || STRSTARTS(str(?xref), "ICD11:") || STRSTARTS(str(?xref), "SNOMEDCT_US_") || STRSTARTS(str(?xref), "IMDRF:") || STRSTARTS(str(?xref), "url:") )
matentzn marked this conversation as resolved.
Show resolved Hide resolved
}