Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decide on release artifacts #78

Open
wdduncan opened this issue Aug 26, 2021 · 3 comments
Open

decide on release artifacts #78

wdduncan opened this issue Aug 26, 2021 · 3 comments

Comments

@wdduncan
Copy link
Collaborator

We need nail down what we want our release artifacts to be. Currently, I've been focused on producing:

  • harmonized-table.tsv: the large pivot table of the biosample_set.xml file
  • harmonized-table.parquet.gz: a parquet version of harmonized-table.tsv
  • harmonized-attribute-value.ttl.gz: a turtle version of harmonized-table.tsv
  • harmonized_table.db.gz: a sqlite version of the harmonized-table.tsv

General question: Do you want change the name of the artifacts from "harmonized X" to something else? It would make sense to do this b/c we are using the term 'harmonized' in a way that differs from common usage.

Other questions:

  1. @turbomam has been normalizing the harmonized_table.db.gz data. We need to add this to outputs produced. Do we want a this to be new sqlite database or a separate table with the database or have columns co-exist within the biosample table in the database?
  2. After new normalized database has been produced, do we to dump out an updated tsv and parquet file? (I think yes)
  3. Do want to keep the original non-normalized/raw tsv as a product? (I think yes)

As part of this, we need to add clean and release targets to the Makefile.

cc @cmungall @hrshdhgd @realmarcin

@turbomam
Copy link
Collaborator

turbomam commented Sep 2, 2021

How about "biosample_harmatts X" to indicate files containing the harmonizable attributes about biosamples?

I see a building company in the UK called Harmatt https://harmatt.co.uk/ and three Turkish people named Harmat in Wikipedia, but no other widespread usage.

@turbomam
Copy link
Collaborator

turbomam commented Sep 2, 2021

By the way, I would call this my most recent, most thorough mapping of INSDC annotations to OBO foundry terms. @cmungall and others have found some possible improvements, possibly to be implemented by choosing different target ontologies.

https://raw.githubusercontent.com/turbomam/scoped-mapping/main/notebooks/onto_slots_by_env_pack.tsv

@wdduncan
Copy link
Collaborator Author

wdduncan commented Sep 2, 2021

Note: The public directory on NERSC for the release artifacts is /global/cfs/cdirs/m3513/endurable/biosample/biosample-analysis.

For path:
/global/cfs/cdirs/m3513/www/biosample
the URL would be:
https://portal.nersc.gov/project/m3513/biosample/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants