Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SQLite/SemSQL export format #1145

Closed
wants to merge 4 commits into from
Closed

Conversation

gouttegd
Copy link
Contributor

This PR adds db as a new export format for release artefacts.

A .db file is a SQLite3 file containing a SemSQL representation of the release product.

I would have preferred to use a different name (like semsql) for the export name, but currently, the code in the ODK assumes that the name of a format is necessarily the same thing as its extension, so it is not possible to declare a format named semsql that is supposed to produce files with an extension of .db. This is something that we may want to change in the future, but arguably being able to produce SQLite/SemSQL files is more important than being able to name the format the way we would like.

All tools needed to build SQLite3/SemSQL files (semsql, rdftab, sqlite3, and relation-graph) are moved from ODKFull to ODKLite, so that the standard, ODK-generated workflows still only require ODKLite. This increases the size of ODKLite from ~1.34GB to ~1.5GB (compared to ~3.06GB for ODKFull).

closes #1142

Support the production of release artefacts in SQLite/SemSQL format
(identified as 'db', which is not ideal but that is the expected
extension of such artefacts, and the existing code does not allow
distinguishing between a format *name* and its *extension*, assuming the
two are always the same).
Make sure that

* sqlite3 (Debian package),
* semsql (Python package),
* rdftab (custom-built from source),
* and Relation-Graph (downloaded as is)

are all available in ODKLite, rather than only in ODKFull.

This is so that a standard release pipeline still only requires
ODKLite, even if the project is configured to produce SQLite/SemSQL
files.
Add 'db' to the list of the default export formats.
@gouttegd gouttegd self-assigned this Nov 28, 2024
@gouttegd gouttegd requested a review from matentzn November 28, 2024 16:04
Copy link
Contributor

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWESOME!

@@ -1056,6 +1060,10 @@ $(TRANSLATIONSDIR)/%.babelon.json: $(TRANSLATIONSDIR)/%.babelon.tsv
convert --check false -f json -o [email protected] &&\
mv [email protected] $@
{% endif -%}
{% if 'db' in project.export_formats -%}
{{ release }}.db: {{ release }}.owl
semsql make {{ release }}.db && rm -f {{ release }}-relation-graph.tsv.gz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example of a fully battle hardened semsql pipeline with years of use:

https://github.com/monarch-initiative/mondo-ingest/blob/52541e9e0e4737168c36a17a921963675e821adc/src/ontology/mondo-ingest.Makefile#L31

%.db: %.owl
	@rm -f $*.db
	@rm -f .template.db
	@rm -f .template.db.tmp
	@rm -f $*-relation-graph.tsv.gz
	RUST_BACKTRACE=full semsql make $*.db -P config/prefixes.csv
	@rm -f .template.db
	@rm -f .template.db.tmp
	@rm -f $*-relation-graph.tsv.gz
	@test -f $*.db || (echo "Error: File not found!" && exit 1)
  1. all the rm we added were from experience when the db was partially created. As the pipeline in its inner workings uses make, failures could have quite weird effects on caching.
  2. -P config/prefixes.csv without this, the semsql was virtually unusable, as there were always missing prefixes. However, getting config/prefixes.csv right is not easy.
  3. RUST_BACKTRACE=full I don't remember but somehow this was usfull for debugging?

Pick and choose, I am fine with what you did, just wanted to let you know what I usually do in case any of it is of any relevance. (For really due diligence you could add all the temporary files to .gitignore as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the rm we added were from experience when the db was partially created.

😱 Makes me wonder if semsql is robust enough to be enabled by default… Maybe it should only be enabled by projects that can rely on “ontology pipeline engineers” being available to take care of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any case I like the idea of having a single pattern rule to handle all productions of .db files, instead of having a similar rule at two or three different places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-P config/prefixes.csv without this, the semsql was virtually unusable, as there were always missing prefixes. However, getting config/prefixes.csv right is not easy.

Sounds like another reason for maybe leaving the decision to enable SemSQL export to each project. Not sure how the ODK could automatically make sure the prefix file is always “right“.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, what good would it do to have all ontologies producing their own SQLite files, if said files end up being unusable and people conclude that they would be better off continuing to use the files from the “SemSQL collection”?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah maybe lets not make it mandatory just yet.. we can activate it for all "our" ontologies. And by the time that is done and useful tools have been implemented I hope we will have a more robust semsql generation pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll leave it by default for now, as this has the nice side effect that SemSQL export is tested with all the test ontologies from the test suite, instead of only the one in which I explicitly added the db export format.

But I will remove it as a default export format once the PR is finalised.

The SemSQL export format is named `db` (as the file extension), not
`semsql`.
@gouttegd
Copy link
Contributor Author

Oh, apache.org, why are you doing that to me? 😭

#14 [stage-0  8/13] RUN wget -nv http://archive.apache.org/dist/jena/binaries/apache-jena-4.9.0.tar.gz -O- | tar xzC /tools &&     mv /tools/apache-jena-4.9.0 /tools/apache-jena
#14 133.5 failed: Connection timed out.
#14 133.5 failed: Network is unreachable.
#14 133.5 
#14 133.5 gzip: stdin: unexpected end of file
#14 133.5 tar: Child returned status 1
#14 133.5 tar: Error is not recoverable: exiting now
#14 ERROR: process "/bin/sh -c wget -nv http://archive.apache.org/dist/jena/binaries/apache-jena-$JENA_VERSION.tar.gz -O- | tar xzC /tools &&     mv /tools/apache-jena-$JENA_VERSION /tools/apache-jena" did not complete successfully: exit code: 2

@gouttegd
Copy link
Contributor Author

Superseded by #1147

@gouttegd gouttegd closed this Nov 28, 2024
@gouttegd gouttegd deleted the add-semsql-export-format branch November 28, 2024 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add SQLite/SemSQL as an export format
2 participants