Add SQLite/SemSQL export format #1145

gouttegd · 2024-11-28T15:59:02Z

This PR adds db as a new export format for release artefacts.

A .db file is a SQLite3 file containing a SemSQL representation of the release product.

I would have preferred to use a different name (like semsql) for the export name, but currently, the code in the ODK assumes that the name of a format is necessarily the same thing as its extension, so it is not possible to declare a format named semsql that is supposed to produce files with an extension of .db. This is something that we may want to change in the future, but arguably being able to produce SQLite/SemSQL files is more important than being able to name the format the way we would like.

All tools needed to build SQLite3/SemSQL files (semsql, rdftab, sqlite3, and relation-graph) are moved from ODKFull to ODKLite, so that the standard, ODK-generated workflows still only require ODKLite. This increases the size of ODKLite from ~1.34GB to ~1.5GB (compared to ~3.06GB for ODKFull).

closes #1142

Support the production of release artefacts in SQLite/SemSQL format (identified as 'db', which is not ideal but that is the expected extension of such artefacts, and the existing code does not allow distinguishing between a format *name* and its *extension*, assuming the two are always the same).

Make sure that * sqlite3 (Debian package), * semsql (Python package), * rdftab (custom-built from source), * and Relation-Graph (downloaded as is) are all available in ODKLite, rather than only in ODKFull. This is so that a standard release pipeline still only requires ODKLite, even if the project is configured to produce SQLite/SemSQL files.

Add 'db' to the list of the default export formats.

matentzn

AWESOME!

matentzn · 2024-11-28T16:25:59Z

template/src/ontology/Makefile.jinja2

@@ -1056,6 +1060,10 @@ $(TRANSLATIONSDIR)/%.babelon.json: $(TRANSLATIONSDIR)/%.babelon.tsv
 		convert --check false -f json -o [email protected] &&\
 		mv [email protected] $@
 {% endif -%}
+{% if 'db' in project.export_formats -%}
+{{ release }}.db: {{ release }}.owl
+	semsql make {{ release }}.db && rm -f {{ release }}-relation-graph.tsv.gz


Here is an example of a fully battle hardened semsql pipeline with years of use:

https://github.com/monarch-initiative/mondo-ingest/blob/52541e9e0e4737168c36a17a921963675e821adc/src/ontology/mondo-ingest.Makefile#L31

%.db: %.owl @rm -f $*.db @rm -f .template.db @rm -f .template.db.tmp @rm -f $*-relation-graph.tsv.gz RUST_BACKTRACE=full semsql make $*.db -P config/prefixes.csv @rm -f .template.db @rm -f .template.db.tmp @rm -f $*-relation-graph.tsv.gz @test -f $*.db || (echo "Error: File not found!" && exit 1)

all the rm we added were from experience when the db was partially created. As the pipeline in its inner workings uses make, failures could have quite weird effects on caching.

-P config/prefixes.csv without this, the semsql was virtually unusable, as there were always missing prefixes. However, getting config/prefixes.csv right is not easy.

RUST_BACKTRACE=full I don't remember but somehow this was usfull for debugging?

Pick and choose, I am fine with what you did, just wanted to let you know what I usually do in case any of it is of any relevance. (For really due diligence you could add all the temporary files to .gitignore as well)

all the rm we added were from experience when the db was partially created.

😱 Makes me wonder if semsql is robust enough to be enabled by default… Maybe it should only be enabled by projects that can rely on “ontology pipeline engineers” being available to take care of it.

If any case I like the idea of having a single pattern rule to handle all productions of .db files, instead of having a similar rule at two or three different places.

-P config/prefixes.csv without this, the semsql was virtually unusable, as there were always missing prefixes. However, getting config/prefixes.csv right is not easy.

Sounds like another reason for maybe leaving the decision to enable SemSQL export to each project. Not sure how the ODK could automatically make sure the prefix file is always “right“.

I mean, what good would it do to have all ontologies producing their own SQLite files, if said files end up being unusable and people conclude that they would be better off continuing to use the files from the “SemSQL collection”?

Yeah maybe lets not make it mandatory just yet.. we can activate it for all "our" ontologies. And by the time that is done and useful tools have been implemented I hope we will have a more robust semsql generation pipeline.

I’ll leave it by default for now, as this has the nice side effect that SemSQL export is tested with all the test ontologies from the test suite, instead of only the one in which I explicitly added the db export format.

But I will remove it as a default export format once the PR is finalised.

The SemSQL export format is named `db` (as the file extension), not `semsql`.

gouttegd · 2024-11-28T17:15:26Z

Oh, apache.org, why are you doing that to me? 😭

#14 [stage-0  8/13] RUN wget -nv http://archive.apache.org/dist/jena/binaries/apache-jena-4.9.0.tar.gz -O- | tar xzC /tools &&     mv /tools/apache-jena-4.9.0 /tools/apache-jena
#14 133.5 failed: Connection timed out.
#14 133.5 failed: Network is unreachable.
#14 133.5 
#14 133.5 gzip: stdin: unexpected end of file
#14 133.5 tar: Child returned status 1
#14 133.5 tar: Error is not recoverable: exiting now
#14 ERROR: process "/bin/sh -c wget -nv http://archive.apache.org/dist/jena/binaries/apache-jena-$JENA_VERSION.tar.gz -O- | tar xzC /tools &&     mv /tools/apache-jena-$JENA_VERSION /tools/apache-jena" did not complete successfully: exit code: 2

gouttegd · 2024-11-28T18:40:53Z

Superseded by #1147

gouttegd added 3 commits November 28, 2024 15:19

Produce SQLite/SemSQL files by default.

0da1870

Add 'db' to the list of the default export formats.

gouttegd self-assigned this Nov 28, 2024

gouttegd requested a review from matentzn November 28, 2024 16:04

matentzn reviewed Nov 28, 2024

View reviewed changes

Fix name of export format in tests.

b06a850

The SemSQL export format is named `db` (as the file extension), not `semsql`.

gouttegd closed this Nov 28, 2024

gouttegd deleted the add-semsql-export-format branch November 28, 2024 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SQLite/SemSQL export format #1145

Add SQLite/SemSQL export format #1145

gouttegd commented Nov 28, 2024

matentzn left a comment

matentzn Nov 28, 2024

gouttegd Nov 28, 2024

gouttegd Nov 28, 2024

gouttegd Nov 28, 2024

gouttegd Nov 28, 2024

matentzn Nov 28, 2024

gouttegd Nov 28, 2024

gouttegd commented Nov 28, 2024

gouttegd commented Nov 28, 2024

Add SQLite/SemSQL export format #1145

Add SQLite/SemSQL export format #1145

Conversation

gouttegd commented Nov 28, 2024

matentzn left a comment

Choose a reason for hiding this comment

matentzn Nov 28, 2024

Choose a reason for hiding this comment

gouttegd Nov 28, 2024

Choose a reason for hiding this comment

gouttegd Nov 28, 2024

Choose a reason for hiding this comment

gouttegd Nov 28, 2024

Choose a reason for hiding this comment

gouttegd Nov 28, 2024

Choose a reason for hiding this comment

matentzn Nov 28, 2024

Choose a reason for hiding this comment

gouttegd Nov 28, 2024

Choose a reason for hiding this comment

gouttegd commented Nov 28, 2024

gouttegd commented Nov 28, 2024