-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SQLite/SemSQL export format #1145
Conversation
Support the production of release artefacts in SQLite/SemSQL format (identified as 'db', which is not ideal but that is the expected extension of such artefacts, and the existing code does not allow distinguishing between a format *name* and its *extension*, assuming the two are always the same).
Make sure that * sqlite3 (Debian package), * semsql (Python package), * rdftab (custom-built from source), * and Relation-Graph (downloaded as is) are all available in ODKLite, rather than only in ODKFull. This is so that a standard release pipeline still only requires ODKLite, even if the project is configured to produce SQLite/SemSQL files.
Add 'db' to the list of the default export formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AWESOME!
@@ -1056,6 +1060,10 @@ $(TRANSLATIONSDIR)/%.babelon.json: $(TRANSLATIONSDIR)/%.babelon.tsv | |||
convert --check false -f json -o [email protected] &&\ | |||
mv [email protected] $@ | |||
{% endif -%} | |||
{% if 'db' in project.export_formats -%} | |||
{{ release }}.db: {{ release }}.owl | |||
semsql make {{ release }}.db && rm -f {{ release }}-relation-graph.tsv.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an example of a fully battle hardened semsql pipeline with years of use:
%.db: %.owl
@rm -f $*.db
@rm -f .template.db
@rm -f .template.db.tmp
@rm -f $*-relation-graph.tsv.gz
RUST_BACKTRACE=full semsql make $*.db -P config/prefixes.csv
@rm -f .template.db
@rm -f .template.db.tmp
@rm -f $*-relation-graph.tsv.gz
@test -f $*.db || (echo "Error: File not found!" && exit 1)
- all the
rm
we added were from experience when the db was partially created. As the pipeline in its inner workings usesmake
, failures could have quite weird effects on caching. -P config/prefixes.csv
without this, the semsql was virtually unusable, as there were always missing prefixes. However, gettingconfig/prefixes.csv
right is not easy.RUST_BACKTRACE=full
I don't remember but somehow this was usfull for debugging?
Pick and choose, I am fine with what you did, just wanted to let you know what I usually do in case any of it is of any relevance. (For really due diligence you could add all the temporary files to .gitignore as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the rm we added were from experience when the db was partially created.
😱 Makes me wonder if semsql
is robust enough to be enabled by default… Maybe it should only be enabled by projects that can rely on “ontology pipeline engineers” being available to take care of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any case I like the idea of having a single pattern rule to handle all productions of .db
files, instead of having a similar rule at two or three different places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-P config/prefixes.csv without this, the semsql was virtually unusable, as there were always missing prefixes. However, getting
config/prefixes.csv
right is not easy.
Sounds like another reason for maybe leaving the decision to enable SemSQL export to each project. Not sure how the ODK could automatically make sure the prefix file is always “right“.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, what good would it do to have all ontologies producing their own SQLite files, if said files end up being unusable and people conclude that they would be better off continuing to use the files from the “SemSQL collection”?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah maybe lets not make it mandatory just yet.. we can activate it for all "our" ontologies. And by the time that is done and useful tools have been implemented I hope we will have a more robust semsql generation pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll leave it by default for now, as this has the nice side effect that SemSQL export is tested with all the test ontologies from the test suite, instead of only the one in which I explicitly added the db
export format.
But I will remove it as a default export format once the PR is finalised.
The SemSQL export format is named `db` (as the file extension), not `semsql`.
Oh, apache.org, why are you doing that to me? 😭
|
Superseded by #1147 |
This PR adds
db
as a new export format for release artefacts.A
.db
file is a SQLite3 file containing a SemSQL representation of the release product.I would have preferred to use a different name (like
semsql
) for the export name, but currently, the code in the ODK assumes that the name of a format is necessarily the same thing as its extension, so it is not possible to declare a format namedsemsql
that is supposed to produce files with an extension of.db
. This is something that we may want to change in the future, but arguably being able to produce SQLite/SemSQL files is more important than being able to name the format the way we would like.All tools needed to build SQLite3/SemSQL files (
semsql
,rdftab
,sqlite3
, andrelation-graph
) are moved from ODKFull to ODKLite, so that the standard, ODK-generated workflows still only require ODKLite. This increases the size of ODKLite from ~1.34GB to ~1.5GB (compared to ~3.06GB for ODKFull).closes #1142