Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the SqliteDosStorage storage backend #6148

Merged
merged 3 commits into from
Nov 16, 2023

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Oct 13, 2023

The implementation subclasses the PsqlDosBackend and replaces the
PostgreSQL database with an sqlite database. By doing so, the
initialization of the storage only requires a directory on the local
file system where it will create the sqlite file for the database and a
container for the disk-objectstore.

The advantage of this sqlite_dos storage over the default psql_dos
is that it doesn't require a system service like PostgreSQL. As a
result, creating the storage is very straightforward and can be done
with almost no setup. The advantage over the existing sqlite_zip is
that the sqlite_dos is not read-only but can be used to write data
as well.

Combined with the verdi profile setup command, a working profile can
be created with a single command:

verdi profile setup core.sqlite_dos -n --profile sqlite-dos

This makes this storage backend very useful for tutorials and demos
that don't rely on performance.

@danielhollas
Copy link
Collaborator

@sphuber this is awesome 👍.

It would also be perfect if it was possible to use this for the test suite. Perhaps there could be some pytest switch that would allow changing the storage backend? It is also currently a bit annoying that the backend is loaded unconditionally, even if you run just a single test file that does not require it. Again, a pytest switch that disables the session fixture that loads the backend would be awesome (and I suspect would speed up iteration during development).

Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, great! I left a small comment for a documentation string (maybe you can check if there are other similar cases in the code).

I fully support Daniel's comment - maybe open an issue and we look into it once this is merged?

I also opened aiidateam/aiida-tutorials#468 for the future, when we'll prepare the next tutorial

Finally, one important note. In the previous coding week, we've been discussing having a documentation page for the various backends, and have it properly linked in the docs so people see it when they setup AiiDA. And this was combined with a simpler verdi profile setup mechanism. I think @superstar54 , @unkcpz and/or @eimrek might have been working on this? Has this ever been merged? I recommend adding the documentation for these new backend as well. Either we open an issue now and plan to fix it soon, or if that documentation is already in the docs, we could fix it directly in this PR

daemon__recursion_limit: int = Field(3000, description='Maximum recursion depth for the daemon workers.')
db__batch_size: int = Field(
100000,
description='Batch size for bulk CREATE operations in the database. Avoids hitting MaxAllocSize of PostgreSQL '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall this still be speaking explicitly of PostgreSQL? Maybe OK to keep but add "E.g., to avoid hitting..."

@sphuber
Copy link
Contributor Author

sphuber commented Oct 17, 2023

Thanks for the review @giovannipizzi

I fully support Daniel's comment - maybe open an issue and we look into it once this is merged?

Regarding switching off the automatic loading of the backend in the session fixture: we had discussed this in the past, but decided against it since it is used by plugins and so changing it would be breaking, as they would have to start adding it manually. I agree that having it off by default would still be better, so maybe we can revisit whether this is an acceptable breaking change for plugins.

As for changing the backend plugin: I already have the required scaffolding in place in the aiida-s3 plugin. We can repurpose this and add it to aiida-core.

I also opened aiidateam/aiida-tutorials#468 for the future, when we'll prepare the next tutorial
Finally, one important note. In the previous coding week, we've been discussing having a documentation page for the various backends, and have it properly linked in the docs so people see it when they setup AiiDA. And this was combined with a simpler verdi profile setup mechanism. I think @superstar54 , @unkcpz and/or @eimrek might have been working on this? Has this ever been merged? I recommend adding the documentation for these new backend as well. Either we open an issue now and plan to fix it soon, or if that documentation is already in the docs, we could fix it directly in this PR

I was not aware of this effort but agree that this would be very useful. This ties neatly into why we cannot merge this PR just yet. There is still some functionality that is not supported by this backend, mostly related to the querybuilder:

  • use of date_trunc in aiida.storage.psql_dos.orm.querybuilder.QueryBuilder.get_creation_statistics
  • queries that use JSONB syntax
  • Group.count()
    If I cannot fix these, then we should clearly document this at the bare minimum, and we should stress that this backend should really only be used for testing, demoing and other non-production type activities.

Since this PR was anyway dependent on a bunch of other PRs being merged first, I had put these final details on hold. But will try to find time soon to work on them.

Update:
The following tests of tests/orm/test_querybuilder.py fail with the sqlite backend:

FAILED tests/orm/test_querybuilder.py::TestRepresentations::test_as_sql - AssertionError: FILES DIFFER:
FAILED tests/orm/test_querybuilder.py::TestRepresentations::test_as_sql_inline - AssertionError: FILES DIFFER:
FAILED tests/orm/test_querybuilder.py::TestRepresentations::test_as_sql_literal_quote - ValueError: SQLite does not support JSON query: aliased(DbNode).extras.['elements'] contains ['Si']
FAILED tests/orm/test_querybuilder.py::test_analyze_query - ValueError: SQLite does not support JSON query: aliased(DbNode).extras.['key'] contains ['x', 1]
FAILED tests/orm/test_querybuilder.py::TestAttributes::test_attribute_existence - AssertionError: assert {'38c68cf9-3c...2c8b5eb7bf06'} == set()
FAILED tests/orm/test_querybuilder.py::TestAttributes::test_attribute_type - AssertionError: assert {'c940c2c7-00...32e7360d1638'} == {'df904228-8d...32e7360d1638'}
FAILED tests/orm/test_querybuilder.py::TestManager::test_statistics - sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such function: date_trunc
FAILED tests/orm/test_querybuilder.py::TestManager::test_statistics_default_class - sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such function: date_trunc

It seems the part of JSON query functionality that is not supported includes at least the contains operator. There also seem to be problems with the JSONB queries respecting type. The !has_key notation also doesn't seem to work as expected.

@sphuber sphuber force-pushed the feature/sqlite-dos-storage branch 2 times, most recently from 2155e14 to aaadf59 Compare November 9, 2023 15:34
@unkcpz
Copy link
Member

unkcpz commented Nov 9, 2023

I think @superstar54 , @unkcpz and/or @eimrek might have been working on this? Has this ever been merged?

@giovannipizzi I missed this comment, sorry.
I think you mentioned the #6070 draft done by @superstar54 during coding week if I remember correctly. Not sure there are overlap work but it was mentioned the Sqlite backend is in aiida-core already.

@sphuber
Copy link
Contributor Author

sphuber commented Nov 10, 2023

Not sure there are overlap work but it was mentioned the Sqlite backend is in aiida-core already.

The PR of @superstar54 uses the SqliteTempBackend which already exists in aiida-core. Here I am adding another backend based on sqlite though, the SqliteDosStorage. The difference is that the latter is persistent, whereas the SqliteTempBackend gets destroyed as soon as the Python interpreter shuts down.

@sphuber
Copy link
Contributor Author

sphuber commented Nov 10, 2023

This is blocked by #6173 Currently it is necessary to specify --first-name, --last-name and --institution with verdi profile setup -n making it way less easy to create a new profile. This should be fixed first.

@sphuber sphuber force-pushed the feature/sqlite-dos-storage branch 3 times, most recently from 026a253 to 758e00d Compare November 16, 2023 14:26
The implementation subclasses the `PsqlDosBackend` and replaces the
PostgreSQL database with an sqlite database. By doing so, the
initialization of the storage only requires a directory on the local
file system where it will create the sqlite file for the database and a
container for the disk-objectstore.

The advantage of this `sqlite_dos` storage over the default `psql_dos`
is that it doesn't require a system service like PostgreSQL. As a
result, creating the storage is very straightforward and can be done
with almost no setup. The advantage over the existing `sqlite_zip` is
that the `sqlite_dos` is not read-only but can be used to write data
as well.

Combined with the `verdi profile setup` command, a working profile can
be created with a single command:

    verdi profile setup core.sqlite_dos -n --profile name --email e@mail

This makes this storage backend very useful for tutorials and demos
that don't rely on performance.
The `store` method of the `SqliteEntityOverride` class, used by the
`SqliteZipBackend` storage backend (and with that all other backends to
subclass this), did not return `self`. This is in conflict with the
signature of the base class that it is overriding.

Since the `SqliteZipBackend` is read-only and so `store` would never be
called, this problem went unnoticed. However, with the addition of the
`SqliteDosStorage` backend which is *not* read-only, this bug would
surface when trying to store a node since certain methods rely on this
method returning the node instance itself.
The storage backends that use sqlite instead of PostgreSQL, i.e.,
`core.sqlite_dos`, `core.sqlite_temp` and `core.sqlite_zip`, piggy back
of the ORM models defined by the `core.psql_dos` backend by dynamically
converting to the sqlite equivalent database models.

The current implementation of `SqlaGroup.count` would except when used
with an sqlite backend since certain columns would be ambiguously
defined:

    sqlite3.OperationalError: ambiguous column name: db_dbgroup.id

This is fixed by explicitly wrapping the classes that are joined in
`sqlalchemy.orm.aliased` which will force sqlalchemy to properly alias
each class removing the ambiguity.
@sphuber sphuber force-pushed the feature/sqlite-dos-storage branch from 758e00d to 8087995 Compare November 16, 2023 14:27
@sphuber sphuber merged commit 5dc1555 into aiidateam:main Nov 16, 2023
36 checks passed
@sphuber sphuber deleted the feature/sqlite-dos-storage branch November 16, 2023 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants