-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define the relationship between nanoarrow and C++ #599
Comments
This is a great question. I've documented using the C++ layer myself in some youtube videos on nanoarrow. I've personally been hoping to see it grow a little bit more to make it easier to iterate some of the nanoarrow structures (for example, I think it would be nice to have an iterator for StructArrays that helps you work with dataframe data), but I also don't have a clear point of view on where exactly the line should be drawn |
Definitely a great point that this is not well defined! A few initial thoughts:
Things like IPC, ADBC, and integration testing are disproportionately affected here, since they are essentially nanoarrow "users" more so than the nanoarrow C library itself. These are places that there is significant verbosity unrelated to the function being tested (e.g., creating an array to roundtrip through IPC), and where I wish I had things like
I am aware of
There could definitely be C++ bindings to the C library (or standalone header-only C++ like sparrow)...I don't think that until now there has been anybody with interest in designing and maintaining the interface. C++17 is definitely the way to go here (but there would need to be discussion on the mailing list of whether this is a good idea since it's possibly stepping on Arrow C++'s/sparrow's toes and there would need to be significant interest in contributing/reviewing).
We definitely should have examples (the fact that we don't have them is just an issue of time) and the intention of the unit testing has never been to replace that. We are a C library, though, and the C library tests will need to have some C library calls in them and should vaguely follow our own advice (e.g., we advise C++ users to use the |
Just chiming in here, while it's not the goal for here, this might be useful (as janky as it is) https://github.com/apache/arrow-adbc/blob/main/docs/source/ext/adbc_cookbook.py Demo here: https://arrow.apache.org/adbc/current/cpp/quickstart.html |
Thumbs-up for building on top of The fact that we have a minimal C interface here is so valuable for more complex build setups as (just about) everybody can use a C-level foreign-function interface. And placing elegant and usable C++ header-only layers on top is then pure magic. I have been told this is called the hour-glass principle. Last time this came up (as well as above) @paleolimbot pointed at sparrow so what would be a differentiator to that (alpha ?) project? |
IIUC sparrow is designed for consumption exclusively as a C++20 library. Although this makes their API seriously expressive, it makes the library less accessible as an ingredient in other APIs. As I understand it,
In light of this, I think strictly framing
I'm hesitant to agree with this since it feels that the implication is once again that the unit tests should contain mostly recognizable usages of the API. Instead I'd say test writers should have a free hand to use and extend C++ helpers as needed, so that the only C library call which appears in a given unit test is the function exercised by the test. For example, no unit test in ipc/ should need to call
I love this. I'll definitely consider borrowing it for writing nanoarrow examples |
I have been looking for a reasonable way to write tutorials/examples for a long time...thank you for writing it/passing this on!
We are on the same page here! I do think that leaning too hard on advanced C++ and/or advanced gtest/gmock in tests limits the ability of those not familiar to contribute, but given that most of the existing tests use pretty much no features of C++/gmock/gtest, I think we can get substantial improvement without going there.
Definitely on board if there is interest (and it sounds like there is)! |
Do we want wrappers like: class Buffer {
public:
void SetAllocator();
void Append(int32_t x) { ArrowBufferAppend(obj_.get(), &x, sizeof(x)); }
uint8_t* data() { return obj_.data; }
int64_t size() { return obj_.size_bytes(); }
private:
UniqueBuffer obj_;
} ...in the C++ bindings? If we do, what do we use for error handling? |
Are we definitely locked into C++17? If not, I wonder if it wouldn't be better to return a std::span instead of a data pointer. As far as error handling is concerned, maybe we could use C++ exceptions? Having used this downstream, I've always found the |
+1 for returning span. FTR, that class can be written in C++11 (arrow-cpp has a polyfill) |
I quite like exceptions although I'm vaguely aware that there's a large and vocal contingent who feel differently. There are some pretty serious performance issues that can result from using them inappropriately (e.g., one time I used them for something like StopIteration and it made the framework I'd written unusably slow), but I don't know that any of the types of errors that usually occur from a non-zero
Anything that we want to use in our own tests still has to be C++11 (we still have a rather important user, DuckDB, who supports C++11, so I think it's reasonable that our tests still run), but we can do things with higher standards if they are fenced with an opt-out define. I don't think we have any users interested in helpful C++ wrappers that are still using C++11 but we do have users that are specifically C++17 that might be (cudf, ADBC), and so I think anything higher than that is maybe not all that useful (but maybe possible with an opt-in define).
+1! It looks like that's our |
After #668 merges I'll put together a draft PR. Maybe a better place to start would be the |
This PR splits `nanoarrow.hpp`, which was getting large, into multiple files (bundling them, of course, when creating the bundled distribution). This is intended to make space for new helpers that are being proposed in #599 although I think it's also just a good idea in general since the utilities to create streams and buffers using templating are in a pretty different category from the things that keep you from leaking memory and many users will only need part of the API. Because it's header-only on top of the C runtime, I think this is probably a good pattern to enable. This doesn't quite solve `nanoarrow_device.hpp` and `nanoarrow_ipc.hpp`...they currently still just include all of `nanoarrow.hpp`. I think we can probably solve that in the bundler but I might punt on that since it's not actually hurting anything at the moment.
Currently it's not explicit what level of C++ is recommended for use in the nanoarrow project. The library itself is of course a minimal API in pure C, but a C++ helper library is packaged alongside and the unit tests are all C++.
This is part of general tension in the project between ergonomics and minimalism of the API. New helper functions are suspect for fear of bloating the API, but this leads to extremely verbose unit tests and code examples.
WRT using unit tests as documentation: although I think it's useful to have complete and self-contained examples of API usage, I think the priority for unit tests should be coverage rather than didactics.
I'd propose:
The text was updated successfully, but these errors were encountered: