Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API (string dtype): comparisons between different string classes #60639

Open
rhshadrach opened this issue Jan 1, 2025 · 16 comments
Open

API (string dtype): comparisons between different string classes #60639

rhshadrach opened this issue Jan 1, 2025 · 16 comments
Labels
API - Consistency Internal Consistency of API/Behavior Needs Discussion Requires discussion from core team before further action Numeric Operations Arithmetic, Comparison, and Logical operations Strings String extension data type and string data

Comments

@rhshadrach
Copy link
Member

Some comparisons between different classes of string (e.g. string[pyarrow] and str) raise. Resolving this is straightforward except for what class should be returned. I would expect it should always be the left obj, e.g. string[pyarrow] == str should return string[pyarrow] whereas str == string[pyarrow] should return str. Is this the concensus?

We currently run into issues with how Python handles subclasses with comparison dunders.

lhs = pd.array(["x", pd.NA, "y"], dtype="string[pyarrow]")
rhs = pd.array(["x", pd.NA, "y"], dtype=pd.StringDtype("pyarrow", np.nan))

print(lhs.__eq__(rhs))
# <ArrowExtensionArray>
# [True, <NA>, True]
# Length: 3, dtype: bool[pyarrow]

print(lhs == rhs)
# [ True False  True]

The two results above differ because ArrowStringArrayNumpySemantics is a proper subclass of ArrowStringArray and therefore Python first calls rhs.__eq__(lhs).

We can avoid this by special casing this particular case in ArrowStringArrayNumpySemantics, but I wanted to open up an issue for discussion before proceeding.

cc @WillAyd @jorisvandenbossche

@rhshadrach rhshadrach added Numeric Operations Arithmetic, Comparison, and Logical operations Strings String extension data type and string data Needs Discussion Requires discussion from core team before further action API - Consistency Internal Consistency of API/Behavior labels Jan 1, 2025
@WillAyd
Copy link
Member

WillAyd commented Jan 1, 2025

I would expect it should always be the left obj, e.g. string[pyarrow] == str should return string[pyarrow] whereas str == string[pyarrow] should return str

You mean the return types should be bool right? I'm assuming so but let me know if I misunderstand

This is another motivating case for PDEP-13 #58455

I think without that, they should probably just return the bool extension type to at least preserve the possibility of null values

@rhshadrach
Copy link
Member Author

rhshadrach commented Jan 1, 2025

Ah, indeed, thanks! I meant the Boolean dtype determined by the left side. So string[pyarrow] == str would return bool[pyarrow]. Is this in conflict with your last sentence above? I'm not sure what the bool extension type means.

@WillAyd
Copy link
Member

WillAyd commented Jan 1, 2025

pd.BooleanDtype is the bool extension type

@jorisvandenbossche
Copy link
Member

Just to be explicit, the two possible return values we currently have for the example above (in case of consistent dtypes for left and right operand):

>>> lhs == rhs.astype(lhs.dtype)
<ArrowExtensionArray>
[True, <NA>, True]
Length: 3, dtype: bool[pyarrow]

>>> lhs.astype(rhs.dtype) == rhs
array([ True, False,  True])

are bool[pyarrow] and np.dtype('bool').
I personally agree that this should be pd.BooleanDtype instead of ArrowDtype for the first one (that is something that has been changed a while ago, and I thought we had an issue about it, but can't directly find it). But let's here focus on which of the current dtypes should be returned in case of mixed operands, i.e. essentially the question of whether to always prioritize the left operand or whether there should be a kind of hierarchy.

While in general letting the left operand take priority sounds fine and something that typically happens in Python (with Python also automatically first calling __eq__ of the lhs before trying the rhs), in the context of array objects and data types, it might make more sense to have a form of hierarchy / priority between groups of dtypes?

For example also for the nullable numeric dtypes vs numpy dtypes, we always give preference to the nullable dtype in that case:

>>> ser1 = pd.Series([1, 2, 3], dtype="int64")   # numpy
>>> ser2 = pd.Series([1, 2, 3], dtype="Int64")   # nullable

# numpy gives numpy bool
>>> (ser1 == ser1).dtype
dtype('bool')

# but once left OR right is nullable, result is nullable
>>> (ser1 == ser2).dtype
BooleanDtype
>>> (ser2 == ser1).dtype
BooleanDtype

@jorisvandenbossche
Copy link
Member

Making the overview of the result of == for all possible string dtype combinations (row labels is LHS, column labels RHS dtype):

object str str string string string[pyarrow]
object bool bool bool bool[pyarrow] boolean bool[pyarrow]
str (pyarrow) bool bool <error> bool <error> bool
str (python) bool bool bool bool bool bool
string (pyarrow) bool[pyarrow] bool <error> bool[pyarrow] <error> bool[pyarrow]
string (python) boolean boolean bool boolean boolean boolean
string[pyarrow] bool[pyarrow] bool <error> bool[pyarrow] <error> bool[pyarrow]
import numpy as np
import pandas as pd
import pyarrow as pa

dtypes = [
    np.dtype(object),
    pd.StringDtype("pyarrow", na_value=np.nan),
    pd.StringDtype("python", na_value=np.nan),
    pd.StringDtype("pyarrow", na_value=pd.NA),
    pd.StringDtype("python", na_value=pd.NA),
    pd.ArrowDtype(pa.string())
]

results = []
for dt1, dt2 in itertools.product(dtypes, dtypes):
    ser1 = pd.Series(["a", None, "b"], dtype=dt1)
    ser2 = pd.Series(["a", None, "b"], dtype=dt2)
    try:
        res = ser1 == ser2
        results.append((dt1, dt2, res.dtype))
    except:
        results.append((dt1, dt2, "<error>"))

df_results = pd.DataFrame(results, columns=["left", "right", "result"])

print(df_results.pivot(index="left", columns="right", values="result").to_markdown())

Some quick observations:

  • There are a few cases that error right now, especially involving the python vs pyarrow storage of the future default str dtype, so something we should fix.
  • For object dtype "strings", all other actual string dtypes take precedence (i.e. they return the "native" bool dtype for their family of strings), even if they are the RHS operand (i.e. this is the first row of the table)
    • Should we do the same for the new default str dtype? I.e. should essentially the second and third row be the same as the first row?

@WillAyd
Copy link
Member

WillAyd commented Jan 3, 2025

That's quite the table there...thanks for putting that together. I still think that all of the extension types equality ops should return the boolean extension type. The "string(python)" type has done this for years, so I think the new iterations (backed by pyarrow and the "str" types) should continue along with that API.

In general I find this a really non-value added activity for end users of the library, so long term still push for the logical type system. But in the interim I think we can not rock the boat to much and just stick with what was in place already for the string extension type

@jorisvandenbossche
Copy link
Member

(about returning the boolean extension dtype) ... The "string(python)" type has done this for years, so I think the new iterations (backed by pyarrow and the "str" types) should continue along with that API.

For "string(pyarrow)", fully agreed (and as mentioned above, this dtype also did that in the past, this was only changed in pandas 2.0).
But to be clear, for the "str" variants, that is IMO out of the question on the short term. As that was for me the whole point to write a PDEP about it and the reason we introduced the "str" / pd.StringDtype(na_value=np.nan) variants in addition to the existing StringDtype using pd.NA, i.e. so it could have different beahviour regarding default dtypes and missing value (and so not use the opt-in "nullable" dtypes, for now).

(of course we could also have added a NaN variant of the boolean extension dtype, and then it could have used that. But we didn't do that, and I think that is now too late for 3.x?)

And I also fully agree we should improve that messy table with logical dtypes.


But that said, we still need to make a decision on the short term for 2.3 / 3.x about which resulting dtype to use in case of mixed dtype operands:

  1. Give priority to the LHS
    • ser[str] == ser[string] -> numpy bool dtype
    • ser[string] == ser[str] -> boolean extension dtype
  2. Define some hierarchy (e.g. object < NaN-variant "str" < NA-variant "string" < ArrowDtype):
    • ser[str] == ser[string] -> boolean extension dtype
    • ser[string] == ser[str] -> boolean extension dtype

(and for the hierarchy option, also have "python < pyarrow" for the storage within the NaN or NA variant. This is less relevant for == equality, but for example for + resulting in new strings, the exact string dtype and its storage is also relevant)

I personally lean to the second option, I think.

(side note: also when having logical dtypes, we will have to answer this question if we have multiple variants of the same logical dtype)

@WillAyd
Copy link
Member

WillAyd commented Jan 3, 2025

Out of those two options I definitely prefer the hierarchy; I think giving the lhs more importance than the rhs is really hard to keep track of, especially in a potentially long pipeline of operations. I think your proposed solution works well too, although something does feel weird about mixing between NA markers for the str / string types...but that may be the least of all evils.

(side note: also when having logical dtypes, we will have to answer this question if we have multiple variants of the same logical dtype)

This may be worth further discussing in the PDEP, but I think it is just an implementation detail (?) and not something that should worry the end user too much

@rhshadrach
Copy link
Member Author

rhshadrach commented Jan 3, 2025

My main opposition to "always return Boolean extension dtype", in addition to what @jorisvandenbossche mentioned, is that if I have two pyarrow-backed Series with different NA-semantics, I would be very surprised to get back a NumPy-backed Series. Likewise, if I have two NaN Series (pd.StringDtype("python", na_value=np.nan) and pd.StringDtype("pyarrow", na_value=np.nan), I would be also surprised to get back a Series with pd.NA-semantics.

I think this puts me in the hierarchy camp. My proposed hierarchy is:

object < (pyarrow, NaN) < (python, NaN) <  (pyarrow, NA) < (python, NA)

My reasoning is that if you have NA-backed data, you've done something to opt into it as it isn't the default. Likewise, if you have PyArrow installed and you have python-backed strings, you've done something to get that as it isn't the default. So since you've opted into, we should give it preference.

Of course, hopefully all of this is a pathological edge-case that users don't encounter.

@WillAyd
Copy link
Member

WillAyd commented Jan 4, 2025

My reasoning is that if you have NA-backed data, you've done something to opt into it as it isn't the default. Likewise, if you have PyArrow installed and you have python-backed strings, you've done something to get that as it isn't the default. So since you've opted into, we should give it preference.

Of course, hopefully all of this is a pathological edge-case that users don't encounter.

This makes sense but then shouldn't the pyarrow implementations get a higher priority than the python ones in your hierarchy?

Of course, hopefully all of this is a pathological edge-case that users don't encounter.

Unfortunately I think this is going to be a very common occurrence. Especially when considering I/O formats that preserve metadata (namely parquet) it will very pretty easy to mix all these up

My main opposition to "always return Boolean extension dtype", in addition to what @jorisvandenbossche mentioned, is that if I have two pyarrow-backed Series with different NA-semantics, I would be very surprised to get back a NumPy-backed Series.

This is definitely valid with the current implementation of the extension types, although keep in mind with the proposed logical types that the storage becomes an implementation detail. We could set up whatever rules we wanted to manage this, although in the least pathological cases I would think pyarrow would be the main storage

@WillAyd
Copy link
Member

WillAyd commented Jan 4, 2025

object < (pyarrow, NaN) < (python, NaN) < (pyarrow, NA) < (python, NA)

Thinking through this one some more, I'm also not sure that the python string implementations should ever take priority over pyarrow. That's a huge performance degradation

@rhshadrach
Copy link
Member Author

rhshadrach commented Jan 4, 2025

So since you've opted into, we should give it preference.
This makes sense but then shouldn't the pyarrow implementations get a higher priority than the python ones in your hierarchy?

If you have PyArrow installed, then pd.Series(list("xyz")) gives you PyArrow backed. So to end up with Python, you need to have opted into it.

Unfortunately I think this is going to be a very common occurrence. Especially when considering I/O formats that preserve metadata (namely parquet) it will very pretty easy to mix all these up

I expect the common occurrence will be object dtype against one of the other 4, but not e.g. NA-pyarrow against NaN-pyarrow.

@WillAyd
Copy link
Member

WillAyd commented Jan 4, 2025

That's probably true for data that you create in your process, but when you bring I/O into the mix things can easily get mixed up.. For instance, if you load a parquet file that someone saved with dtype="string" but you use the default type system, you are going to mix up NA/NaN sentinels, even with PyArrow installed.

I don't know if our parquet I/O keeps the storage backend as part of the metadata, but if it does that would also make it easy to mix up types (ex: a user without PyArrow installed saves a file that gets loaded by someone with PyArrow)

@rhshadrach
Copy link
Member Author

rhshadrach commented Jan 4, 2025

I don't know if our parquet I/O keeps the storage backend as part of the metadata

Current parquet behavior is to always infer str.

pd.set_option("infer_string", True)

df = pd.DataFrame({"a": pd.array(list("xyz"), dtype="object")})
df.to_parquet("test.parquet")
print(pd.read_parquet("test.parquet")["a"].dtype)
# str

df = pd.DataFrame({"a": pd.array(list("xyz"), dtype="string")})
df.to_parquet("test.parquet")
print(pd.read_parquet("test.parquet")["a"].dtype)
# str

df = pd.DataFrame({"a": pd.array(list("xyz"), dtype="str")})
df.to_parquet("test.parquet")
print(pd.read_parquet("test.parquet")["a"].dtype)
# str

@WillAyd
Copy link
Member

WillAyd commented Jan 4, 2025

Ah that looks like a bug. dtype="string" should be roundtripping (it does without infer_string)

@WillAyd
Copy link
Member

WillAyd commented Jan 4, 2025

I vaguely recall there being some work done upstream in Arrow to better differentiate these. Not sure if the version of that matters but Joris would know best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Needs Discussion Requires discussion from core team before further action Numeric Operations Arithmetic, Comparison, and Logical operations Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

3 participants