API (string dtype): comparisons between different string classes #60639

rhshadrach · 2025-01-01T21:22:07Z

Some comparisons between different classes of string (e.g. string[pyarrow] and str) raise. Resolving this is straightforward except for what class should be returned. I would expect it should always be the left obj, e.g. string[pyarrow] == str should return string[pyarrow] whereas str == string[pyarrow] should return str. Is this the concensus?

We currently run into issues with how Python handles subclasses with comparison dunders.

lhs = pd.array(["x", pd.NA, "y"], dtype="string[pyarrow]")
rhs = pd.array(["x", pd.NA, "y"], dtype=pd.StringDtype("pyarrow", np.nan))

print(lhs.__eq__(rhs))
# <ArrowExtensionArray>
# [True, <NA>, True]
# Length: 3, dtype: bool[pyarrow]

print(lhs == rhs)
# [ True False  True]

The two results above differ because ArrowStringArrayNumpySemantics is a proper subclass of ArrowStringArray and therefore Python first calls rhs.__eq__(lhs).

We can avoid this by special casing this particular case in ArrowStringArrayNumpySemantics, but I wanted to open up an issue for discussion before proceeding.

cc @WillAyd @jorisvandenbossche

The text was updated successfully, but these errors were encountered:

WillAyd · 2025-01-01T23:13:52Z

I would expect it should always be the left obj, e.g. string[pyarrow] == str should return string[pyarrow] whereas str == string[pyarrow] should return str

You mean the return types should be bool right? I'm assuming so but let me know if I misunderstand

This is another motivating case for PDEP-13 #58455

I think without that, they should probably just return the bool extension type to at least preserve the possibility of null values

rhshadrach · 2025-01-01T23:21:18Z

Ah, indeed, thanks! I meant the Boolean dtype determined by the left side. So string[pyarrow] == str would return bool[pyarrow]. Is this in conflict with your last sentence above? I'm not sure what the bool extension type means.

WillAyd · 2025-01-01T23:26:10Z

pd.BooleanDtype is the bool extension type

jorisvandenbossche · 2025-01-03T15:33:10Z

Just to be explicit, the two possible return values we currently have for the example above (in case of consistent dtypes for left and right operand):

>>> lhs == rhs.astype(lhs.dtype)
<ArrowExtensionArray>
[True, <NA>, True]
Length: 3, dtype: bool[pyarrow]

>>> lhs.astype(rhs.dtype) == rhs
array([ True, False,  True])

are bool[pyarrow] and np.dtype('bool').
I personally agree that this should be pd.BooleanDtype instead of ArrowDtype for the first one (that is something that has been changed a while ago, and I thought we had an issue about it, but can't directly find it). But let's here focus on which of the current dtypes should be returned in case of mixed operands, i.e. essentially the question of whether to always prioritize the left operand or whether there should be a kind of hierarchy.

While in general letting the left operand take priority sounds fine and something that typically happens in Python (with Python also automatically first calling __eq__ of the lhs before trying the rhs), in the context of array objects and data types, it might make more sense to have a form of hierarchy / priority between groups of dtypes?

For example also for the nullable numeric dtypes vs numpy dtypes, we always give preference to the nullable dtype in that case:

>>> ser1 = pd.Series([1, 2, 3], dtype="int64")   # numpy
>>> ser2 = pd.Series([1, 2, 3], dtype="Int64")   # nullable

# numpy gives numpy bool
>>> (ser1 == ser1).dtype
dtype('bool')

# but once left OR right is nullable, result is nullable
>>> (ser1 == ser2).dtype
BooleanDtype
>>> (ser2 == ser1).dtype
BooleanDtype

jorisvandenbossche · 2025-01-03T19:57:00Z

Making the overview of the result of == for all possible string dtype combinations (row labels is LHS, column labels RHS dtype):

	object	str	str	string	string	string[pyarrow]
object	bool	bool	bool	bool[pyarrow]	boolean	bool[pyarrow]
str (pyarrow)	bool	bool	<error>	bool	<error>	bool
str (python)	bool	bool	bool	bool	bool	bool
string (pyarrow)	bool[pyarrow]	bool	<error>	bool[pyarrow]	<error>	bool[pyarrow]
string (python)	boolean	boolean	bool	boolean	boolean	boolean
string[pyarrow]	bool[pyarrow]	bool	<error>	bool[pyarrow]	<error>	bool[pyarrow]

import numpy as np
import pandas as pd
import pyarrow as pa

dtypes = [
    np.dtype(object),
    pd.StringDtype("pyarrow", na_value=np.nan),
    pd.StringDtype("python", na_value=np.nan),
    pd.StringDtype("pyarrow", na_value=pd.NA),
    pd.StringDtype("python", na_value=pd.NA),
    pd.ArrowDtype(pa.string())
]

results = []
for dt1, dt2 in itertools.product(dtypes, dtypes):
    ser1 = pd.Series(["a", None, "b"], dtype=dt1)
    ser2 = pd.Series(["a", None, "b"], dtype=dt2)
    try:
        res = ser1 == ser2
        results.append((dt1, dt2, res.dtype))
    except:
        results.append((dt1, dt2, "<error>"))

df_results = pd.DataFrame(results, columns=["left", "right", "result"])

print(df_results.pivot(index="left", columns="right", values="result").to_markdown())

Some quick observations:

There are a few cases that error right now, especially involving the python vs pyarrow storage of the future default str dtype, so something we should fix.
For object dtype "strings", all other actual string dtypes take precedence (i.e. they return the "native" bool dtype for their family of strings), even if they are the RHS operand (i.e. this is the first row of the table)
- Should we do the same for the new default str dtype? I.e. should essentially the second and third row be the same as the first row?

WillAyd · 2025-01-03T20:10:39Z

That's quite the table there...thanks for putting that together. I still think that all of the extension types equality ops should return the boolean extension type. The "string(python)" type has done this for years, so I think the new iterations (backed by pyarrow and the "str" types) should continue along with that API.

In general I find this a really non-value added activity for end users of the library, so long term still push for the logical type system. But in the interim I think we can not rock the boat to much and just stick with what was in place already for the string extension type

jorisvandenbossche · 2025-01-03T20:35:49Z

(about returning the boolean extension dtype) ... The "string(python)" type has done this for years, so I think the new iterations (backed by pyarrow and the "str" types) should continue along with that API.

For "string(pyarrow)", fully agreed (and as mentioned above, this dtype also did that in the past, this was only changed in pandas 2.0).
But to be clear, for the "str" variants, that is IMO out of the question on the short term. As that was for me the whole point to write a PDEP about it and the reason we introduced the "str" / pd.StringDtype(na_value=np.nan) variants in addition to the existing StringDtype using pd.NA, i.e. so it could have different beahviour regarding default dtypes and missing value (and so not use the opt-in "nullable" dtypes, for now).

(of course we could also have added a NaN variant of the boolean extension dtype, and then it could have used that. But we didn't do that, and I think that is now too late for 3.x?)

And I also fully agree we should improve that messy table with logical dtypes.

But that said, we still need to make a decision on the short term for 2.3 / 3.x about which resulting dtype to use in case of mixed dtype operands:

Give priority to the LHS
- ser[str] == ser[string] -> numpy bool dtype
- ser[string] == ser[str] -> boolean extension dtype
Define some hierarchy (e.g. object < NaN-variant "str" < NA-variant "string" < ArrowDtype):
- ser[str] == ser[string] -> boolean extension dtype
- ser[string] == ser[str] -> boolean extension dtype

(and for the hierarchy option, also have "python < pyarrow" for the storage within the NaN or NA variant. This is less relevant for == equality, but for example for + resulting in new strings, the exact string dtype and its storage is also relevant)

I personally lean to the second option, I think.

(side note: also when having logical dtypes, we will have to answer this question if we have multiple variants of the same logical dtype)

WillAyd · 2025-01-03T21:09:35Z

Out of those two options I definitely prefer the hierarchy; I think giving the lhs more importance than the rhs is really hard to keep track of, especially in a potentially long pipeline of operations. I think your proposed solution works well too, although something does feel weird about mixing between NA markers for the str / string types...but that may be the least of all evils.

(side note: also when having logical dtypes, we will have to answer this question if we have multiple variants of the same logical dtype)

This may be worth further discussing in the PDEP, but I think it is just an implementation detail (?) and not something that should worry the end user too much

rhshadrach · 2025-01-03T22:09:03Z

My main opposition to "always return Boolean extension dtype", in addition to what @jorisvandenbossche mentioned, is that if I have two pyarrow-backed Series with different NA-semantics, I would be very surprised to get back a NumPy-backed Series. Likewise, if I have two NaN Series (pd.StringDtype("python", na_value=np.nan) and pd.StringDtype("pyarrow", na_value=np.nan), I would be also surprised to get back a Series with pd.NA-semantics.

I think this puts me in the hierarchy camp. My proposed hierarchy is:

object < (pyarrow, NaN) < (python, NaN) <  (pyarrow, NA) < (python, NA)

My reasoning is that if you have NA-backed data, you've done something to opt into it as it isn't the default. Likewise, if you have PyArrow installed and you have python-backed strings, you've done something to get that as it isn't the default. So since you've opted into, we should give it preference.

Of course, hopefully all of this is a pathological edge-case that users don't encounter.

WillAyd · 2025-01-04T01:32:26Z

My reasoning is that if you have NA-backed data, you've done something to opt into it as it isn't the default. Likewise, if you have PyArrow installed and you have python-backed strings, you've done something to get that as it isn't the default. So since you've opted into, we should give it preference.

Of course, hopefully all of this is a pathological edge-case that users don't encounter.

This makes sense but then shouldn't the pyarrow implementations get a higher priority than the python ones in your hierarchy?

Of course, hopefully all of this is a pathological edge-case that users don't encounter.

Unfortunately I think this is going to be a very common occurrence. Especially when considering I/O formats that preserve metadata (namely parquet) it will very pretty easy to mix all these up

My main opposition to "always return Boolean extension dtype", in addition to what @jorisvandenbossche mentioned, is that if I have two pyarrow-backed Series with different NA-semantics, I would be very surprised to get back a NumPy-backed Series.

This is definitely valid with the current implementation of the extension types, although keep in mind with the proposed logical types that the storage becomes an implementation detail. We could set up whatever rules we wanted to manage this, although in the least pathological cases I would think pyarrow would be the main storage

WillAyd · 2025-01-04T01:39:18Z

object < (pyarrow, NaN) < (python, NaN) < (pyarrow, NA) < (python, NA)

Thinking through this one some more, I'm also not sure that the python string implementations should ever take priority over pyarrow. That's a huge performance degradation

rhshadrach · 2025-01-04T14:06:07Z

So since you've opted into, we should give it preference.
This makes sense but then shouldn't the pyarrow implementations get a higher priority than the python ones in your hierarchy?

If you have PyArrow installed, then pd.Series(list("xyz")) gives you PyArrow backed. So to end up with Python, you need to have opted into it.

Unfortunately I think this is going to be a very common occurrence. Especially when considering I/O formats that preserve metadata (namely parquet) it will very pretty easy to mix all these up

I expect the common occurrence will be object dtype against one of the other 4, but not e.g. NA-pyarrow against NaN-pyarrow.

WillAyd · 2025-01-04T14:59:45Z

That's probably true for data that you create in your process, but when you bring I/O into the mix things can easily get mixed up.. For instance, if you load a parquet file that someone saved with dtype="string" but you use the default type system, you are going to mix up NA/NaN sentinels, even with PyArrow installed.

I don't know if our parquet I/O keeps the storage backend as part of the metadata, but if it does that would also make it easy to mix up types (ex: a user without PyArrow installed saves a file that gets loaded by someone with PyArrow)

rhshadrach · 2025-01-04T15:17:51Z

I don't know if our parquet I/O keeps the storage backend as part of the metadata

Current parquet behavior is to always infer str.

pd.set_option("infer_string", True)

df = pd.DataFrame({"a": pd.array(list("xyz"), dtype="object")})
df.to_parquet("test.parquet")
print(pd.read_parquet("test.parquet")["a"].dtype)
# str

df = pd.DataFrame({"a": pd.array(list("xyz"), dtype="string")})
df.to_parquet("test.parquet")
print(pd.read_parquet("test.parquet")["a"].dtype)
# str

df = pd.DataFrame({"a": pd.array(list("xyz"), dtype="str")})
df.to_parquet("test.parquet")
print(pd.read_parquet("test.parquet")["a"].dtype)
# str

WillAyd · 2025-01-04T15:24:03Z

Ah that looks like a bug. dtype="string" should be roundtripping (it does without infer_string)

WillAyd · 2025-01-04T16:35:20Z

I vaguely recall there being some work done upstream in Arrow to better differentiate these. Not sure if the version of that matters but Joris would know best

rhshadrach added Numeric Operations Arithmetic, Comparison, and Logical operations Strings String extension data type and string data Needs Discussion Requires discussion from core team before further action API - Consistency Internal Consistency of API/Behavior labels Jan 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API (string dtype): comparisons between different string classes #60639

API (string dtype): comparisons between different string classes #60639

rhshadrach commented Jan 1, 2025

WillAyd commented Jan 1, 2025

rhshadrach commented Jan 1, 2025 •

edited

Loading

WillAyd commented Jan 1, 2025

jorisvandenbossche commented Jan 3, 2025

jorisvandenbossche commented Jan 3, 2025

WillAyd commented Jan 3, 2025 •

edited

Loading

jorisvandenbossche commented Jan 3, 2025

WillAyd commented Jan 3, 2025 •

edited

Loading

rhshadrach commented Jan 3, 2025 •

edited

Loading

WillAyd commented Jan 4, 2025

WillAyd commented Jan 4, 2025

rhshadrach commented Jan 4, 2025 •

edited

Loading

WillAyd commented Jan 4, 2025

rhshadrach commented Jan 4, 2025 •

edited

Loading

WillAyd commented Jan 4, 2025

WillAyd commented Jan 4, 2025

API (string dtype): comparisons between different string classes #60639

API (string dtype): comparisons between different string classes #60639

Comments

rhshadrach commented Jan 1, 2025

WillAyd commented Jan 1, 2025

rhshadrach commented Jan 1, 2025 • edited Loading

WillAyd commented Jan 1, 2025

jorisvandenbossche commented Jan 3, 2025

jorisvandenbossche commented Jan 3, 2025

WillAyd commented Jan 3, 2025 • edited Loading

jorisvandenbossche commented Jan 3, 2025

WillAyd commented Jan 3, 2025 • edited Loading

rhshadrach commented Jan 3, 2025 • edited Loading

WillAyd commented Jan 4, 2025

WillAyd commented Jan 4, 2025

rhshadrach commented Jan 4, 2025 • edited Loading

WillAyd commented Jan 4, 2025

rhshadrach commented Jan 4, 2025 • edited Loading

WillAyd commented Jan 4, 2025

WillAyd commented Jan 4, 2025

rhshadrach commented Jan 1, 2025 •

edited

Loading

WillAyd commented Jan 3, 2025 •

edited

Loading

WillAyd commented Jan 3, 2025 •

edited

Loading

rhshadrach commented Jan 3, 2025 •

edited

Loading

rhshadrach commented Jan 4, 2025 •

edited

Loading

rhshadrach commented Jan 4, 2025 •

edited

Loading