Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: MultiIndex union/difference not commutative #60642

Open
2 of 3 tasks
ssche opened this issue Jan 2, 2025 · 1 comment
Open
2 of 3 tasks

BUG: MultiIndex union/difference not commutative #60642

ssche opened this issue Jan 2, 2025 · 1 comment
Labels
Bug Index Related to the Index class or subclasses Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Needs Info Clarification about behavior needed to assess issue Needs Triage Issue that has not been reviewed by a pandas team member setops union, intersection, difference, symmetric_difference

Comments

@ssche
Copy link
Contributor

ssche commented Jan 2, 2025

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

I wasn't able to extract the data for this example to set up the test case more programmatically, but I managed to reduce the data significantly without compromising the behaviour. I hope the below is somewhat portable (please let me know if it isn't). I used numpy==2.2.1 and pandas==2.2.3

>>> import pickle

>>> ix1_pickled = [redacted]
>>> ix2_pickled = [redacted]
ix1 = pickle.loads(ix1_pickled)
ix2 = pickle.loads(ix2_pickled)

>>> ix1
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])

>>> ix2
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])
>>>
>>> # expected - both indices are the same - should yield same result
>>> ix2.union(ix1)
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])
>>> 
>>> # it seems each row is considered different
>>> ix1.union(ix2)
MultiIndex([(nan, '2018-06-01'),
            (nan, '2018-06-01')],
           names=['dim1', 'dim2'])
>>> 
>>> # expected
>>> ix1.difference(ix2)
MultiIndex([], names=['dim1', 'dim2'])
>>> 
>>> # not expected
>>> ix2.difference(ix1)
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])

>>> # some diagnostics to show the values (the dates other than `2018-06-01` could come from my attempt to minimise the example as those values were previously contained)
>>> ix1.levels
FrozenList([[nan], [2018-06-01 00:00:00, 2018-07-01 00:00:00, 2018-08-01 00:00:00, 2018-09-01 00:00:00, 2018-10-01 00:00:00, 2018-11-01 00:00:00, 2018-12-01 00:00:00]])
>>> ix2.levels
FrozenList([[nan], [2018-06-01 00:00:00, 2018-07-01 00:00:00, 2018-08-01 00:00:00, 2018-09-01 00:00:00, 2018-10-01 00:00:00, 2018-11-01 00:00:00, 2018-12-01 00:00:00]])
>>> ix1.levels[1] == ix2.levels[1]
array([ True,  True,  True,  True,  True,  True,  True])

Issue Description

Creating the union of two indices with a nan level causes the union result to depend on the order of the call (index1.union(index2) vs. index2.union(index1)). With other words, one of the calls yields the wrong result as the call deems every row to be distinct. I'm fairly certain that is is due to nan value in dim1, but if I recreate the example programmatically, the behaviour is as expected.

>>> ix3 = pd.MultiIndex.from_product([[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]])
>>> ix3
MultiIndex([(nan, '2018-06-01')],
           )
>>> ix4 = pd.MultiIndex.from_product([[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]])
>>> 
>>> ix4.dtypes
level_0           float64
level_1    datetime64[ns]
dtype: object
>>> ix3.dtypes
level_0           float64
level_1    datetime64[ns]
>>> ix3.union(ix4)
MultiIndex([(nan, '2018-06-01')],
           )
>>> ix4.union(ix3)
MultiIndex([(nan, '2018-06-01')],
           )

However, in test cases for a rather large application, I arrive at the state from the pickle example. I'm not sure what's different to the working example

Expected Behavior

I would expect the difference of the two indices from the pickled example to be empty and the union to be the same as the two indices.

I am also at a loss as to why I can't reproduce the wrong behaviour programmatically.

Installed Versions


INSTALLED VERSIONS
------------------
commit                : 0691c5cf90477d3503834d983f69350f250a6ff7
python                : 3.13.1
python-bits           : 64
OS                    : Linux
OS-release            : 6.12.6-200.fc41.x86_64
Version               : #1 SMP PREEMPT_DYNAMIC Thu Dec 19 21:06:34 UTC 2024
machine               : x86_64
processor             : 
byteorder             : little
LC_ALL                : None
LANG                  : en_AU.UTF-8
LOCALE                : en_AU.UTF-8

pandas                : 2.2.3
numpy                 : 2.2.1
pytz                  : 2020.4
dateutil              : 2.9.0.post0
pip                   : 24.3.1
Cython                : 3.0.11
sphinx                : None
IPython               : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
blosc                 : None
bottleneck            : 1.4.2
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
html5lib              : None
hypothesis            : None
gcsfs                 : None
jinja2                : None
lxml.etree            : None
matplotlib            : None
numba                 : None
numexpr               : 2.10.2
odfpy                 : None
openpyxl              : 3.1.2
pandas_gbq            : None
psycopg2              : 2.9.10
pymysql               : None
pyarrow               : 18.1.0
pyreadstat            : None
pytest                : 8.3.4
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : 1.14.1
sqlalchemy            : None
tables                : 3.10.1
tabulate              : None
xarray                : None
xlrd                  : 2.0.1
xlsxwriter            : None
zstandard             : None
tzdata                : 2024.2
qtpy                  : None
pyqt5                 : None
@ssche ssche added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Index Related to the Index class or subclasses Needs Triage Issue that has not been reviewed by a pandas team member setops union, intersection, difference, symmetric_difference labels Jan 2, 2025
@rhshadrach
Copy link
Member

rhshadrach commented Jan 4, 2025

pickle can perform arbitrary code execution and thus presents security issues. Can you post your example without using pickle.

ix1 = pd.MultiIndex(levels=ix1.levels, codes=ix1.codes, names=ix1.names, verify_integrity=False)

@rhshadrach rhshadrach added the Needs Info Clarification about behavior needed to assess issue label Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Index Related to the Index class or subclasses Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Needs Info Clarification about behavior needed to assess issue Needs Triage Issue that has not been reviewed by a pandas team member setops union, intersection, difference, symmetric_difference
Projects
None yet
Development

No branches or pull requests

2 participants