Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Problem converting pandas dataframe to memtable when column contains both float and string values #10633

Closed
1 task done
rhstanton opened this issue Dec 30, 2024 · 7 comments
Labels
question Questions about the library

Comments

@rhstanton
Copy link

What happened?

I've been running into problems reading Excel spreadsheets into ibis via pandas, and have tracked things down to the issue in the title. Here's sample code to show the problem:

import pandas as pd
import ibis

ibis.options.interactive = True

df = pd.DataFrame({"Amount" : [1.1, 2.2, '?']})
df_ibis = ibis.memtable(df)
df_ibis.head()

Instead of creating and displaying the ibis table, as expected, I get an error (more detail below):

ArrowTypeError: ("Expected bytes, got a 'float' object", 'Conversion failed for column Amount with type object')

What version of ibis are you using?

9.5.0

What backend(s) are you using, if any?

Default: DuckDB

Relevant log output

---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
File ~/projects/Tactus/.env/lib/python3.13/site-packages/IPython/core/formatters.py:770, in PlainTextFormatter.__call__(self, obj)
    763 stream = StringIO()
    764 printer = pretty.RepresentationPrinter(stream, self.verbose,
    765     self.max_width, self.newline,
    766     max_seq_length=self.max_seq_length,
    767     singleton_pprinters=self.singleton_printers,
    768     type_pprinters=self.type_printers,
    769     deferred_pprinters=self.deferred_printers)
--> 770 printer.pretty(obj)
    771 printer.flush()
    772 return stream.getvalue()

File ~/projects/Tactus/.env/lib/python3.13/site-packages/IPython/lib/pretty.py:419, in RepresentationPrinter.pretty(self, obj)
    408                         return meth(obj, self, cycle)
    409                 if (
    410                     cls is not object
    411                     # check if cls defines __repr__
   (...)
    417                     and callable(_safe_getattr(cls, "__repr__", None))
    418                 ):
--> 419                     return _repr_pprint(obj, self, cycle)
    421     return _default_pprint(obj, self, cycle)
    422 finally:

File ~/projects/Tactus/.env/lib/python3.13/site-packages/IPython/lib/pretty.py:794, in _repr_pprint(obj, p, cycle)
    792 """A pprint that just redirects to the normal repr function."""
    793 # Find newlines and replace them with p.break_()
--> 794 output = repr(obj)
    795 lines = output.splitlines()
    796 with p.group():

File ~/projects/Tactus/.env/lib/python3.13/site-packages/ibis/expr/types/core.py:83, in Expr.__repr__(self)
     81 def __repr__(self) -> str:
     82     if ibis.options.interactive:
---> 83         return _capture_rich_renderable(self)
     84     else:
     85         return self._noninteractive_repr()

File ~/projects/Tactus/.env/lib/python3.13/site-packages/ibis/expr/types/core.py:63, in _capture_rich_renderable(renderable)
     61 console = Console(force_terminal=False)
     62 with console.capture() as capture:
---> 63     console.print(renderable)
     64 return capture.get().rstrip()

File ~/projects/Tactus/.env/lib/python3.13/site-packages/rich/console.py:1705, in Console.print(self, sep, end, style, justify, overflow, no_wrap, emoji, markup, highlight, width, height, crop, soft_wrap, new_line_start, *objects)
   1703 if style is None:
   1704     for renderable in renderables:
-> 1705         extend(render(renderable, render_options))
   1706 else:
   1707     for renderable in renderables:

File ~/projects/Tactus/.env/lib/python3.13/site-packages/rich/console.py:1306, in Console.render(self, renderable, options)
   1304 renderable = rich_cast(renderable)
   1305 if hasattr(renderable, "__rich_console__") and not isclass(renderable):
-> 1306     render_iterable = renderable.__rich_console__(self, _options)
   1307 elif isinstance(renderable, str):
   1308     text_renderable = self.render_str(
   1309         renderable, highlight=_options.highlight, markup=_options.markup
   1310     )

File ~/projects/Tactus/.env/lib/python3.13/site-packages/ibis/expr/types/core.py:106, in Expr.__rich_console__(self, console, options)
    103 if opts.interactive:
    104     from ibis.expr.types.pretty import to_rich
--> 106     rich_object = to_rich(self, console_width=console_width)
    107 else:
    108     rich_object = Text(self._noninteractive_repr())

File ~/projects/Tactus/.env/lib/python3.13/site-packages/ibis/expr/types/pretty.py:274, in to_rich(expr, max_rows, max_columns, max_length, max_string, max_depth, console_width)
    270     return _to_rich_scalar(
    271         expr, max_length=max_length, max_string=max_string, max_depth=max_depth
    272     )
    273 else:
--> 274     return _to_rich_table(
    275         expr,
    276         max_rows=max_rows,
    277         max_columns=max_columns,
    278         max_length=max_length,
    279         max_string=max_string,
    280         max_depth=max_depth,
    281         console_width=console_width,
    282     )

File ~/projects/Tactus/.env/lib/python3.13/site-packages/ibis/expr/types/pretty.py:347, in _to_rich_table(tablish, max_rows, max_columns, max_length, max_string, max_depth, console_width)
    344     if orig_ncols > len(computed_cols):
    345         table = table.select(*computed_cols)
--> 347 result = table.limit(max_rows + 1).to_pyarrow()
    348 # Now format the columns in order, stopping if the console width would
    349 # be exceeded.
    350 col_info = []

File ~/projects/Tactus/.env/lib/python3.13/site-packages/ibis/expr/types/core.py:489, in Expr.to_pyarrow(self, params, limit, **kwargs)
    461 @experimental
    462 def to_pyarrow(
    463     self,
   (...)
    467     **kwargs: Any,
    468 ) -> pa.Table:
    469     """Execute expression and return results in as a pyarrow table.
    470 
    471     This method is eager and will execute the associated expression
   (...)
    487         A pyarrow table holding the results of the executed expression.
    488     """
--> 489     return self._find_backend(use_default=True).to_pyarrow(
    490         self, params=params, limit=limit, **kwargs
    491     )

File ~/projects/Tactus/.env/lib/python3.13/site-packages/ibis/backends/duckdb/__init__.py:1430, in Backend.to_pyarrow(self, expr, params, limit, **_)
   1422 def to_pyarrow(
   1423     self,
   1424     expr: ir.Expr,
   (...)
   1428     **_: Any,
   1429 ) -> pa.Table:
-> 1430     table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()
   1431     return expr.__pyarrow_result__(table)

File ~/projects/Tactus/.env/lib/python3.13/site-packages/ibis/backends/duckdb/__init__.py:1372, in Backend._to_duckdb_relation(self, expr, params, limit)
   1356 def _to_duckdb_relation(
   1357     self,
   1358     expr: ir.Expr,
   (...)
   1361     limit: int | str | None = None,
   1362 ):
   1363     """Preprocess the expr, and return a `duckdb.DuckDBPyRelation` object.
   1364 
   1365     When retrieving in-memory results, it's faster to use `duckdb_con.sql`
   (...)
   1370     `duckdb_con.execute` everywhere else.
   1371     """
-> 1372     self._run_pre_execute_hooks(expr)
   1373     table_expr = expr.as_table()
   1374     sql = self.compile(table_expr, limit=limit, params=params)

File ~/projects/Tactus/.env/lib/python3.13/site-packages/ibis/backends/duckdb/__init__.py:1354, in Backend._run_pre_execute_hooks(self, expr)
   1351 if expr.op().find((ops.GeoSpatialUnOp, ops.GeoSpatialBinOp)):
   1352     self.load_extension("spatial")
-> 1354 super()._run_pre_execute_hooks(expr)

File ~/projects/Tactus/.env/lib/python3.13/site-packages/ibis/backends/__init__.py:1144, in BaseBackend._run_pre_execute_hooks(self, expr)
   1142 """Backend-specific hooks to run before an expression is executed."""
   1143 self._register_udfs(expr)
-> 1144 self._register_in_memory_tables(expr)

File ~/projects/Tactus/.env/lib/python3.13/site-packages/ibis/backends/__init__.py:1119, in BaseBackend._register_in_memory_tables(self, expr)
   1117 for memtable in expr.op().find(ops.InMemoryTable):
   1118     if not self._in_memory_table_exists(memtable.name):
-> 1119         self._register_in_memory_table(memtable)
   1120         weakref.finalize(
   1121             memtable, self._finalize_in_memory_table, memtable.name
   1122         )

File ~/projects/Tactus/.env/lib/python3.13/site-packages/ibis/backends/duckdb/__init__.py:1619, in Backend._register_in_memory_table(self, op)
   1618 def _register_in_memory_table(self, op: ops.InMemoryTable) -> None:
-> 1619     self.con.register(op.name, op.data.to_pyarrow(op.schema))

File ~/projects/Tactus/.env/lib/python3.13/site-packages/ibis/formats/pandas.py:402, in PandasDataFrameProxy.to_pyarrow(self, schema)
    399 import pyarrow_hotfix  # noqa: F401
    401 pyarrow_schema = PyArrowSchema.from_ibis(schema)
--> 402 return pa.Table.from_pandas(self.obj, schema=pyarrow_schema)

File ~/projects/Tactus/.env/lib/python3.13/site-packages/pyarrow/table.pxi:4751, in pyarrow.lib.Table.from_pandas()

File ~/projects/Tactus/.env/lib/python3.13/site-packages/pyarrow/pandas_compat.py:625, in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    620     return (isinstance(arr, np.ndarray) and
    621             arr.flags.contiguous and
    622             issubclass(arr.dtype.type, np.integer))
    624 if nthreads == 1:
--> 625     arrays = [convert_column(c, f)
    626               for c, f in zip(columns_to_convert, convert_fields)]
    627 else:
    628     arrays = []

File ~/projects/Tactus/.env/lib/python3.13/site-packages/pyarrow/pandas_compat.py:612, in dataframe_to_arrays.<locals>.convert_column(col, field)
    607 except (pa.ArrowInvalid,
    608         pa.ArrowNotImplementedError,
    609         pa.ArrowTypeError) as e:
    610     e.args += ("Conversion failed for column {!s} with type {!s}"
    611                .format(col.name, col.dtype),)
--> 612     raise e
    613 if not field_nullable and result.null_count > 0:
    614     raise ValueError("Field {} was non-nullable but pandas column "
    615                      "had {} null values".format(str(field),
    616                                                  result.null_count))

File ~/projects/Tactus/.env/lib/python3.13/site-packages/pyarrow/pandas_compat.py:606, in dataframe_to_arrays.<locals>.convert_column(col, field)
    603     type_ = field.type
    605 try:
--> 606     result = pa.array(col, type=type_, from_pandas=True, safe=safe)
    607 except (pa.ArrowInvalid,
    608         pa.ArrowNotImplementedError,
    609         pa.ArrowTypeError) as e:
    610     e.args += ("Conversion failed for column {!s} with type {!s}"
    611                .format(col.name, col.dtype),)

File ~/projects/Tactus/.env/lib/python3.13/site-packages/pyarrow/array.pxi:360, in pyarrow.lib.array()

File ~/projects/Tactus/.env/lib/python3.13/site-packages/pyarrow/array.pxi:87, in pyarrow.lib._ndarray_to_array()

File ~/projects/Tactus/.env/lib/python3.13/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowTypeError: ("Expected bytes, got a 'float' object", 'Conversion failed for column Amount with type object')

Code of Conduct

  • I agree to follow this project's Code of Conduct
@rhstanton rhstanton added the bug Incorrect behavior inside of ibis label Dec 30, 2024
@cpcloud
Copy link
Member

cpcloud commented Jan 1, 2025

Thanks for the issue.

What behavior are you expecting here with respect to the column type of Amount?

@cpcloud cpcloud added question Questions about the library and removed bug Incorrect behavior inside of ibis labels Jan 1, 2025
@rhstanton
Copy link
Author

I'm not sure I really mind. It would just be nice not to have to go through a pandas clean-up step first (after diagnosing the problem) before running ibis.memtable(). After all, this is a very common type of data inconsistency in Excel spreadsheets that have been created by hand. I'm in the process of switching from pandas to ibis for everything and it's been a 99% positive experience so far (thank you!), but this was one time where using pandas would definitely have been quicker.

Is there a better way to read an xlsx file into ibis than going via pandas first? If you're looking for suggested enhancements, a built-in read_excel function would be nice (though I do understand that DuckDB doesn't have one of those).

@cpcloud
Copy link
Member

cpcloud commented Jan 1, 2025

In fact there is one sneakily embedded in the spatial extension!

D COPY (select 1 as a) TO 'output.xlsx' WITH (FORMAT GDAL, DRIVER 'xlsx');
D select * from st_read('output.xlsx');
┌───────┐
│   a   │
│ int32 │
├───────┤
│     1 │
└───────┘

One workaround, which I suppose could be officially promoted as "the way" to read excel files is to use con.read_geo:

In [1]: from ibis.interactive import *

In [2]: con = ibis.duckdb.connect()

In [3]: t = con.read_geo('output.xlsx')

In [4]: t
Out[4]:
┏━━━━━━━┓
┃ a     ┃
┡━━━━━━━┩
│ int32 │
├───────┤
│     1 │
└───────┘

@rhstanton
Copy link
Author

rhstanton commented Jan 1, 2025

Very nice! It reads my original spreadsheet without complaint (as type str).

Yes, it still requires some cleaning up, but a. I can at least see the data, making this easier than an import error; and b. I can do this all in ibis, which was the point in the first place.

Thanks!

@cpcloud cpcloud closed this as completed Jan 1, 2025
@github-project-automation github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Jan 1, 2025
@rhstanton
Copy link
Author

A quick follow-up: The original spreadsheet, from which my example was extracted, had two blank lines at the top. Is there a way to skip 2 rows with read_geo? I've tried looking at the recommended DuckDB Web page to see what **kwargs includes, but it doesn't seem to be too helpful... Thanks!

@cpcloud
Copy link
Member

cpcloud commented Jan 1, 2025

It looks like there's no option for that in GDAL itself: https://gdal.org/en/stable/drivers/vector/xlsx.html

@rhstanton
Copy link
Author

Bother...! I guess I'll have to rename everything manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions about the library
Projects
Status: done
Development

No branches or pull requests

2 participants