Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Unable to Train Model with PyArrow Table #6769

Open
crs1910 opened this issue Dec 30, 2024 · 0 comments
Open

[python-package] Unable to Train Model with PyArrow Table #6769

crs1910 opened this issue Dec 30, 2024 · 0 comments
Labels

Comments

@crs1910
Copy link

crs1910 commented Dec 30, 2024

Description

I encountered an issue while attempting to train a model using a PyArrow table. The process fails with an unexpected results : TypeError: Cannot initialize Dataset from Table
I have a polar data frame with 53M rows and I converted it to pyarrow table.Below are the details of the error and steps to reproduce it.

Reproducible example

arrow_table_train = df_polar.to_arrow()
train_dataset = lgb.Dataset(
            arrow_table_train.select(features), label=arrow_table_train["label"], weight=arrow_table_train["weight"], free_raw_data=True,
        )
model = lgb.train(
    hyperparams,
    train_dataset,
    valid_sets=[train_dataset],
    valid_names=["train"],
)

Error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /my_env/lib/python3.10/site-packages/lightgbm/basic.py:2185, in Dataset._lazy_init(self, data, label, reference, weight, group, init_score, predictor, feature_name, categorical_feature, params, position)
   2184 try:
-> 2185     csr = scipy.sparse.csr_matrix(data)
   2186     self.__init_from_csr(csr, params_str, ref_dataset)

File /my_env/lib/python3.10/site-packages/scipy/sparse/_compressed.py:93, in _cs_matrix.__init__(self, arg1, shape, dtype, copy)
     92 coo = self._coo_container(arg1, dtype=dtype)
---> 93 arrays = coo._coo_to_compressed(self._swap)
     94 self.indptr, self.indices, self.data, self._shape = arrays

File /my_env/lib/python3.10/site-packages/scipy/sparse/_coo.py:374, in _coo_base._coo_to_compressed(self, swap, copy)
    372 data = np.empty_like(self.data, dtype=self.dtype)
--> 374 coo_tocsr(M, N, nnz, major, minor, self.data, indptr, indices, data)
    375 return indptr, indices, data, self.shape

**ValueError: unsupported data types in input**

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[28], line 5
      3 train_dataset = lgb.Dataset(
      4             arrow_table_train.select(features), label=arrow_table_train["label"], weight=arrow_table_train["weight"], free_raw_data=True,
----> 5         )
      6  model = lgb.train(
      7      hyperparams,
      8      train_dataset,
      9      valid_sets=[train_dataset],
     10      valid_names=["train"],
     11  )

File /my_env/lib/python3.10/site-packages/lightgbm/basic.py:2576, in Dataset.construct(self)
   2571             self._set_init_score_by_predictor(
   2572                 predictor=self._predictor, data=self.data, used_indices=used_indices
   2573             )
   2574 else:
   2575     # create train
-> 2576     self._lazy_init(
   2577         data=self.data,
   2578         label=self.label,
   2579         reference=None,
   2580         weight=self.weight,
   2581         group=self.group,
   2582         init_score=self.init_score,
   2583         predictor=self._predictor,
   2584         feature_name=self.feature_name,
   2585         categorical_feature=self.categorical_feature,
   2586         params=self.params,
   2587         position=self.position,
   2588     )
   2589 if self.free_raw_data:
   2590     self.data = None

File /my_env/lib/python3.10/site-packages/lightgbm/basic.py:2188, in Dataset._lazy_init(self, data, label, reference, weight, group, init_score, predictor, feature_name, categorical_feature, params, position)
   2186         self.__init_from_csr(csr, params_str, ref_dataset)
   2187     except BaseException as err:
-> 2188         raise TypeError(f"Cannot initialize Dataset from {type(data).__name__}") from err
   2189 if label is not None:
   2190     self.set_label(label)

**TypeError: Cannot initialize Dataset from Table**

Environment info

LightGBM version : 4.5.0
Python Version: 3.10
Command used to install LightGBM:

pip install lightgbm==4.5.0
@crs1910 crs1910 changed the title Title: Unable to Train Model with PyArrow Table Unable to Train Model with PyArrow Table Dec 30, 2024
@jameslamb jameslamb changed the title Unable to Train Model with PyArrow Table [python-package] Unable to Train Model with PyArrow Table Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants