Skip to content

Commit

Permalink
Initial support of user-provided datasets (#164)
Browse files Browse the repository at this point in the history
  • Loading branch information
Alexsandruss authored Nov 12, 2024
1 parent d8ad679 commit c0765de
Show file tree
Hide file tree
Showing 4 changed files with 67 additions and 15 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,6 @@ flowchart TB
- [Benchmarks Runner](sklbench/runner/README.md)
- [Report Generator](sklbench/report/README.md)
- [Benchmarks](sklbench/benchmarks/README.md)
- [Data Processing](sklbench/datasets/README.md)
- [Data Processing and Storage](sklbench/datasets/README.md)
- [Emulators](sklbench/emulators/README.md)
- [Developer Guide](docs/README.md)
51 changes: 41 additions & 10 deletions sklbench/datasets/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Data Handling in Benchmarks
# Data Processing and Storage in Benchmarks

Data handling steps:
1. Load data:
Expand All @@ -7,6 +7,14 @@ Data handling steps:
2. Split data into subsets if requested
3. Convert to requested form (data type, format, order, etc.)

Existing data sources:
- Synthetic data from sklearn
- OpenML datasets
- Custom loaders for named datasets
- User-provided datasets in compatible format

## Data Caching

There are two levels of caching with corresponding directories: `raw cache` for files downloaded from external sources, and just `cache` for files applicable for fast-loading in benchmarks.

Each dataset has few associated files in usual `cache`: data component files (`x`, `y`, `weights`, etc.) and JSON file with dataset properties (number of classes, clusters, default split arguments).
Expand All @@ -21,16 +29,39 @@ data_cache/
```

Cached file formats:
| Format | File extension | Associated Python types |
| --- | --- | --- |
| [Parquet](https://parquet.apache.org) | `.parq` | pandas.DataFrame |
| Numpy uncompressed binary dense data | `.npz` | numpy.ndarray, pandas.Series |
| Numpy uncompressed binary CSR data | `.csr.npz` | scipy.sparse.csr_matrix |
| Format | File extension | Associated Python types | Comment |
| --- | --- | --- | --- |
| [Parquet](https://parquet.apache.org) | `.parq` | pandas.DataFrame | |
| Numpy uncompressed binary dense data | `.npz` | numpy.ndarray, pandas.Series | Data is stored under `arr_0` name |
| Numpy uncompressed binary CSR data | `.csr.npz` | scipy.sparse.csr_matrix | Data is stored under `data`, `indices` and `indptr` names |

Existing data sources:
- Synthetic data from sklearn
- OpenML datasets
- Custom loaders for named datasets
## How to Modify Dataset for Compatibility with Scikit-learn_bench

In order to reuse an existing dataset in scikit-learn_bench, you need to convert its file(s) into compatible format for dataset cache loader.

Cached dataset consist of few files:
- `{dataset name}.json` file which store required and optional dataset information
- `{dataset name}_{data component name}.{data component extension}` files which store dataset components (data, labels, etc.)

Example of `{dataset name}.json`:
```json
{"n_classes": 2, "default_split": {"test_size": 0.2, "random_state": 11}}
```

`n_classes` property in a dataset info file is *required* for classification datasets.

Currently, `x` (data) and `y` (labels) are the only supported and *required* data components.

Scikit-learn_bench-compatible dataset should be stored in `data:cache_directory` (`${PWD}/data_cache` or `{repository root}/data_cache` by default).

You can specify created compatible dataset in config files the same way as datasets explicitly registered in scikit-learn_bench using its name:
```json
{
"data": {
"dataset": "{dataset name}"
}
}
```

---
[Documentation tree](../../README.md#-documentation)
15 changes: 12 additions & 3 deletions sklbench/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
from ..utils.custom_types import BenchCase
from .loaders import (
dataset_loading_functions,
load_custom_data,
load_openml_data,
load_sklearn_synthetic_data,
)
Expand All @@ -47,9 +48,17 @@ def load_data(bench_case: BenchCase) -> Tuple[Dict, Dict]:
dataset = get_bench_case_value(bench_case, "data:dataset")
if dataset is not None:
dataset_params = get_bench_case_value(bench_case, "data:dataset_kwargs", dict())
return dataset_loading_functions[dataset](
**common_kwargs, preproc_kwargs=preproc_kwargs, dataset_params=dataset_params
)
if dataset in dataset_loading_functions:
# registered dataset loading branch
return dataset_loading_functions[dataset](
**common_kwargs,
preproc_kwargs=preproc_kwargs,
dataset_params=dataset_params,
)
else:
# user-provided dataset loading branch
return load_custom_data(**common_kwargs, preproc_kwargs=preproc_kwargs)

# load by source
source = get_bench_case_value(bench_case, "data:source")
if source is not None:
Expand Down
14 changes: 13 additions & 1 deletion sklbench/datasets/loaders.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
make_regression,
)

from .common import cache, preprocess
from .common import cache, load_data_description, load_data_from_cache, preprocess
from .downloaders import (
download_and_read_csv,
download_kaggle_files,
Expand Down Expand Up @@ -84,6 +84,18 @@ def load_sklearn_synthetic_data(
return {"x": x, "y": y}, data_desc


@preprocess
def load_custom_data(
data_name: str,
data_cache: str,
raw_data_cache: str,
):
"""Function to load data specified by user and stored in format compatible with scikit-learn_bench cache"""
return load_data_from_cache(data_cache, data_name), load_data_description(
data_cache, data_name
)


"""
Classification datasets
"""
Expand Down

0 comments on commit c0765de

Please sign in to comment.