Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: Cannot recognize type <class 'NoneType'> error when calling Datachain.from_json() for a file produced by from_storage().to_json() chain #724

Closed
AlamHasabie opened this issue Dec 20, 2024 · 4 comments
Assignees
Labels
bug Something isn't working priority-p1

Comments

@AlamHasabie
Copy link

AlamHasabie commented Dec 20, 2024

Description

Background

I'm following this tutorial here to produce the JSON representation of a dataset. While I'm able to generate the JSON file, I'm unable to consume it back as a dataset using from_json method

Setup

  1. Create a folder containing some random files
  2. Use from_storage method and export the dataset using to_json.
from datachain import DataChain
dataset_chain = DataChain.from_storage(
    LOCAL_DATASET_URI,
    update=True
)
dataset_chain.to_json(JSON_PATH)
  1. Now load the json file using from_json, where the error is triggered
dataset = DataChain.from_json(JSON_PATH, type="text")

Expected Behavior

I expect that the the json file is correctly deserialized into a DataChain instance and I can use the chain functions on it, e.g. export_files() or filter()

Error

Traceback (most recent call last):
  File "~/Documents/datachain/test_load_json.py", line 5, in <module>
    dataset = DataChain.from_json(json_path, type="text")
  File "~/miniconda3/envs/datachain/lib/python3.10/site-packages/datachain/lib/dc.py", line 610, in from_json
    return chain.gen(**signal_dict)  # type: ignore[misc, arg-type]
  File ~"/miniconda3/envs/datachain/lib/python3.10/site-packages/datachain/lib/dc.py", line 960, in gen
    udf_obj.to_udf_wrapper(),
  File "~/miniconda3/envs/datachain/lib/python3.10/site-packages/datachain/lib/udf.py", line 218, in to_udf_wrapper
    self.output.to_udf_spec(),
  File "~/miniconda3/envs/datachain/lib/python3.10/site-packages/datachain/lib/signal_schema.py", line 335, in to_udf_spec
    res[db_name] = python_to_sql(type_)
  File "~miniconda3/envs/datachain/lib/python3.10/site-packages/datachain/lib/convert/python_to_sql.py", line 82, in python_to_sql
    raise TypeError(f"Cannot recognize type {typ}")
TypeError: Cannot recognize type <class 'NoneType'>

Here's the JSON output (path is partially redacted) :

[
{"file":{"source":"file:///~/Documents/datachain/sample_dataset","path":"d.txt","size":0,"version":"","etag":"0x1.9d8febc69af0fp+30","is_latest":1,"last_modified":"2024-12-19T10:52:33.651310+00:00","location":null}},
{"file":{"source":"file:///~/Documents/datachain/sample_dataset","path":"a.json","size":382,"version":"","etag":"0x1.9d8feaf0f9101p+30","is_latest":1,"last_modified":"2024-12-19T10:51:40.243225+00:00","location":null}},
{"file":{"source":"file:///~/Documents/datachain/sample_dataset","path":"b.txt","size":461,"version":"","etag":"0x1.9d8feb918637fp+30","is_latest":1,"last_modified":"2024-12-19T10:52:20.381073+00:00","location":null}},
{"file":{"source":"file:///~/Documents/datachain/sample_dataset","path":"c.txt","size":12,"version":"","etag":"0x1.9d8febb8a8f28p+30","is_latest":1,"last_modified":"2024-12-19T10:52:30.164988+00:00","location":null}}
]

Potential Causes

The location:null value in the JSON might be causing the schema inference to fail, but I'm not familiar with the codebase, so I cannot confirm.

Environment and Versions

python==3.10.15
datachain==0.7.11
OS : Ubuntu 24.04 LTS

Version Info

0.7.11
Python 3.10.15
@AlamHasabie AlamHasabie added the bug Something isn't working label Dec 20, 2024
@AlamHasabie
Copy link
Author

I did try to fill in the values of file.location with some random string and the json is correctly loaded, however calling export_files produces sqlite3.OperationalError: no such column: file__path

@shcheklein
Copy link
Member

shcheklein commented Dec 21, 2024

Issues found so far, some of them will become independent tickets:

  • In case of an exception we remove successfully cached listings. We should keep them. There is no reason removing them on cleanup. fix(session): keep cached successful listings #728
  • Not convenient to setup output in parse-tabular. Not easy to give a hint / model for a specific part of the JSON.

@shcheklein
Copy link
Member

shcheklein commented Dec 21, 2024

Workaround for a simple list of files:

from datachain import DataChain, File

dataset_chain = DataChain.from_storage(".github", update=True)
dataset_chain.to_json("output.json", include_outer_list=False)

dataset = DataChain.from_storage("output.json").parse_tabular(format="json", output={"file": File})
dataset.show(3)

or two alternatives with from_json:

from datachain import DataChain, File

dataset_chain = DataChain.from_storage(".github", update=True)
dataset_chain.to_json("output.json", include_outer_list=False)

dataset = DataChain.from_json("output.json", format="jsonl", jmespath="file", spec=File, object_name="file")
dataset.show(3)
from datachain import DataChain, File

dataset_chain = DataChain.from_storage(".github", update=True)
dataset_chain.to_json("output.json")

dataset = DataChain.from_json("output.json",  jmespath="[].file", spec=File, object_name="file")
dataset.show(3)

No matter what we do, we have to specify File as a model name, so that it can "understand" the location type as Optional[something], it can't infer it from all null values automatically and this is expected.

@shcheklein
Copy link
Member

Closing for now, since there is a workaround. We need to get back to the item left when we review from_json / from_tabular unification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority-p1
Projects
None yet
Development

No branches or pull requests

2 participants