Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go] Schema inference on RecordFromJSON and TableFromJSON functions #30

Open
agchang opened this issue Jul 17, 2024 · 2 comments
Open
Labels
Type: enhancement New feature or request

Comments

@agchang
Copy link
Contributor

agchang commented Jul 17, 2024

Describe the enhancement requested

I am interested in support for schema inference in the RecordFromJSON and TableFromJSON functions, as these currently require an arrow.Schema up front. I can try to contribute this if people think it makes sense. I noticed for CSV, there is NewInferringReader which just assumes the type of the first row.

Component(s)

Go

@agchang agchang added the Type: enhancement New feature or request label Jul 17, 2024
@agchang agchang changed the title Schema inference on FromJSON functions [Go] Schema inference on FromJSON functions Jul 17, 2024
@agchang agchang changed the title [Go] Schema inference on FromJSON functions [Go] Schema inference on RecordFromJSON and TableFromJSON functions Jul 17, 2024
@joellubi
Copy link
Member

Hi @agchang, thanks for opening this issue. I think this feature very well may make sense to implement, and we would welcome your contribution if you decide to do so!

I'll write down a few of my thoughts because something like this will generally involve some tradeoffs:

Unlike in CSV where changing the number of columns between rows is invalid, JSON allows changes to the "schema" element-by-element. This can mean adding/removing a field between rows or even having entirely disjoint sets of fields.

  • A simple approach may be to set the schema using the fields from the first row. Fields that are missing in subsequent rows can be set to null, fields that are added can be ignored.
  • A more robust but more complicated approach would be to grow the schema row-by-row, as new fields are encountered. The resulting schema would be the union of all fields encountered across the rows.
    • This may be impractical with a single pass over the JSON, as it would require instantiating and backfilling arrays every time a new field is encountered and wouldn't work at all if writing batches.
    • Alternatively, two passes can be taken. The first will build up a list of all fields present across all rows with their inferred types. The second pass can use this to set a fixed schema and simply reuse the existing *FromJSON() functions.

If we want to go with the latter approach, my recommendation would be to focus on a dedicated implementation of the "first-pass" which infers an Arrow schema from JSON. We can then just use the output of this function as input to the existing ones:

func InferSchemaFromJSON(r io.Reader) (*arrow.Schema, error) { ... } // This needs to be implemented

func main() {
  jsonBlob := `{ ... }`

  schema, err := InferSchemaFromJSON(strings.NewReader(jsonBlob))
  if err != nil {
    log.Fatal(err)
  }

  table, err := TableFromJSON(memory.DefaultAllocator, schema, []string{jsonBlob})
  if err != nil {
    log.Fatal(err)
  }

  // do table stuff
}

@assignUser assignUser transferred this issue from apache/arrow Aug 30, 2024
@loicalleyne
Copy link
Contributor

@agchang I made bodkin to address the schema generation issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants