[Go] Schema inference on `RecordFromJSON` and `TableFromJSON` functions #30

agchang · 2024-07-17T23:55:51Z

Describe the enhancement requested

I am interested in support for schema inference in the RecordFromJSON and TableFromJSON functions, as these currently require an arrow.Schema up front. I can try to contribute this if people think it makes sense. I noticed for CSV, there is NewInferringReader which just assumes the type of the first row.

Component(s)

Go

The text was updated successfully, but these errors were encountered:

joellubi · 2024-07-24T12:49:46Z

Hi @agchang, thanks for opening this issue. I think this feature very well may make sense to implement, and we would welcome your contribution if you decide to do so!

I'll write down a few of my thoughts because something like this will generally involve some tradeoffs:

Unlike in CSV where changing the number of columns between rows is invalid, JSON allows changes to the "schema" element-by-element. This can mean adding/removing a field between rows or even having entirely disjoint sets of fields.

A simple approach may be to set the schema using the fields from the first row. Fields that are missing in subsequent rows can be set to null, fields that are added can be ignored.
A more robust but more complicated approach would be to grow the schema row-by-row, as new fields are encountered. The resulting schema would be the union of all fields encountered across the rows.
- This may be impractical with a single pass over the JSON, as it would require instantiating and backfilling arrays every time a new field is encountered and wouldn't work at all if writing batches.
- Alternatively, two passes can be taken. The first will build up a list of all fields present across all rows with their inferred types. The second pass can use this to set a fixed schema and simply reuse the existing *FromJSON() functions.

If we want to go with the latter approach, my recommendation would be to focus on a dedicated implementation of the "first-pass" which infers an Arrow schema from JSON. We can then just use the output of this function as input to the existing ones:

func InferSchemaFromJSON(r io.Reader) (*arrow.Schema, error) { ... } // This needs to be implemented

func main() {
  jsonBlob := `{ ... }`

  schema, err := InferSchemaFromJSON(strings.NewReader(jsonBlob))
  if err != nil {
    log.Fatal(err)
  }

  table, err := TableFromJSON(memory.DefaultAllocator, schema, []string{jsonBlob})
  if err != nil {
    log.Fatal(err)
  }

  // do table stuff
}

loicalleyne · 2024-11-20T14:53:07Z

@agchang I made bodkin to address the schema generation issue.

agchang added the Type: enhancement New feature or request label Jul 17, 2024

agchang changed the title ~~Schema inference on FromJSON functions~~ [Go] Schema inference on FromJSON functions Jul 17, 2024

agchang changed the title ~~[Go] Schema inference on FromJSON functions~~ [Go] Schema inference on RecordFromJSON and TableFromJSON functions Jul 17, 2024

assignUser transferred this issue from apache/arrow Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Go] Schema inference on `RecordFromJSON` and `TableFromJSON` functions #30

[Go] Schema inference on `RecordFromJSON` and `TableFromJSON` functions #30

agchang commented Jul 17, 2024

joellubi commented Jul 24, 2024

loicalleyne commented Nov 20, 2024

[Go] Schema inference on RecordFromJSON and TableFromJSON functions #30

[Go] Schema inference on RecordFromJSON and TableFromJSON functions #30

Comments

agchang commented Jul 17, 2024

Describe the enhancement requested

Component(s)

joellubi commented Jul 24, 2024

loicalleyne commented Nov 20, 2024

[Go] Schema inference on `RecordFromJSON` and `TableFromJSON` functions #30

[Go] Schema inference on `RecordFromJSON` and `TableFromJSON` functions #30