3

We are currently undergoing migration from spark 2.4 to spark 3.5 (and dataproc 1 to 2), and our workflows are failing with the following error

Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Error while reading data, error message: Schema mismatch: referenced variable 'USER_DETAILS.array.USER_URL' has array levels of 1, while the corresponding field path to Parquet column has 0 repeated fields

The workflow writes to an existing bigquery table and everything was working fine before upgrade. The schema for the column is

STRUCT<bag ARRAY<STRUCT<`array` STRUCT<USER_KEY STRING, USER_NAME STRING, USER_URL STRING>>>>

We have also set spark.sql.parquet.writeLegacyFormat as true to ensure the write happens in legacy way (since the above column was created with legacy spark and that's why we have bag and array in that column. For more details, you can check this piece of code)

For write, we use the indirect write method with intermediate format as Parquet and we are using the library : spark-bigquery-with-dependencies_2.12-0.42.4.jar (earlier it was spark-bigquery-with-dependencies_2.11-0.27.1.jar).

I'm trying to find a solution but I am not able to find anything. Can someone please help?

1 Answer 1

0

can u try adding below parameter when writing the dataframe to bigquery.

.option("allowFieldAddition", "True")

allowFieldAdditionAdds the ALLOW_FIELD_ADDITION SchemaUpdateOption to the BigQuery LoadJob. Allowed values are true and false.(Optional. Default to false).Supported only by the `INDIRECT` write method.

Is it possible to share the bigquery table schema or any small working example to demonstrate the issue?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.