We are currently undergoing migration from spark 2.4 to spark 3.5 (and dataproc 1 to 2), and our workflows are failing with the following error
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Error while reading data, error message: Schema mismatch: referenced variable 'USER_DETAILS.array.USER_URL' has array levels of 1, while the corresponding field path to Parquet column has 0 repeated fields
The workflow writes to an existing bigquery table and everything was working fine before upgrade. The schema for the column is
STRUCT<bag ARRAY<STRUCT<`array` STRUCT<USER_KEY STRING, USER_NAME STRING, USER_URL STRING>>>>
We have also set spark.sql.parquet.writeLegacyFormat
as true to ensure the write happens in legacy way (since the above column was created with legacy spark and that's why we have bag and array in that column. For more details, you can check this piece of code)
For write, we use the indirect write method with intermediate format as Parquet and we are using the library : spark-bigquery-with-dependencies_2.12-0.42.4.jar
(earlier it was spark-bigquery-with-dependencies_2.11-0.27.1.jar).
I'm trying to find a solution but I am not able to find anything. Can someone please help?