Skip to main content
  1. Blog
  2. Article

Rob Gibbon
on 28 May 2026


The purpose of this guide is to highlight the key differences between Apache Spark 3 and Spark 4, and provide advice on how to plan a migration. Let’s get started.

The biggest changes

Let’s talk about the biggest changes between Apache Spark 3.x and Spark 4.

Scala 2.12 no more

First up, there’s no support for Scala 2.12 in Spark 4, so any jobs you have that are using that version will need to be recompiled against Scala 2.13. This is generally straightforward, but note that all the dependencies of your job must also be recompiled against Scala 2.13. Depending on how many dependencies your job has, this might be a large project. 

Scala 2.13 also overhauled the `collections` library with a major rewrite of its API. Given the collections library is a foundational library, there’s a good chance you might also need to make code changes. Fortunately, there are free and open source tools that can help you to migrate your codebases, like Scalafix. You can add Scalafix as a dependency of scala build tool (SBT) and enable the “Collection213Upgrade” rule:

sbt> scalafixEnable
sbt> scalafixAll dependency:[email protected]:scala-collection-migrations:2.13.0

Java 17 and 21

Spark 4 discontinues support for old versions of the Java runtime below version 17. Java 17 is now the intended default, although Java 21 is also supported. If you are running Spark jobs or plugins that require an older version of Java, this might break things for you as there have been some changes – for example the Java XML and servlet libraries – that might require code to be adapted.

Promising developments

Some of the changes are quite promising. There are some great new capabilities – but also some changes to default behaviour. Let’s tackle the defaults first.

New defaults

ANSI SQL mode by default: in Spark 3, invalid operations often returned NULL, but in Spark 4, they’ll throw runtime exceptions (e.g., division by zero or invalid type casts) to ensure data quality. This could be a breaking change to your queries and pipelines, so you’ll need to test carefully.

Structured JSON logging: in Spark 4, logs are written as structured JSON by default, which makes them easier to process using logging infrastructure like Loki. But note that this change could break existing monitoring and alerting, so again test carefully.

RocksDB: the backend for shuffle and state management in structured streaming now defaults to RocksDB, which can significantly reduce JVM heap pressure and result in fewer garbage collection stalls.

New capabilities

New “VARIANT” data type: designed for semi-structured data like JSON, the new data type delivers significantly faster queries than traditional string-based JSON parsing.

SQL pipe syntax: it’s now possible to chain SQL operations using the |> operator.

Procedural SQL and scripting: Spark 4 adds support for multi-statement SQL scripts with local variables and control flow (IF/WHILE logic), which means you can likely write an entire ETL pipeline in pure SQL.

Improved genAI capabilities: full support for vector data types and optimized batch inference for LLM-driven workloads.

Migration strategy

So now we understand the big changes, let’s give some thought to how to plan and execute a migration from Spark 3 to Spark 4.

Foundational changes

Before you start to change your Spark code, it’s best to make sure that your underlying infrastructure can support the new requirements.

  • You’ll need to migrate your runtime environment to Java 17 or Java 21. If you’re using Canonical’s solution for Apache Spark on Kubernetes, we’ve already taken care of this for you.
  • If you’re using Scala, you’ll need to migrate all code and internal libraries to Scala 2.13.
  • Make sure all third-party connectors and plugins (e.g. Spark-RAPIDS, Kafka) have published versions for Spark 4/Scala 2.13.
  • Replace any javax.* imports in your Java code with jakarta.* for servlet and XML compatibility.

Runtime validation

In some ways, the biggest breaking change in Spark 4 is likely going to be that ANSI SQL mode is enabled by default.

  • Set spark.sql.ansi.enabled=true in your current Spark 3 environment. This will surface any division-by-zero or numeric overflow errors that would have previously been hidden by silent NULL values.
  • Replace risky operations with “safe” alternatives. For example, you could use try_cast instead of cast to maintain the old behavior of returning NULL, in order to preserve the old behaviour.

Enabler modernization

The next stage will be to validate that your workloads continue to run smoothly with the new enablers in Spark 4.

  • If using Structured Streaming, test your jobs with RocksDB, which is now the default state store.
  • Check your CREATE TABLE statements. Without an explicit USING clause, Spark 4 will default to the provider in spark.sql.sources.default instead of Hive.

Rollout plan

Our suggestion is to start with migrating your batch workloads first, followed by streaming workloads, and move analytical/decision support workloads last. Move each workload type incrementally. Carefully validate results, inspect logs, and check for discrepancies.

Conclusion

I hope this guide helps you with your migration. As with many of these things, preparation is key. If you’d like to learn more about Canonical’s Charmed solution for Apache Spark, feel free to get in touch via the contact form.

Further reading

Related posts


Rob Gibbon
15 October 2024

Apache Spark 4.0 beta release – try it now

Data Platform Article

Apache Spark is a popular framework for developing distributed, parallel data processing applications. Our solution for Apache Spark on Kubernetes has made significant progress in the past year since we launched, adding support for Apache Iceberg, a new GPU accelerated image using the NVIDIA Spark-RAPIDS plugin, and support for the Volcan ...


Giulia Lanzafame
4 September 2025

Implement an enterprise-ready data lakehouse architecture with Spark and Kyuubi 

Data Platform Article

Here at Canonical we are excited to announce that we have shipped the first release of our solution for enterprise-ready data lakehouses, built on the combination of Apache Spark and Apache Kyuubi. Using our Charmed Apache Kyuubi in integration with Spark, you can deliver a robust, production-level, and open source data lakehouse . Our Ap ...


Giulia Lanzafame
26 June 2025

Accelerating data science with Apache Spark and GPUs

Data Platform Article

Apache Spark has always been very well known for distributing computation among multiple nodes using the assistance of partitions, and CPU cores have always performed processing within a single partition.  What’s less widely known is that it is possible to accelerate Spark with GPUs. Harnessing this power in the right situation brings imm ...