Edit

Share via


Spark connector for SQL databases

The Spark connector for SQL databases is a high-performance library that lets you read from and write to SQL Server, Azure SQL databases, and Fabric SQL databases. The connector offers the following capabilities:

  • Use Spark to run large write and read operations on Azure SQL Database, Azure SQL Managed Instance, SQL Server on Azure VM, and Fabric SQL databases.
  • When you use a table or a view, the connector supports security models set at the SQL engine level. These models include object-level security (OLS), row-level security (RLS), and column-level security (CLS).

The connector is preinstalled in the Fabric runtime, so you don't need to install it separately.

Authentication

Microsoft Entra authentication is integrated with Microsoft Fabric.

  • When you sign in to the Fabric workspace, your credentials are automatically passed to the SQL engine for authentication and authorization.
  • Requires Microsoft Entra ID to be enabled and configured on your SQL database engine.
  • No extra configuration is needed in your Spark code if Microsoft Entra ID is set up. The credentials are automatically mapped.

You can also use the SQL authentication method (by specifying a SQL username and password) or a service principal (by providing an Azure access token for app-based authentication).

Permissions

To use the Spark connector, your identity—whether it's a user or an app—must have the necessary database permissions for the target SQL engine. These permissions are required to read from or write to tables and views.

For Azure SQL Database, Azure SQL Managed Instance, and SQL Server on Azure VM:

  • The identity running the operation typically needs db_datawriter and db_datareader permissions, and optionally db_owner for full control.

For Fabric SQL databases:

  • The identity typically needs db_datawriter and db_datareader permissions, and optionally db_owner.
  • The identity also needs at least read permission on the Fabric SQL database at the item level.

Note

If you use a service principal, it can run as an app (no user context) or as a user if user impersonation is enabled. The service principal must have the required database permissions for the operations you want to perform.

Usage and code examples

In this section, we provide code examples to demonstrate how to use the Spark connector for SQL databases effectively. These examples cover various scenarios, including reading from and writing to SQL tables, and configuring the connector options.

Supported options

The minimal required option is url as "jdbc:sqlserver://<server>:<port>;database=<database>;" or set spark.mssql.connector.default.url.

  • When the url is provided:

    • Always use url as first preference.
    • If spark.mssql.connector.default.url isn't set, the connector will set it and reuse it for future usage.
  • When the url isn't provided:

    • If spark.mssql.connector.default.url is set, the connector uses the value from the spark config.
    • If spark.mssql.connector.default.url isn't set, an error is thrown because the required details aren't available.

This connector supports the options defined here: SQL DataSource JDBC Options

The connector also supports the following options:

Option Default value Description
reliabilityLevel "BEST_EFFORT" Controls the reliability of insert operations. Possible values: BEST_EFFORT (default, fastest, might result in duplicate rows if an executor restarts), NO_DUPLICATES (slower, ensures no duplicate rows are inserted even if an executor restarts). Choose based on your tolerance for duplicates and performance needs.
isolationLevel "READ_COMMITTED" Sets the transaction isolation level for SQL operations. Possible values: READ_COMMITTED (default, prevents reading uncommitted data), READ_UNCOMMITTED, REPEATABLE_READ, SNAPSHOT, SERIALIZABLE. Higher isolation levels can reduce concurrency but improve data consistency.
tableLock "false" Controls whether the SQL Server TABLOCK table-level lock hint is used during insert operations. Possible values: true (enables TABLOCK, which can improve bulk write performance), false (default, doesn't use TABLOCK). Setting to true might increase throughput for large inserts but can reduce concurrency for other operations on the table.
schemaCheckEnabled "true" Controls whether strict schema validation is enforced between your Spark DataFrame and the SQL table. Possible values: true (default, enforces strict schema matching), false (allows more flexibility and might skip some schema checks). Setting to false can help with schema mismatches but might lead to unexpected results if the structures differ significantly.

Other Bulk API options can be set as options on the DataFrame and are passed to bulk copy APIs on write.

Write and read example

The following code shows how to write and read data by using the mssql("<schema>.<table>") method with automatic Microsoft Entra ID authentication.

Tip

Data is created inline for demonstration purposes. In a production scenario, you would typically read data from an existing source or create a more complex DataFrame.

import com.microsoft.sqlserver.jdbc.spark
url = "jdbc:sqlserver://<server>:<port>;database=<database>;"
row_data = [("Alice", 1),("Bob", 2),("Charlie", 3)]
column_header = ["Name", "Age"]
df = spark.createDataFrame(row_data, column_header)
df.write.mode("overwrite").option("url", url).mssql("dbo.publicExample")
spark.read.option("url", url).mssql("dbo.publicExample").show()

url = "jdbc:sqlserver://<server>:<port>;database=<database2>;" # different database
df.write.mode("overwrite").option("url", url).mssql("dbo.tableInDatabase2") # default url is updated
spark.read.mssql("dbo.tableInDatabase2").show() # no url option specified and will use database2

You can also select columns, apply filters, and use other options when you read data from the SQL database engine.

Authentication examples

The following examples show how to use authentication methods other than Microsoft Entra ID, such as service principal (access token) and SQL authentication.

Note

As mentioned earlier, Microsoft Entra ID authentication is handled automatically when you sign in to the Fabric workspace, so you only need to use these methods if your scenario requires them.

import com.microsoft.sqlserver.jdbc.spark
url = "jdbc:sqlserver://<server>:<port>;database=<database>;"
row_data = [("Alice", 1),("Bob", 2),("Charlie", 3)]
column_header = ["Name", "Age"]
df = spark.createDataFrame(row_data, column_header)

from azure.identity import ClientSecretCredential
credential = ClientSecretCredential(tenant_id="", client_id="", client_secret="") # service principal app
scope = "https://database.windows.net/.default"
token = credential.get_token(scope).token

df.write.mode("overwrite").option("url", url).option("accesstoken", token).mssql("dbo.publicExample")
spark.read.option("accesstoken", token).mssql("dbo.publicExample").show()

Supported DataFrame save modes

When you write data from Spark to SQL databases, you can choose from several save modes. Save modes control how data is written when the destination table already exists, and can affect schema, data, and indexing. Understanding these modes helps you avoid unexpected data loss or changes.

This connector supports the options defined here: Spark Save functions

  • ErrorIfExists (default save mode): If the destination table exists, the write is aborted and an exception is returned. Otherwise, a new table is created with data.

  • Ignore: If the destination table exists, the write ignores the request and doesn't return an error. Otherwise, a new table is created with data.

  • Overwrite: If the destination table exists, the table is dropped, recreated, and new data is appended.

    Note

    When you use overwrite, the original table schema (especially MSSQL-exclusive data types) and table indices are lost and replaced by the schema inferred from your Spark DataFrame. To avoid losing schema and indices, use .option("truncate", true) instead of overwrite.

  • Append: If the destination table exists, new data is appended to it. Otherwise, a new table is created with data.

Troubleshoot

When the process finishes, the output of your Spark read operation appears in the cell's output area. Errors from com.microsoft.sqlserver.jdbc.SQLServerException come directly from SQL Server. You can find detailed error information in the Spark application logs.