Share via


Advanced usage of Databricks Connect

Note

This article covers Databricks Connect for Databricks Runtime 14.0 and above.

This article describes topics that go beyond the basic setup of Databricks Connect.

Configure the Spark Connect connection string

In addition to connecting to your cluster using the options outlined in Configure a connection to a cluster, a more advanced option is connecting using the Spark Connect connection string. You can pass the string in the remote function or set the SPARK_REMOTE environment variable.

Note

You can only use a Databricks personal access token authentication to connect using the Spark Connect connection string.

Python

To set the connection string using the remote function:

from databricks.connect import DatabricksSession

workspace_instance_name = retrieve_workspace_instance_name()
token                   = retrieve_token()
cluster_id              = retrieve_cluster_id()

spark = DatabricksSession.builder.remote(
   f"sc://{workspace_instance_name}:443/;token={token};x-databricks-cluster-id={cluster_id}"
).getOrCreate()

Alternatively, set the SPARK_REMOTE environment variable:

sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>

Then initialize the DatabricksSession class:

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.getOrCreate()

Scala

Set the SPARK_REMOTE environment variable:

sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>

Then initialize the DatabricksSession class:

import com.databricks.connect.DatabricksSession

val spark = DatabricksSession.builder.getOrCreate()

Use Spark Connect server with Databricks Connect

You can optionally run Databricks Connect against an open source Spark Connect server.

Important

Some features available in Databricks Runtime and Databricks Connect are exclusive to Databricks or not yet released in open source Apache Spark. If your code relies on these features, the following steps may fail with errors.

  1. Start a local Spark Connect server. See How to use Spark Connect

  2. Configure Databricks Connect. Set the environment variable SPARK_REMOTE to point to your local Spark Connect server. See Connecting to Spark Connect using Clients.

    export SPARK_REMOTE="sc://localhost"
    
  3. Initialize the Databricks session:

    Python

    from databricks.connect import DatabricksSession
    
    spark = DatabricksSession.builder.getOrCreate()
    

    Scala

    import com.databricks.connect.DatabricksSession
    
    val spark = DatabricksSession.builder.getOrCreate()
    

Additional HTTP headers

Databricks Connect communicates with the Databricks Clusters via gRPC over HTTP/2.

To have better control over the requests coming from clients, advanced users may choose to install a proxy service between the client and the Azure Databricks cluster. In some cases the proxies may require custom headers in the HTTP requests.

Use the header() method to add custom headers to HTTP requests:

Python

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.header('x-custom-header', 'value').getOrCreate()

Scala

import com.databricks.connect.DatabricksSession

val spark = DatabricksSession.builder.header("x-custom-header", "value").getOrCreate()

Certificates

If your cluster relies on a custom SSL/TLS certificate to resolve a Azure Databricks workspace fully qualified domain name (FQDN), you must set the environment variable GRPC_DEFAULT_SSL_ROOTS_FILE_PATH on your local development machine. This environment variable must be set to the full path to the installed certificate on the cluster.

Python

The following example sets this environment variable:

import os

os.environ["GRPC_DEFAULT_SSL_ROOTS_FILE_PATH"] = "/etc/ssl/certs/ca-bundle.crt"

For other ways to set environment variables, see your operating system's documentation.

Scala

Java and Scala do not offer ways to configure environment variables programmatically. Refer to your operating system or IDE documentation for information on how to configure them as part of your application.

Logging and debug logs

Python

Databricks Connect for Python produces logs using standard Python logging.

Logs are emitted to the standard error stream (stderr) and by default they are turned off. Setting an environment variable SPARK_CONNECT_LOG_LEVEL=debug will modify this default and print all log messages at the DEBUG level and higher.

Scala

Databricks Connect for Scala uses SLF4J logging, and does not ship with any SLF4J providers.

Applications using Databricks Connect are expected to include an SLF4J provider, and in some cases, configured to print the log messages.

  • The simplest option is to include the slf4j-simple provider which prints log messages at the INFO level and higher to the standard error stream (stderr).
  • A more configurable alternative is to use the slf4j-reload4j provider which picks up configuration from a log4j.properties file in the classpath.

The following example shows a simple log4j.properties file.

log4j.rootLogger=INFO,stderr

log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%p\t%d{ISO8601}\t%r\t%c\t[%t]\t%m%n

In the preceding example, debug logs are printed if the root logger (or a specific logger) is configured at the DEBUG level:

log4j.rootLogger=DEBUG,stderr