Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
This documentation has been retired and might not be updated.
Databricks recommends that you use Databricks Asset Bundles instead of dbx by Databricks Labs. See What are Databricks Asset Bundles? and Migrate from dbx to bundles.
To use Azure Databricks with Visual Studio Code, see the article Databricks extension for Visual Studio Code.
This article describes a Python-based code sample that you can work with in any Python-compatible IDE. Specifically, this article describes how to work with this code sample in Visual Studio Code, which provides the following developer productivity features:
- Code completion
- Linting
- Testing
- Debugging code objects that do not require a real-time connection to remote Azure Databricks resources.
This article uses dbx by Databricks Labs along with Visual Studio Code to submit the code sample to a remote Azure Databricks workspace. dbx instructs Azure Databricks to Lakeflow Jobs to run the submitted code on an Azure Databricks jobs cluster in that workspace.
You can use popular third-party Git providers for version control and continuous integration and continuous delivery or continuous deployment (CI/CD) of your code. For version control, these Git providers include the following:
- GitHub
- Bitbucket
- GitLab
- Azure DevOps (not available in Azure China regions)
- AWS CodeCommit
- GitHub AE
For CI/CD, dbx supports the following CI/CD platforms:
To demonstrate how version control and CI/CD can work, this article describes how to use Visual Studio Code, dbx, and this code sample, along with GitHub and GitHub Actions.
Code sample requirements
To use this code sample, you must have the following:
- An Azure Databricks workspace in your Azure Databricks account.
- A GitHub account. Create a GitHub account, if you do not already have one.
Additionally, on your local development machine, you must have the following:
Python version 3.8 or above.
You should use a version of Python that matches the one that is installed on your target clusters. To get the version of Python that is installed on an existing cluster, you can use the cluster's web terminal to run the
python --versioncommand. See also the “System environment” section in the Databricks Runtime release notes versions and compatibility for the Databricks Runtime version for your target clusters. In any case, the version of Python must be 3.8 or above.To get the version of Python that is currently referenced on your local machine, run
python --versionfrom your local terminal. (Depending on how you set up Python on your local machine, you may need to runpython3instead ofpythonthroughout this article.) See also Select a Python interpreter.pip.
pipis automatically installed with newer versions of Python. To check whetherpipis already installed, runpip --versionfrom your local terminal. (Depending on how you set up Python orpipon your local machine, you may need to runpip3instead ofpipthroughout this article.)dbx version 0.8.0 or above. You can install the
dbxpackage from the Python Package Index (PyPI) by runningpip install dbx.Note
You do not need to install
dbxnow. You can install it later in the code sample setup section.A method to create Python virtual environments to ensure you are using the correct versions of Python and package dependencies in your
dbxprojects. This article covers pipenv.The Databricks CLI version 0.18 or below, set up with authentication.
Note
You do not need to install the legacy Databricks CLI (Databricks CLI version 0.17) now. You can install it later in the code sample setup section. If you want to install it later, you must remember to set up authentication at that time instead.
The Python extension for Visual Studio Code.
The GitHub Pull Requests and Issues extension for Visual Studio Code.
Git.
About the code sample
The Python code sample for this article, available in the databricks/ide-best-practices repo in GitHub, does the following:
- Gets data from the owid/covid-19-data repo in GitHub.
- Filters the data for a specific ISO country code.
- Creates a pivot table from the data.
- Performs data cleansing on the data.
- Modularizes the code logic into reusable functions.
- Unit tests the functions.
- Provides
dbxproject configurations and settings to enable the code to write the data to a Delta table in a remote Azure Databricks workspace.
Set up the code sample
After you have the requirements in place for this code sample, complete the following steps to begin using the code sample.
Note
These steps do not include setting up this code sample for CI/CD. You do not need to set up CI/CD to run this code sample. If you want to set up CI/CD later, see Run with GitHub Actions.
Step 1: Create a Python virtual environment
From your terminal, create a blank folder to contain a virtual environment for this code sample. These instructions use a parent folder named
ide-demo. You can give this folder any name you want. If you use a different name, replace the name throughout this article. After you create the folder, switch to it, and then start Visual Studio Code from that folder. Be sure to include the dot (.) after thecodecommand.For Linux and macOS:
mkdir ide-demo cd ide-demo code .Tip
If you get the error
command not found: code, see Launching from the command line on the Microsoft website.For Windows:
md ide-demo cd ide-demo code .In Visual Studio Code, on the menu bar, click View > Terminal.
From the root of the
ide-demofolder, run thepipenvcommand with the following option, where<version>is the target version of Python that you already have installed locally (and, ideally, a version that matches your target clusters' version of Python), for example3.8.14.pipenv --python <version>Make a note of the
Virtualenv locationvalue in the output of thepipenvcommand, as you will need it in the next step.Select the target Python interpreter, and then activate the Python virtual environment:
On the menu bar, click View > Command Palette, type
Python: Select, and then click Python: Select Interpreter.Select the Python interpreter within the path to the Python virtual environment that you just created. (This path is listed as the
Virtualenv locationvalue in the output of thepipenvcommand.)On the menu bar, click View > Command Palette, type
Terminal: Create, and then click Terminal: Create New Terminal.Make sure that the command prompt indicates that you are in the
pipenvshell. To confirm, you should see something like(<your-username>)before your command prompt. If you do not see it, run the following command:pipenv shellTo exit the
pipenvshell, run the commandexit, and the parentheses disappear.
For more information, see Using Python environments in VS Code in the Visual Studio Code documentation.
Step 2: Clone the code sample from GitHub
- In Visual Studio Code, open the
ide-demofolder (File > Open Folder), if it is not already open. - Click View > Command Palette, type
Git: Clone, and then click Git: Clone. - For Provide repository URL or pick a repository source, enter
https://github.com/databricks/ide-best-practices - Browse to your
ide-demofolder, and click Select Repository Location.
Step 3: Install the code sample's dependencies
Install a version of
dbxand Databricks CLI version 0.18 or below that is compatible with your version of Python. To do this, in Visual Studio Code from your terminal, from youride-demofolder with apipenvshell activated (pipenv shell), run the following command:pip install dbxConfirm that
dbxis installed. To do this, run the following command:dbx --versionIf the version number is returned,
dbxis installed.If the version number is below 0.8.0, upgrade
dbxby running the following command, and then check the version number again:pip install dbx --upgrade dbx --version # Or ... python -m pip install dbx --upgrade dbx --versionWhen you install
dbx, the legacy Databricks CLI (Databricks CLI version 0.17) is also automatically installed. To confirm that the legacy Databricks CLI (Databricks CLI version 0.17) is installed, run the following command:databricks --versionIf Databricks CLI version 0.17 is returned, the legacy Databricks CLI is installed.
If you have not set up the legacy Databricks CLI (Databricks CLI version 0.17) with authentication, you must do it now. To confirm that authentication is set up, run the following basic command to get some summary information about your Azure Databricks workspace. Be sure to include the forward slash (
/) after thelssubcommand:databricks workspace ls /If a list of root-level folder names for your workspace is returned, authentication is set up.
Install the Python packages that this code sample depends on. To do this, run the following command from the
ide-demo/ide-best-practicesfolder:pip install -r unit-requirements.txtConfirm that the code sample's dependent packages are installed. To do this, run the following command:
pip listIf the packages that are listed in the
requirements.txtandunit-requirements.txtfiles are somewhere in this list, the dependent packages are installed.Note
The files listed in
requirements.txtare for specific package versions. For better compatibility, you can cross-reference these versions with the cluster node type that you want your Azure Databricks workspace to use for running deployments on later. See the “System environment” section for your cluster's Databricks Runtime version in Databricks Runtime release notes versions and compatibility.
Step 4: Customize the code sample for your Azure Databricks workspace
Customize the repo's
dbxproject settings. To do this, in the.dbx/project.jsonfile, change the value of theprofileobject fromDEFAULTto the name of the profile that matches the one that you set up for authentication with the legacy Databricks CLI (Databricks CLI version 0.17). If you did not set up any non-default profile, leaveDEFAULTas is. For example:{ "environments": { "default": { "profile": "DEFAULT", "storage_type": "mlflow", "properties": { "workspace_directory": "/Workspace/Shared/dbx/covid_analysis", "artifact_location": "dbfs:/Shared/dbx/projects/covid_analysis" } } }, "inplace_jinja_support": false }Customize the
dbxproject's deployment settings. To do this, in theconf/deployment.ymlfile, change the value of thespark_versionandnode_type_idobjects from10.4.x-scala2.12andm6gd.largeto the Azure Databricks runtime version string and cluster node type that you want your Azure Databricks workspace to use for running deployments on.For example, to specify Databricks Runtime 10.4 LTS and a
Standard_DS3_v2node type:environments: default: workflows: - name: 'covid_analysis_etl_integ' new_cluster: spark_version: '10.4.x-scala2.12' num_workers: 1 node_type_id: 'Standard_DS3_v2' spark_python_task: python_file: 'file://jobs/covid_trends_job.py' - name: 'covid_analysis_etl_prod' new_cluster: spark_version: '10.4.x-scala2.12' num_workers: 1 node_type_id: 'Standard_DS3_v2' spark_python_task: python_file: 'file://jobs/covid_trends_job.py' parameters: ['--prod'] - name: 'covid_analysis_etl_raw' new_cluster: spark_version: '10.4.x-scala2.12' num_workers: 1 node_type_id: 'Standard_DS3_v2' spark_python_task: python_file: 'file://jobs/covid_trends_job_raw.py'
Tip
In this example, each of these three job definitions has the same spark_version and node_type_id value. You can use different values for different job definitions. You can also create shared values and reuse them across job definitions, to reduce typing errors and code maintenance. See the YAML example in the dbx documentation.
Explore the code sample
After you set up the code sample, use the following information to learn about how the various files in the ide-demo/ide-best-practices folder work.
Code modularization
Unmodularized code
The jobs/covid_trends_job_raw.py file is an unmodularized version of the code logic. You can run this file by itself.
Modularized code
The jobs/covid_trends_job.py file is a modularized version of the code logic. This file relies on the shared code in the covid_analysis/transforms.py file. The covid_analysis/__init__.py file treats the covide_analysis folder as a containing package.
Testing
Unit tests
The tests/testdata.csv file contains a small portion of the data in the covid-hospitalizations.csv file for testing purposes. The tests/transforms_test.py file contains the unit tests for the covid_analysis/transforms.py file.
Unit test runner
The pytest.ini file contains configuration options for running tests with pytest. See pytest.ini and Configuration Options in the pytest documentation.
The .coveragerc file contains configuration options for Python code coverage measurements with coverage.py. See Configuration reference in the coverage.py documentation.
The requirements.txt file, which is a subset of the unit-requirements.txt file that you ran earlier with pip, contains a list of packages that the unit tests also depend on.
Packaging
The setup.py file provides commands to be run at the console (console scripts), such as the pip command, for packaging Python projects with setuptools. See Entry Points in the setuptools documentation.
Other files
There are other files in this code sample that have not been previously described:
- The
.github/workflowsfolder contains three files,databricks_pull_request_tests.yml,onpush.yml, andonrelease.yaml, that represent the GitHub Actions, which are covered later in the GitHub Actions section. - The
.gitignorefile contains a list of local folders and files that Git ignores for your repo.
Run the code sample
You can use dbx on your local machine to instruct Azure Databricks to run the code sample in your remote workspace on-demand, as described in the next subsection. Or you can use GitHub Actions to have GitHub run the code sample every time you push code changes to your GitHub repo.
Run with dbx
Install the contents of the
covid_analysisfolder as a package in Pythonsetuptoolsdevelopment mode by running the following command from the root of yourdbxproject (for example, theide-demo/ide-best-practicesfolder). Be sure to include the dot (.) at the end of this command:pip install -e .This command creates a
covid_analysis.egg-infofolder, which contains information about the compiled version of thecovid_analysis/__init__.pyandcovid_analysis/transforms.pyfiles.Run the tests by running the following command:
pytest tests/The tests' results are displayed in the terminal. All four tests should show as passing.
Tip
For additional approaches to testing, including testing for R and Scala notebooks, see Unit testing for notebooks.
Optionally, get test coverage metrics for your tests by running the following command:
coverage run -m pytest tests/Note
If a message displays that
coveragecannot be found, runpip install coverage, and try again.To view test coverage results, run the following command:
coverage report -mIf all four tests pass, send the
dbxproject's contents to your Azure Databricks workspace, by running the following command:dbx deploy --environment=defaultInformation about the project and its runs are sent to the location specified in the
workspace_directoryobject in the.dbx/project.jsonfile.The project's contents are sent to the location specified in the
artifact_locationobject in the.dbx/project.jsonfile.Run the pre-production version of the code in your workspace, by running the following command:
dbx launch covid_analysis_etl_integA link to the run's results are displayed in the terminal. It should look something like this:
https://<your-workspace-instance-id>/?o=1234567890123456#job/123456789012345/run/12345Follow this link in your web browser to see the run's results in your workspace.
Run the production version of the code in your workspace, by running the following command:
dbx launch covid_analysis_etl_prodA link to the run's results are displayed in the terminal. It should look something like this:
https://<your-workspace-instance-id>/?o=1234567890123456#job/123456789012345/run/23456Follow this link in your web browser to see the run's results in your workspace.
Run with GitHub Actions
In the project's .github/workflows folder, the onpush.yml and onrelease.yml GitHub Actions files do the following:
- On each push to a tag that begins with
v, usesdbxto deploy thecovid_analysis_etl_prodjob. - On each push that is not to a tag that begins with
v:- Uses
pytestto run the unit tests. - Uses
dbxto deploy the file specified in thecovid_analysis_etl_integjob to the remote workspace. - Uses
dbxto launch the already-deployed file specified in thecovid_analysis_etl_integjob on the remote workspace, tracing this run until it finishes.
- Uses
Note
An additional GitHub Actions file, databricks_pull_request_tests.yml, is provided for you as a template to experiment with, without impacting the onpush.yml and onrelease.yml GitHub Actions files. You can run this code sample without the databricks_pull_request_tests.yml GitHub Actions file. Its usage is not covered in this article.
The following subsections describe how to set up and run the onpush.yml and onrelease.yml GitHub Actions files.
Set up to use GitHub Actions
Set up your Azure Databricks workspace by following the instructions in Service principals for CI/CD. This includes the following actions:
- Create a service principal.
- Create a Microsoft Entra ID token for the service principal.
As a security best practice, Databricks recommends that you use a Microsoft Entra ID token for a service principal, instead of the Databricks personal access token for your workspace user, for enabling GitHub to authenticate with your Azure Databricks workspace.
After you create the service principal and its Microsoft Entra ID token, stop and make a note of the Microsoft Entra ID token value, which you will you use in the next section.
Run GitHub Actions
Step 1: Publish your cloned repo
- In Visual Studio Code, in the sidebar, click the GitHub icon. If the icon is not visible, enable the GitHub Pull Requests and Issues extension through the Extensions view (View > Extensions) first.
- If the Sign In button is visible, click it, and follow the on-screen instructions to sign in to your GitHub account.
- On the menu bar, click View > Command Palette, type
Publish to GitHub, and then click Publish to GitHub. - Select an option to publish your cloned repo to your GitHub account.
Step 2: Add encrypted secrets to your repo
In the GitHub website for your published repo, follow the instructions in Creating encrypted secrets for a repository, for the following encrypted secrets:
- Create an encrypted secret named
DATABRICKS_HOST, set to the value of your per-workspace URL, for examplehttps://adb-1234567890123456.7.azuredatabricks.net. - Create an encrypted secret named
DATABRICKS_TOKEN, set to the value of the Microsoft Entra ID token for the service principal.
Step 3: Create and publish a branch to your repo
- In Visual Studio Code, in Source Control view (View > Source Control), click the … (Views and More Actions) icon.
- Click Branch > Create Branch From.
- Enter a name for the branch, for example
my-branch. - Select the branch to create the branch from, for example main.
- Make a minor change to one of the files in your local repo, and then save the file. For example, make a minor change to a code comment in the
tests/transforms_test.pyfile. - In Source Control view, click the … (Views and More Actions) icon again.
- Click Changes > Stage All Changes.
- Click the … (Views and More Actions) icon again.
- Click Commit > Commit Staged.
- Enter a message for the commit.
- Click the … (Views and More Actions) icon again.
- Click Branch > Publish Branch.
Step 4: Create a pull request and merge
- Go to the GitHub website for your published repo,
https://github/<your-GitHub-username>/ide-best-practices. - On the Pull requests tab, next to my-branch had recent pushes, click Compare & pull request.
- Click Create pull request.
- On the pull request page, wait for the icon next to CI pipleline / ci-pipeline (push) to display a green check mark. (It may take a few moments for the several minutes for the icon to appear.) If there is a red X instead of a green check mark, click Details to find out why. If the icon or Details are no longer showing, click Show all checks.
- If the green check mark appears, merge the pull request into the
mainbranch by clicking Merge pull request.