Edit

Share via


Run notebooks on the Microsoft Sentinel data lake

Jupyter notebooks provide an interactive environment for exploring, analyzing, and visualizing data in the Microsoft Sentinel data lake. With notebooks, you can write and execute code, document your workflow, and view results—all in one place. This makes it easy to perform data exploration, build advanced analytics solutions, and share insights with others. By leveraging Python and Apache Spark within Visual Studio Code, notebooks help you transform raw security data into actionable intelligence.

This article shows you how to explore and interact with data lake data using Jupyter notebooks in Visual Studio Code.

Prerequisites

Onboard to the Microsoft Sentinel data lake

To use notebooks in the Microsoft Sentinel data lake, you must first onboard to the data lake. If you haven't onboarded to the Sentinel data lake, see Onboarding to Microsoft Sentinel data lake. If you have recently onboarded to the data lake, it may take some time until sufficient volume of data is ingested before you can create meaningful analyses using notebooks.

Permissions

Microsoft Entra ID roles provide broad access across all workspaces in the data lake. Alternatively you can grant access to individual workspaces using Azure RBAC roles. Users with Azure RBAC permissions to Microsoft Sentinel workspaces can run notebooks against those workspaces in the data lake tier. For more information, see Roles and permissions in Microsoft Sentinel.

To create new custom tables in the analytics tier, the data lake managed identity must be assigned the Log Analytics Contributor role in the Log Analytics workspace.

To assign the role, follow the steps below:

  1. In the Azure portal, navigate to the Log Analytics workspace that you want to assign the role to.
  2. Select Access control (IAM) in the left navigation pane.
  3. Select Add role assignment.
  4. In the Role table, select Log Analytics Contributor, then select Next
  5. Select Managed identity, then select Select members.
  6. Your data lake managed identity is a system assigned managed identity named msg-resources-<guid>. Select the managed identity, then select Select.
  7. Select Review and assign.

For more information on assigning roles to managed identities, see Assign Azure roles using the Azure portal.

Install Visual Studio Code and the Microsoft Sentinel extension

If you don't already have Visual Studio Code, download and install Visual Studio Code for Mac, Linux, or Windows.

The Microsoft Sentinel extension for Visual Studio Code (VS Code) is installed from the extensions marketplace. To install the extension, follow these steps:

  1. Select the Extensions Marketplace in the left toolbar.
  2. Search for Sentinel.
  3. Select the Microsoft Sentinel extension and select Install.
  4. After the extension is installed, the Microsoft Sentinel shield icon appears in the left toolbar.

A screenshot showing the extension market place.

Install the GitHub Copilot extension for Visual Studio Code to enable code completion and suggestions in notebooks.

  1. Search for GitHub Copilot in the Extensions Marketplace and install it.
  2. After installation, sign in to GitHub Copilot using your GitHub account.

Explore data lake tier tables

After installing the Microsoft Sentinel extension, you can start exploring data lake tier tables and creating Jupyter notebooks to analyze the data.

Sign in to the Microsoft Sentinel extension

  1. Select the Microsoft Sentinel shield icon in the left toolbar.

  2. A dialog appears with the following text The extension "Microsoft Sentinel" wants to sign in using Microsoft. Select Allow.

    A screenshot showing the sign in dialog.

  3. Select your account name to complete the sign in.

    A screenshot showing the account selection list at the top of the page.

View data lake tables and jobs

Once you sign in, the Sentinel extension displays a list of Lake tables and Jobs in the left pane. The tables are grouped by the database and category. Select a table to see the column definitions.

For information on Jobs, see Jobs and Scheduling.

A screenshot showing the list of tables, jobs, and the selected table's metadata.

Create a new notebook

  1. To create a new notebook, use one of the following methods.

  2. Enter > in the search box or press Ctrl+Shift+P and then enter Create New Jupyter Notebook. A screenshot showing how to create a new notebook from the search bar.

  3. Select File > New File, then select Jupyter Notebook from the dropdown.
    A screenshot showing how to create a new notebook form the file menu.

  4. In the new notebook, paste the following code into the first cell.

    from sentinel_lake.providers import MicrosoftSentinelProvider
    data_provider = MicrosoftSentinelProvider(spark)
    
    table_name = "EntraGroups"  
    df = data_provider.read_table(table_name)  
    df.select("displayName", "groupTypes", "mail", "mailNickname", "description", "tenantId").show(100,   truncate=False)  
    

The editor provides intellisense code completion for both the MicrosoftSentinelProvider class and the table names in the data lake.

  1. Select the Run triangle to execute the code in the notebook. The results are displayed in the output pane below the code cell.
    A screenshot showing how to run a notebook cell.

  2. Select Microsoft Sentinel from the list for a list of runtime pools. A screenshot showing the runtime picker.

  3. Select Medium to run the notebook in the medium sized runtime pool. For more information on the different runtimes, see Selecting the appropriate Microsoft Sentinel runtime. A screenshot showing the run pool size picker.

Note

Selecting the kernel starts the Spark session and runs the code in the notebook. After selecting the pool, it can take 3-5 mins for the session to start. Subsequent runs a faster as the session is already active.

When the session is started, the code in the notebook runs and the results are displayed in the output pane below the code cell, for example: A screenshot showing the results from running a notebook cell.

For sample notebooks that demonstrate how to interact with the Microsoft Sentinel data lake, see Sample notebooks for Microsoft Sentinel data lake.

Status bar

The status bar at the bottom of the notebook provides information about the current state of the notebook and the Spark session. The status bar includes the following information:

  • The vCore utilization percentage for the selected Spark pool. Hover over the percentage to see the number of vCores used and the total number of vCores available in the pool. The percentages represent the current usage across interactive and job workloads for the logged in account.

  • The connection status of the Spark session for example Connecting, Connected, or Not Connected.

A screenshot showing the status bar at the bottom of the notebook.

Set session timeouts

You can set the session timeout and timeout warnings for interactive notebooks. To change the timeout, select the connection status in the status bar at the bottom of the notebook. Choose from the following options:

  • Set session timeout period: Sets the time in minutes before the session times out. The default is 30 minutes.

  • Reset session timeout period: Resets the session timeout to the default value of 30 minutes.

  • Set session timeout warning period: Sets the time in minutes before the timeout that a warning is displayed that the session is about to time out. The default is 5 minutes.

  • Reset session timeout warning period: Resets the session timeout warning to the default value of 5 minutes.

    A screenshot showing the session timeout setting.

Use GitHub Copilot in notebooks

Use GitHub Copilot to help you write code in notebooks. GitHub Copilot provides code suggestions and autocompletion based on the context of your code. To use GitHub Copilot, ensure that you have the GitHub Copilot extension installed in Visual Studio Code.

Copy code from the Sample notebooks for Microsoft Sentinel data lake and save it in your notebooks folder to provide context for GitHub Copilot. GitHub Copilot will then be able to suggest code completions based on the context of your notebook.

The following example shows GitHub Copilot generating a code review.

A screenshot showing GitHub Copilot generating a code review.

Microsoft Sentinel Provider class

To connect to the Microsoft Sentinel data lake, use the SentinelLakeProvider class. This class is part of the access_module.data_loader module and provides methods to interact with the data lake. To use this class, import it and create an instance of the class using a spark session.

from sentinel_lake.providers import MicrosoftSentinelProvider
data_provider = MicrosoftSentinelProvider(spark)

For more information on the available methods, see Microsoft Sentinel Provider class reference.

Select the appropriate runtime pool

There are three runtime pools available to run your Jupyter notebooks in the Microsoft Sentinel extension. Each pool is designed for different workloads and performance requirements. The choice of runtime pool affects the performance, cost, and execution time of your Spark jobs.

Runtime Pool Recommended Use Cases Characteristics
Small Development, testing, and lightweight exploratory analysis.
Small workloads with simple transformations.
Cost efficiency prioritized.
Suitable for small workloads
Simple transformations.
Lower cost, longer execution time.
Medium ETL jobs with joins, aggregations, and ML model training.
Moderate workloads with complex transformations.
Improved performance over Small.
Handles parallelism and moderate memory-intensive operations.
Large Deep learning and ML workloads.
Extensive data shuffling, large joins, or real-time processing.
Critical execution time.
High memory and compute power.
Minimal delays.
Best for large, complex, or time-sensitive workloads.

Note

When first accessed, kernel options may take about 30 seconds to load.
After selecting a runtime pool, it can take 3–5 minutes for the session to start.

View messages, logs, and errors

Messages logs and error messages are displayed in three areas in Visual Studio Code.

  1. The Output pane.

    1. In the Output pane, select Microsoft Sentinel from the drop-down.
    2. Select Debug to include detailed log entries.

    A screenshot showing the output pane.

  2. In-line messages in the notebook provide feedback and information about the execution of code cells. These messages include execution status updates, progress indicators, and error notifications related to the code in the preceding cell

  3. A notification pop-up in the bottom right corner of Visual Studio Code, also know as a toast message, provides real-time alerts and updates about the status of operations within the notebook and the spark session. These notifications include messages, warnings, and error alerts such as successful connection to a spark session, and timeout warnings.

    A screenshot showing a toast message and an in-line error message.

Jobs and scheduling

You can schedule jobs to run at specific times or intervals using the Microsoft Sentinel extension for Visual Studio Code. Jobs allow you to automate data processing tasks to summarize, transform, or analyze data in the Microsoft Sentinel data lake. Jobs are also used to process data and write results to custom tables in the data lake tier or analytics tier. For more information on creating and managing jobs, see Create and manage Jupyter notebook jobs.

Service parameters and limits for VS Code Notebooks

The following section lists the service parameters and limits for Microsoft Sentinel data lake when using VS Code Notebooks.

Category Parameter/limit
Custom table in the analytics tier Custom tables in analytics tier can't be deleted from a notebook; Use Log Analytics to delete these tables. For more information, see Add or delete tables and columns in Azure Monitor Logs
Gateway web socket timeout 2 hours
Interactive query timeout 2 hours
Interactive session inactivity timeout 20 minutes
Language Python
Notebook job timeout 8 hours
Max concurrent notebook jobs 3, subsequent jobs are queued
Max concurrent users on interactive querying 8-10 on Large pool
Session start-up time Spark compute session takes about 5-6 minutes to start. You can view the status of the session at the bottom of your VS Code Notebook.
Supported libraries Only Azure Synapse libraries 3.4 and the Microsoft Sentinel Provider library for abstracted functions are supported for querying the data lake. Pip installs or custom libraries aren't supported.
VS Code UX limit to display records 100,000 rows

Troubleshooting

The following table lists common errors you may encounter when working with notebooks, their root causes and suggested actions to resolve them.

Error Category Error Name Error Code Error Message Suggested Action
DatabaseError DatabaseNotFound 2001 Database {DatabaseName} not found. Verify that the database exists. If the database is new, wait for a metadata refresh.
DatabaseError AmbiguousDatabaseName 2002 Several databases (IDs: {DatabaseID1}, {DatabaseID2}, ...) share the name {DatabaseName}. Provide a specific database ID. Specify a database ID when multiple databases have the same name.
DatabaseError DatabaseIdMismatch 2003 Database ({DatabaseName}, ID {DatabaseID}) not found. Check both the database name and ID. To obtain database IDs, list all the databases.
DatabaseError ListDatabasesFailure 2004 Can't fetch databases. Restart the session and try again. Restart the session and retry the operation after a few minutes.
TableError TableDoesNotExist 2100 Table {TableName} not found in the database {DatabaseName}. Verify that the table exists in the database. If the table or database is new, wait a few minutes and try again.
TableError ProvisioningIncomplete 2101 Table {TableName} is not ready. Wait a few minutes before trying again. The table is being provisioned. Wait a few minutes before trying again.
TableError DeltaTableMissing 2102 Table {TableName} is empty. New tables can take up to a few hours to be ready. It can take a few hours to fully synchronize an analytics table into the data lake. For tables that are only in the data lake, check if the data needs to be loaded or restored.
TableError TableDoesNotExistForDelete 2103 Can't delete table. Table {TableName} not found. Verify that the table exists in the database. If the table or database is new, wait a few minutes and try again.
AuthorizationFailure MissingSASToken 2201 Can't access table. Restart the session and try again. Authorization failed while trying to fetch the access token for the table. Restart the session and try again.
AuthorizationFailure InvalidSASToken 2202 Can't access table. Restart the session and try again. Authorization failed while trying to fetch the access token for the table. Restart the session and try again.
AuthorizationFailure TokenExpired 2203 Can't access table. Restart the session and try again. Authorization failed while trying to fetch the access token for the table. Restart the session and try again.
AuthorizationFailure TableInsufficientPermissions 2204 Access needed for the table {TableName} in the database {DatabaseName}. Contact an administrator to request access to the table or the database (workspace).
AuthorizationFailure InternalTableAccessDenied 2205 Access to the table {TableName} is restricted. Only system or user-defined tables can be accessed from a notebook.
AuthorizationFailure TableAuthFailure 2206 Can't save data to the table. Restart the session and try again. Authorization failed while trying to save data to the table. Restart the session and try again.
ConfigurationError HadoopConfigFailure 2301 Can't update session configuration. Restart the session and try again. This problem is transient and can be resolved by restarting the session and trying again. If this problem persists, contact support.
DataError JsonParsingFailure 2302 Table metadata has been corrupted. Contact support for assistance. Contact support for assistance. Provide your tenant ID, the table name, and the database name.
TableSchemaError TableSchemaMismatch 2401 Column not found in the destination table. Align the DataFrame schema and the destination table or use overwrite mode. Update the DataFrame schema to match the table in your target database. You can also replace the table entirely in overwrite mode.
TableSchemaError MissingRequiredColumns 2402 Column {ColumnName} is missing from the DataFrame. Check the DataFrame schema and align it with the destination table. Update the DataFrame schema to match the table in your target database. You can also replace the table entirely in overwrite mode.
TableSchemaError ColumnTypeChangeNotAllowed 2403 Can't change the data type of the column {ColumnName}. A data type change is not allowed for the column. Check existing columns in the destination table and align all data types in the DataFrame.
TableSchemaError ColumnNullabilityChangeNotAllowed 2404 Can't change nullability of the column {ColumnName}. Can't update nullability settings of the column. Check the destination table and align the settings with the DataFrame.
IngestionError FolderCreationFailure 2501 Can't create storage for the table {TableName}. This problem is transient and can be resolved by restarting the session and trying again. If this problem persists, contact support.
IngestionError SubJobRequestFailure 2502 Can't create ingestion job for the table {TableName}. This problem is transient and can be resolved by restarting the session and trying again. If this problem persists, contact support.
IngestionError SubJobCreationFailure 2503 Can't create ingestion job for the table {TableName}. This problem is transient and can be resolved by restarting the session and trying again. If this problem persists, contact support.
InputError InvalidWriteMode 2601 Invalid write mode. Use append or overwrite. Specify a valid write mode (append or overwrite) before saving the DataFrame.
InputError PartitioningNotAllowed 2602 Can't partition analytics tables. Remove any partitioning for all columns in analytics tables.
InputError MissingTableSuffixLake 2603 Invalid custom table name. All names of custom tables in the data lake must end with _SPRK. Add _SPRK as a suffix to the table name before writing it to the data lake.
InputError MissingTableSuffixLA 2604 Invalid custom table name. All names of custom analytics tables must end with _SPRK_CL. Add _SPRK_CL as a suffix to the table name before writing it to analytics storage.
UnknownError InternalServerError 2901 Something went wrong. Restart the session and try again. This problem is transient and can be resolved by restarting the session and trying again. If this problem persists, contact support.