Edit

Share via


Tutorial: Analyze functional dependencies in a semantic model

In this tutorial, you build on work by a Power BI analyst that's stored as semantic models (Power BI datasets). By using SemPy (preview) in the Synapse Data Science experience in Microsoft Fabric, you analyze functional dependencies in DataFrame columns. This analysis helps you discover subtle data quality issues to get more accurate insights.

In this tutorial, you learn how to:

  • Apply domain knowledge to formulate hypotheses about functional dependencies in a semantic model.
  • Get familiar with components of Semantic Link's Python library (SemPy) that integrate with Power BI and help automate data quality analysis. These components include:
    • FabricDataFrame—pandas-like structure enhanced with additional semantic information
    • Functions that pull semantic models from a Fabric workspace into your notebook
    • Functions that evaluate functional dependency hypotheses and identify relationship violations in your semantic models

Prerequisites

  1. Select Workspaces from the navigation pane to find and select your workspace. This workspace becomes your current workspace.

  2. Download the Customer Profitability Sample.pbix file from the fabric-samples GitHub repository.

  3. In your workspace, select Import > Report or Paginated Report > From this computer to upload the Customer Profitability Sample.pbix file to your workspace.

Follow along in the notebook

The powerbi_dependencies_tutorial.ipynb notebook accompanies this tutorial.

Set up the notebook

Set up a notebook environment with the modules and data you need.

  1. Use %pip to install SemPy from PyPI in the notebook.

    %pip install semantic-link
    
  2. Import the modules you need.

    import sempy.fabric as fabric
    from sempy.dependencies import plot_dependency_metadata
    

Load and preprocess the data

This tutorial uses a standard sample semantic model Customer Profitability Sample.pbix. For a description of the semantic model, see Customer Profitability sample for Power BI.

  1. Load Power BI data into a FabricDataFrame by using the fabric.read_table function.

    dataset = "Customer Profitability Sample"
    customer = fabric.read_table(dataset, "Customer")
    customer.head()
    
  2. Load the State table into a FabricDataFrame.

    state = fabric.read_table(dataset, "State")
    state.head()
    

    Although the output looks like a pandas DataFrame, this code initializes a data structure called a FabricDataFrame that adds operations on top of pandas.

  3. Check the data type of customer.

    type(customer)
    

    The output shows that customer is sempy.fabric._dataframe._fabric_dataframe.FabricDataFrame.

  4. Join the customer and state DataFrame objects.

    customer_state_df = customer.merge(state, left_on="State", right_on="StateCode", how='left')
    customer_state_df.head()
    

Identify functional dependencies

A functional dependency is a one-to-many relationship between values in two or more columns in a DataFrame. Use these relationships to automatically detect data quality problems.

  1. Run SemPy's find_dependencies function on the merged DataFrame to identify functional dependencies between column values.

    dependencies = customer_state_df.find_dependencies()
    dependencies
    
  2. Visualize the dependencies by using SemPy's plot_dependency_metadata function.

    plot_dependency_metadata(dependencies)
    

    Screenshot of the dependency metadata plot.

    The functional dependencies graph shows that the Customer column determines columns like City, Postal Code, and Name.

    The graph doesn't show a functional dependency between City and Postal Code, likely because there are many violations in the relationship between the columns. Use SemPy's plot_dependency_violations function to visualize dependency violations between specific columns.

Explore the data for quality issues

  1. Draw a graph with SemPy's plot_dependency_violations visualization function.

    customer_state_df.plot_dependency_violations('Postal Code', 'City')
    

    Screenshot of a plot that shows dependency violations.

    The plot of dependency violations shows values for Postal Code on the left side, and values for City on the right side. An edge connects a Postal Code on the left hand side with a City on the right hand side if there's a row that contains these two values. The edges are annotated with the count of such rows. For example, there are two rows with postal code 20004, one with city "North Tower" and the other with city "Washington".

    The plot also shows a few violations and many empty values.

  2. Confirm the number of empty values for Postal Code:

    customer_state_df['Postal Code'].isna().sum()
    

    50 rows have NA for Postal Code.

  3. Drop rows with empty values. Then, find dependencies using the find_dependencies function. Notice the extra parameter verbose=1 that offers a glimpse into the internal workings of SemPy:

    customer_state_df2=customer_state_df.dropna()
    customer_state_df2.find_dependencies(verbose=1)
    

    The conditional entropy for Postal Code and City is 0.049. This value indicates that there are functional dependency violations. Before you fix the violations, raise the threshold on conditional entropy from the default value of 0.01 to 0.05, just to see the dependencies. Lower thresholds result in fewer dependencies (or higher selectivity).

  4. Raise the threshold on conditional entropy from the default value of 0.01 to 0.05:

    plot_dependency_metadata(customer_state_df2.find_dependencies(threshold=0.05))
    

    Screenshot of the dependency metadata plot with a higher entropy threshold.

    If you apply domain knowledge of which entity determines the values of other entities, this dependency graph seems accurate.

  5. Explore more data quality issues that were detected. For example, a dashed arrow joins City and Region, which indicates that the dependency is only approximate. This approximate relationship could imply that there's a partial functional dependency.

    customer_state_df.list_dependency_violations('City', 'Region')
    
  6. Take a closer look at each of the cases where a nonempty Region value causes a violation:

    customer_state_df[customer_state_df.City=='Downers Grove']
    

    The result shows the city of Downers Grove in Illinois and Nebraska. However, Downers Grove is a city in Illinois, not Nebraska.

  7. Take a look at the city of Fremont:

    customer_state_df[customer_state_df.City=='Fremont']
    

    There's a city called Fremont in California. However, for Texas, the search engine returns Premont, not Fremont.

  8. It's also suspicious to see violations of the dependency between Name and Country/Region, as signified by the dotted line in the original graph of dependency violations (before dropping the rows with empty values).

    customer_state_df.list_dependency_violations('Name', 'Country/Region')
    

    One customer, SDI Design, appears in two regions—United States and Canada. This case might not be a semantic violation, just uncommon. Still, it's worth a close look:

  9. Take a closer look at the customer SDI Design:

    customer_state_df[customer_state_df.Name=='SDI Design']
    

    Further inspection shows two different customers from different industries with the same name.

Exploratory data analysis and data cleaning are iterative. What you find depends on your questions and perspective. Semantic Link gives you new tools to get more from your data.

Check out other tutorials for semantic link and SemPy: