Edit

Share via


Calculate similarity with the ai.similarity function

The ai.similarity function uses generative AI to compare two string expressions and then calculate a semantic similarity score. It uses only a single line of code. You can compare text values from one column of a DataFrame with a single common text value or with pairwise text values in another column.

AI functions improve data engineering by using the power of large language models in Microsoft Fabric. To learn more, see this overview article.

Important

This feature is in preview, for use in Fabric Runtime 1.3 and later.

  • Review the prerequisites in this overview article, including the library installations that are temporarily required to use AI functions.
  • By default, the gpt-4o-mini model currently powers AI functions. Learn more about billing and consumption rates.
  • Although the underlying model can handle several languages, most of the AI functions are optimized for use on English-language texts.
  • During the initial rollout of AI functions, users are temporarily limited to 1,000 requests per minute with the built-in AI endpoint in Fabric.

Use ai.similarity with pandas

The ai.similarity function extends the pandas Series class.

To calculate the semantic similarity of each input row for a single common text value, call the function on a pandas DataFrame text column. The function can also calculate the semantic similarity of each row for corresponding pairwise values in another column that has the same dimensions as the input column.

The function returns a pandas Series that contains similarity scores, which can be stored in a new DataFrame column.

Syntax

df["similarity"] = df["col1"].ai.similarity("value")

Parameters

Name Description
other
Required
A string that contains either:
- A single common text value, which is used to compute similarity scores for each input row.
- Another pandas Series with the same dimensions as the input. It contains text values to use to compute pairwise similarity scores for each input row.

Returns

The function returns a pandas Series that contains similarity scores for each input text row. The output similarity scores are relative, and they're best used for ranking. Score values can range from -1 (opposites) to 1* (identical). A score value of 0 indicates that the values are unrelated in meaning.

Example

# This code uses AI. Always review output for mistakes. 
# Read terms: https://azure.microsoft.com/support/legal/preview-supplemental-terms/.

df = pd.DataFrame([ 
        ("Bill Gates"), 
        ("Satya Nadella"), 
        ("Joan of Arc")
    ], columns=["name"])
    
df["similarity"] = df["name"].ai.similarity("Microsoft")
display(df)

Use ai.similarity with PySpark

The ai.similarity function is also available for Spark DataFrames. You must specify the name of an existing input column as a parameter. You must also specify a single common text value for comparisons, or the name of another column for pairwise comparisons.

The function returns a new DataFrame that includes similarity scores for each row of input text that's in an output column.

Syntax

df.ai.similarity(input_col="col1", other="value", output_col="similarity")

Parameters

Name Description
input_col
Required
A string that contains the name of an existing column with input text values to use for computing similarity scores.
other or other_col
Required
Only one of these parameters is required. The other parameter is a string that contains a single common text value used to compute similarity scores for each row of input. The other_col parameter is a string that designates the name of a second existing column, with text values used to compute pairwise similarity scores.
output_col
Optional
A string that contains the name of a new column to store calculated similarity scores for each input text row. If you don't set this parameter, a default name generates for the output column.
error_col
Optional
A string that contains the name of a new column that stores any OpenAI errors that result from processing each input text row. If you don't set this parameter, a default name generates for the error column. If an input row has no errors, this column has a null value.

Returns

The function returns a Spark DataFrame that includes a new column that contains generated similarity scores for each input text row. The output similarity scores are relative, and they're best used for ranking. Score values can range from -1* (opposites) to 1 (identical). A score of 0 indicates that the values are unrelated in meaning.

Example

# This code uses AI. Always review output for mistakes. 
# Read terms: https://azure.microsoft.com/support/legal/preview-supplemental-terms/.

df = spark.createDataFrame([
        ("Bill Gates",), 
        ("Sayta Nadella",), 
        ("Joan of Arc",) 
    ], ["names"])

similarity = df.ai.similarity(input_col="names", other="Microsoft", output_col="similarity")
display(similarity)