Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Applies to:
Databricks SQL
Databricks Runtime 10.4 LTS and above
Returns the top k most frequently occurring item values in an expr along with their approximate counts.
Syntax
approx_top_k(expr[, k[, maxItemsTracked]]) [FILTER ( WHERE cond ) ]
This function can also be invoked as a window function using the OVER clause.
Arguments
expr: An expression of STRING, BOOLEAN, DATE, TIMESTAMP, or numeric type.k: An optional INTEGER literal greater than 0. Ifkis not specified, it defaults to5.maxItemsTracked: An optional INTEGER literal greater than or equal tok. IfmaxItemsTrackedis not specified, it defaults to10000.cond: An optional boolean expression filtering the rows used for aggregation.
Returns
Results are returned as an ARRAY of type STRUCT, where each STRUCT contains an item field for the value (with its
original input type) and a count field (of type LONG) with the approximate number of occurrences. The array is sorted
by count descending.
The aggregate function returns the top k most frequently occurring item values in an expression expr along with
their approximate counts. The error in each count may be up to 2.0 * numRows / maxItemsTracked where numRows is the
total number of rows. Higher values of maxItemsTracked provide better accuracy at the cost of increased memory usage.
Expressions that have fewer than maxItemsTracked distinct items will yield exact item counts. Results include NULL
values as their own item in the results.
Examples
> SELECT approx_top_k(expr) FROM VALUES (0), (0), (1), (1), (2), (3), (4), (4) AS tab(expr);
[{'item':4,'count':2},{'item':1,'count':2},{'item':0,'count':2},{'item':3,'count':1},{'item':2,'count':1}]
> SELECT approx_top_k(expr, 2) FROM VALUES 'a', 'b', 'c', 'c', 'c', 'c', 'd', 'd' AS tab(expr);
[{'item':'c','count',4},{'item':'d','count':2}]
> SELECT approx_top_k(expr, 10, 100) FROM VALUES (0), (1), (1), (2), (2), (2) AS tab(expr);
[{'item':2,'count':3},{'item':1,'count':2},{'item':0,'count':1}]