Aggregate Functions
Aggregate functions refer to statistical calculations performed on a set of values, returning a single numerical result. This article first prepares a table, and the following function examples will use this table for calculation results.
f1 | f2 | f3 |
---|---|---|
10 | 20 | 30 |
20 | 30 | 40 |
30 | null | 50 |
40 | 50 | 60 |
50 | 60 | 70 |
60 | 70 | null |
avg
Syntax | avg(N) |
---|---|
Parameter | Numeric type parameter, optional type: number |
Return Type | Number |
Description | Returns the average value of the specified numeric column |
Example | avg({f1}), returns 35 |
corr
Syntax | corr(ARG1, ARG2) |
---|---|
Parameter Description | 1. ARG1, optional type: number; 2. ARG2, optional type: number |
Return Type | Number |
Description | Returns the correlation coefficient between two numeric columns |
Example | (arg1, arg2) = [(1,2),(2,3),(3,4),(4,5),(5,6),(1000,1001)], corr({arg1}, {arg2}), returns 1 |
count
Syntax | count(ARG) |
---|---|
Parameter | The parameter to be calculated; optional types: any type |
Return Type | Number |
Description | Returns the number of values in the specified column |
Example | count({f2}), returns: 5 (null values are not counted) |
covar_pop
Syntax | covar_pop(ARG1, ARG2) |
---|---|
Parameter Description | 1. ARG1, optional type: number; 2. ARG2, optional type: number |
Return Type | Number |
Description | Returns the population covariance of two numeric columns |
Example | (arg1, arg2) = [(1,2),(2,3),(3,4),(4,5),(5,6),(1000,1001)], covar_pop({arg1}, {arg2}) returns 138058.47222222222 |
covar_samp
Syntax | covar_samp(ARG1, ARG2) |
---|---|
Parameter Description | 1. ARG1, optional type: number; 2. ARG2, optional type: number |
Return Type | Number |
Description | Returns the sample covariance of two numeric columns |
Example | (arg1, arg2) = [(1,2),(2,3),(3,4),(4,5),(5,6),(1000,1001)], covar_samp(arg1, arg2) returns 165670.16666666666 |
distinct_count
Syntax | distinct_count(ARG) |
---|---|
Parameter | The parameter to be calculated; supported types: any type |
Return Type | Number |
Description | Returns the number of unique (non-duplicate) results in the specified column |
Example | distinct_count({f2}), returns: 5 |
Unsupported Data Sources | None |
first
Syntax | first(ARG) |
---|---|
Parameter | The parameter to be calculated, optional type: ANY |
Return Type | ANY |
Description | Retrieves the value of the first record in the specified column; the value is not fixed |
Example | first({f1}), returns: 10 |
last
Syntax | last(ARG) |
---|---|
Parameter | The parameter to be calculated; optional types: any type |
Return Type | Any type |
Description | Retrieves the value of the last record in the specified column. The value is not fixed. |
Example | last({f1}), returns: 60 |
list_collect
Syntax | list_collect(ARG) |
---|---|
Parameter | The parameter to be calculated; optional types: any |
Return Type | Array |
Description | Aggregate function that collects the values of arg into an array. (The return format is an array type. If you see a null value, try converting it to a string to view it.) |
Example | list_collect({f1}), returns: an array containing all values of the f1 field |
list_collect_flatten
Syntax | list_collect_flatten(ARR) |
---|---|
Parameter | Array type parameter, optional type: array |
Return Type | Array |
Description | Aggregate function that collects and flattens the parameters of arr into an array. (The return format is an array. If you see a null value, try converting it to a string to view it.) |
Example | list_collect_flatten({f1}), returns: extracts all elements from the f1 array and combines them into a new array. |
max
Syntax | max(ARG) |
---|---|
Parameter | Parameter to be calculated, optional types: any type |
Return Type | Any type |
Description | Returns the maximum value of the specified numeric column |
Example | max({f1}), returns: 60 |
Unsupported Data Sources | None |
max_by
Syntax | max_by(ARG, COMPARE) |
---|---|
Parameter Description | 1. Parameter to be calculated, optional types: any type; 2. Comparison value, optional types: any type |
Return Type | Any type |
Description | Returns the value of the arg column corresponding to the maximum value in the compare column. If there are multiple maximum values, one is selected at random. |
Example | max_by({f1}, if(isnull({f2}), -1, {f2})), returns: 60 |
median
Syntax | median(ARG) |
---|---|
Parameter | Numeric parameter, optional type: number |
Return Type | Number |
Description | Returns the median of the specified numeric column |
Example | median({f1}), returns: 35 |
min
Syntax | min(ARG) |
---|---|
Parameter | The parameter to be calculated; optional types: any type |
Return Type | Any type |
Description | Returns the minimum value of the specified numeric column |
Example | min({f1}), returns: 10 |
min_by
Syntax | min_by(ARG, COMPARE) |
---|---|
Parameter Description | 1. Parameter to be calculated, optional type: any type; 2. Comparison value, optional type: any type |
Return Type | Any type |
Description | Returns the value of the arg column corresponding to the row with the minimum value in the compare column. If there are multiple minimum values, one is selected at random. |
Example | min_by({f1},{f2}), returns: 10 |
mode
Syntax | mode() within group(ORDER BY expr2 [ DESC ] ) |
---|---|
Parameter Description | expr2: Expression used for sorting, optional types: ANY; DESC: Optional parameter, specifies descending order, default is ascending order |
Return Type | ANY |
Description | Returns the value that appears most frequently; if there is a tie, returns the first one |
Example | For the data [1, 2, 2, 3, 3], mode() within group(ORDER BY value) returns: 2 (2 and 3 have the same frequency, returns the first 2) |
nth
Syntax | nth(ARG, I) |
---|---|
Parameter Description | 1. Parameter to be calculated, optional type: any type; 2. Integer parameter, optional type: number |
Return Type | Any type |
Description | Retrieves the i-th value |
Example | nth({f1}, 2), returns: 30 (index starts from 1) |
percentile
Syntax | percentile(ARG, P) |
---|---|
Parameter Description | 1. Numeric type parameter, allowed type: number; 2. Decimal between 0 and 1, allowed type: number |
Return Type | Number |
Description | Calculates the percentile |
Example | percentile({f1}, 50), returns: 30 |
regr_avgx
Syntax | regr_avgx(ARG1, ARG2) |
---|---|
Parameter Description | 1. ARG1, optional type: number; 2. ARG2, optional type: number |
Return Type | Number |
Description | Returns the average value of the independent variable (sum(ARG2)/N) |
Example | regr_avgx({f1},{f2}), returns: 46 |
regr_avgy
Syntax | regr_avgy(ARG1, ARG2) |
---|---|
Parameter Description | 1. ARG1, optional type: number; 2. ARG2, optional type: number |
Return Type | Number |
Description | Returns the average value of the dependent variable (sum(ARG1)/N) |
Example | regr_avgy({f1},{f2}), returns: 36 |
regr_count
Syntax | regr_count(ARG1, ARG2) |
---|---|
Parameter Description | 1. ARG1, optional type: number; 2. ARG2, optional type: number |
Return Type | Number |
Description | The number of input rows where both expressions are not null |
Example | regr_count({f1},{f2}), returns: 5 |
regr_intercept
Syntax | regr_intercept(ARG1, ARG2) |
---|---|
Parameter Description | 1. ARG1, optional type: number; 2. ARG2, optional type: number |
Return Type | Number |
Description | The intercept of the linear equation of the least squares fit determined by the (arg2, arg1) pairs for arg1 |
Example | (arg1, arg2) = [(1,2),(2,3),(3,4),(4,5),(5,6),(1000,1)], regr_intercept(arg1, arg2) returns 665.666666666666 |
regr_r2
Syntax | regr_r2(ARG1, ARG2) |
---|---|
Parameter Description | 1. ARG1, optional type: number; 2. ARG2, optional type: number |
Return Type | Number |
Description | Square of the correlation coefficient |
Example | (arg1, arg2) = [(1,2),(2,3),(3,4),(4,5),(5,6),(10,1)], regr_r2({arg1}, {arg2}) = 0.06323185011709602 |
regr_slope
Syntax | regr_slope(ARG1, ARG2) |
---|---|
Parameter Description | 1. ARG1, optional type: number; 2. ARG2, optional type: number |
Return Type | Number |
Description | The slope of the least-squares fit linear equation determined by the (arg2, arg1) pairs |
Example | (arg1, arg2) = [(1,2),(2,3),(3,4),(4,5),(5,6),(10,1)], regr_slope(arg1, arg2) = -0.42857142857142855 |
regr_sxx
Syntax | regr_sxx(ARG1, ARG2) |
---|---|
Parameter Description | 1. ARG1, optional type: number; 2. ARG2, optional type: number |
Return Type | Number |
Description | sum(arg2^2) - sum(arg2)^2/N (the "sum of squares" of the independent variable) |
Example | (arg1, arg2) = [(1,2),(2,3),(3,4),(4,5),(5,6),(10,1)], regr_sxx(arg1, arg2) = 17.5 |
regr_sxy
Syntax | regr_sxy(ARG1, ARG2) |
---|---|
Parameter Description | 1. ARG1, optional type: number; 2. ARG2, optional type: number |
Return Type | Number |
Description | sum(arg2 * arg1) - sum(arg2) * sum(arg1)/N (the "sum of products" of the independent variable and the dependent variable) |
Example | (arg1, arg2) = [(1,2),(2,3),(3,4),(4,5),(5,6),(10,1)], regr_sxy(arg1, arg2) = -7.5 |
regr_syy
Syntax | regr_syy(ARG1, ARG2) |
---|---|
Parameter Description | 1. ARG1, optional type: number; 2. ARG2, optional type: number |
Return Type | Number |
Description | sum(arg1^2) - sum(arg1)^2/N (the "sum of squares" of the dependent variable) |
Example | (arg1, arg2) = [(1,2),(2,3),(3,4),(4,5),(5,6),(10,1)], regr_syy(arg1, arg2) = 50.833333333333336 |
set_collect
Syntax | set_collect(ARG) |
---|---|
Parameter | The parameter to be calculated; optional types: any type |
Return Type | Array |
Description | Aggregate function that collects the values of arg into a deduplicated array. (The return format is an array. If you see null displayed, try converting it to a string to view the result.) |
Example | set_collect([1, 2, 2, 3]), returns: [1, 2, 3] |
set_collect_flatten
Syntax | set_collect_flatten(ARR) |
---|---|
Parameter | Array type parameter, optional type: array |
Return Type | Array |
Description | Aggregate function that collects and flattens the parameters of arr into a deduplicated array. (The return format is an array. If you see a null value, try converting it to a string to view it.) |
Example | set_collect_flatten([[1, 2], [2, 3]]) returns: [1, 2, 3] |
stddev_pop
Syntax | stddev_pop(N) |
---|---|
Parameter | Numeric parameter, allowed type: number |
Return Type | Number |
Description | Returns the population standard deviation of a numeric column |
Example | n = [(1000), (1001), (1003), (1004), (1005), (1000)], stddev_pop(n) = 1.9507833184532709 |
stddev_samp
Syntax | stddev_samp(N) |
---|---|
Parameter | Numeric parameter, optional type: number |
Return Type | Number |
Description | Returns the sample standard deviation of a numeric column |
Example | n = [(1000),(1001),(1003),(1004),(1005),(1000)], stddev_samp(n) = 2.1369760566432809 |
string_agg
Syntax | string_agg(S, DELIMITER) |
---|---|
Parameter Description | 1. S: The field to be concatenated, optional types: any type; 2. DELIMITER: The delimiter, optional type: string |
Return Type | String |
Description | Concatenates the field S into a string, separated by DELIMITER. For Oracle, the within group clause is required. |
Example | string_agg(['a', 'b', 'c'], ','), returns: 'a,b,c' |
sum
Syntax | sum(N) |
---|---|
Parameter | Numeric parameter, allowed type: number |
Return Type | Number |
Description | Returns the sum of the specified numeric column |
Example | sum({f1}), returns: 210 |
var_pop
Syntax | var_pop(N) |
---|---|
Parameter | Numeric parameter, accepted type: number |
Return Type | Number |
Description | Returns the population variance of a numeric column |
Example | n = [(1000), (1001), (1003), (1004), (1005), (1000)], var_pop(n) = 3.8055555555555554 (Note: The original example description was incorrect; this is the actual population variance) |
var_samp
Syntax | var_samp(N) |
---|---|
Parameter | Numeric parameter, optional type: number |
Return Type | Number |
Description | Returns the sample variance of a numeric column (the square of the sample standard deviation) |
Example | n = [(1000), (1001), (1003), (1004), (1005), (1000)], var_samp(n) = 4.5666666666666667 |
Adding Filter Conditions
It is supported to use the filter clause after an aggregate calculation expression to achieve filtering effects. The syntax is as follows, where filter_clause is the filter condition:
aggregate_name(expression ...) FILTER (WHERE filter_clause )
- Example 1:
sum({f1}) FILTER (WHERE {f1} > 10)
This means summing the data where the f1 field is greater than 10.
- Example 2:
avg({f1}) FILTER (WHERE {f1} > 10 AND {f2} = 'abc')
This means averaging the data where the f1 field is greater than 10 and the f2 field equals 'abc'.
Adding Sorting Clauses
Adding sorting clauses allows you to sort the aggregation results. The functions that support sorting clauses are: first, last, mode, percentile, string_agg. The syntax is as follows, where order_clause is the sorting condition.
aggregate_name(expression ...) within group(ORDER BY order_clause)
- Example 1:
first({f1}) within group(ORDER BY {f2})
This means sorting the f1 field by the f2 field and taking the first value of the f1 field in each group.
- Example 2:
last({f1}) within group(ORDER BY {f2})
This means sorting the f1 field by the f2 field and taking the last value of the f1 field in each group.
- Example 3:
percentile({f1}, 0.5) within group(ORDER BY {f2})
This means sorting the f1 field by the f2 field and taking the value of the f1 field at the 50% position in each group.