Skip to content

Aggregate Functions

Aggregate functions refer to statistical calculations performed on a set of values, returning a single numerical result. This article first prepares a table, and the following function examples will use this table for calculation results.

f1f2f3
102030
203040
30null50
405060
506070
6070null

avg

Syntaxavg(N)
ParameterNumeric type parameter, optional type: number
Return TypeNumber
DescriptionReturns the average value of the specified numeric column
Exampleavg({f1}), returns 35

corr

Syntaxcorr(ARG1, ARG2)
Parameter Description1. ARG1, optional type: number;
2. ARG2, optional type: number
Return TypeNumber
DescriptionReturns the correlation coefficient between two numeric columns
Example(arg1, arg2) = [(1,2),(2,3),(3,4),(4,5),(5,6),(1000,1001)], corr({arg1}, {arg2}), returns 1

count

Syntaxcount(ARG)
ParameterThe parameter to be calculated; optional types: any type
Return TypeNumber
DescriptionReturns the number of values in the specified column
Examplecount({f2}), returns: 5 (null values are not counted)

covar_pop

Syntaxcovar_pop(ARG1, ARG2)
Parameter Description1. ARG1, optional type: number;
2. ARG2, optional type: number
Return TypeNumber
DescriptionReturns the population covariance of two numeric columns
Example(arg1, arg2) = [(1,2),(2,3),(3,4),(4,5),(5,6),(1000,1001)], covar_pop({arg1}, {arg2}) returns 138058.47222222222

covar_samp

Syntaxcovar_samp(ARG1, ARG2)
Parameter Description1. ARG1, optional type: number;
2. ARG2, optional type: number
Return TypeNumber
DescriptionReturns the sample covariance of two numeric columns
Example(arg1, arg2) = [(1,2),(2,3),(3,4),(4,5),(5,6),(1000,1001)], covar_samp(arg1, arg2) returns 165670.16666666666

distinct_count

Syntaxdistinct_count(ARG)
ParameterThe parameter to be calculated; supported types: any type
Return TypeNumber
DescriptionReturns the number of unique (non-duplicate) results in the specified column
Exampledistinct_count({f2}), returns: 5
Unsupported Data SourcesNone

first

Syntaxfirst(ARG)
ParameterThe parameter to be calculated, optional type: ANY
Return TypeANY
DescriptionRetrieves the value of the first record in the specified column; the value is not fixed
Examplefirst({f1}), returns: 10

last

Syntaxlast(ARG)
ParameterThe parameter to be calculated; optional types: any type
Return TypeAny type
DescriptionRetrieves the value of the last record in the specified column. The value is not fixed.
Examplelast({f1}), returns: 60

list_collect

Syntaxlist_collect(ARG)
ParameterThe parameter to be calculated; optional types: any
Return TypeArray
DescriptionAggregate function that collects the values of arg into an array. (The return format is an array type. If you see a null value, try converting it to a string to view it.)
Examplelist_collect({f1}), returns: an array containing all values of the f1 field

list_collect_flatten

Syntaxlist_collect_flatten(ARR)
ParameterArray type parameter, optional type: array
Return TypeArray
DescriptionAggregate function that collects and flattens the parameters of arr into an array. (The return format is an array. If you see a null value, try converting it to a string to view it.)
Examplelist_collect_flatten({f1}), returns: extracts all elements from the f1 array and combines them into a new array.

max

Syntaxmax(ARG)
ParameterParameter to be calculated, optional types: any type
Return TypeAny type
DescriptionReturns the maximum value of the specified numeric column
Examplemax({f1}), returns: 60
Unsupported Data SourcesNone

max_by

Syntaxmax_by(ARG, COMPARE)
Parameter Description1. Parameter to be calculated, optional types: any type;
2. Comparison value, optional types: any type
Return TypeAny type
DescriptionReturns the value of the arg column corresponding to the maximum value in the compare column. If there are multiple maximum values, one is selected at random.
Examplemax_by({f1}, if(isnull({f2}), -1, {f2})), returns: 60

median

Syntaxmedian(ARG)
ParameterNumeric parameter, optional type: number
Return TypeNumber
DescriptionReturns the median of the specified numeric column
Examplemedian({f1}), returns: 35

min

Syntaxmin(ARG)
ParameterThe parameter to be calculated; optional types: any type
Return TypeAny type
DescriptionReturns the minimum value of the specified numeric column
Examplemin({f1}), returns: 10

min_by

Syntaxmin_by(ARG, COMPARE)
Parameter Description1. Parameter to be calculated, optional type: any type;
2. Comparison value, optional type: any type
Return TypeAny type
DescriptionReturns the value of the arg column corresponding to the row with the minimum value in the compare column. If there are multiple minimum values, one is selected at random.
Examplemin_by({f1},{f2}), returns: 10

mode

Syntaxmode() within group(ORDER BY expr2 [ DESC ] )
Parameter Descriptionexpr2: Expression used for sorting, optional types: ANY;
DESC: Optional parameter, specifies descending order, default is ascending order
Return TypeANY
DescriptionReturns the value that appears most frequently; if there is a tie, returns the first one
ExampleFor the data [1, 2, 2, 3, 3], mode() within group(ORDER BY value) returns: 2 (2 and 3 have the same frequency, returns the first 2)

nth

Syntaxnth(ARG, I)
Parameter Description1. Parameter to be calculated, optional type: any type;
2. Integer parameter, optional type: number
Return TypeAny type
DescriptionRetrieves the i-th value
Examplenth({f1}, 2), returns: 30 (index starts from 1)

percentile

Syntaxpercentile(ARG, P)
Parameter Description1. Numeric type parameter, allowed type: number;
2. Decimal between 0 and 1, allowed type: number
Return TypeNumber
DescriptionCalculates the percentile
Examplepercentile({f1}, 50), returns: 30

regr_avgx

Syntaxregr_avgx(ARG1, ARG2)
Parameter Description1. ARG1, optional type: number;
2. ARG2, optional type: number
Return TypeNumber
DescriptionReturns the average value of the independent variable (sum(ARG2)/N)
Exampleregr_avgx({f1},{f2}), returns: 46

regr_avgy

Syntaxregr_avgy(ARG1, ARG2)
Parameter Description1. ARG1, optional type: number;
2. ARG2, optional type: number
Return TypeNumber
DescriptionReturns the average value of the dependent variable (sum(ARG1)/N)
Exampleregr_avgy({f1},{f2}), returns: 36

regr_count

Syntaxregr_count(ARG1, ARG2)
Parameter Description1. ARG1, optional type: number;
2. ARG2, optional type: number
Return TypeNumber
DescriptionThe number of input rows where both expressions are not null
Exampleregr_count({f1},{f2}), returns: 5

regr_intercept

Syntaxregr_intercept(ARG1, ARG2)
Parameter Description1. ARG1, optional type: number;
2. ARG2, optional type: number
Return TypeNumber
DescriptionThe intercept of the linear equation of the least squares fit determined by the (arg2, arg1) pairs for arg1
Example(arg1, arg2) = [(1,2),(2,3),(3,4),(4,5),(5,6),(1000,1)], regr_intercept(arg1, arg2) returns 665.666666666666

regr_r2

Syntaxregr_r2(ARG1, ARG2)
Parameter Description1. ARG1, optional type: number;
2. ARG2, optional type: number
Return TypeNumber
DescriptionSquare of the correlation coefficient
Example(arg1, arg2) = [(1,2),(2,3),(3,4),(4,5),(5,6),(10,1)], regr_r2({arg1}, {arg2}) = 0.06323185011709602

regr_slope

Syntaxregr_slope(ARG1, ARG2)
Parameter Description1. ARG1, optional type: number;
2. ARG2, optional type: number
Return TypeNumber
DescriptionThe slope of the least-squares fit linear equation determined by the (arg2, arg1) pairs
Example(arg1, arg2) = [(1,2),(2,3),(3,4),(4,5),(5,6),(10,1)], regr_slope(arg1, arg2) = -0.42857142857142855

regr_sxx

Syntaxregr_sxx(ARG1, ARG2)
Parameter Description1. ARG1, optional type: number;
2. ARG2, optional type: number
Return TypeNumber
Descriptionsum(arg2^2) - sum(arg2)^2/N (the "sum of squares" of the independent variable)
Example(arg1, arg2) = [(1,2),(2,3),(3,4),(4,5),(5,6),(10,1)], regr_sxx(arg1, arg2) = 17.5

regr_sxy

Syntaxregr_sxy(ARG1, ARG2)
Parameter Description1. ARG1, optional type: number;
2. ARG2, optional type: number
Return TypeNumber
Descriptionsum(arg2 * arg1) - sum(arg2) * sum(arg1)/N (the "sum of products" of the independent variable and the dependent variable)
Example(arg1, arg2) = [(1,2),(2,3),(3,4),(4,5),(5,6),(10,1)], regr_sxy(arg1, arg2) = -7.5

regr_syy

Syntaxregr_syy(ARG1, ARG2)
Parameter Description1. ARG1, optional type: number;
2. ARG2, optional type: number
Return TypeNumber
Descriptionsum(arg1^2) - sum(arg1)^2/N (the "sum of squares" of the dependent variable)
Example(arg1, arg2) = [(1,2),(2,3),(3,4),(4,5),(5,6),(10,1)], regr_syy(arg1, arg2) = 50.833333333333336

set_collect

Syntaxset_collect(ARG)
ParameterThe parameter to be calculated; optional types: any type
Return TypeArray
DescriptionAggregate function that collects the values of arg into a deduplicated array. (The return format is an array. If you see null displayed, try converting it to a string to view the result.)
Exampleset_collect([1, 2, 2, 3]), returns: [1, 2, 3]

set_collect_flatten

Syntaxset_collect_flatten(ARR)
ParameterArray type parameter, optional type: array
Return TypeArray
DescriptionAggregate function that collects and flattens the parameters of arr into a deduplicated array. (The return format is an array. If you see a null value, try converting it to a string to view it.)
Exampleset_collect_flatten([[1, 2], [2, 3]]) returns: [1, 2, 3]

stddev_pop

Syntaxstddev_pop(N)
ParameterNumeric parameter, allowed type: number
Return TypeNumber
DescriptionReturns the population standard deviation of a numeric column
Examplen = [(1000), (1001), (1003), (1004), (1005), (1000)], stddev_pop(n) = 1.9507833184532709

stddev_samp

Syntaxstddev_samp(N)
ParameterNumeric parameter, optional type: number
Return TypeNumber
DescriptionReturns the sample standard deviation of a numeric column
Examplen = [(1000),(1001),(1003),(1004),(1005),(1000)], stddev_samp(n) = 2.1369760566432809

string_agg

Syntaxstring_agg(S, DELIMITER)
Parameter Description1. S: The field to be concatenated, optional types: any type;
2. DELIMITER: The delimiter, optional type: string
Return TypeString
DescriptionConcatenates the field S into a string, separated by DELIMITER. For Oracle, the within group clause is required.
Examplestring_agg(['a', 'b', 'c'], ','), returns: 'a,b,c'

sum

Syntaxsum(N)
ParameterNumeric parameter, allowed type: number
Return TypeNumber
DescriptionReturns the sum of the specified numeric column
Examplesum({f1}), returns: 210

var_pop

Syntaxvar_pop(N)
ParameterNumeric parameter, accepted type: number
Return TypeNumber
DescriptionReturns the population variance of a numeric column
Examplen = [(1000), (1001), (1003), (1004), (1005), (1000)], var_pop(n) = 3.8055555555555554 (Note: The original example description was incorrect; this is the actual population variance)

var_samp

Syntaxvar_samp(N)
ParameterNumeric parameter, optional type: number
Return TypeNumber
DescriptionReturns the sample variance of a numeric column (the square of the sample standard deviation)
Examplen = [(1000), (1001), (1003), (1004), (1005), (1000)], var_samp(n) = 4.5666666666666667

Adding Filter Conditions

It is supported to use the filter clause after an aggregate calculation expression to achieve filtering effects. The syntax is as follows, where filter_clause is the filter condition:

aggregate_name(expression ...) FILTER (WHERE filter_clause )
  • Example 1:
sum({f1}) FILTER (WHERE {f1} > 10)

This means summing the data where the f1 field is greater than 10.

  • Example 2:
avg({f1}) FILTER (WHERE {f1} > 10 AND {f2} = 'abc')

This means averaging the data where the f1 field is greater than 10 and the f2 field equals 'abc'.

Adding Sorting Clauses

Adding sorting clauses allows you to sort the aggregation results. The functions that support sorting clauses are: first, last, mode, percentile, string_agg. The syntax is as follows, where order_clause is the sorting condition.

aggregate_name(expression ...) within group(ORDER BY order_clause)
  • Example 1:
first({f1}) within group(ORDER BY {f2})

This means sorting the f1 field by the f2 field and taking the first value of the f1 field in each group.

  • Example 2:
last({f1}) within group(ORDER BY {f2})

This means sorting the f1 field by the f2 field and taking the last value of the f1 field in each group.

  • Example 3:
percentile({f1}, 0.5) within group(ORDER BY {f2})

This means sorting the f1 field by the f2 field and taking the value of the f1 field at the 50% position in each group.

User Manual for Hengshi Analysis Platform