Apache Spark

SELECT Statements

A SELECT statement can consist of the following basic clauses.

  • SELECT

  • INTO

  • FROM

  • JOIN

  • WHERE

  • GROUP BY

  • HAVING

  • UNION

  • ORDER BY

  • LIMIT

SELECT Syntax

The following syntax diagram outlines the syntax supported by the SQL engine of the provider:

SELECT {

[ TOP <numeric_literal> | DISTINCT ]

{

*

| {

<expression> [ [ AS ] <column_reference> ]

| { <table_name> | <correlation_name> } .*

} [ , ... ]

}

[ INTO csv:// [ filename= ] <file_path> [ ;delimiter=tab ] ]

{

FROM <table_reference> [ [ AS ] <identifier> ]

} [ , ... ]

[ [

INNER | { { LEFT | RIGHT | FULL } [ OUTER ] }

] JOIN <table_reference> [ ON <search_condition> ] [ [ AS ] <identifier> ]

] [ ... ]

[ WHERE <search_condition> ]

[ GROUP BY <column_reference> [ , ... ]

[ HAVING <search_condition> ]

[ UNION [ ALL ] <select_statement> ]

[

ORDER BY

<column_reference> [ ASC | DESC ] [ NULLS FIRST | NULLS LAST ]

]

[

LIMIT <expression>

[

{ OFFSET | , }

<expression>

]

]

} | SCOPE_IDENTITY()

<expression> ::=

| <column_reference>

| @ <parameter>

| ?

| COUNT( * | { [ DISTINCT ] <expression> } )

| { AVG | MAX | MIN | SUM | COUNT } ( <expression> )

| NULLIF ( <expression> , <expression> )

| COALESCE ( <expression> , ... )

| CASE <expression>

WHEN { <expression> | <search_condition> } THEN { <expression> | NULL } [ ... ]

[ ELSE { <expression> | NULL } ]

END

| <literal>

| <sql_function>

<search_condition> ::=

{

<expression> { = | > | < | >= | <= | <> | != | LIKE | NOT LIKE | IN | NOT IN | IS NULL | IS NOT NULL | AND | OR | CONTAINS | BETWEEN } [ <expression> ]

} [ { AND | OR } ... ]

Examples

  1. Return all columns:

    SELECT * FROM Customers

  2. Rename a column:

    SELECT [CompanyName] AS MY_CompanyName FROM Customers

  3. Cast a column's data as a different data type:

    SELECT CAST(Balance AS VARCHAR) AS Str_Balance FROM Customers

  4. Search data:

    SELECT * FROM Customers WHERE Country = 'US'

  5. Return the number of items matching the query criteria:

    SELECT COUNT(*) AS MyCount FROM Customers

  6. Return the number of unique items matching the query criteria:

    SELECT COUNT(DISTINCT CompanyName) FROM Customers

  7. Return the unique items matching the query criteria:

    SELECT DISTINCT CompanyName FROM Customers

  8. Summarize data:

    SELECT CompanyName, MAX(Balance) FROM Customers GROUP BY CompanyName

    See Aggregate Functions below for details.

  9. Retrieve data from multiple tables.

    SELECT Customers.ContactName, Orders.OrderDate FROM Customers, Orders WHERE Customers.CustomerId=Orders.CustomerId

    See JOIN Queries below for details.

  10. Sort a result set in ascending order:

    SELECT City, CompanyName FROM Customers ORDER BY CompanyName ASC

  11. Restrict a result set to the specified number of rows:

    SELECT City, CompanyName FROM Customers LIMIT 10

  12. Parameterize a query to pass in inputs at execution time. This enables you to create prepared statements and mitigate SQL injection attacks.

    SELECT * FROM Customers WHERE Country = @param

Aggregate Functions

COUNT

Returns the number of rows matching the query criteria.

SELECT COUNT(*) FROM Customers WHERE Country = 'US'

COUNT(DISTINCT)

Returns the number of distinct, non-null field values matching the query criteria.

SELECT COUNT(DISTINCT City) AS DistinctValues FROM Customers WHERE Country = 'US'

AVG

Returns the average of the column values.

SELECT CompanyName, AVG(Balance) FROM Customers WHERE Country = 'US' GROUP BY CompanyName

MIN

Returns the minimum column value.

SELECT MIN(Balance), CompanyName FROM Customers WHERE Country = 'US' GROUP BY CompanyName

MAX

Returns the maximum column value.

SELECT CompanyName, MAX(Balance) FROM Customers WHERE Country = 'US' GROUP BY CompanyName

SUM

Returns the total sum of the column values.

SELECT SUM(Balance) FROM Customers WHERE Country = 'US'

JOIN Queries

The Provider for Spark SQL supports standard SQL joins like the following examples.

Inner Join

An inner join selects only rows from both tables that match the join condition:

SELECT Customers.ContactName, Orders.OrderDate FROM Customers, Orders WHERE Customers.CustomerId=Orders.CustomerId

Left Join

A left join selects all rows in the FROM table and only matching rows in the JOIN table:

SELECT Customers.ContactName, Orders.OrderDate FROM Customers LEFT OUTER JOIN Orders ON Customers.CustomerId=Orders.CustomerId

Date Literal Functions

The following date literal functions can be used to filter date fields using relative intervals. Note that while the <, >, and = operators are supported for these functions, <= and >= are not.

L_TODAY()

The current day.

SELECT * FROM MyTable WHERE MyDateField = L_TODAY()

L_YESTERDAY()

The previous day.

SELECT * FROM MyTable WHERE MyDateField = L_YESTERDAY()

L_TOMORROW()

The following day.

SELECT * FROM MyTable WHERE MyDateField = L_TOMORROW()

L_LAST_WEEK()

Every day in the preceding week.

SELECT * FROM MyTable WHERE MyDateField = L_LAST_WEEK()

L_THIS_WEEK()

Every day in the current week.

SELECT * FROM MyTable WHERE MyDateField = L_THIS_WEEK()

L_NEXT_WEEK()

Every day in the following week.

SELECT * FROM MyTable WHERE MyDateField = L_NEXT_WEEK()

Also available:

  • L_LAST/L_THIS/L_NEXT MONTH

  • L_LAST/L_THIS/L_NEXT QUARTER

  • L_LAST/L_THIS/L_NEXT YEAR

L_LAST_N_DAYS(n)

The previous n days, excluding the current day.

SELECT * FROM MyTable WHERE MyDateField = L_LAST_N_DAYS(3)

L_NEXT_N_DAYS(n)

The following n days, including the current day.

SELECT * FROM MyTable WHERE MyDateField = L_NEXT_N_DAYS(3)

Also available:

  • L_LAST/L_NEXT_90_DAYS

L_LAST_N_WEEKS(n)

Every day in every week, starting n weeks before current week, and ending in the previous week.

SELECT * FROM MyTable WHERE MyDateField = L_LAST_N_WEEKS(3)

L_NEXT_N_WEEKS(n)

Every day in every week, starting the following week, and ending n weeks in the future.

SELECT * FROM MyTable WHERE MyDateField = L_NEXT_N_WEEKS(3)

Also available:

  • L_LAST/L_NEXT_N_MONTHS(n)

  • L_LAST/L_NEXT_N_QUARTERS(n)

  • L_LAST/L_NEXT_N_YEARS(n)

Projection Functions

ROUND(expr [, d])

Returns expr rounded to d decimal places using HALF_UP rounding mode.

  • expr: Any numeric expression.

  • d: The number of decimal places.

BROUND(expr [, d])

Returns expr rounded to d decimal places using HALF_EVEN rounding mode.

  • expr: Any numeric expression.

  • d: The number of decimal places.

FLOOR(expr)

Returns the largest integer not greater than expr.

  • expr: Any numeric expression.

CEIL(expr)

Returns the smallest integer not smaller than expr.

  • expr: Any numeric expression.

RAND([seed])

Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).

  • seed: The seed to use to generate the random value.

EXP(expr)

Returns e to the power of expr.

  • expr: Any numeric expression.

LN(expr)

Returns the natural logarithm (base e) of expr.

  • expr: Any numeric expression.

LOG10(expr)

Returns the logarithm of expr with base 10.

  • expr: Any numeric expression.

LOG2(expr)

Returns the logarithm of expr with base 2.

  • expr: Any numeric expression.

LOG(base, expr)

Returns the logarithm of expr with base.

  • base: A numeric expression to use as the base.

  • expr: Any numeric expression.

POW(expr1, expr2)

Raises expr1 to the power of expr2.

  • expr1: Any numeric expression.

  • expr2: Any numeric expression.

SQRT(expr)

Returns the square root of expr.

  • expr: Any numeric expression.

BIN(expr)

Returns the string representation of the long value expr represented in binary.

  • expr: A long expression.

HEX(expr)

Converts expr to hexadecimal.

  • expr: The expression to convert to hex.

UNHEX(expr)

Converts hexadecimal expr to binary.

  • expr: The hexadecimal value to convert to binary.

CONV(num, from_base, to_base)

Convert num from from_base to to_base.

  • num: The number to convert.

  • from_base: The original base of num.

  • to_base: The base to convert num to.

ABS(expr)

Returns the absolute value of the numeric value.

  • expr: Any valid numeric expression.

PMOD(expr1, expr2)

Returns the positive value of expr1 mod expr2.

  • expr1: Any valid numeric expression.

  • expr2: Any valid numeric expression.

SIN(expr)

Returns the sine of expr, as if computed by java.lang.Math.sin.

  • expr: Any valid numeric expression.

ASIN(expr)

Returns the inverse sine (a.k.a. arc sine) the arc sin of expr, as if computed by java.lang.Math.asin.

  • expr: Any valid numeric expression.

COS(expr)

Returns the cosine of expr, as if computed by java.lang.Math.cos.

  • expr: Any valid numeric expression.

ACOS(expr)

Returns the inverse cosine (a.k.a. arc cosine) of expr, as if computed by java.lang.Math.acos.

  • expr: Any valid numeric expression.

TAN(expr)

Returns the tangent of expr, as if computed by java.lang.Math.tan.

  • expr: Any valid numeric expression.

ATAN(expr)

Returns the inverse tangent (a.k.a. arc tangent) of expr, as if computed by java.lang.Math.atan

  • expr: Any valid numeric expression.

DEGREES(expr)

Converts radians to degrees.

  • expr: Any valid numeric expression.

RADIANS(expr)

Converts degrees to radians.

  • expr: Any valid numeric expression.

POSITIVE(expr)

Returns the postive value of expr.

  • expr: Any valid numeric expression.

NEGATIVE(expr)

Returns the negated value of expr.

  • expr: Any valid numeric expression.

SIGN(expr)

Returns -1.0, 0.0 or 1.0 as expr is negative, 0 or positive.

  • expr: Any valid numeric expression.

E()

Returns Euler's number, e.

PI()

Returns pi.

FACTORIAL(expr)

Returns the factorial of expr. expr is [0..20]. Otherwise, null.

  • expr: A numeric expression.

CBRT(expr)

Returns the cube root of expr.

  • expr: Any valid numeric expression.

SHIFTELFT(base, shift)

Bitwise left shift.

  • base: The base number to shift.

  • shift: The number of bits to shift.

SHIFTRIGHT(base, shift)

Bitwise right shift.

  • base: The base number to shift.

  • shift: The number of bits to shift.

SHIFTRIGHTUNSIGNED(base, shift)

Bitwise unsigned right shift.

  • base: The base number to shift.

  • shift: The number of bits to shift.

GREATEST(expr1, expr2 [, expr3] [, ...])

Returns the greatest value of all parameters, skipping null values.

  • expr1: Any valid expression.

  • expr2: Any valid expression.

  • expr3: Any valid expression.

LEAST(expr1, expr2 [, expr3] [, ...])

Returns the least value of all parameters, skipping null values.

  • expr1: Any valid expression.

  • expr2: Any valid expression.

  • expr3: Any valid expression.

WIDTH_BUCKET(expr, min_value, max_value, num_buckets)

Returns an integer between 0 and num_buckets+1 by mapping expr into the ith equally sized bucket. Buckets are made by dividing [min_value, max_value] into equally sized regions. If expr < min_value, return 1, if expr > max_value return num_buckets+1.

  • expr: A valid numeric expression.

  • min_value: The minimum value.

  • max_value: The maximum value.

  • num_buckets: The number of buckets.

SIZE(expr)

Returns the size of an array or a map. Returns -1 if null.

  • expr: Any valid expression.

MAP_KEYS(map)

Returns an unordered array containing the keys of the map.

  • map: A valid map expression.

MAP_VALUES(map)

Returns an unordered array containing the values of the map.

  • map: A valid map expression.

ARRAY_CONTAINS(array, expr)

Returns true if the array contains the value.

  • array: The array to search.

  • expr: The expression to search for.

SORT_ARRAY(array [, ascendingOrder])

Sorts the input array in ascending or descending order according to the natural ordering of the array elements.

  • array: The array to sort.

  • order: Identifies whether to sort in ascending order.

BINARY(expr)

Casts the value expr to the target data type binary.

  • expr: The expression to cast.

CAST(expr AS type)

Casts the value expr to the target data type type.

  • expr: Any valid expression.

  • type: The type to cast expr to.

FROM_UNIXTIME(unixtime [, format])

Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00".

  • unixtime: Unix time.

  • format: The format to convert unixtime to.

UNIX_TIMESTAMP([expr [, pattern]])

Returns the UNIX timestamp of the given time.

  • expr: The time string to convert.

  • format: The format of expr.

TO_DATE(date_str [, fmt])

Parses the date_str expression with the fmt expression to a date. Returns null with invalid input. By default, it follows casting rules to a date if the fmt is omitted.

  • date_str: The date string expression.

  • fmt: The format of date_str.

YEAR(date)

Returns the year component of the date/timestamp.

  • date: The date to extract the year from.

QUARTER(date)

Returns the quarter of the year for date, in the range 1 to 4.

  • date: The date to extract the quarter from.

MONTH(date)

Returns the month component of the date/timestamp.

  • date: The date to extract the month from.

DAY(date)

Returns the day of month of the date/timestamp.

  • date: The date to extract the day from.

HOUR(timestamp)

Returns the hour component of the string/timestamp.

  • timestamp: The timestamp to extract the hours from.

MINUTE(timestamp)

Returns the minute component of the string/timestamp.

  • timestamp: The timestamp to extract the minutes from.

SECOND(timestamp)

Returns the second component of the string/timestamp.

  • timestamp: The timestamp to extract the seconds from.

WEEKOFYEAR(date)

Returns the week of the year of the given date. A week is considered to start on a Monday and week 1 is the first week with >3 days.

  • date: The date to extract the week of the year from.

DATEDIFF(endDate, startDate)

Returns the number of days from startDate to endDate.

  • endDate: The end date.

  • startDate: The start date.

DATE_ADD(start_date, num_days)

Returns the date that is num_days after start_date.

  • start_date: The start date.

  • num_days: The number of days to add to start_date.

DATE_SUB(start_date, num_days)

Returns the date that is num_days before start_date.

  • start_date: The start date.

  • num_days: The number of days to subtract from start_date.

FROM_UTC_TIMESTAMP(timestamp, timezone)

Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, 'GMT+1' would yield '2017-07-14 03:40:00.0'.

  • timestamp: The UTC timestamp.

  • timezone: The timezone to convert to.

TO_UTC_TIMESTAMP(timestamp, timezone)

Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC. For example, 'GMT+1' would yield '2017-07-14 01:40:00.0'.

  • timestamp: The timestamp to convert to UTC.

  • timezone: The timezone of timestamp.

CURRENT_DATE()

Returns the current date at the start of query evaluation.

CURRENT_TIMESTAMP()

Returns the current timestamp at the start of query evaluation.

ADD_MONTHS(start_date, num_months [, fmt])

Returns the date that is num_months after start_date.

  • start_date: The starting date.

  • num_months: The number of months to add.

  • fmt: The output format.

LAST_DAY(date)

Returns the last day of the month which the date belongs to.

  • date: A valid date expression.

NEXT_DAY(start_date, day_of_week)

Returns the first date which is later than start_date and named as indicated.

  • start_date: The start date.

  • day_of_week: The day of week.

TRUNC(date, time_unit)

Returns date with the time portion of the day truncated to the unit specified by the format model fmt. fmt should be one of ["year", "yyyy", "yy", "mon", "month", "mm"]

  • date: A valid date expression.

  • time_unit: The time unit.

MONTHS_BETWEEN(timestamp1, timestamp2)

Returns number of months between timestamp1 and timestamp2.

  • timestamp1: A valid timestamp expression.

  • timestamp2: A valid timestamp expression.

DATE_FORMAT(timestamp, fmt)

Converts timestamp to a value of string in the format specified by the date format fmt.

  • timestamp: A valid timestamp expression.

  • fmt: A valid date format.

IF(expr1, expr2, expr3)

If expr1 evaluates to true, then returns expr2; otherwise returns expr3.

  • expr1: An expression that should evaluate to a boolean value.

  • expr2: A valid expression.

  • expr3: A valid expression.

ISNULL(expr)

Returns true if expr is null, or false otherwise.

  • expr: A valid expression.

ISNOTNULL(expr)

Returns true if expr is not null, or false otherwise.

  • expr: A valid expression.

NVL(expr1, expr2)

Returns expr1 if it's not NaN, or expr2 otherwise.

  • expr1: A valid expression.

  • expr2: A valid expression.

COALESCE(expr1, expr2 [, expr3] [, ...])

Returns the first non-null argument if exists. Otherwise, null.

  • expr1: A valid expression.

  • expr2: A valid expression.

  • expr3: A valid expression.

NULLIF(expr1, expr2)

Returns null if expr1 equals to expr2, or expr1 otherwise.

  • expr1: A valid expression.

  • expr2: A valid expression.

ASSERT_TRUE(expr)

Throws an exception if expr is not true.

  • expr: A valid expression that evaluates to a boolean.

ASCII(str)

Returns the numeric value of the first character of str.

  • str: A string expression.

BASE64(bin)

Converts the argument from a binary bin to a base 64 string.

  • bin: A binary expression.

CHAR_LENGTH(str)

Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

  • str: A string expression.

CHR(expr)

Returns the ASCII character having the binary equivalent to expr. If n is larger than 256 the result is equivalent to chr(n % 256)

  • expr: A integer expression.

CONCAT(str1, str2 [, str3] [, ...])

Returns the string or bytes resulting from concatenating the strings or bytes passed in as parameters in order. For example, concat('foo', 'bar') results in 'foobar'. Note that this function can take any number of input strings.

  • str1: A valid string expression.

  • str2: A valid string expression.

  • str3: A valid string expression.

CONCAT_WS(sep [, exp1] [, ...])

Returns the concatenation of the strings separated by sep.

  • set: A string separator.

  • exp1: A valid expression.

DECODE(bin, charset)

Decodes the first argument using the second argument character set.

  • bin: The binary expression to decode.

  • charset: The charset to use to decode bin.

ELT(n, input1 [, input2] [, ...])

Returns the n-th input, e.g., returns input2 when n is 2.

  • n: A valid integer index.

  • input1: A valid string expression.

  • input3: A valid string expression.

ENCODE(str, charset)

Encodes the first argument using the second argument character set.

  • str: A string expression to encode.

  • charset: The charset to use to encode str.

FIELD(val1, val2 [, val3] [, ...])

Returns the index of val in the val1,val2,val3,... list or 0 if not found. For example field('world','say','hello','world') returns 3. All primitive types are supported, arguments are compared using str.equals(x). If val is NULL, the return value is 0.

  • val1: A valid expression.

  • val2: A valid expression.

  • val3: A valid expression.

FIND_IN_SET(str, str_array)

Returns the index (1-based) of the given string (str) in the comma-delimited list (str_array). Returns 0, if the string was not found or if the given string (str) contains a comma.

  • str: The string expression to search for.

  • str_array: A comma-delimited list of values.

FORMAT_NUMBER(expr1, expr2)

Formats the number expr1 like '#,###,###.##', rounded to expr2 decimal places. If expr2 is 0, the result has no decimal point or fractional part. This is supposed to function like MySQL's FORMAT.

  • expr1: A numeric expression to format.

  • expr2: The number of deciml places.

GET_JSON_OBJECT(json_txt, path)

Extracts a json object from path.

  • json_txt: JSON data.

  • path: The path to extract.

IN_FILE(str, filename)

Returns true if the string str appears as an entire line in filename.

  • str: The string to search for.

  • filename: The name of the file to search.

INSTR(str, substr)

Returns the (1-based) index of the first occurrence of substr in str.

  • str: A string expression.

  • substr: The string expression to search for.

LENGTH(expr)

Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

  • expr: A string expression.

LOCATE(substr, str [, pos])

Returns the position of the first occurrence of substr in str after position pos. The given pos and return value are 1-based.

  • substr: The string expression to search for.

  • str: The string expression to search in.

  • pos: The starting index.

LOWER(expr)

Returns str with all characters changed to lowercase.

  • expr: A string expression.

LPAD(str, len, pad_str)

Returns str, left-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters.

  • str: A string expression.

  • len: The length to pad.

  • pad_str: The pad string.

LTRIM(str)

Removes the leading space characters from str.

  • str: A string expression.

OCTET_LENGTH(expr)

Returns the byte length of expr or number of bytes in binary data.

  • expr: Any string expression.

PARSE_URL(url, partToExtract [, key])

Returns the specified part from the URL. For example, parse_url('http://facebook.com/path1/p.php?k1=v1#Ref1', 'HOST') returns 'facebook.com'. Also a value of a particular key in QUERY can be extracted by providing the key as the third argument, for example, parse_url('http://facebook.com/path1/p.php?k1=v1#Ref1', 'QUERY', 'k1') returns 'v1'.

  • url: A valid URL expression.

  • partToExtract: The URL part to extract. Valid values for partToExtract include HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO.

  • key: The key.

PRINTF(strfmt [, obj1] [, ...])

Returns a formatted string from printf-style format strings.

  • strfmt: The string format.

  • obj1: The object to include in the formatted string.

REGEXP_EXTRACT(str, regexp [, idx])

Extracts a group that matches regexp.

  • str: A string expression.

  • regexp: A regular expression to search for.

  • idx: The starting index.

REGEXP_REPLACE(str, regexp, rep)

Replaces all substring of str that match regexp with rep.

  • str: A string expression.

  • regexp: A regular expression to search for.

  • rep: The string to replace.

REPEAT(str, n)

Returns the string which repeats the given string value n times.

  • str: The string expression to repeat.

  • n: The number of times to repeat str.

REPLACE(str, search [, replace])

Replaces all occurrences of search with replace. If search is not found in str, str is returned unchanged. If replace is not specified or is an empty string, nothing replaces the string that is removed from str.

  • str: A string expression.

  • search: The search string.

  • replace: A string expression to replace search values.

REVERSE(str)

Returns the reversed given string.

  • str: A string expression.

RPAD(str, len, pad_str)

Returns str, right-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters.

  • str: A string expression.

  • len: The length to pad.

  • pad_str: The pad string.

RTRIM(str)

Removes the trailing space characters from str.

  • str: A string expression.

SENTENCES(str [, lang, country])

Splits str into an array of array of words.

  • str: A string expression.

  • lang: The language of str.

  • country: The country of the specified language.

SPACE(n)

Returns a string consisting of n spaces.

  • n: The number of spaces.

SPLIT(str, regex)

Splits str around occurrences that match regex.

  • str: A string expression.

  • regex: The regular expression to match.

STR_TO_MAP(text [, pairDelim [, keyValueDelim]])

Creates a map after splitting the text into key/value pairs using delimiters. Default delimiters are ',' for pairDelim and ':' for keyValueDelim.

  • text: A string expression.

  • pairDelim: The pair delimiter.

  • keyValueDelim: The value delimiter.

SUBSTR(str, pos [, len])

Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len.

  • str: A string expression.

  • pos: The starting position.

  • len: The length of the string.

SUBSTRING_INDEX(str, delim, count)

Returns the substring from str before count occurrences of the delimiter delim. If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If count is negative, everything to the right of the final delimiter (counting from the right) is returned. The function substring_index performs a case-sensitive match when searching for delim.

  • str: A string expression.

  • delim: The delimiter.

  • count: Total number of occurrences.

TRANSLATE(input, from, to)

Translates the input string by replacing the characters present in the from string with the corresponding characters in the to string.

  • input: A string expression.

  • from: A string expression.

  • to: A string expression.

TRIM(str)

Removes the leading and trailing space characters from str.

  • str: A string expression.

UNBASE64(str)

Converts the argument from a base 64 string str to a binary.

  • str: A string expression.

UPPER(str)

Returns str with all characters changed to uppercase.

  • str: A string expression.

INITCAP(str)

Returns str with the first letter of each word in uppercase. All other letters are in lowercase. Words are delimited by white space.

  • str: A string expression.

LEVENSHTEIN(str1, str2)

Returns the Levenshtein distance between the two given strings.

  • str1: A string expression.

  • str2: A string expression.

SOUNDEX(str)

Returns Soundex code of the string.

  • str: A string expression.

MASK(str [, upper [, lower [, number]]])

Returns a masked version of str. By default, upper case letters are converted to "X", lower case letters are converted to "x" and numbers are converted to "n". For example mask("abcd-EFGH-8765-4321") results in xxxx-XXXX-nnnn-nnnn. You can override the characters used in the mask by supplying additional arguments: the second argument controls the mask character for upper case letters, the third argument for lower case letters and the fourth argument for numbers. For example, mask("abcd-EFGH-8765-4321", "U", "l", "#") results in llll-UUUU-####-####.

  • str: The string to mask.

  • upper: The character to mask for uppercase letters.

  • lower: The character to mask for lowercase letters.

  • number: The character to mask for numbers.

MASK_FIRST_N(str [, n])

Returns a masked version of str with the first n values masked. Upper case letters are converted to "X", lower case letters are converted to "x" and numbers are converted to "n". For example, mask_first_n("1234-5678-8765-4321", 4) results in nnnn-5678-8765-4321.

  • str: The string to mask.

  • n: The number of values to mask.

MASK_LAST_N(str [, n])

Returns a masked version of str with the last n values masked. Upper case letters are converted to "X", lower case letters are converted to "x" and numbers are converted to "n". For example, mask_last_n("1234-5678-8765-4321", 4) results in 1234-5678-8765-nnnn.

  • str: The string to mask.

  • n: The number of values to mask.

MASK_SHOW_FIRST_N(str [, n])

Returns a masked version of str, showing the first n characters unmasked. Upper case letters are converted to "X", lower case letters are converted to "x" and numbers are converted to "n". For example, mask_show_first_n("1234-5678-8765-4321", 4) results in 1234-nnnn-nnnn-nnnn.

  • str: The string to mask.

  • n: The number of values to mask.

MASK_SHOW_LAST_N(str [, n])

Returns a masked version of str, showing the last n characters unmasked. Upper case letters are converted to "X", lower case letters are converted to "x" and numbers are converted to "n". For example, mask_show_last_n("1234-5678-8765-4321", 4) results in nnnn-nnnn-nnnn-4321.

  • str: The string to mask.

  • n: The number of values to mask.

MASK_HASH(str)

Returns a hashed value based on str. The hash is consistent and can be used to join masked values together across tables. This function returns null for non-string types.

  • str: The string to mask.

JAVA_METHOD(class, method [, arg1] [, ...])

Calls a method with reflection.

  • class: The class to call.

  • method: The method to call.

  • arg1: The argument to pass in.

REFLECT(class, method [, arg1] [, ...])

Calls a method with reflection.

  • class: The class to call.

  • method: The method to call.

  • arg1: The argument to pass in.

HASH(expr1 [, expr2] [, ...])

Returns a hash value of the arguments.

  • expr1: A valid expression.

  • expr2: A valid expression.

CURRENT_USER()

Returns current user name from the configured authenticator manager. Could be the same as the user provided when connecting, but with some authentication managers (for example HadoopDefaultAuthenticator) it could be different.

LOGGED_IN_USER()

Returns current user name from the session state. This is the username provided when connecting to Hive.

CURRENT_DATABASE()

Returns current database name.

SHA1(expr)

Returns a sha1 hash value as a hex string of the expr.

  • expr: A valid expression.

CRC32(expr)

Returns a cyclic redundancy check value of the expr as a bigint.

  • expr: A valid expression.

SHA2(expr, bitlength)

Returns a checksum of SHA-2 family as a hex string of expr. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent to 256.

  • expr: A valid expression.

  • bitlength: The bit length.

AES_ENCRYPT(input, key)

Encrypt input using AES. Key lengths of 128, 192 or 256 bits can be used. 192 and 256 bits keys can be used if Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files are installed. If either argument is NULL or the key length is not one of the permitted values, the return value is NULL. Example: base64(aes_encrypt('ABC', '1234567890123456')) = 'y6Ss+zCYObpCbgfWfyNWTw=='.

  • input: The input value to encrypt.

  • key: The key to use when encrypting.

VERSION()

Returns the Hive version. The string contains 2 fields, the first being a build number and the second being a build hash. Example: "select version();" might return "2.1.0.2.5.0.0-1245 r027527b9c5ce1a3d7d0b6d2e6de2378fb0c39232". Actual results will depend on your build.

COUNT(DISTINCT expr1 [, expr2] [, ...])

Returns the number of rows for which the supplied expression(s) are unique and non-null.

  • expr1: A valid expression.

  • expr2: A valid expression.

SUM(expr)

Returns the sum calculated from values of a group.

  • expr: A valid expression.

SUM(DISTINCT expr)

Returns the sum calculated from distinct values of a group.

  • expr: A valid expression.

AVG(expr)

Returns the mean calculated from values of a group.

  • expr: A valid expression.

AVG(DISTINCT expr)

Returns the mean calculated from distinct values of a group.

  • expr: A valid expression.

MIN(expr)

Returns the minimum value of expr.

  • expr: A valid expression.

MAX(expr)

Returns the maximum value of expr.

  • expr: A valid expression.

VARIANCE(expr)

Returns the sample variance calculated from values of a group.

  • expr: A valid expression.

STDDEV_POP(expr)

Returns the population standard deviation calculated from values of a group.

  • expr: A valid expression.

STDDEV_SAMP(expr)

Returns the sample standard deviation calculated from values of a group.

  • expr: A valid expression.

COVAR_POP(expr1, expr2)

Returns the population covariance of a set of number pairs.

  • expr1: A valid expression.

  • expr2: A valid expression.

COVAR_SAMP(expr1, expr2)

Returns the sample covariance of a set of number pairs.

  • expr1: A valid expression.

  • expr2: A valid expression.

CORR(expr1, expr2)

Returns Pearson coefficient of correlation between a set of number pairs.

  • expr1: A valid expression.

  • expr2: A valid expression.

PERCENTILE(col, percentage [, accuracy])

Returns the exact percentile value of numeric column col at the given percentage. The value of percentage must be between 0.0 and 1.0. The value of frequency should be positive integral

  • col: A numeric expression.

  • percentage: The percentage.

  • accuracy: The accuracy to control approximation.

PERCENTILE_APPROX(col, percentage [, accuracy])

Returns the approximate percentile value of numeric column col at the given percentage. The value of percentage must be between 0.0 and 1.0. The accuracy parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation. When percentage is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column col at the given percentage array.

  • col: A numeric expression.

  • percentage: The percentage.

  • accuracy: The accuracy to control approximation.

COLLECT_SET(expr)

Collects and returns a set of unique elements.

  • expr: A valid expression.

COLLECT_LIST(expr)

Collects and returns a set of unique elements.

  • expr: A valid expression.

NTILE(n)

Divides the rows for each window partition into n buckets ranging from 1 to at most n.

  • n: The number of buckets.

EXPLODE(expr)

Separates the elements of array expr into multiple rows, or the elements of map expr into multiple rows and columns.

  • expr: A valid expression.

POSEXPLODE(expr)

Separates the elements of array expr into multiple rows with positions, or the elements of map expr into multiple rows and columns with positions.

  • expr: A valid expression.

INLINE(expr)

Explodes an array of structs into a table.

  • expr: A valid expression.

STACK(n, expr1 [, expr2 ] [, ...])

Separates expr1, ..., exprk into n rows.

  • n: The number of rows.

  • expr1: A valid expression.

  • expr2: A valid expression.

PARSE_URL_TUPLE(urlStr, p1 [, p2 ] [, ...])

Takes URL string and a set of n URL parts, and returns a tuple of n values. This is similar to the parse_url() UDF but can extract multiple parts at once out of a URL. Valid part names are: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:[KEY].

  • urlStr: A valid URL string.

  • p1: A valid part name.

  • p2: A valid part name.

Predicate Functions

ROUND(expr [, d])

Returns expr rounded to d decimal places using HALF_UP rounding mode.

  • expr: Any numeric expression.

  • d: The number of decimal places.

BROUND(expr [, d])

Returns expr rounded to d decimal places using HALF_EVEN rounding mode.

  • expr: Any numeric expression.

  • d: The number of decimal places.

FLOOR(expr)

Returns the largest integer not greater than expr.

  • expr: Any numeric expression.

CEIL(expr)

Returns the smallest integer not smaller than expr.

  • expr: Any numeric expression.

RAND([seed])

Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).

  • seed: The seed to use to generate the random value.

EXP(expr)

Returns e to the power of expr.

  • expr: Any numeric expression.

LN(expr)

Returns the natural logarithm (base e) of expr.

  • expr: Any numeric expression.

LOG10(expr)

Returns the logarithm of expr with base 10.

  • expr: Any numeric expression.

LOG2(expr)

Returns the logarithm of expr with base 2.

  • expr: Any numeric expression.

LOG(base, expr)

Returns the logarithm of expr with base.

  • base: A numeric expression to use as the base.

  • expr: Any numeric expression.

POW(expr1, expr2)

Raises expr1 to the power of expr2.

  • expr1: Any numeric expression.

  • expr2: Any numeric expression.

SQRT(expr)

Returns the square root of expr.

  • expr: Any numeric expression.

BIN(expr)

Returns the string representation of the long value expr represented in binary.

  • expr: A long expression.

HEX(expr)

Converts expr to hexadecimal.

  • expr: The expression to convert to hex.

UNHEX(expr)

Converts hexadecimal expr to binary.

  • expr: The hexadecimal value to convert to binary.

CONV(num, from_base, to_base)

Convert num from from_base to to_base.

  • num: The number to convert.

  • from_base: The original base of num.

  • to_base: The base to convert num to.

ABS(expr)

Returns the absolute value of the numeric value.

  • expr: Any valid numeric expression.

PMOD(expr1, expr2)

Returns the positive value of expr1 mod expr2.

  • expr1: Any valid numeric expression.

  • expr2: Any valid numeric expression.

SIN(expr)

Returns the sine of expr, as if computed by java.lang.Math.sin.

  • expr: Any valid numeric expression.

ASIN(expr)

Returns the inverse sine (a.k.a. arc sine) the arc sin of expr, as if computed by java.lang.Math.asin.

  • expr: Any valid numeric expression.

COS(expr)

Returns the cosine of expr, as if computed by java.lang.Math.cos.

  • expr: Any valid numeric expression.

ACOS(expr)

Returns the inverse cosine (a.k.a. arc cosine) of expr, as if computed by java.lang.Math.acos.

  • expr: Any valid numeric expression.

TAN(expr)

Returns the tangent of expr, as if computed by java.lang.Math.tan.

  • expr: Any valid numeric expression.

ATAN(expr)

Returns the inverse tangent (a.k.a. arc tangent) of expr, as if computed by java.lang.Math.atan

  • expr: Any valid numeric expression.

DEGREES(expr)

Converts radians to degrees.

  • expr: Any valid numeric expression.

RADIANS(expr)

Converts degrees to radians.

  • expr: Any valid numeric expression.

POSITIVE(expr)

Returns the postive value of expr.

  • expr: Any valid numeric expression.

NEGATIVE(expr)

Returns the negated value of expr.

  • expr: Any valid numeric expression.

SIGN(expr)

Returns -1.0, 0.0 or 1.0 as expr is negative, 0 or positive.

  • expr: Any valid numeric expression.

E()