Hive 0.13.0 から Hive 2.1.1 までの間に利用可能になったUDF

https://github.com/myui/hive-udf-backports

上記によりHive v0.13でも下記の関数が利用できるようになります。
なので、それぞれのサンプルを簡単に紹介しておこうと思います。
関数の説明については、公式ドキュメントからの引用です。https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Hive v1.1.0より
Return Type
Name
Description
Example
T
greatest(T v1, T v2, ...)
Returns the greatest value of the list of values. If null is contained, the result is always null.
SELECT greatest(1, 2, 3, 3)
=> 3
T
least(T v1, T v2, ...)
Returns the least value of the list of values. If null is contained, the result is always null.
SELECT least(1, 2, 3, 3)
=> 1
string
add_months(string start_date, int num_months)
Returns the date that is num_months after start_date
SELECT
add_months('2017-06-03',6) as add1,
add_months('2017-06-03 00:00:00',7),
add_months('2017-06-03 00:00:00',-1),
==>
2017-12-03
2018-01-03
2017-05-03
string
last_day(string date)
Returns the last day of the month which the date belongs to. date is a string in the format 'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'.
SELECT
last_day('2017-06-03'),
last_day('2017-06-03 00:00:00')
==>
2017-06-30
2017-06-30
string
initcap(string A)
Returns string, with the first letter of each word in uppercase, all other letters in lowercase. Words are delimited by whitespace
SELECT
initcap('abc'),
initcap('ABC'),
initcap('abc'),
initcap('ABC')
==>
Abc
Abc
Abc
Abc


Hive v1.2.0より
Return Type
Name
Description
Example
BIGINT
factorial(INT a)
Returns the factorial of a. Valid a is [0..20].

DOUBLE
cbrt(DOUBLE a)
Returns the cube root of a double value.

INT

BIGINT
shiftleft(TINYINT|SMALLINT|INT a, INT b)

shiftleft(BIGINT a, INT b)
Bitwise left shift. Shifts a b positions to the left.

Returns int for tinyint, smallint and int a. Returns bigint for bigint a.

INT

BIGINT

shiftright(TINYINT|SMALLINT|INT a, INT b)

shiftright(BIGINT a, INT b)

Bitwise right shift. Shifts a b positions to the right.

Returns int for tinyint, smallint and int a. Returns bigint for bigint a.


INT

BIGINT
shiftrightunsigned(TINYINT|SMALLINT|INT a, INT b),

shiftrightunsigned(BIGINT a, INT b)

Bitwise unsigned right shift. Shifts a b positions to the right.

Returns int for tinyint, smallint and int a. Returns bigint for bigint a.

date
current_date
Returns the current date at the start of query evaluation. All calls of current_date within the same query return the same value.
(TD_SCHEDULED_TIME is recommended in TreasureData.)

timestamp
current_timestamp
Returns the current timestamp at the start of query evaluation. All calls of current_timestamp within the same query return the same value.
(TD_SCHEDULED_TIME is recommended in TreasureData.)
SELECT current_timestamp()
==>
2017-06-04 01:52:17.292
string
next_day(string start_date, string day_of_week)
Returns the first date which is later than start_date and named as day_of_week. start_date is a string/date/timestamp. day_of_week is 2 letters, 3 letters or full name of the day of the week (e.g. Mo, tue, FRIDAY). The time part of start_date is ignored.
SELECT
next_day('2017-06-10', 'MON'),
next_day('2017-06-10', 'TUE'),
next_day('2017-06-10', 'WED'),
next_day('2017-06-10', 'THU'),
next_day('2017-06-10', 'FRI'),
next_day('2017-06-10', 'SAT'),
next_day('2017-06-10', 'SUN')
==>
2017-06-12
2017-06-13
2017-06-14
2017-06-15
2017-06-16
2017-06-17
2017-06-11
string
trunc(string date, string format)
Returns date truncated to the unit specified by the format. Supported formats: MONTH/MON/MM, YEAR/YYYY/YY. If NOT supported format is used, the result is NULL.
SELECT
trunc('2017-08-05', 'MONTH'),
trunc('2017-08-05 03:00:00', 'YEAR'),
trunc('2016-03-01 03:00:00', 'DAY')
==>
2017-08-01
2017-01-01
NULL
double
months_between(date1, date2)
Returns number of months between dates date1 and date2. If date1 is later than date2, then the result is positive. If date1 is earlier than date2, then the result is negative. If date1 and date2 are either the same days of the month or both last days of months, then the result is always an integer. Otherwise the UDF calculates the fractional portion of the result based on a 31-day month and considers the difference in time components date1 and date2. date1 and date2 type can be date, timestamp or string in the format 'yyyy-MM-dd' or 'yyyy-MM-dd HH:mm:ss'. The result is rounded to 8 decimal places.
SELECT
months_between('2017-08-01', '2017-01-01'),
months_between('2017-01-01', '2017-08-05'),
months_between('2017-08-05', '2017-01-01'),
months_between('2017-01-01 00:00:00', '2016-01-01 03:00:00')
==>
7.0
-7.12903226
7.12903226
12.0
string
date_format(date/timestamp/string ts, string fmt)
Converts a date/timestamp/string to a value of string in the format specified by the date format fmt. Supported formats are Java SimpleDateFormat formats – https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html. The second argument fmt should be constant.
SELECT
date_format('2015-04-08', 'y'),
date_format('2015-04-08', 'd'),
date_format('2015-04-08', 's')
==>
2015
8
0
int
levenshtein(string A, string B)
Returns the Levenshtein distance between two strings.
SELECT levenshtein('xxx', 'xyx')
==>
1
string
soundex(string A)
Returns soundex code of the string.
SELECT
soundex('TreasureData')
==>
T626

Hive v1.3.0 より
Return Type
Name
Description
Example
int
quarter(date/timestamp/string)
Returns the quarter of the year for a date, timestamp, or string in the range 1 to 4.
SELECT
quarter('2017-06-01')
==>
2
string
chr(bigint|double A)
Returns the ASCII character having the binary equivalent to A. If A is larger than 256 the result is equivalent to chr(A % 256).
SELECT chr(48)
==>
0
string
replace(string A, string OLD, string NEW)
Returns the string A with all non-overlapping occurrences of OLD replaced with NEW.
SELECT
replace("ababab", "abab", "Z")
==>
Zab
string
substring_index(string A, string delim, int count)
Returns the substring from string A before count occurrences of the delimiter delim. If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If count is negative, everything to the right of the final delimiter (counting from the right) is returned. Substring_index performs a case-sensitive match when searching for delim.
SELECT
substring_index('www.apache.org', '.', 2)
==>
www.apache
string
md5(string/binary)
Calculates an MD5 128-bit checksum for the string or binary (as of Hive 1.3.0). The value is returned as a string of 32 hex digits, or NULL if the argument was NULL.

Tips: What is difference between TD_MD5 and MD5?
There is no difference.
SELECT
MD5('abc'),
TD_MD5('abc')
==>
900150983cd24fb0d6963f7d28e17f72
900150983cd24fb0d6963f7d28e17f72

string
sha1(string/binary)
Calculates the SHA-1 digest for string or binary and returns the value as a hex string.

bigint
crc32(string/binary)
Computes a cyclic redundancy check value for string or binary argument and returns bigint value.


string
sha2(string/binary, int)
Calculates the SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512). The first argument is the string or binary to be hashed. The second argument indicates the desired bit length of the result, which must have a value of 224, 256, 384, 512, or 0 (which is equivalent to 256). SHA-224 is supported starting from Java 8. If either argument is NULL or the hash length is not one of the permitted values, the return value is NULL.
sha2('ABC', 256) = 'b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78'.
binary
aes_encrypt(input string/binary, key string/binary)
Encrypt input using AES. Key lengths of 128, 192 or 256 bits can be used. 192 and 256 bits keys can be used if Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files are installed. If either argument is NULL or the key length is not one of the permitted values, the return value is NULL.
SELECT
base64(aes_encrypt('ABC','1234567890123456'))
==>
y6Ss+zCYObpCbgfWfyNWTw==
binary
aes_decrypt(input binary, key string/binary)
Decrypt input using AES. Key lengths of 128, 192 or 256 bits can be used. 192 and 256 bits keys can be used if Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files are installed. If either argument is NULL or the key length is not one of the permitted values, the return value is NULL.
SELECT
decode(aes_decrypt(unbase64('y6Ss+zCYObpCbgfWfyNWTw=='), '1234567890123456'), 'UTF-8')
==>
ABC


Hive v2.1.0より
Return Type
Name
Description
Example
string

mask(string str[, string upper[, string lower[, string number]]])
Returns a masked version of str. By default, upper case letters are converted to "X", lower case letters are converted to "x" and numbers are converted to "n". For example mask("abcd-EFGH-8765-4321") results in xxxx-XXXX-nnnn-nnnn. You can override the characters used in the mask by supplying additional arguments: the second argument controls the mask character for upper case letters, the third argument for lower case letters and the fourth argument for numbers.
Limitation: Only alphabet and number are masked.
SELECT
mask('Testing000@gmail.com'),
mask('Testing000@gmail.com','x'),
mask('Testing000@gmail.com','x','y'),
mask('Testing000@gmail.com','x','y','z')
==>
SELECT
mask('Testing000@gmail.com'),
mask('Testing000@gmail.com','x'),
mask('Testing000@gmail.com','x','y'),
mask('Testing000@gmail.com','x','y','z')
==>
Xxxxxxxnnn@xxxxx.xxx xxxxxxxnnn@xxxxx.xxx xyyyyyynnn@yyyyy.yyy xyyyyyyzzz@yyyyy.yyy
string
mask_first_n(string str[, int n])
Returns a masked version of str with the first n values masked. Upper case letters are converted to "X", lower case letters are converted to "x" and numbers are converted to "n". Default "n" is 4.
SELECT
mask_first_n('Testing000@gmail.com'),
mask_first_n('Testing000@gmail.com',7)
==>
Xxxxing000@gmail.com
Xxxxxxx000@gmail.com

mask_last_n(string str[, int n])
Returns a masked version of str with the last n values masked. Upper case letters are converted to "X", lower case letters are converted to "x" and numbers are converted to "n". Default "n" is 4.
SELECT
mask_last_n('Testing000@gmail.com'),
mask_last_n('Testing000@gmail.com',7)
==>
Testing000@gmail.xxx Testing000@gmxxx.xxx
string
mask_show_first_n(string str[, int n])
Returns a masked version of str, showing the first n characters unmasked). Upper case letters are converted to "X", lower case letters are converted to "x" and numbers are converted to "n". Default "n" is 4.
SELECT
mask_show_first_n('Testing000@gmail.com'),
mask_show_first_n('Testing000@gmail.com',7)
==>
Testxxxnnn@xxxxx.xxx Testingnnn@xxxxx.xxx
string
mask_show_last_n(string str[, int n])
Returns a masked version of str, showing the last n characters unmasked. Upper case letters are converted to "X", lower case letters are converted to "x" and numbers are converted to "n". Default "n" is 4.
SELECT
mask_show_last_n('Testing000@gmail.com'),
mask_show_last_n('Testing000@gmail.com',7)
==>
Xxxxxxxnnn@xxxxx.com Xxxxxxxnnn@xxail.com
string
mask_hash(string|char|varchar str)
Returns a hashed value based on str. The hash is consistent and can be used to join masked values together across tables. This function returns null for non-string types.
SELECT
mask_hash('Testing000@gmail.com')
==>
e979e1ca866a125460db6fb26995df2b

Hive v2.2.0より
Return Type
Name
Description
Example
T
nullif( a, b )
Returns NULL if a=b; otherwise returns a.
Shorthand for: CASE WHEN a = b then NULL else a
SELECT
nullif( 0, 0),
nullif( 1, 0)
==>
NULL
1
int
character_length(string str)
Returns the number of UTF-8 characters contained in str. The function char_length is shorthand for this function.
SELECT
character_length('abc'),
character_length('あいう')
==>
3
3
double
regr_avgx(independent, dependent)
Equivalent to avg(dependent).

double
regr_avgy(independent, dependent)
Equivalent to avg(independent).

double
regr_count(independent, dependent)
Returns the number of non-null pairs used to fit the linear regression line.

double
regr_intercept(independent, dependent)


Returns the y-intercept of the linear regression line, i.e. the value of b in the equation dependent = a * independent + b.

double
regr_r2(independent, dependent)
Returns the coefficient of determination for the regression.

double
regr_slope(independent, dependent)
Returns the slope of the linear regression line, i.e. the value of a in the equation dependent = a * independent + b.

double
regr_sxx(independent, dependent)
Equivalent to regr_count(independent, dependent) * var_pop(dependent).

double
regr_sxy(independent, dependent)
Equivalent to regr_count(independent, dependent) * covar_pop(independent, dependent).

double
regr_syy(independent, dependent)
Equivalent to regr_count(independent, dependent) * var_pop(independent).