Spark sql if

Spark sql if DEFAULT

Functions (Apache Spark 2.x)

This article lists the built-in functions in Apache Spark SQL.

%

expr1 % expr2 - Returns the remainder after /.

Examples:

>SELECT2%1.8;0.2>SELECTMOD(2,1.8);0.2

&

expr1 & expr2 - Returns the result of bitwise AND of and .

Examples:

*

expr1 expr2 - Returns .

Examples:

+

expr1 + expr2 - Returns +.

Examples:

-

expr1 - expr2 - Returns -.

Examples:

/

expr1 / expr2 - Returns /. It always performs floating point division.

Examples:

>SELECT3/2;1.5>SELECT2L/2L;1.0

<

expr1 < expr2 - Returns true if is less than .

Arguments:

  • expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types such array/struct, the data types of fields must be orderable.

Examples:

>SELECT1<2;true>SELECT1.1<'1';false>SELECTto_date('2009-07-30 04:17:52')<to_date('2009-07-30 04:17:52');false>SELECTto_date('2009-07-30 04:17:52')<to_date('2009-08-01 04:17:52');true>SELECT1<NULL;NULL

<=

expr1 <= expr2 - Returns true if is less than or equal to .

Arguments:

  • expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types such array/struct, the data types of fields must be orderable.

Examples:

>SELECT2<=2;true>SELECT1.0<='1';true>SELECTto_date('2009-07-30 04:17:52')<=to_date('2009-07-30 04:17:52');true>SELECTto_date('2009-07-30 04:17:52')<=to_date('2009-08-01 04:17:52');true>SELECT1<=NULL;NULL

<=>

expr1 <=> expr2 - Returns same result as the EQUAL(=) operator for non-null operands, but returns true if both are null, false if one of the them is null.

Arguments:

  • expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be used in equality comparison. Map type is not supported. For complex types such array/struct, the data types of fields must be orderable.

Examples:

>SELECT2<=>2;true>SELECT1<=>'1';true>SELECTtrue<=>NULL;false>SELECTNULL<=>NULL;true

=

expr1 = expr2 - Returns true if equals , or false otherwise.

Arguments:

  • expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be used in equality comparison. Map type is not supported. For complex types such array/struct, the data types of fields must be orderable.

Examples:

>SELECT2=2;true>SELECT1='1';true>SELECTtrue=NULL;NULL>SELECTNULL=NULL;NULL

==

expr1 == expr2 - Returns true if equals , or false otherwise.

Arguments:

  • expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be used in equality comparison. Map type is not supported. For complex types such array/struct, the data types of fields must be orderable.

Examples:

>SELECT2==2;true>SELECT1=='1';true>SELECTtrue==NULL;NULL>SELECTNULL==NULL;NULL

>

expr1 > expr2 - Returns true if is greater than .

Arguments:

  • expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types such array/struct, the data types of fields must be orderable.

Examples:

>SELECT2>1;true>SELECT2>'1.1';true>SELECTto_date('2009-07-30 04:17:52')>to_date('2009-07-30 04:17:52');false>SELECTto_date('2009-07-30 04:17:52')>to_date('2009-08-01 04:17:52');false>SELECT1>NULL;NULL

>=

expr1 >= expr2 - Returns true if is greater than or equal to .

Arguments:

  • expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types such array/struct, the data types of fields must be orderable.

Examples:

>SELECT2>=1;true>SELECT2.0>='2.1';false>SELECTto_date('2009-07-30 04:17:52')>=to_date('2009-07-30 04:17:52');true>SELECTto_date('2009-07-30 04:17:52')>=to_date('2009-08-01 04:17:52');false>SELECT1>=NULL;NULL

^

expr1 ^ expr2 - Returns the result of bitwise exclusive OR of and .

Examples:

abs

abs(expr) - Returns the absolute value of the numeric value.

Examples:

acos

acos(expr) - Returns the inverse cosine (arc cosine) of , as if computed by .

Examples:

>SELECTacos(1);0.0>SELECTacos(2);NaN

add_months

add_months(start_date, num_months) - Returns the date that is after .

Examples:

>SELECTadd_months('2016-08-31',1);2016-09-30

Since: 1.5.0

aggregate

aggregate(expr, start, merge, finish) - Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. The final state is converted into the final result by applying a finish function.

Examples:

>SELECTaggregate(array(1,2,3),0,(acc,x)->acc+x);6>SELECTaggregate(array(1,2,3),0,(acc,x)->acc+x,acc->acc*10);60

Since: 2.4.0

and

expr1 and expr2 - Logical AND.

approx_count_distinct

approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++. defines the maximum estimation error allowed.

approx_percentile

approx_percentile(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column at the given percentage. The value of percentage must be between 0.0 and 1.0. The parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of yields better accuracy, is the relative error of the approximation. When is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column at the given percentage array.

Examples:

>SELECTapprox_percentile(10.0,array(0.5,0.4,0.1),100);[10.0,10.0,10.0]>SELECTapprox_percentile(10.0,0.5,100);10.0

array

array(expr, …) - Returns an array with the given elements.

Examples:

>SELECTarray(1,2,3);[1,2,3]

array_contains

array_contains(array, value) - Returns true if the array contains the value.

Examples:

>SELECTarray_contains(array(1,2,3),2);true

array_distinct

array_distinct(array) - Removes duplicate values from the array.

Examples:

>SELECTarray_distinct(array(1,2,3,null,3));[1,2,3,null]

Since: 2.4.0

array_except

array_except(array1, array2) - Returns an array of the elements in array1 but not in array2, without duplicates.

Examples:

>SELECTarray_except(array(1,2,3),array(1,3,5));[2]

Since: 2.4.0

array_intersect

array_intersect(array1, array2) - Returns an array of the elements in the intersection of array1 and array2, without duplicates.

Examples:

>SELECTarray_intersect(array(1,2,3),array(1,3,5));[1,3]

Since: 2.4.0

array_join

array_join(array, delimiter[, nullReplacement]) - Concatenates the elements of the given array using the delimiter and an optional string to replace nulls. If no value is set for nullReplacement, any null value is filtered.

Examples:

>SELECTarray_join(array('hello','world'),' ');helloworld>SELECTarray_join(array('hello',null,'world'),' ');helloworld>SELECTarray_join(array('hello',null,'world'),' ',',');hello,world

Since: 2.4.0

array_max

array_max(array) - Returns the maximum value in the array. NULL elements are skipped.

Examples:

>SELECTarray_max(array(1,20,null,3));20

Since: 2.4.0

array_min

array_min(array) - Returns the minimum value in the array. NULL elements are skipped.

Examples:

>SELECTarray_min(array(1,20,null,3));1

Since: 2.4.0

array_position

array_position(array, element) - Returns the (1-based) index of the first element of the array as long.

Examples:

>SELECTarray_position(array(3,2,1),1);3

Since: 2.4.0

array_remove

array_remove(array, element) - Remove all elements that equal to element from array.

Examples:

>SELECTarray_remove(array(1,2,3,null,3),3);[1,2,null]

Since: 2.4.0

array_repeat

array_repeat(element, count) - Returns the array containing element count times.

Examples:

>SELECTarray_repeat('123',2);["123","123"]

Since: 2.4.0

array_sort

array_sort(array) - Sorts the input array in ascending order. The elements of the input array must be orderable. Null elements will be placed at the end of the returned array.

Examples:

>SELECTarray_sort(array('b','d',null,'c','a'));["a","b","c","d",null]

Since: 2.4.0

array_union

array_union(array1, array2) - Returns an array of the elements in the union of array1 and array2, without duplicates.

Examples:

>SELECTarray_union(array(1,2,3),array(1,3,5));[1,2,3,5]

Since: 2.4.0

arrays_overlap

arrays_overlap(a1, a2) - Returns true if a1 contains at least a non-null element present also in a2. If the arrays have no common element and they are both non-empty and either of them contains a null element null is returned, false otherwise.

Examples:

>SELECTarrays_overlap(array(1,2,3),array(3,4,5));true

Since: 2.4.0

arrays_zip

arrays_zip(a1, a2, …) - Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.

Examples:

>SELECTarrays_zip(array(1,2,3),array(2,3,4));
[{"0":1,"1":2},{"0":2,"1":3},{"0":3,"1":4}]
>SELECTarrays_zip(array(1,2),array(2,3),array(3,4));
[{"0":1,"1":2,"2":3},{"0":2,"1":3,"2":4}]

Since: 2.4.0

ascii

ascii(str) - Returns the numeric value of the first character of .

Examples:

>SELECTascii('222');50>SELECTascii(2);50

asin

asin(expr) - Returns the inverse sine (arc sine) the arc sin of , as if computed by .

Examples:

>SELECTasin(0);0.0>SELECTasin(2);NaN

assert_true

assert_true(expr) - Throws an exception if is not true.

Examples:

>SELECTassert_true(0<1);NULL

atan

atan(expr) - Returns the inverse tangent (arc tangent) of , as if computed by

Examples:

atan2

atan2(exprY, exprX) - Returns the angle in radians between the positive x-axis of a plane and the point given by the coordinates (, ), as if computed by .

Arguments:

  • exprY - coordinate on y-axis
  • exprX - coordinate on x-axis

Examples:

>SELECTatan2(0,0);0.0

avg

avg(expr) - Returns the mean calculated from values of a group.

base64

base64(bin) - Converts the argument from a binary to a base 64 string.

Examples:

>SELECTbase64('Spark SQL');U3BhcmsgU1FM

bigint

bigint(expr) - Casts the value to the target data type .

bin

bin(expr) - Returns the string representation of the long value represented in binary.

Examples:

>SELECTbin(13);1101>SELECTbin(-13);1111111111111111111111111111111111111111111111111111111111110011>SELECTbin(13.3);1101

binary

binary(expr) - Casts the value to the target data type .

bit_length

bit_length(expr) - Returns the bit length of string data or number of bits of binary data.

Examples:

>SELECTbit_length('Spark SQL');72

boolean

boolean(expr) - Casts the value to the target data type .

bround

bround(expr, d) - Returns rounded to decimal places using HALF_EVEN rounding mode.

Examples:

>SELECTbround(2.5,0);2.0

cardinality

cardinality(expr) - Returns the size of an array or a map. The function returns -1 if its input is null and spark.sql.legacy.sizeOfNull is set to true. If spark.sql.legacy.sizeOfNull is set to false, the function returns null for null input. By default, the spark.sql.legacy.sizeOfNull parameter is set to true.

Examples:

>SELECTcardinality(array('b','d','c','a'));4>SELECTcardinality(map('a',1,'b',2));2>SELECTcardinality(NULL);-1

cast

cast(expr AS type) - Casts the value to the target data type .

Examples:

>SELECTcast('10'asint);10

cbrt

cbrt(expr) - Returns the cube root of .

Examples:

>SELECTcbrt(27.0);3.0

ceil

ceil(expr) - Returns the smallest integer not smaller than .

Examples:

>SELECTceil(-0.1);0>SELECTceil(5);5

ceiling

ceiling(expr) - Returns the smallest integer not smaller than .

Examples:

>SELECTceiling(-0.1);0>SELECTceiling(5);5

char

char(expr) - Returns the ASCII character having the binary equivalent to . If n is larger than 256 the result is equivalent to chr(n % 256)

Examples:

char_length

char_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

Examples:

>SELECTchar_length('Spark SQL ');10>SELECTCHAR_LENGTH('Spark SQL ');10>SELECTCHARACTER_LENGTH('Spark SQL ');10

character_length

character_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

Examples:

>SELECTcharacter_length('Spark SQL ');10>SELECTCHAR_LENGTH('Spark SQL ');10>SELECTCHARACTER_LENGTH('Spark SQL ');10

chr

chr(expr) - Returns the ASCII character having the binary equivalent to . If n is larger than 256 the result is equivalent to chr(n % 256)

Examples:

coalesce

coalesce(expr1, expr2, …) - Returns the first non-null argument if exists. Otherwise, null.

Examples:

>SELECTcoalesce(NULL,1,NULL);1

collect_list

collect_list(expr) - Collects and returns a list of non-unique elements.

collect_set

collect_set(expr) - Collects and returns a set of unique elements.

concat

concat(col1, col2, …, colN) - Returns the concatenation of col1, col2, …, colN.

Examples:

>SELECTconcat('Spark','SQL');SparkSQL>SELECTconcat(array(1,2,3),array(4,5),array(6));[1,2,3,4,5,6]atlogicforarraysisavailablesince2.4.0.

concat_ws

concat_ws(sep, [str | array(str)]+) - Returns the concatenation of the strings separated by .

Examples:

>SELECTconcat_ws(' ','Spark','SQL');SparkSQL

conv

conv(num, from_base, to_base) - Convert from to .

Examples:

>SELECTconv('100',2,10);4>SELECTconv(-10,16,-10);-16

corr

corr(expr1, expr2) - Returns Pearson coefficient of correlation between a set of number pairs.

cos

cos(expr) - Returns the cosine of , as if computed by .

Arguments:

Examples:

cosh

cosh(expr) - Returns the hyperbolic cosine of , as if computed by .

Arguments:

Examples:

cot

cot(expr) - Returns the cotangent of , as if computed by .

Arguments:

Examples:

>SELECTcot(1);0.6420926159343306

count

count(*) - Returns the total number of retrieved rows, including rows containing null.

count(expr[, expr…]) - Returns the number of rows for which the supplied expression(s) are all non-null.

count(DISTINCT expr[, expr…]) - Returns the number of rows for which the supplied expression(s) are unique and non-null.

count_min_sketch

count_min_sketch(col, eps, confidence, seed) - Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.

covar_pop

covar_pop(expr1, expr2) - Returns the population covariance of a set of number pairs.

covar_samp

covar_samp(expr1, expr2) - Returns the sample covariance of a set of number pairs.

crc32

crc32(expr) - Returns a cyclic redundancy check value of the as a bigint.

Examples:

>SELECTcrc32('Spark');1557323817

cube

cume_dist

cume_dist() - Computes the position of a value relative to all values in the partition.

current_database

current_database() - Returns the current database.

Examples:

>SELECTcurrent_database();default

current_date

current_date() - Returns the current date at the start of query evaluation.

Since: 1.5.0

current_timestamp

current_timestamp() - Returns the current timestamp at the start of query evaluation.

Since: 1.5.0

date

date(expr) - Casts the value to the target data type .

date_add

date_add(start_date, num_days) - Returns the date that is after .

Examples:

>SELECTdate_add('2016-07-30',1);2016-07-31

Since: 1.5.0

date_format

date_format(timestamp, fmt) - Converts to a value of string in the format specified by the date format .

Examples:

>SELECTdate_format('2016-04-08','y');2016

Since: 1.5.0

date_sub

date_sub(start_date, num_days) - Returns the date that is before .

Examples:

>SELECTdate_sub('2016-07-30',1);2016-07-29

Since: 1.5.0

date_trunc

date_trunc(fmt, ts) - Returns timestamp truncated to the unit specified by the format model . should be one of [“YEAR”, “YYYY”, “YY”, “MON”, “MONTH”, “MM”, “DAY”, “DD”, “HOUR”, “MINUTE”, “SECOND”, “WEEK”, “QUARTER”]

Examples:

>SELECTdate_trunc('YEAR','2015-03-05T09:32:05.359');2015-01-0100:00:00>SELECTdate_trunc('MM','2015-03-05T09:32:05.359');2015-03-0100:00:00>SELECTdate_trunc('DD','2015-03-05T09:32:05.359');2015-03-0500:00:00>SELECTdate_trunc('HOUR','2015-03-05T09:32:05.359');2015-03-0509:00:00

Since: 2.3.0

datediff

datediff(endDate, startDate) - Returns the number of days from to .

Examples:

>SELECTdatediff('2009-07-31','2009-07-30');1>SELECTdatediff('2009-07-30','2009-07-31');-1

Since: 1.5.0

day

day(date) - Returns the day of month of the date/timestamp.

Examples:

>SELECTday('2009-07-30');30

Since: 1.5.0

dayofmonth

dayofmonth(date) - Returns the day of month of the date/timestamp.

Examples:

>SELECTdayofmonth('2009-07-30');30

Since: 1.5.0

dayofweek

dayofweek(date) - Returns the day of the week for date/timestamp (1 = Sunday, 2 = Monday, …, 7 = Saturday).

Examples:

>SELECTdayofweek('2009-07-30');5

Since: 2.3.0

dayofyear

dayofyear(date) - Returns the day of year of the date/timestamp.

Examples:

>SELECTdayofyear('2016-04-09');100

Since: 1.5.0

decimal

decimal(expr) - Casts the value to the target data type .

decode

decode(bin, charset) - Decodes the first argument using the second argument character set.

Examples:

>SELECTdecode(encode('abc','utf-8'),'utf-8');abc

degrees

degrees(expr) - Converts radians to degrees.

Arguments:

Examples:

>SELECTdegrees(3.141592653589793);180.0

dense_rank

dense_rank() - Computes the rank of a value in a group of values. The result is one plus the previously assigned rank value. Unlike the function rank, dense_rank will not produce gaps in the ranking sequence.

double

double(expr) - Casts the value to the target data type .

e

e() - Returns Euler’s number, e.

Examples:

>SELECTe();2.718281828459045

element_at

element_at(array, index) - Returns element of array at given (1-based) index. If index < 0, accesses elements from the last to the first. Returns NULL if the index exceeds the length of the array.

element_at(map, key) - Returns value for given key, or NULL if the key is not contained in the map

Examples:

>SELECTelement_at(array(1,2,3),2);2>SELECTelement_at(map(1,'a',2,'b'),2);b

Since: 2.4.0

elt

elt(n, input1, input2, …) - Returns the -th input, e.g., returns when is 2.

Examples:

>SELECTelt(1,'scala','java');scala

encode

encode(str, charset) - Encodes the first argument using the second argument character set.

Examples:

>SELECTencode('abc','utf-8');abc

exists

exists(expr, pred) - Tests whether a predicate holds for one or more elements in the array.

Examples:

>SELECTexists(array(1,2,3),x->x%2==0);true

Since: 2.4.0

exp

exp(expr) - Returns e to the power of .

Examples:

explode

explode(expr) - Separates the elements of array into multiple rows, or the elements of map into multiple rows and columns.

Examples:

>SELECTexplode(array(10,20));1020

explode_outer

explode_outer(expr) - Separates the elements of array into multiple rows, or the elements of map into multiple rows and columns.

Examples:

>SELECTexplode_outer(array(10,20));1020

expm1

expm1(expr) - Returns exp() - 1.

Examples:

>SELECTexpm1(0);0.0

factorial

factorial(expr) - Returns the factorial of . is [0..20]. Otherwise, null.

Examples:

>SELECTfactorial(5);120

filter

filter(expr, func) - Filters the input array using the given predicate.

Examples:

>SELECTfilter(array(1,2,3),x->x%2==1);[1,3]

Since: 2.4.0

find_in_set

find_in_set(str, str_array) - Returns the index (1-based) of the given string () in the comma-delimited list (). Returns 0, if the string was not found or if the given string () contains a comma.

Examples:

>SELECTfind_in_set('ab','abc,b,ab,c,def');3

first

first(expr[, isIgnoreNull]) - Returns the first value of for a group of rows. If is true, returns only non-null values.

first_value

first_value(expr[, isIgnoreNull]) - Returns the first value of for a group of rows. If is true, returns only non-null values.

flatten

flatten(arrayOfArrays) - Transforms an array of arrays into a single array.

Examples:

>SELECTflatten(array(array(1,2),array(3,4)));[1,2,3,4]

Since: 2.4.0

float

float(expr) - Casts the value to the target data type .

floor

floor(expr) - Returns the largest integer not greater than .

Examples:

>SELECTfloor(-0.1);-1>SELECTfloor(5);5

format_number

format_number(expr1, expr2) - Formats the number like ‘#,###,###.##’, rounded to decimal places. If is 0, the result has no decimal point or fractional part. also accept a user specified format. This is supposed to function like MySQL’s FORMAT.

Examples:

>SELECTformat_number(12332.123456,4);12,332.1235>SELECTformat_number(12332.123456,'##################.###');12332.123

format_string

format_string(strfmt, obj, …) - Returns a formatted string from printf-style format strings.

Examples:

>SELECTformat_string("Hello World %d %s",100,"days");HelloWorld100days

from_json

from_json(jsonStr, schema[, options]) - Returns a struct value with the given and .

Examples:

>SELECTfrom_json('{"a":1, "b":0.8}','a INT, b DOUBLE');
{"a":1,"b":0.8}
>SELECTfrom_json('{"time":"26/08/2015"}','time Timestamp',map('timestampFormat','dd/MM/yyyy'));
{"time":"2015-08-26 00:00:00.0"}

Since: 2.2.0

from_unixtime

from_unixtime(unix_time, format) - Returns in the specified .

Examples:

>SELECTfrom_unixtime(0,'yyyy-MM-dd HH:mm:ss');1970-01-0100:00:00

Since: 1.5.0

from_utc_timestamp

from_utc_timestamp(timestamp, timezone) - Given a timestamp like ‘2017-07-14 02:40:00.0’, interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, ‘GMT+1’ would yield ‘2017-07-14 03:40:00.0’.

Examples:

>SELECTfrom_utc_timestamp('2016-08-31','Asia/Seoul');2016-08-3109:00:00

Since: 1.5.0

get_json_object

get_json_object(json_txt, path) - Extracts a json object from .

Examples:

>SELECTget_json_object('{"a":"b"}','$.a');b

greatest

greatest(expr, …) - Returns the greatest value of all parameters, skipping null values.

Examples:

>SELECTgreatest(10,9,2,4,3);10

grouping

grouping_id

hash

hash(expr1, expr2, …) - Returns a hash value of the arguments.

Examples:

>SELECThash('Spark',array(123),2);-1321691492

hex

hex(expr) - Converts to hexadecimal.

Examples:

>SELECThex(17);11>SELECThex('Spark SQL');537061726B2053514C

hour

hour(timestamp) - Returns the hour component of the string/timestamp.

Examples:

>SELECThour('2009-07-30 12:58:59');12

Since: 1.5.0

hypot

hypot(expr1, expr2) - Returns sqrt(2 + 2).

Examples:

>SELECThypot(3,4);5.0

if

if(expr1, expr2, expr3) - If evaluates to true, then returns ; otherwise returns .

Examples:

>SELECTif(1<2,'a','b');a

ifnull

ifnull(expr1, expr2) - Returns if is null, or otherwise.

Examples:

>SELECTifnull(NULL,array('2'));["2"]

in

expr1 in(expr2, expr3, …) - Returns true if equals to any valN.

Arguments:

  • expr1, expr2, expr3, … - the arguments must be same type.

Examples:

>SELECT1in(1,2,3);true>SELECT1in(2,3,4);false>SELECTnamed_struct('a',1,'b',2)in(named_struct('a',1,'b',1),named_struct('a',1,'b',3));false>SELECTnamed_struct('a',1,'b',2)in(named_struct('a',1,'b',2),named_struct('a',1,'b',3));true

initcap

initcap(str) - Returns with the first letter of each word in uppercase. All other letters are in lowercase. Words are delimited by white space.

Examples:

>SELECTinitcap('sPark sql');SparkSql

inline

inline(expr) - Explodes an array of structs into a table.

Examples:

>SELECTinline(array(struct(1,'a'),struct(2,'b')));1a2b

inline_outer

inline_outer(expr) - Explodes an array of structs into a table.

Examples:

>SELECTinline_outer(array(struct(1,'a'),struct(2,'b')));1a2b

input_file_block_length

input_file_block_length() - Returns the length of the block being read, or -1 if not available.

input_file_block_start

input_file_block_start() - Returns the start offset of the block being read, or -1 if not available.

input_file_name

input_file_name() - Returns the name of the file being read, or empty string if not available.

instr

instr(str, substr) - Returns the (1-based) index of the first occurrence of in .

Examples:

>SELECTinstr('SparkSQL','SQL');6

int

int(expr) - Casts the value to the target data type .

isnan

isnan(expr) - Returns true if is NaN, or false otherwise.

Examples:

>SELECTisnan(cast('NaN'asdouble));true

isnotnull

isnotnull(expr) - Returns true if is not null, or false otherwise.

Examples:

>SELECTisnotnull(1);true

isnull

isnull(expr) - Returns true if is null, or false otherwise.

Examples:

>SELECTisnull(1);false

java_method

java_method(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.

Examples:

>SELECTjava_method('java.util.UUID','randomUUID');c33fb387-8500-4bfa-81d2-6e0e3e930df2>SELECTjava_method('java.util.UUID','fromString','a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');a5cf6c42-0c85-418f-af6c-3e4e5b1328f2

json_tuple

json_tuple(jsonStr, p1, p2, …, pn) - Returns a tuple like the function get_json_object, but it takes multiple names. All the input parameters and output column types are string.

Examples:

>SELECTjson_tuple('{"a":1, "b":2}','a','b');12

kurtosis

kurtosis(expr) - Returns the kurtosis value calculated from values of a group.

lag

lag(input[, offset[, default]]) - Returns the value of at the th row before the current row in the window. The default value of is 1 and the default value of is null. If the value of at the th row is null, null is returned. If there is no such offset row (e.g., when the offset is 1, the first row of the window does not have any previous row), is returned.

last

last(expr[, isIgnoreNull]) - Returns the last value of for a group of rows. If is true, returns only non-null values.

last_day

last_day(date) - Returns the last day of the month which the date belongs to.

Examples:

>SELECTlast_day('2009-01-12');2009-01-31

Since: 1.5.0

last_value

last_value(expr[, isIgnoreNull]) - Returns the last value of for a group of rows. If is true, returns only non-null values.

lcase

lcase(str) - Returns with all characters changed to lowercase.

Examples:

>SELECTlcase('SparkSql');sparksql

lead

lead(input[, offset[, default]]) - Returns the value of at the th row after the current row in the window. The default value of is 1 and the default value of is null. If the value of at the th row is null, null is returned. If there is no such an offset row (e.g., when the offset is 1, the last row of the window does not have any subsequent row), is returned.

least

least(expr, …) - Returns the least value of all parameters, skipping null values.

Examples:

>SELECTleast(10,9,2,4,3);2

left

left(str, len) - Returns the leftmost ( can be string type) characters from the string ,if is less or equal than 0 the result is an empty string.

Examples:

>SELECTleft('Spark SQL',3);Spa

length

length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

Examples:

>SELECTlength('Spark SQL ');10>SELECTCHAR_LENGTH('Spark SQL ');10>SELECTCHARACTER_LENGTH('Spark SQL ');10

levenshtein

levenshtein(str1, str2) - Returns the Levenshtein distance between the two given strings.

Examples:

>SELECTlevenshtein('kitten','sitting');3

like

str like pattern - Returns true if str matches pattern, null if any arguments are null, false otherwise.

Arguments:

  • str - a string expression

  • pattern - a string expression. The pattern is a string which is matched literally, with exception to the following special symbols:

    _ matches any one character in the input (similar to . in posix regular expressions)

    % matches zero or more characters in the input (similar to .* in posix regular expressions)

    The escape character is ‘’. If an escape character precedes a special symbol or another escape character, the following character is matched literally. It is invalid to escape any other character.

    Since Spark 2.0, string literals are unescaped in our SQL parser. For example, in order to match “abc”, the pattern should be “abc”.

    When SQL config ‘spark.sql.parser.escapedStringLiterals’ is enabled, it fallbacks to Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the pattern to match “abc” should be “abc”.

Examples:

>SELECT'%SystemDrive%\Users\John'like'\%SystemDrive\%\\Users%'true

Note:

Use RLIKE to match with standard regular expressions.

ln

ln(expr) - Returns the natural logarithm (base e) of .

Examples:

locate

locate(substr, str[, pos]) - Returns the position of the first occurrence of in after position . The given and return value are 1-based.

Examples:

>SELECTlocate('bar','foobarbar');4>SELECTlocate('bar','foobarbar',5);7>SELECTPOSITION('bar'IN'foobarbar');4

log

log(base, expr) - Returns the logarithm of with .

Examples:

>SELECTlog(10,100);2.0

log10

log10(expr) - Returns the logarithm of with base 10.

Examples:

>SELECTlog10(10);1.0

log1p

log1p(expr) - Returns log(1 + ).

Examples:

>SELECTlog1p(0);0.0

log2

log2(expr) - Returns the logarithm of with base 2.

Examples:

lower

lower(str) - Returns with all characters changed to lowercase.

Examples:

>SELECTlower('SparkSql');sparksql

lpad

lpad(str, len, pad) - Returns , left-padded with to a length of . If is longer than , the return value is shortened to characters.

Examples:

>SELECTlpad('hi',5,'??');???hi>SELECTlpad('hi',1,'??');h

ltrim

ltrim(str) - Removes the leading space characters from .

ltrim(trimStr, str) - Removes the leading string contains the characters from the trim string

Arguments:

  • str - a string expression
  • trimStr - the trim string characters to trim, the default value is a single space

Examples:

>SELECTltrim(' SparkSQL ');SparkSQL>SELECTltrim('Sp','SSparkSQLS');arkSQLS

map

map(key0, value0, key1, value1, …) - Creates a map with the given key/value pairs.

Examples:

>SELECTmap(1.0,'2',3.0,'4');
{1.0:"2",3.0:"4"}

map_concat

map_concat(map, …) - Returns the union of all the given maps

Examples:

>SELECTmap_concat(map(1,'a',2,'b'),map(2,'c',3,'d'));
{1:"a",2:"c",3:"d"}

Since: 2.4.0

map_from_arrays

map_from_arrays(keys, values) - Creates a map with a pair of the given key/value arrays. All elements in keys should not be null

Examples:

>SELECTmap_from_arrays(array(1.0,3.0),array('2','4'));
{1.0:"2",3.0:"4"}

Since: 2.4.0

map_from_entries

map_from_entries(arrayOfEntries) - Returns a map created from the given array of entries.

Examples:

>SELECTmap_from_entries(array(struct(1,'a'),struct(2,'b')));
{1:"a",2:"b"}

Since: 2.4.0

map_keys

map_keys(map) - Returns an unordered array containing the keys of the map.

Examples:

>SELECTmap_keys(map(1,'a',2,'b'));[1,2]

map_values

map_values(map) - Returns an unordered array containing the values of the map.

Examples:

>SELECTmap_values(map(1,'a',2,'b'));["a","b"]

max

max(expr) - Returns the maximum value of .

md5

md5(expr) - Returns an MD5 128-bit checksum as a hex string of .

Examples:

>SELECTmd5('Spark');8cde774d6f7333752ed72cacddb05126

mean

mean(expr) - Returns the mean calculated from values of a group.

min

min(expr) - Returns the minimum value of .

minute

minute(timestamp) - Returns the minute component of the string/timestamp.

Examples:

>SELECTminute('2009-07-30 12:58:59');58

Since: 1.5.0

mod

expr1 mod expr2 - Returns the remainder after /.

Examples:

>SELECT2mod1.8;0.2>SELECTMOD(2,1.8);0.2

monotonically_increasing_id

monotonically_increasing_id() - Returns monotonically increasing 64-bit integers. The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the lower 33 bits represent the record number within each partition. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records. The function is non-deterministic because its result depends on partition IDs.

month

month(date) - Returns the month component of the date/timestamp.

Examples:

>SELECTmonth('2016-07-30');7

Since: 1.5.0

months_between

months_between(timestamp1, timestamp2[, roundOff]) - If is later than , then the result is positive. If and are on the same day of month, or both are the last day of month, time of day will be ignored. Otherwise, the difference is calculated based on 31 days per month, and rounded to 8 digits unless roundOff=false.

Examples:

>SELECTmonths_between('1997-02-28 10:30:00','1996-10-30');3.94959677>SELECTmonths_between('1997-02-28 10:30:00','1996-10-30',false);3.9495967741935485

Since: 1.5.0

named_struct

named_struct(name1, val1, name2, val2, …) - Creates a struct with the given field names and values.

Examples:

>SELECTnamed_struct("a",1,"b",2,"c",3);
{"a":1,"b":2,"c":3}

nanvl

nanvl(expr1, expr2) - Returns if it’s not NaN, or otherwise.

Examples:

>SELECTnanvl(cast('NaN'asdouble),123);123.0

negative

negative(expr) - Returns the negated value of .

Examples:

>SELECTnegative(1);-1

next_day

next_day(start_date, day_of_week) - Returns the first date which is later than and named as indicated.

Examples:

>SELECTnext_day('2015-01-14','TU');2015-01-20

Since: 1.5.0

not

not expr - Logical not.

now

now() - Returns the current timestamp at the start of query evaluation.

Since: 1.5.0

ntile

ntile(n) - Divides the rows for each window partition into buckets ranging from 1 to at most .

nullif

nullif(expr1, expr2) - Returns null if equals to , or otherwise.

Examples:

>SELECTnullif(2,2);NULL

nvl

nvl(expr1, expr2) - Returns if is null, or otherwise.

Examples:

>SELECTnvl(NULL,array('2'));["2"]

nvl2

nvl2(expr1, expr2, expr3) - Returns if is not null, or otherwise.

Examples:

>SELECTnvl2(NULL,2,1);1

octet_length

octet_length(expr) - Returns the byte length of string data or number of bytes of binary data.

Examples:

>SELECToctet_length('Spark SQL');9

or

expr1 or expr2 - Logical OR.

parse_url

parse_url(url, partToExtract[, key]) - Extracts a part from a URL.

Examples:

>SELECTparse_url('https://spark.apache.org/path?query=1','HOST')spark.apache.org>SELECTparse_url('https://spark.apache.org/path?query=1','QUERY')query=1>SELECTparse_url('https://spark.apache.org/path?query=1','QUERY','query')1

percent_rank

percent_rank() - Computes the percentage ranking of a value in a group of values.

percentile

percentile(col, percentage [, frequency]) - Returns the exact percentile value of numeric column at the given percentage. The value of percentage must be between 0.0 and 1.0. The value of frequency should be positive integral

percentile(col, array(percentage1 [, percentage2]…) [, frequency]) - Returns the exact percentile value array of numeric column at the given percentage(s). Each value of the percentage array must be between 0.0 and 1.0. The value of frequency should be positive integral

percentile_approx

percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column at the given percentage. The value of percentage must be between 0.0 and 1.0. The parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of yields better accuracy, is the relative error of the approximation. When is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column at the given percentage array.

Examples:

>SELECTpercentile_approx(10.0,array(0.5,0.4,0.1),100);[10.0,10.0,10.0]>SELECTpercentile_approx(10.0,0.5,100);10.0

pi

pi() - Returns pi.

Examples:

>SELECTpi();3.141592653589793

pmod

pmod(expr1, expr2) - Returns the positive value of mod .

Examples:

>SELECTpmod(10,3);1>SELECTpmod(-10,3);2

posexplode

posexplode(expr) - Separates the elements of array into multiple rows with positions, or the elements of map into multiple rows and columns with positions.

Examples:

>SELECTposexplode(array(10,20));010120

posexplode_outer

posexplode_outer(expr) - Separates the elements of array into multiple rows with positions, or the elements of map into multiple rows and columns with positions.

Examples:

>SELECTposexplode_outer(array(10,20));010120

position

position(substr, str[, pos]) - Returns the position of the first occurrence of in after position . The given and return value are 1-based.

Examples:

>SELECTposition('bar','foobarbar');4>SELECTposition('bar','foobarbar',5);7>SELECTPOSITION('bar'IN'foobarbar');4

positive

positive(expr) - Returns the value of .

pow

pow(expr1, expr2) - Raises to the power of .

Examples:

>SELECTpow(2,3);8.0

power

power(expr1, expr2) - Raises to the power of .

Examples:

>SELECTpower(2,3);8.0

printf

printf(strfmt, obj, …) - Returns a formatted string from printf-style format strings.

Examples:

>SELECTprintf("Hello World %d %s",100,"days");HelloWorld100days

quarter

quarter(date) - Returns the quarter of the year for date, in the range 1 to 4.

Examples:

>SELECTquarter('2016-08-31');3

Since: 1.5.0

radians

radians(expr) - Converts degrees to radians.

Arguments:

Examples:

>SELECTradians(180);3.141592653589793

rand

rand([seed]) - Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).

Examples:

>SELECTrand();0.9629742951434543>SELECTrand(0);0.8446490682263027>SELECTrand(null);0.8446490682263027functionisnon-deterministicingeneralcase.

randn

randn([seed]) - Returns a random value with independent and identically distributed (i.i.d.) values drawn from the standard normal distribution.

Examples:

>SELECTrandn();-0.3254147983080288>SELECTrandn(0);1.1164209726833079>SELECTrandn(null);1.1164209726833079functionisnon-deterministicingeneralcase.

rank

rank() - Computes the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to the current row in the ordering of the partition. The values will produce gaps in the sequence.

reflect

reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.

Examples:

>SELECTreflect('java.util.UUID','randomUUID');c33fb387-8500-4bfa-81d2-6e0e3e930df2>SELECTreflect('java.util.UUID','fromString','a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');a5cf6c42-0c85-418f-af6c-3e4e5b1328f2

regexp_replace

regexp_replace(str, regexp, rep) - Replaces all substrings of that match with .

Examples:

>SELECTregexp_replace('100-200','(\\d+)','num');num-num

repeat

repeat(str, n) - Returns the string which repeats the given string value n times.

Examples:

>SELECTrepeat('123',2);123123

replace

replace(str, search[, replace]) - Replaces all occurrences of with .

Arguments:

  • str - a string expression
  • search - a string expression. If is not found in , is returned unchanged.
  • replace - a string expression. If is not specified or is an empty string, nothing replaces the string that is removed from .

Examples:

>SELECTreplace('ABCabc','abc','DEF');ABCDEF

reverse

reverse(array) - Returns a reversed string or an array with reverse order of elements.

Examples:

>SELECTreverse('Spark SQL');LQSkrapS>SELECTreverse(array(2,1,4,3));[3,4,1,2]

rse logic for arrays is available since 2.4.0. Since: 1.5.0

right

right(str, len) - Returns the rightmost ( can be string type) characters from the string ,if is less or equal than 0 the result is an empty string.

Examples:

>SELECTright('Spark SQL',3);SQL

rint

rint(expr) - Returns the double value that is closest in value to the argument and is equal to a mathematical integer.

Examples:

>SELECTrint(12.3456);12.0

rlike

str rlike regexp - Returns true if matches , or false otherwise.

Arguments:

  • str - a string expression

  • regexp - a string expression. The pattern string should be a Java regular expression.

    Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser. For example, to match “abc”, a regular expression for can be “^abc$”.

    There is a SQL config ‘spark.sql.parser.escapedStringLiterals’ that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the that can match “abc” is “^abc$”.

Examples:

Whenspark.sql.parser.escapedStringLiteralsisdisabled(default).>SELECT'%SystemDrive%\Users\John'rlike'%SystemDrive%\\Users.*'trueWhenspark.sql.parser.escapedStringLiteralsisenabled.>SELECT'%SystemDrive%\Users\John'rlike'%SystemDrive%\Users.*'true

Note:

Use LIKE to match with simple string pattern.

rollup

round

round(expr, d) - Returns rounded to decimal places using HALF_UP rounding mode.

Examples:

>SELECTround(2.5,0);3.0

row_number

row_number() - Assigns a unique, sequential number to each row, starting with one, according to the ordering of rows within the window partition.

rpad

rpad(str, len, pad) - Returns , right-padded with to a length of . If is longer than , the return value is shortened to characters.

Examples:

>SELECTrpad('hi',5,'??');hi???>
Sours: https://docs.databricks.com/spark/2.x/spark-sql/language-manual/functions.html

Spark SQL functions

Returns the remainder of the two numbersMultiplies the two numbersAdds the two numbersSubtracts the two numbersDivides the two numbersReturns the absolute value of the inputReturns the inverse cosine valueReturns the estimated cardinality by HyperLogLog++Returns the approximate percentile value at the given percentageReturns the inverse sine valueReturns the inverse tangent valueReturns the angle between the positive x-axis plane and the points given by the coordinatesReturns the average valueReturns the cube root or Returns the smallest integer not larger than the inputted valueConvert from one base to anotherReturns the Pearson coefficient between the numbersReturns the cosine valueReturns the hyperbolic cosine valueReturns the cotangent valueReturns the rank of a value in a group of valuesReturns Euler’s numberReturns e to the power of the valueReturns e to the power of the value minus 1Returns the factorial of the valueReturns the largest integer not smaller than the valueReturns the largest value of all the parametersReturns the hypotenuse of the two values givenReturns the kurtosis value from the groupReturns the smallest value of all the parametersReturns the natural logarithm of the valueReturns the logarithm of the valueReturns the logarithm, in base 10, of the valueReturns the logarithm of the value plus 1Returns the logarithm, in base 2, of the valueReturns the maximum value of the expressionReturns the mean calculated from the valuesReturns the minimum value of the expressionReturns monotonically increasing IDsReturns the negated valueReturns the percentage ranking of a valueReturns the exact percentile at a given percentageReturns the approximate percentile at a given percentageReturns piReturns the positive modulo between two valuesReturns the positive balue, Returns the first value to the power of the second valueConverts the value to radiansReturns a random number between 0 and 1Returns a random valueReturns the closest double valueReturns the closest rounded value, Returns the number’s signReturns sine of the valueReturns hyperbolic sine of the valueReturns the square root of the valueReturns the standard deviation of the valueReturns the population standard deviation of the valueReturns the sample standard deviation of the valueReturns the sum of the valuesReturns tangent of the valueReturns hyperbolic tangent of the valueReturns the calculated population variance, Returns the calculated sample variance
Sours: https://experienceleague.adobe.com/docs/experience-platform/query/sql/spark-sql-functions.html?lang=en
  1. Rios de alabanza
  2. Hippie grunge outfits
  3. Country songs freedom
  4. Maxima 2010 price
ascii(e: Column): ColumnComputes the numeric value of the first character of the string column, and returns the result as an int column.base64(e: Column): ColumnComputes the BASE64 encoding of a binary column and returns it as a string column.This is the reverse of unbase64.concat_ws(sep: String, exprs: Column*): ColumnConcatenates multiple input string columns together into a single string column, using the given separator.decode(value: Column, charset: String): ColumnComputes the first argument into a string from a binary using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16').encode(value: Column, charset: String): ColumnComputes the first argument into a binary from a string using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16').format_number(x: Column, d: Int): ColumnFormats numeric column x to a format like '#,###,###.##', rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string column.format_string(format: String, arguments: Column*): ColumnFormats the arguments in printf-style and returns the result as a string column.initcap(e: Column): ColumnReturns a new string column by converting the first letter of each word to uppercase. Words are delimited by whitespace. For example, "hello world" will become "Hello World".instr(str: Column, substring: String): ColumnLocate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.length(e: Column): ColumnComputes the character length of a given string or number of bytes of a binary string. The length of character strings include the trailing spaces. The length of binary strings includes binary zeros.lower(e: Column): ColumnConverts a string column to lower case.levenshtein ( l : Column , r : Column ) : ColumnComputes the Levenshtein distance of the two given string columns.locate(substr: String, str: Column): ColumnLocate the position of the first occurrence of substr.locate(substr: String, str: Column, pos: Int): ColumnLocate the position of the first occurrence of substr in a string column, after position pos.lpad(str: Column, len: Int, pad: String): ColumnLeft-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.ltrim(e: Column): ColumnTrim the spaces from left end for the specified string value.regexp_extract(e: Column, exp: String, groupIdx: Int): ColumnExtract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned.regexp_replace(e: Column, pattern: String, replacement: String): ColumnReplace all substrings of the specified string value that match regexp with rep.regexp_replace(e: Column, pattern: Column, replacement: Column): ColumnReplace all substrings of the specified string value that match regexp with rep.unbase64(e: Column): ColumnDecodes a BASE64 encoded string column and returns it as a binary column. This is the reverse of base64.rpad(str: Column, len: Int, pad: String): ColumnRight-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.repeat(str: Column, n: Int): ColumnRepeats a string column n times, and returns it as a new string column.rtrim(e: Column): ColumnTrim the spaces from right end for the specified string value.rtrim(e: Column, trimString: String): ColumnTrim the specified character string from right end for the specified string column.soundex(e: Column): Column Returns the soundex code for the specified expressionsplit(str: Column, regex: String): ColumnSplits str around matches of the given regex.split(str: Column, regex: String, limit: Int): ColumnSplits str around matches of the given regex.substring(str: Column, pos: Int, len: Int): Column Substring starts at `pos` and is of length `len` when str is String type or returns the slice of byte array that starts at `pos` in byte and is of length `len` when str is Binary typesubstring_index(str: Column, delim: String, count: Int): ColumnReturns the substring from string str before count occurrences of the delimiter delim.
* If count is positive, everything the left of the final delimiter (counting from left) is
* returned. If count is negative, every to the right of the final delimiter (counting from the
* right) is returned. substring_index performs a case-sensitive match when searching for delim.overlay(src: Column, replaceString: String, pos: Int, len: Int): ColumnOverlay the specified portion of `src` with `replaceString`,
* starting from byte position `pos` of `inputString` and proceeding for `len` bytes.overlay(src: Column, replaceString: String, pos: Int): ColumnOverlay the specified portion of `src` with `replaceString`,
* starting from byte position `pos` of `inputString`.translate(src: Column, matchingString: String, replaceString: String): ColumnTranslate any character in the src by a character in replaceString.
* The characters in replaceString correspond to the characters in matchingString.
* The translate will happen when any character in the string matches the character
* in the `matchingString`.trim(e: Column): ColumnTrim the spaces from both ends for the specified string column.trim(e: Column, trimString: String): Column Trim the specified character from both ends for the specified string column.upper(e: Column): ColumnConverts a string column to upper case.
Sours: https://sparkbyexamples.com/spark/usage-of-spark-sql-string-functions/
Convert any SQL Query to Spark Dataframe

SQL Functions

Ascend uses Spark SQL syntax. This page offers a list of functions supported by the Ascend platform.

📘

Looking for that special function?

Have a SQL function you love, but isn't yet supported in Spark SQL? Let us know in our Slack Community, and we'll do our best to add it for you!

Description

any(expr) - Returns true if at least one value of expr is true.

Argument typeReturn type
( Bool )Bool

More

ANY on Apache Spark Documentation

Description

some(expr) - Returns true if at least one value of expr is true.

Argument typeReturn type
( Bool )Bool

More

SOME on Apache Spark Documentation

Description

bool_or(expr) - Returns true if at least one value of expr is true.

Argument typeReturn type
( Bool )Bool

More

BOOL_OR on Apache Spark Documentation

Description

bool_and(expr) - Returns true if all values of expr are true.

Argument typeReturn type
( Bool )Bool

More

BOOL_AND on Apache Spark Documentation

Description

every(expr) - Returns true if all values of expr are true.

Argument typeReturn type
( Bool )Bool

More

EVERY on Apache Spark Documentation

Description

avg(expr) - Returns the mean calculated from values of a group.

Argument typeReturn type
( Integer or Float )Float

More

AVG on Apache Spark Documentation

Description

bit_or(expr) - Returns the bitwise OR of all non-null input values, or null if none.

Argument typeReturn type
( Integer )Integer

More

BIT_OR on Apache Spark Documentation

Description

bit_and(expr) - Returns the bitwise AND of all non-null input values, or null if none.

Argument typeReturn type
( Integer )Integer

More

BIT_AND on Apache Spark Documentation

Description

bit_xor(expr) - Returns the bitwise XOR of all non-null input values, or null if none.

Argument typeReturn type
( Integer )Integer

More

BIT_XOR on Apache Spark Documentation

Description

mean(expr) - Returns the mean calculated from values of a group.

Argument typeReturn type
( Integer or Float )Float

More

MEAN on Apache Spark Documentation

Description

count(*) - Returns the total number of retrieved rows, including rows containing null. count(expr) - Returns the number of rows for which the supplied expression is non-null. count(DISTINCT expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are unique and non-null.

Argument typeReturn type
( Integer or Float or Bool or String or Binary or Date or Timestamp or Array or Struct or Map or )Integer

More

COUNT on Apache Spark Documentation

Description

count_if(expr) - Returns the number of TRUE values for the expression.

Argument typeReturn type
( Bool )Integer

More

COUNT_IF on Apache Spark Documentation

Description

max(expr) - Returns the maximum value of expr.

Argument typeReturn type
( Integer )Integer
( Float )Float
( Bool )Bool
( String )String
( Binary )Binary
( Date )Date
( Timestamp )Timestamp

More

MAX on Apache Spark Documentation

Description

max_by(x, y) - Returns the value of x associated with the maximum value of y.

Argument typeReturn type
( Integer, Integer or Float or Bool or String or Binary or Date or Timestamp )Integer
( Float, Integer or Float or Bool or String or Binary or Date or Timestamp )Float
( Bool, Integer or Float or Bool or String or Binary or Date or Timestamp )Bool
( String, Integer or Float or Bool or String or Binary or Date or Timestamp )String
( Binary, Integer or Float or Bool or String or Binary or Date or Timestamp )Binary
( Date, Integer or Float or Bool or String or Binary or Date or Timestamp )Date
( Timestamp, Integer or Float or Bool or String or Binary or Date or Timestamp )Timestamp
( Array, Integer or Float or Bool or String or Binary or Date or Timestamp )Array
( Struct, Integer or Float or Bool or String or Binary or Date or Timestamp )Struct
( Map, Integer or Float or Bool or String or Binary or Date or Timestamp )Map

More

MAX_BY on Apache Spark Documentation

Description

min_by(x, y) - Returns the value of x associated with the minimum value of y.

Argument typeReturn type
( Integer, Integer or Float or Bool or String or Binary or Date or Timestamp )Integer
( Float, Integer or Float or Bool or String or Binary or Date or Timestamp )Float
( Bool, Integer or Float or Bool or String or Binary or Date or Timestamp )Bool
( String, Integer or Float or Bool or String or Binary or Date or Timestamp )String
( Binary, Integer or Float or Bool or String or Binary or Date or Timestamp )Binary
( Date, Integer or Float or Bool or String or Binary or Date or Timestamp )Date
( Timestamp, Integer or Float or Bool or String or Binary or Date or Timestamp )Timestamp
( Array, Integer or Float or Bool or String or Binary or Date or Timestamp )Array
( Struct, Integer or Float or Bool or String or Binary or Date or Timestamp )Struct
( Map, Integer or Float or Bool or String or Binary or Date or Timestamp )Map

More

MIN_BY on Apache Spark Documentation

Description

min(expr) - Returns the minimum value of expr.

Argument typeReturn type
( Integer )Integer
( Float )Float
( Bool )Bool
( String )String
( Binary )Binary
( Date )Date
( Timestamp )Timestamp

More

MIN on Apache Spark Documentation

Description

sum(expr) - Returns the sum calculated from values of a group.

Argument typeReturn type
( Integer )Integer
( Float )Float

More

SUM on Apache Spark Documentation

Description

collect_list(expr) - Collects and returns a list of non-unique elements.

Argument typeReturn type
( Integer or Float or Bool or String or Binary or Date or Timestamp or Array or Struct or Map )Array

More

COLLECT_LIST on Apache Spark Documentation

Description

collect_set(expr) - Collects and returns a set of unique elements.

Argument typeReturn type
( Integer or Bool or String or Binary or Date or Timestamp )Array

More

COLLECT_SET on Apache Spark Documentation

Description

first(expr[, isIgnoreNull]) - Returns the first value of expr for a group of rows. If isIgnoreNull is true, returns only non-null values.

Argument typeReturn type
( Integer )Integer
( Integer, Bool )Integer
( Float )Float
( Float, Bool )Float
( Bool )Bool
( Bool, Bool )Bool
( String )String
( String, Bool )String
( Binary )Binary
( Binary, Bool )Binary
( Date )Date
( Date, Bool )Date
( Timestamp )Timestamp
( Timestamp, Bool )Timestamp
( Array )Array
( Array, Bool )Array
( Struct )Struct
( Struct, Bool )Struct
( Map )Map
( Map, Bool )Map

More

FIRST on Apache Spark Documentation

Description

first_value(expr[, isIgnoreNull]) - Returns the first value of expr for a group of rows. If isIgnoreNull is true, returns only non-null values.

Argument typeReturn type
( Integer )Integer
( Integer, Bool )Integer
( Float )Float
( Float, Bool )Float
( Bool )Bool
( Bool, Bool )Bool
( String )String
( String, Bool )String
( Binary )Binary
( Binary, Bool )Binary
( Date )Date
( Date, Bool )Date
( Timestamp )Timestamp
( Timestamp, Bool )Timestamp
( Array )Array
( Array, Bool )Array
( Struct )Struct
( Struct, Bool )Struct
( Map )Map
( Map, Bool )Map

More

FIRST_VALUE on Apache Spark Documentation

Description

last(expr[, isIgnoreNull]) - Returns the last value of expr for a group of rows. If isIgnoreNull is true, returns only non-null values.

Argument typeReturn type
( Integer )Integer
( Integer, Bool )Integer
( Float )Float
( Float, Bool )Float
( Bool )Bool
( Bool, Bool )Bool
( String )String
( String, Bool )String
( Binary )Binary
( Binary, Bool )Binary
( Date )Date
( Date, Bool )Date
( Timestamp )Timestamp
( Timestamp, Bool )Timestamp
( Array )Array
( Array, Bool )Array
( Struct )Struct
( Struct, Bool )Struct
( Map )Map
( Map, Bool )Map

More

LAST on Apache Spark Documentation

Description

last_value(expr[, isIgnoreNull]) - Returns the last value of expr for a group of rows. If isIgnoreNull is true, returns only non-null values.

Argument typeReturn type
( Integer )Integer
( Integer, Bool )Integer
( Float )Float
( Float, Bool )Float
( Bool )Bool
( Bool, Bool )Bool
( String )String
( String, Bool )String
( Binary )Binary
( Binary, Bool )Binary
( Date )Date
( Date, Bool )Date
( Timestamp )Timestamp
( Timestamp, Bool )Timestamp
( Array )Array
( Array, Bool )Array
( Struct )Struct
( Struct, Bool )Struct
( Map )Map
( Map, Bool )Map

More

LAST_VALUE on Apache Spark Documentation

Description

corr(expr1, expr2) - Returns Pearson coefficient of correlation between a set of number pairs.

Argument typeReturn type
( Float, Float )Float

More

CORR on Apache Spark Documentation

Description

covar_pop(expr1, expr2) - Returns the population covariance of a set of number pairs.

Argument typeReturn type
( Float, Float )Float

More

COVAR_POP on Apache Spark Documentation

Description

covar_samp(expr1, expr2) - Returns the sample covariance of a set of number pairs.

Argument typeReturn type
( Float, Float )Float

More

COVAR_SAMP on Apache Spark Documentation

Description

kurtosis(expr) - Returns the kurtosis value calculated from values of a group.

Argument typeReturn type
( Float )Float

More

KURTOSIS on Apache Spark Documentation

Description

skewness(expr) - Returns the skewness value calculated from values of a group.

Argument typeReturn type
( Float )Float

More

SKEWNESS on Apache Spark Documentation

Description

stddev_pop(expr) - Returns the population standard deviation calculated from values of a group.

Argument typeReturn type
( Float )Float

More

STDDEV_POP on Apache Spark Documentation

Description

stddev_samp(expr) - Returns the sample standard deviation calculated from values of a group.

Argument typeReturn type
( Float )Float

More

STDDEV_SAMP on Apache Spark Documentation

Description

std(expr) - Returns the sample standard deviation calculated from values of a group.

Argument typeReturn type
( Float )Float

More

STD on Apache Spark Documentation

Description

stddev(expr) - Returns the sample standard deviation calculated from values of a group.

Argument typeReturn type
( Float )Float

More

STDDEV on Apache Spark Documentation

Description

var_pop(expr) - Returns the population variance calculated from values of a group.

Argument typeReturn type
( Float )Float

More

VAR_POP on Apache Spark Documentation

Description

var_samp(expr) - Returns the sample variance calculated from values of a group.

Argument typeReturn type
( Float )Float

More

VAR_SAMP on Apache Spark Documentation

Description

variance(expr) - Returns the sample variance calculated from values of a group.

Argument typeReturn type
( Float )Float

More

VARIANCE on Apache Spark Documentation

Description

percentile(col, percentage [, frequency]) - Returns the exact percentile value of numeric column col at the given percentage. The value of percentage must be between 0.0 and 1.0. The value of frequency should be positive integral.

Argument typeReturn type
( Float, Float )Float
( Float, Float, Integer )Float

More

PERCENTILE on Apache Spark Documentation

Description

percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column col at the given percentage. The value of percentage must be between 0.0 and 1.0. The accuracy parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation.

Argument typeReturn type
( Float, Float )Float
( Float, Float, Integer )Float

More

PERCENTILE_APPROX on Apache Spark Documentation

Description

approx_percentile(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column col at the given percentage. The value of percentage must be between 0.0 and 1.0. The accuracy parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation.

Argument typeReturn type
( Float, Float )Float
( Float, Float, Integer )Float

More

APPROX_PERCENTILE on Apache Spark Documentation

Description

approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++. relativeSD defines the maximum estimation error allowed.

Argument typeReturn type
( Integer or Bool or String or Binary or Date or Timestamp )Integer
( Integer or Bool or String or Binary or Date or Timestamp, Float )Integer

More

APPROX_COUNT_DISTINCT on Apache Spark Documentation

Description

count_min_sketch(col, eps, confidence, seed) - Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a CountMinSketch before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.

Argument typeReturn type
( Integer or Bool or String or Binary or Date or Timestamp, Float, Float, Integer )Binary

More

COUNT_MIN_SKETCH on Apache Spark Documentation

Description

isnull(expr) - Returns true if expr is null, or false otherwise.

Argument typeReturn type
( Integer or Float or Bool or String or Binary or Date or Timestamp or Array or Struct or Map )Bool

More

ISNULL on Apache Spark Documentation

Description

isnotnull(expr) - Returns true if expr is not null, or false otherwise.

Argument typeReturn type
( Integer or Float or Bool or String or Binary or Date or Timestamp or Array or Struct or Map )Bool

More

ISNOTNULL on Apache Spark Documentation

Description

nullif(expr1, expr2) - Returns null if expr1 equals to expr2, or expr1 otherwise.

Argument typeReturn type
( Integer, Integer )Integer
( Float, Float )Float
( Bool, Bool )Bool
( String, String )String
( Binary, Binary )Binary
( Date, Date )Date
( Timestamp, Timestamp )Timestamp
( Array, Array )Array
( Struct, Struct )Struct
( Map, Map )Map

More

NULLIF on Apache Spark Documentation

Description

not(expr) - Logical not.

Argument typeReturn type
( Bool )Bool

More

NOT on Apache Spark Documentation

Description

if(expr1, expr2, expr3) - If expr1 evaluates to true, then returns expr2; otherwise returns expr3.

Argument typeReturn type
( Bool, Integer, Integer )Integer
( Bool, Float, Float )Float
( Bool, Bool, Bool )Bool
( Bool, String, String )String
( Bool, Binary, Binary )Binary
( Bool, Date, Date )Date
( Bool, Timestamp, Timestamp )Timestamp
( Bool, Array, Array )Array
( Bool, Struct, Struct )Struct
( Bool, Map, Map )Map

More

IF on Apache Spark Documentation

Description

ifnull(expr1, expr2) - Returns expr2 if expr1 is null, or expr1 otherwise.

Argument typeReturn type
( Integer, Integer )Integer
( Float, Float )Float
( Bool, Bool )Bool
( String, String )String
( Binary, Binary )Binary
( Date, Date )Date
( Timestamp, Timestamp )Timestamp
( Array, Array )Array
( Struct, Struct )Struct
( Map, Map )Map

More

IFNULL on Apache Spark Documentation

Description

nvl(expr1, expr2) - Returns expr2 if expr1 is null, or expr1 otherwise.

Argument typeReturn type
( Integer, Integer )Integer
( Float, Float )Float
( Bool, Bool )Bool
( String, String )String
( Binary, Binary )Binary
( Date, Date )Date
( Timestamp, Timestamp )Timestamp
( Array, Array )Array
( Struct, Struct )Struct
( Map, Map )Map

More

NVL on Apache Spark Documentation

Description

nvl2(expr1, expr2, expr3) - Returns expr2 if expr1 is not null, or expr3 otherwise.

Argument typeReturn type
( Integer, Integer, Integer )Integer
( Float, Float, Float )Float
( Bool, Bool, Bool )Bool
( String, String, String )String
( Binary, Binary, Binary )Binary
( Date, Date, Date )Date
( Timestamp, Timestamp, Timestamp )Timestamp
( Array, Array, Array )Array
( Struct, Struct, Struct )Struct
( Map, Map, Map )Map

More

NVL2 on Apache Spark Documentation

Description

coalesce(expr1, expr2, ...) - Returns the first non-null argument if exists. Otherwise, null.

Argument typeReturn type
( Integer )Integer
( Float )Float
( Bool )Bool
( String )String
( Binary )Binary
( Date )Date
( Timestamp )Timestamp
( Array )Array
( Struct )Struct
( Map )Map

More

COALESCE on Apache Spark Documentation

Description

raise_error(msg) - Raises an error with the specified string message. The function always returns a value of string data type.

Argument typeReturn type
( String )String

More

RAISE_ERROR on Apache Spark Documentation

Description

assert_true(expr) - Throws an exception if expr is not true.

Argument typeReturn type
( Bool )String

More

ASSERT_TRUE on Apache Spark Documentation

Description

typeof(expr) - Return DDL-formatted type string for the data type of the input.

Argument typeReturn type
( Integer or Float or Bool or String or Binary or Date or Timestamp or Array or Struct or Map )String

More

TYPEOF on Apache Spark Documentation

Description

bigint(expr) - Casts the value expr to the target data type bigint.

Argument typeReturn type
( Integer or Float or Bool or String )Integer

More

BIGINT on Apache Spark Documentation

Description

date(expr) - Casts the value expr to the target data type date.

Argument typeReturn type
( String or Date or Timestamp )Date

More

DATE on Apache Spark Documentation

Description

double(expr) - Casts the value expr to the target data type double.

Argument typeReturn type
( Integer or Float or Bool or String )Float

More

DOUBLE on Apache Spark Documentation

Description

int(expr) - Casts the value expr to the target data type int.

Argument typeReturn type
( Integer or Float or Bool or String )Integer

More

INT on Apache Spark Documentation

Description

timestamp(expr) - Casts the value expr to the target data type timestamp.

Argument typeReturn type
( String )Timestamp

More

TIMESTAMP on Apache Spark Documentation

Description

binary(expr) - Casts the value expr to the target data type binary.

Argument typeReturn type
( Integer or String or Binary )Binary

More

BINARY on Apache Spark Documentation

Description

boolean(expr) - Casts the value expr to the target data type boolean.

Argument typeReturn type
( Integer or Float or Bool or String or Date or Timestamp )Bool

More

BOOLEAN on Apache Spark Documentation

Description

decimal(expr) - Casts the value expr to the target data type decimal.

Argument typeReturn type
( Integer or Float or Bool or String )Float

More

DECIMAL on Apache Spark Documentation

Description

float(expr) - Casts the value expr to the target data type float.

Argument typeReturn type
( Integer or Float or Bool or String )Float

More

FLOAT on Apache Spark Documentation

Description

smallint(expr) - Casts the value expr to the target data type smallint.

Argument typeReturn type
( Integer or Float or Bool or String )Integer

More

SMALLINT on Apache Spark Documentation

Description

string(expr) - Casts the value expr to the target data type string.

Argument typeReturn type
( Integer or Float or Bool or String or Binary or Date or Timestamp or Array or Struct or Map )String

More

STRING on Apache Spark Documentation

Description

rank() - Computes the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to the current row in the ordering of the partition. The values will produce gaps in the sequence.

Argument typeReturn type
No argumentsInteger

More

RANK on Apache Spark Documentation

Description

dense_rank() - Computes the rank of a value in a group of values. The result is one plus the previously assigned rank value. Unlike the function rank, dense_rank will not produce gaps in the ranking sequence.

Argument typeReturn type
No argumentsInteger

More

DENSE_RANK on Apache Spark Documentation

Description

percent_rank() - Computes the percentage ranking of a value in a group of values.

Argument typeReturn type
No argumentsFloat

More

PERCENT_RANK on Apache Spark Documentation

Description

cume_dist() - Computes the position of a value relative to all values in the partition.

Argument typeReturn type
No argumentsFloat

More

CUME_DIST on Apache Spark Documentation

Description

ntile(n) - Divides the rows for each window partition into n buckets ranging from 1 to at most n.

Argument typeReturn type
( Integer )Integer

More

NTILE on Apache Spark Documentation

Description

row_number() - Assigns a unique, sequential number to each row, starting with one, according to the ordering of rows within the window partition.

Argument typeReturn type
No argumentsInteger

More

ROW_NUMBER on Apache Spark Documentation

Description

bin(expr) - Returns the string representation of the long value expr represented in binary.

Argument typeReturn type
( Integer )String

More

BIN on Apache Spark Documentation

Description

conv(num, from_base, to_base) - Convert num from from_base to to_base.

Argument typeReturn type
( String, Integer, Integer )String

More

CONV on Apache Spark Documentation

Description

monotonically_increasing_id() - Returns monotonically increasing 64-bit integers. The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the lower 33 bits represent the record number within each partition. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

Argument typeReturn type
No argumentsInteger

More

MONOTONICALLY_INCREASING_ID on Apache Spark Documentation

Description

lead(input[, offset[, default]]) - Returns the value of input at the offsetth row after the current row in the window. The default value of offset is 1 and the default value of default is null. If the value of input at the offsetth row is null, null is returned. If there is no such an offset row (e.g., when the offset is 1, the last row of the window does not have any subsequent row), default is returned.

Argument typeReturn type
( Integer )Integer
( Integer, Integer )Integer
( Integer, Integer, Integer )Integer
( Float )Float
( Float, Integer )Float
( Float, Integer, Float )Float
( Bool )Bool
( Bool, Integer )Bool
( Bool, Integer, Bool )Bool
( String )String
( String, Integer )String
( String, Integer, String )String
( Binary )Binary
( Binary, Integer )Binary
( Binary, Integer, Binary )Binary
( Date )Date
( Date, Integer )Date
( Date, Integer, Date )Date
( Timestamp )Timestamp
( Timestamp, Integer )Timestamp
( Timestamp, Integer, Timestamp )Timestamp
( Array )Array
( Array, Integer )Array
( Array, Integer, Array )Array
( Struct )Struct
( Struct, Integer )Struct
( Struct, Integer, Struct )Struct
( Map )Map
( Map, Integer )Map
( Map, Integer, Map )Map

More

LEAD on Apache Spark Documentation

Description

lag(input[, offset[, default]]) - Returns the value of input at the offsetth row before the current row in the window. The default value of offset is 1 and the default value of default is null. If the value of input at the offsetth row is null, null is returned. If there is no such offset row (e.g., when the offset is 1, the first row of the window does not have any previous row), default is returned.

Argument typeReturn type
( Integer )Integer
( Integer, Integer )Integer
( Integer, Integer, Integer )Integer
( Float )Float
( Float, Integer )Float
( Float, Integer, Float )Float
( Bool )Bool
( Bool, Integer )Bool
( Bool, Integer, Bool )Bool
( String )String
( String, Integer )String
( String, Integer, String )String
( Binary )Binary
( Binary, Integer )Binary
( Binary, Integer, Binary )Binary
( Date )Date
( Date, Integer )Date
( Date, Integer, Date )Date
( Timestamp )Timestamp
( Timestamp, Integer )Timestamp
( Timestamp, Integer, Timestamp )Timestamp
( Array )Array
( Array, Integer )Array
( Array, Integer, Array )Array
( Struct )Struct
( Struct, Integer )Struct
( Struct, Integer, Struct )Struct
( Map )Map
( Map, Integer )Map
( Map, Integer, Map )Map

More

LAG on Apache Spark Documentation

Description

bit_count(expr) - Returns the number of bits that are set in the argument expr as an unsigned 64-bit integer, or NULL if the argument is NULL.

Argument typeReturn type
( Integer or Bool )Integer

More

BIT_COUNT on Apache Spark Documentation

Description

shiftleft(base, expr) - Bitwise left shift.

Argument typeReturn type
( Integer, Integer )Integer

More

SHIFTLEFT on Apache Spark Documentation

Description

shiftright(base, expr) - Bitwise (signed) right shift.

Argument typeReturn type
( Integer, Integer )Integer

More

SHIFTRIGHT on Apache Spark Documentation

Description

shiftrightunsigned(base, expr) - Bitwise unsigned right shift.

Argument typeReturn type
( Integer, Integer )Integer

More

SHIFTRIGHTUNSIGNED on Apache Spark Documentation

Description

e() - Returns Euler's number, e.

Argument typeReturn type
No argumentsFloat

More

E on Apache Spark Documentation

Description

pi() - Returns pi.

Argument typeReturn type
No argumentsFloat

More

PI on Apache Spark Documentation

Description

abs(expr) - Returns the absolute value of the numeric value.

Argument typeReturn type
( Integer )Integer
( Float )Float

More

ABS on Apache Spark Documentation

Description

negative(expr) - Returns the negated value of expr.

Argument typeReturn type
( Integer )Integer
( Float )Float
( )

More

NEGATIVE on Apache Spark Documentation

Description

positive(expr) - Returns the value of expr.

Argument typeReturn type
( Integer )Integer
( Float )Float
( )

More

POSITIVE on Apache Spark Documentation

Description

sign(expr) - Returns -1.0, 0.0 or 1.0 as expr is negative, 0 or positive.

Argument typeReturn type
( Integer or Float )Float

More

SIGN on Apache Spark Documentation

Description

signum(expr) - Returns -1.0, 0.0 or 1.0 as expr is negative, 0 or positive.

Argument typeReturn type
( Integer or Float )Float

More

SIGNUM on Apache Spark Documentation

Description

isnan(expr) - Returns true if expr is NaN, or false otherwise.

Argument typeReturn type
( Float )Bool

More

ISNAN on Apache Spark Documentation

Description

rand([seed]) - Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).

Argument typeReturn type
No argumentsFloat
( Integer )Float

More

RAND on Apache Spark Documentation

Description

randn([seed]) - Returns a random value with independent and identically distributed (i.i.d.) values drawn from the standard normal distribution.

Argument typeReturn type
No argumentsFloat
( Integer )Float

More

RANDN on Apache Spark Documentation

Description

uuid() - Returns an universally unique identifier (UUID) string. The value is returned as a canonical UUID 36-character string.

Argument typeReturn type
No argumentsString

More

UUID on Apache Spark Documentation

Description

sqrt(expr) - Returns the square root of expr.

Argument typeReturn type
( Float )Float

More

SQRT on Apache Spark Documentation

Description

cbrt(expr) - Returns the cube root of expr.

Argument typeReturn type
( Float )Float

More

CBRT on Apache Spark Documentation

Description

hypot(expr1, expr2) - Returns sqrt(expr12 + expr22).

Argument typeReturn type
( Float, Float )Float

More

HYPOT on Apache Spark Documentation

Description

pow(expr1, expr2) - Raises expr1 to the power of expr2.

Argument typeReturn type
( Float, Float )Float

More

POW on Apache Spark Documentation

Description

exp(expr) - Returns e to the power of expr.

Argument typeReturn type
( Float )Float

More

EXP on Apache Spark Documentation

Description

expm1(expr) - Returns exp(expr) - 1.

Argument typeReturn type
( Float )Float

More

EXPM1 on Apache Spark Documentation

Description

ln(expr) - Returns the natural logarithm (base e) of expr.

Argument typeReturn type
( Float )Float

More

LN on Apache Spark Documentation

Description

log(base, expr) - Returns the logarithm of expr with base.

Argument typeReturn type
( Float )Float
( Float, Float )Float

More

LOG on Apache Spark Documentation

Description

log10(expr) - Returns the logarithm of expr with base 10.

Argument typeReturn type
( Float )Float

More

LOG10 on Apache Spark Documentation

Description

log1p(expr) - Returns log(1 + expr).

Argument typeReturn type
( Float )Float

More

LOG1P on Apache Spark Documentation

Description

log2(expr) - Returns the logarithm of expr with base 2.

Argument typeReturn type
( Float )Float

More

LOG2 on Apache Spark Documentation

Description

greatest(expr, ...) - Returns the greatest value of all parameters, skipping null values.

Argument typeReturn type
( Integer, Integer )Integer
( Float, Float )Float
( Bool, Bool )Bool
( String, String )String
( Binary, Binary )Binary
( Date, Date )Date
( Timestamp, Timestamp )Timestamp

More

GREATEST on Apache Spark Documentation

Description

least(expr, ...) - Returns the least value of all parameters, skipping null values.

Argument typeReturn type
( Integer, Integer )Integer
( Float, Float )Float
( Bool, Bool )Bool
( String, String )String
( Binary, Binary )Binary
( Date, Date )Date
( Timestamp, Timestamp )Timestamp

More

LEAST on Apache Spark Documentation

Description

mod(expr1, expr2) - Returns the remainder after expr1/expr2.

Argument typeReturn type
( Integer, Integer )Integer
( Float, Float )Float

More

MOD on Apache Spark Documentation

Description

pmod(expr1, expr2) - Returns the positive value of expr1 mod expr2.

Argument typeReturn type
( Integer, Integer )Integer
( Float, Float )Float

More

PMOD on Apache Spark Documentation

Description

factorial(expr) - Returns the factorial of expr. expr is [0..20]. Otherwise, null.

Argument typeReturn type
( Integer )Integer

More

FACTORIAL on Apache Spark Documentation

Description

nanvl(expr1, expr2) - Returns expr1 if it's not NaN, or expr2 otherwise.

Argument typeReturn type
( Float, Float )Float

More

NANVL on Apache Spark Documentation

Description

div(expr1, expr2) - Divide expr1 by expr2. It returns NULL if an operand is NULL or expr2 is 0. The result is casted to long.

Argument typeReturn type
( Integer, Integer )Integer

More

DIV on Apache Spark Documentation

Description

power(expr1, expr2) - Raises expr1 to the power of expr2.

Argument typeReturn type
( Float, Float )Float

More

POWER on Apache Spark Documentation

Description

random([seed]) - Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).

Argument typeReturn type
No argumentsFloat
( Integer )Float

More

RANDOM on Apache Spark Documentation

Description

round(expr, d) - Returns expr rounded to d decimal places using HALF_UP rounding mode.

Argument typeReturn type
( Float )Float
( Float, Integer )Float

More

ROUND on Apache Spark Documentation

Description

bround(expr, d) - Returns expr rounded to d decimal places using HALF_EVEN rounding mode.

Argument typeReturn type
( Float )Float
( Float, Integer )Float

More

BROUND on Apache Spark Documentation

Description

ceil(expr) - Returns the smallest integer not smaller than expr.

Argument typeReturn type
( Float )Float

More

CEIL on Apache Spark Documentation

Description

ceiling(expr) - Returns the smallest integer not smaller than expr.

Argument typeReturn type
( Float )Float

More

CEILING on Apache Spark Documentation

Description

floor(expr) - Returns the largest integer not greater than expr.

Argument typeReturn type
( Float )Float

More

FLOOR on Apache Spark Documentation

Description

rint(expr) - Returns the double value that is closest in value to the argument and is equal to a mathematical integer.

Argument typeReturn type
( Float )Float

More

RINT on Apache Spark Documentation

Description

degrees(expr) - Converts radians to degrees.

Argument typeReturn type
( Float )Float

More

DEGREES on Apache Spark Documentation

Description

radians(expr) - Converts degrees to radians.

Argument typeReturn type
( Float )Float

More

RADIANS on Apache Spark Documentation

Description

cos(expr) - Returns the cosine of expr.

Argument typeReturn type
( Float )Float

More

COS on Apache Spark Documentation

Description

cosh(expr) - Returns the hyperbolic cosine of expr.

Argument typeReturn type
( Float )Float

More

COSH on Apache Spark Documentation

Description

acos(expr) - Returns the inverse cosine (a.k.a. arccosine) of expr if -1<=expr<=1 or NaN otherwise.

Argument typeReturn type
( Float )Float

More

ACOS on Apache Spark Documentation

Description

acosh(expr) - Returns inverse hyperbolic cosine of expr.

Argument typeReturn type
( Float )Float

More

ACOSH on Apache Spark Documentation

Description

sin(expr) - Returns the sine of expr.

Argument typeReturn type
( Float )Float

More

SIN on Apache Spark Documentation

Description

sinh(expr) - Returns the hyperbolic sine of expr.

Argument typeReturn type
( Float )Float

More

SINH on Apache Spark Documentation

Description

asin(expr) - Returns the inverse sine (a.k.a. arcsine) the arc sin of expr if -1<=expr<=1 or NaN otherwise.

Argument typeReturn type
( Float )Float

More

ASIN on Apache Spark Documentation

Description

asinh(expr) - Returns inverse hyperbolic sine of expr.

Argument typeReturn type
( Float )Float

More

ASINH on Apache Spark Documentation

Description

tan(expr) - Returns the tangent of expr.

Argument typeReturn type
( Float )Float

More

TAN on Apache Spark Documentation

Description

tanh(expr) - Returns the hyperbolic tangent of expr.

Argument typeReturn type
( Float )Float

More

TANH on Apache Spark Documentation

Description

atanh(expr) - Returns inverse hyperbolic tangent of expr.

Argument typeReturn type
( Float )Float

More

ATANH on Apache Spark Documentation

Description

cot(expr) - Returns the cotangent of expr.

Argument typeReturn type
( Float )Float

More

COT on Apache Spark Documentation

Description

atan(expr) - Returns the inverse tangent (a.k.a. arctangent).

Argument typeReturn type
( Float )Float

More

ATAN on Apache Spark Documentation

Description

atan2(expr1, expr2) - Returns the angle in radians between the positive x-axis of a plane and the point given by the coordinates (expr1, expr2).

Argument typeReturn type
( Float, Float )Float

More

Sours: https://developer.ascend.io/docs/sql-functions

Sql if spark

!

! expr - Logical not.

%

expr1 % expr2 - Returns the remainder after /.

Examples:

&

expr1 & expr2 - Returns the result of bitwise AND of and .

Examples:

*

expr1 * expr2 - Returns *.

Examples:

+

expr1 + expr2 - Returns +.

Examples:

-

expr1 - expr2 - Returns -.

Examples:

/

expr1 / expr2 - Returns /. It always performs floating point division.

Examples:

<

expr1 < expr2 - Returns true if is less than .

Arguments:

  • expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types such array/struct, the data types of fields must be orderable.

Examples:

<=

expr1 <= expr2 - Returns true if is less than or equal to .

Arguments:

  • expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types such array/struct, the data types of fields must be orderable.

Examples:

<=>

expr1 <=> expr2 - Returns same result as the EQUAL(=) operator for non-null operands, but returns true if both are null, false if one of the them is null.

Arguments:

  • expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be used in equality comparison. Map type is not supported. For complex types such array/struct, the data types of fields must be orderable.

Examples:

=

expr1 = expr2 - Returns true if equals , or false otherwise.

Arguments:

  • expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be used in equality comparison. Map type is not supported. For complex types such array/struct, the data types of fields must be orderable.

Examples:

==

expr1 == expr2 - Returns true if equals , or false otherwise.

Arguments:

  • expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be used in equality comparison. Map type is not supported. For complex types such array/struct, the data types of fields must be orderable.

Examples:

>

expr1 > expr2 - Returns true if is greater than .

Arguments:

  • expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types such array/struct, the data types of fields must be orderable.

Examples:

>=

expr1 >= expr2 - Returns true if is greater than or equal to .

Arguments:

  • expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types such array/struct, the data types of fields must be orderable.

Examples:

^

expr1 ^ expr2 - Returns the result of bitwise exclusive OR of and .

Examples:

abs

abs(expr) - Returns the absolute value of the numeric value.

Examples:

acos

acos(expr) - Returns the inverse cosine (a.k.a. arccosine) of if -1<=<=1 or NaN otherwise.

Examples:

add_months

add_months(start_date, num_months) - Returns the date that is after .

Examples:

Since: 1.5.0

and

expr1 and expr2 - Logical AND.

approx_count_distinct

approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++. defines the maximum estimation error allowed.

approx_percentile

approx_percentile(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column at the given percentage. The value of percentage must be between 0.0 and 1.0. The parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of yields better accuracy, is the relative error of the approximation. When is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column at the given percentage array.

Examples:

array

array(expr, ...) - Returns an array with the given elements.

Examples:

array_contains

array_contains(array, value) - Returns true if the array contains the value.

Examples:

ascii

ascii(str) - Returns the numeric value of the first character of .

Examples:

asin

asin(expr) - Returns the inverse sine (a.k.a. arcsine) the arc sin of if -1<=<=1 or NaN otherwise.

Examples:

assert_true

assert_true(expr) - Throws an exception if is not true.

Examples:

atan

atan(expr) - Returns the inverse tangent (a.k.a. arctangent).

Examples:

atan2

atan2(expr1, expr2) - Returns the angle in radians between the positive x-axis of a plane and the point given by the coordinates (, ).

Examples:

avg

avg(expr) - Returns the mean calculated from values of a group.

base64

base64(bin) - Converts the argument from a binary to a base 64 string.

Examples:

bigint

bigint(expr) - Casts the value to the target data type .

bin

bin(expr) - Returns the string representation of the long value represented in binary.

Examples:

binary

binary(expr) - Casts the value to the target data type .

bit_length

bit_length(expr) - Returns the bit length of string data or number of bits of binary data.

Examples:

boolean

boolean(expr) - Casts the value to the target data type .

bround

bround(expr, d) - Returns rounded to decimal places using HALF_EVEN rounding mode.

Examples:

cast

cast(expr AS type) - Casts the value to the target data type .

Examples:

cbrt

cbrt(expr) - Returns the cube root of .

Examples:

ceil

ceil(expr) - Returns the smallest integer not smaller than .

Examples:

ceiling

ceiling(expr) - Returns the smallest integer not smaller than .

Examples:

char

char(expr) - Returns the ASCII character having the binary equivalent to . If n is larger than 256 the result is equivalent to chr(n % 256)

Examples:

char_length

char_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

Examples:

character_length

character_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

Examples:

chr

chr(expr) - Returns the ASCII character having the binary equivalent to . If n is larger than 256 the result is equivalent to chr(n % 256)

Examples:

coalesce

coalesce(expr1, expr2, ...) - Returns the first non-null argument if exists. Otherwise, null.

Examples:

collect_list

collect_list(expr) - Collects and returns a list of non-unique elements.

collect_set

collect_set(expr) - Collects and returns a set of unique elements.

concat

concat(str1, str2, ..., strN) - Returns the concatenation of str1, str2, ..., strN.

Examples:

concat_ws

concat_ws(sep, [str | array(str)]+) - Returns the concatenation of the strings separated by .

Examples:

conv

conv(num, from_base, to_base) - Convert from to .

Examples:

corr

corr(expr1, expr2) - Returns Pearson coefficient of correlation between a set of number pairs.

cos

cos(expr) - Returns the cosine of .

Examples:

cosh

cosh(expr) - Returns the hyperbolic cosine of .

Examples:

cot

cot(expr) - Returns the cotangent of .

Examples:

count

count(*) - Returns the total number of retrieved rows, including rows containing null.

count(expr) - Returns the number of rows for which the supplied expression is non-null.

count(DISTINCT expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are unique and non-null.

count_min_sketch

count_min_sketch(col, eps, confidence, seed) - Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.

covar_pop

covar_pop(expr1, expr2) - Returns the population covariance of a set of number pairs.

covar_samp

covar_samp(expr1, expr2) - Returns the sample covariance of a set of number pairs.

crc32

crc32(expr) - Returns a cyclic redundancy check value of the as a bigint.

Examples:

cube

cume_dist

cume_dist() - Computes the position of a value relative to all values in the partition.

current_database

current_database() - Returns the current database.

Examples:

current_date

current_date() - Returns the current date at the start of query evaluation.

Since: 1.5.0

current_timestamp

current_timestamp() - Returns the current timestamp at the start of query evaluation.

Since: 1.5.0

date

date(expr) - Casts the value to the target data type .

date_add

date_add(start_date, num_days) - Returns the date that is after .

Examples:

Since: 1.5.0

date_format

date_format(timestamp, fmt) - Converts to a value of string in the format specified by the date format .

Examples:

Since: 1.5.0

date_sub

date_sub(start_date, num_days) - Returns the date that is before .

Examples:

Since: 1.5.0

date_trunc

date_trunc(fmt, ts) - Returns timestamp truncated to the unit specified by the format model . should be one of ["YEAR", "YYYY", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", "WEEK", "QUARTER"]

Examples:

Since: 2.3.0

datediff

datediff(endDate, startDate) - Returns the number of days from to .

Examples:

Since: 1.5.0

day

day(date) - Returns the day of month of the date/timestamp.

Examples:

Since: 1.5.0

dayofmonth

dayofmonth(date) - Returns the day of month of the date/timestamp.

Examples:

Since: 1.5.0

dayofweek

dayofweek(date) - Returns the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday).

Examples:

Since: 2.3.0

dayofyear

dayofyear(date) - Returns the day of year of the date/timestamp.

Examples:

Since: 1.5.0

decimal

decimal(expr) - Casts the value to the target data type .

decode

decode(bin, charset) - Decodes the first argument using the second argument character set.

Examples:

degrees

degrees(expr) - Converts radians to degrees.

Examples:

dense_rank

dense_rank() - Computes the rank of a value in a group of values. The result is one plus the previously assigned rank value. Unlike the function rank, dense_rank will not produce gaps in the ranking sequence.

double

double(expr) - Casts the value to the target data type .

e

e() - Returns Euler's number, e.

Examples:

elt

elt(n, input1, input2, ...) - Returns the -th input, e.g., returns when is 2.

Examples:

encode

encode(str, charset) - Encodes the first argument using the second argument character set.

Examples:

exp

exp(expr) - Returns e to the power of .

Examples:

explode

explode(expr) - Separates the elements of array into multiple rows, or the elements of map into multiple rows and columns.

Examples:

explode_outer

explode_outer(expr) - Separates the elements of array into multiple rows, or the elements of map into multiple rows and columns.

Examples:

expm1

expm1(expr) - Returns exp() - 1.

Examples:

factorial

factorial(expr) - Returns the factorial of . is [0..20]. Otherwise, null.

Examples:

find_in_set

find_in_set(str, str_array) - Returns the index (1-based) of the given string () in the comma-delimited list (). Returns 0, if the string was not found or if the given string () contains a comma.

Examples:

first

first(expr[, isIgnoreNull]) - Returns the first value of for a group of rows. If is true, returns only non-null values.

first_value

first_value(expr[, isIgnoreNull]) - Returns the first value of for a group of rows. If is true, returns only non-null values.

float

float(expr) - Casts the value to the target data type .

floor

floor(expr) - Returns the largest integer not greater than .

Examples:

format_number

format_number(expr1, expr2) - Formats the number like '#,###,###.##', rounded to decimal places. If is 0, the result has no decimal point or fractional part. This is supposed to function like MySQL's FORMAT.

Examples:

format_string

format_string(strfmt, obj, ...) - Returns a formatted string from printf-style format strings.

Examples:

from_json

from_json(jsonStr, schema[, options]) - Returns a struct value with the given and .

Examples:

Since: 2.2.0

from_unixtime

from_unixtime(unix_time, format) - Returns in the specified .

Examples:

Since: 1.5.0

from_utc_timestamp

from_utc_timestamp(timestamp, timezone) - Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, 'GMT+1' would yield '2017-07-14 03:40:00.0'.

Examples:

Since: 1.5.0

get_json_object

get_json_object(json_txt, path) - Extracts a json object from .

Examples:

greatest

greatest(expr, ...) - Returns the greatest value of all parameters, skipping null values.

Examples:

grouping

grouping_id

hash

hash(expr1, expr2, ...) - Returns a hash value of the arguments.

Examples:

hex

hex(expr) - Converts to hexadecimal.

Examples:

hour

hour(timestamp) - Returns the hour component of the string/timestamp.

Examples:

Since: 1.5.0

hypot

hypot(expr1, expr2) - Returns sqrt(2 + 2).

Examples:

if

if(expr1, expr2, expr3) - If evaluates to true, then returns ; otherwise returns .

Examples:

ifnull

ifnull(expr1, expr2) - Returns if is null, or otherwise.

Examples:

in

expr1 in(expr2, expr3, ...) - Returns true if equals to any valN.

Arguments:

  • expr1, expr2, expr3, ... - the arguments must be same type.

Examples:

initcap

initcap(str) - Returns with the first letter of each word in uppercase. All other letters are in lowercase. Words are delimited by white space.

Examples:

inline

inline(expr) - Explodes an array of structs into a table.

Examples:

inline_outer

inline_outer(expr) - Explodes an array of structs into a table.

Examples:

input_file_block_length

input_file_block_length() - Returns the length of the block being read, or -1 if not available.

input_file_block_start

input_file_block_start() - Returns the start offset of the block being read, or -1 if not available.

input_file_name

input_file_name() - Returns the name of the file being read, or empty string if not available.

instr

instr(str, substr) - Returns the (1-based) index of the first occurrence of in .

Examples:

int

int(expr) - Casts the value to the target data type .

isnan

isnan(expr) - Returns true if is NaN, or false otherwise.

Examples:

isnotnull

isnotnull(expr) - Returns true if is not null, or false otherwise.

Examples:

isnull

isnull(expr) - Returns true if is null, or false otherwise.

Examples:

java_method

java_method(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.

Examples:

json_tuple

json_tuple(jsonStr, p1, p2, ..., pn) - Returns a tuple like the function get_json_object, but it takes multiple names. All the input parameters and output column types are string.

Examples:

kurtosis

kurtosis(expr) - Returns the kurtosis value calculated from values of a group.

lag

lag(input[, offset[, default]]) - Returns the value of at the th row before the current row in the window. The default value of is 1 and the default value of is null. If the value of at the th row is null, null is returned. If there is no such offset row (e.g., when the offset is 1, the first row of the window does not have any previous row), is returned.

last

last(expr[, isIgnoreNull]) - Returns the last value of for a group of rows. If is true, returns only non-null values.

last_day

last_day(date) - Returns the last day of the month which the date belongs to.

Examples:

Since: 1.5.0

last_value

last_value(expr[, isIgnoreNull]) - Returns the last value of for a group of rows. If is true, returns only non-null values.

lcase

lcase(str) - Returns with all characters changed to lowercase.

Examples:

lead

lead(input[, offset[, default]]) - Returns the value of at the th row after the current row in the window. The default value of is 1 and the default value of is null. If the value of at the th row is null, null is returned. If there is no such an offset row (e.g., when the offset is 1, the last row of the window does not have any subsequent row), is returned.

least

least(expr, ...) - Returns the least value of all parameters, skipping null values.

Examples:

left

left(str, len) - Returns the leftmost ( can be string type) characters from the string ,if is less or equal than 0 the result is an empty string.

Examples:

length

length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

Examples:

levenshtein

levenshtein(str1, str2) - Returns the Levenshtein distance between the two given strings.

Examples:

like

str like pattern - Returns true if str matches pattern, null if any arguments are null, false otherwise.

Arguments:

  • str - a string expression
  • pattern - a string expression. The pattern is a string which is matched literally, with exception to the following special symbols:

    _ matches any one character in the input (similar to . in posix regular expressions)

    % matches zero or more characters in the input (similar to .* in posix regular expressions)

    The escape character is '\'. If an escape character precedes a special symbol or another escape character, the following character is matched literally. It is invalid to escape any other character.

    Since Spark 2.0, string literals are unescaped in our SQL parser. For example, in order to match "\abc", the pattern should be "\abc".

    When SQL config 'spark.sql.parser.escapedStringLiterals' is enabled, it fallbacks to Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the pattern to match "\abc" should be "\abc".

Examples:

Note:

Use RLIKE to match with standard regular expressions.

ln

ln(expr) - Returns the natural logarithm (base e) of .

Examples:

locate

locate(substr, str[, pos]) - Returns the position of the first occurrence of in after position . The given and return value are 1-based.

Examples:

log

log(base, expr) - Returns the logarithm of with .

Examples:

log10

log10(expr) - Returns the logarithm of with base 10.

Examples:

log1p

log1p(expr) - Returns log(1 + ).

Examples:

log2

log2(expr) - Returns the logarithm of with base 2.

Examples:

lower

lower(str) - Returns with all characters changed to lowercase.

Examples:

lpad

lpad(str, len, pad) - Returns , left-padded with to a length of . If is longer than , the return value is shortened to characters.

Examples:

ltrim

ltrim(str) - Removes the leading space characters from .

ltrim(trimStr, str) - Removes the leading string contains the characters from the trim string

Arguments:

  • str - a string expression
  • trimStr - the trim string characters to trim, the default value is a single space

Examples:

map

map(key0, value0, key1, value1, ...) - Creates a map with the given key/value pairs.

Examples:

map_keys

map_keys(map) - Returns an unordered array containing the keys of the map.

Examples:

map_values

map_values(map) - Returns an unordered array containing the values of the map.

Examples:

max

max(expr) - Returns the maximum value of .

md5

md5(expr) - Returns an MD5 128-bit checksum as a hex string of .

Examples:

mean

mean(expr) - Returns the mean calculated from values of a group.

min

min(expr) - Returns the minimum value of .

minute

minute(timestamp) - Returns the minute component of the string/timestamp.

Examples:

Since: 1.5.0

mod

expr1 mod expr2 - Returns the remainder after /.

Examples:

monotonically_increasing_id

monotonically_increasing_id() - Returns monotonically increasing 64-bit integers. The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the lower 33 bits represent the record number within each partition. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

month

month(date) - Returns the month component of the date/timestamp.

Examples:

Since: 1.5.0

months_between

months_between(timestamp1, timestamp2) - Returns number of months between and .

Examples:

Since: 1.5.0

named_struct

named_struct(name1, val1, name2, val2, ...) - Creates a struct with the given field names and values.

Examples:

nanvl

nanvl(expr1, expr2) - Returns if it's not NaN, or otherwise.

Examples:

negative

negative(expr) - Returns the negated value of .

Examples:

next_day

next_day(start_date, day_of_week) - Returns the first date which is later than and named as indicated.

Examples:

Since: 1.5.0

not

not expr - Logical not.

now

now() - Returns the current timestamp at the start of query evaluation.

Since: 1.5.0

ntile

ntile(n) - Divides the rows for each window partition into buckets ranging from 1 to at most .

nullif

nullif(expr1, expr2) - Returns null if equals to , or otherwise.

Examples:

nvl

nvl(expr1, expr2) - Returns if is null, or otherwise.

Examples:

nvl2

nvl2(expr1, expr2, expr3) - Returns if is not null, or otherwise.

Examples:

octet_length

octet_length(expr) - Returns the byte length of string data or number of bytes of binary data.

Examples:

or

expr1 or expr2 - Logical OR.

parse_url

parse_url(url, partToExtract[, key]) - Extracts a part from a URL.

Examples:

percent_rank

percent_rank() - Computes the percentage ranking of a value in a group of values.

percentile

percentile(col, percentage [, frequency]) - Returns the exact percentile value of numeric column at the given percentage. The value of percentage must be between 0.0 and 1.0. The value of frequency should be positive integral

percentile(col, array(percentage1 [, percentage2]...) [, frequency]) - Returns the exact percentile value array of numeric column at the given percentage(s). Each value of the percentage array must be between 0.0 and 1.0. The value of frequency should be positive integral

percentile_approx

percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column at the given percentage. The value of percentage must be between 0.0 and 1.0. The parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of yields better accuracy, is the relative error of the approximation. When is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column at the given percentage array.

Examples:

pi

pi() - Returns pi.

Examples:

pmod

pmod(expr1, expr2) - Returns the positive value of mod .

Examples:

posexplode

posexplode(expr) - Separates the elements of array into multiple rows with positions, or the elements of map into multiple rows and columns with positions.

Examples:

posexplode_outer

posexplode_outer(expr) - Separates the elements of array into multiple rows with positions, or the elements of map into multiple rows and columns with positions.

Examples:

position

position(substr, str[, pos]) - Returns the position of the first occurrence of in after position . The given and return value are 1-based.

Examples:

positive

positive(expr) - Returns the value of .

pow

pow(expr1, expr2) - Raises to the power of .

Examples:

power

power(expr1, expr2) - Raises to the power of .

Examples:

printf

printf(strfmt, obj, ...) - Returns a formatted string from printf-style format strings.

Examples:

quarter

quarter(date) - Returns the quarter of the year for date, in the range 1 to 4.

Examples:

Since: 1.5.0

radians

radians(expr) - Converts degrees to radians.

Examples:

rand

rand([seed]) - Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).

Examples:

randn

randn([seed]) - Returns a random value with independent and identically distributed (i.i.d.) values drawn from the standard normal distribution.

Examples:

rank

rank() - Computes the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to the current row in the ordering of the partition. The values will produce gaps in the sequence.

reflect

reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.

Examples:

regexp_extract(str, regexp[, idx]) - Extracts a group that matches .

Examples:

regexp_replace

regexp_replace(str, regexp, rep) - Replaces all substrings of that match with .

Examples:

repeat

repeat(str, n) - Returns the string which repeats the given string value n times.

Examples:

replace

replace(str, search[, replace]) - Replaces all occurrences of with .

Arguments:

  • str - a string expression
  • search - a string expression. If is not found in , is returned unchanged.
  • replace - a string expression. If is not specified or is an empty string, nothing replaces the string that is removed from .

Examples:

reverse

reverse(str) - Returns the reversed given string.

Examples:

right

right(str, len) - Returns the rightmost ( can be string type) characters from the string ,if is less or equal than 0 the result is an empty string.

Examples:

rint

rint(expr) - Returns the double value that is closest in value to the argument and is equal to a mathematical integer.

Examples:

rlike

str rlike regexp - Returns true if matches , or false otherwise.

Arguments:

  • str - a string expression
  • regexp - a string expression. The pattern string should be a Java regular expression.

    Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser. For example, to match "\abc", a regular expression for can be "^\abc$".

    There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the that can match "\abc" is "^\abc$".

Examples:

Note:

Use LIKE to match with simple string pattern.

rollup

round

round(expr, d) - Returns rounded to decimal places using HALF_UP rounding mode.

Examples:

row_number

row_number() - Assigns a unique, sequential number to each row, starting with one, according to the ordering of rows within the window partition.

rpad

rpad(str, len, pad) - Returns , right-padded with to a length of . If is longer than , the return value is shortened to characters.

Examples:

rtrim

rtrim(str) - Removes the trailing space characters from .

rtrim(trimStr, str) - Removes the trailing string which contains the characters from the trim string from the

Arguments:

  • str - a string expression
  • trimStr - the trim string characters to trim, the default value is a single space

Examples:

second

second(timestamp) - Returns the second component of the string/timestamp.

Examples:

Since: 1.5.0

sentences

sentences(str[, lang, country]) - Splits into an array of array of words.

Examples:

sha

sha(expr) - Returns a sha1 hash value as a hex string of the .

Examples:

sha1

sha1(expr) - Returns a sha1 hash value as a hex string of the .

Examples:

sha2

sha2(expr, bitLength) - Returns a checksum of SHA-2 family as a hex string of . SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent to 256.

Examples:

shiftleft

shiftleft(base, expr) - Bitwise left shift.

Examples:

shiftright

shiftright(base, expr) - Bitwise (signed) right shift.

Examples:

shiftrightunsigned

shiftrightunsigned(base, expr) - Bitwise unsigned right shift.

Examples:

sign

sign(expr) - Returns -1.0, 0.0 or 1.0 as is negative, 0 or positive.

Examples:

signum

signum(expr) - Returns -1.0, 0.0 or 1.0 as is negative, 0 or positive.

Examples:

sin

sin(expr) - Returns the sine of .

Examples:

sinh

sinh(expr) - Returns the hyperbolic sine of .

Examples:

size

size(expr) - Returns the size of an array or a map. Returns -1 if null.

Examples:

skewness

skewness(expr) - Returns the skewness value calculated from values of a group.

smallint

smallint(expr) - Casts the value to the target data type .

sort_array

sort_array(array[, ascendingOrder]) - Sorts the input array in ascending or descending order according to the natural ordering of the array elements.

Examples:

soundex

soundex(str) - Returns Soundex code of the string.

Examples:

space

space(n) - Returns a string consisting of spaces.

Examples:

spark_partition_id

spark_partition_id() - Returns the current partition id.

split

split(str, regex) - Splits around occurrences that match .

Examples:

sqrt

sqrt(expr) - Returns the square root of .

Examples:

stack

stack(n, expr1, ..., exprk) - Separates , ..., into rows.

Examples:

std

std(expr) - Returns the sample standard deviation calculated from values of a group.

stddev

stddev(expr) - Returns the sample standard deviation calculated from values of a group.

stddev_pop

stddev_pop(expr) - Returns the population standard deviation calculated from values of a group.

stddev_samp

stddev_samp(expr) - Returns the sample standard deviation calculated from values of a group.

str_to_map

str_to_map(text[, pairDelim[, keyValueDelim]]) - Creates a map after splitting the text into key/value pairs using delimiters. Default delimiters are ',' for and ':' for .

Examples:

string

string(expr) - Casts the value to the target data type .

struct

struct(col1, col2, col3, ...) - Creates a struct with the given field values.

substr

substr(str, pos[, len]) - Returns the substring of that starts at and is of length , or the slice of byte array that starts at and is of length .

Examples:

substring

substring(str, pos[, len]) - Returns the substring of that starts at and is of length , or the slice of byte array that starts at and is of length .

Examples:

substring_index

substring_index(str, delim, count) - Returns the substring from before occurrences of the delimiter . If is positive, everything to the left of the final delimiter (counting from the left) is returned. If is negative, everything to the right of the final delimiter (counting from the right) is returned. The function substring_index performs a case-sensitive match when searching for .

Examples:

sum

sum(expr) - Returns the sum calculated from values of a group.

tan

tan(expr) - Returns the tangent of .

Examples:

tanh

tanh(expr) - Returns the hyperbolic tangent of .

Examples:

timestamp

timestamp(expr) - Casts the value to the target data type .

tinyint

tinyint(expr) - Casts the value to the target data type .

to_date

to_date(date_str[, fmt]) - Parses the expression with the expression to a date. Returns null with invalid input. By default, it follows casting rules to a date if the is omitted.

Examples:

Since: 1.5.0

to_json

to_json(expr[, options]) - Returns a json string with a given struct value

Examples:

Since: 2.2.0

to_timestamp

to_timestamp(timestamp[, fmt]) - Parses the expression with the expression to a timestamp. Returns null with invalid input. By default, it follows casting rules to a timestamp if the is omitted.

Examples:

Since: 2.2.0

to_unix_timestamp

to_unix_timestamp(expr[, pattern]) - Returns the UNIX timestamp of the given time.

Examples:

Since: 1.6.0

to_utc_timestamp

to_utc_timestamp(timestamp, timezone) - Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC. For example, 'GMT+1' would yield '2017-07-14 01:40:00.0'.

Examples:

Since: 1.5.0

translate

translate(input, from, to) - Translates the string by replacing the characters present in the string with the corresponding characters in the string.

Examples:

trim

trim(str) - Removes the leading and trailing space characters from .

trim(BOTH trimStr FROM str) - Remove the leading and trailing characters from

trim(LEADING trimStr FROM str) - Remove the leading characters from

trim(TRAILING trimStr FROM str) - Remove the trailing characters from

Arguments:

  • str - a string expression
  • trimStr - the trim string characters to trim, the default value is a single space
  • BOTH, FROM - these are keywords to specify trimming string characters from both ends of the string
  • LEADING, FROM - these are keywords to specify trimming string characters from the left end of the string
  • TRAILING, FROM - these are keywords to specify trimming string characters from the right end of the string

Examples:

trunc

trunc(date, fmt) - Returns with the time portion of the day truncated to the unit specified by the format model . should be one of ["year", "yyyy", "yy", "mon", "month", "mm"]

Examples:

Since: 1.5.0

ucase

ucase(str) - Returns with all characters changed to uppercase.

Examples:

unbase64

unbase64(str) - Converts the argument from a base 64 string to a binary.

Examples:

unhex

unhex(expr) - Converts hexadecimal to binary.

Examples:

unix_timestamp

unix_timestamp([expr[, pattern]]) - Returns the UNIX timestamp of current or specified time.

Examples:

Since: 1.5.0

upper

upper(str) - Returns with all characters changed to uppercase.

Examples:

uuid

uuid() - Returns an universally unique identifier (UUID) string. The value is returned as a canonical UUID 36-character string.

Examples:

var_pop

var_pop(expr) - Returns the population variance calculated from values of a group.

var_samp

var_samp(expr) - Returns the sample variance calculated from values of a group.

variance

variance(expr) - Returns the sample variance calculated from values of a group.

weekofyear

weekofyear(date) - Returns the week of the year of the given date. A week is considered to start on a Monday and week 1 is the first week with >3 days.

Examples:

Since: 1.5.0

when

CASE WHEN expr1 THEN expr2 [WHEN expr3 THEN expr4]* [ELSE expr5] END - When = true, returns ; else when = true, returns ; else returns .

Arguments:

  • expr1, expr3 - the branch condition expressions should all be boolean type.
  • expr2, expr4, expr5 - the branch value expressions and else value expression should all be same type or coercible to a common type.

Examples:

window

xpath

xpath(xml, xpath) - Returns a string array of values within the nodes of xml that match the XPath expression.

Examples:

xpath_boolean

xpath_boolean(xml, xpath) - Returns true if the XPath expression evaluates to true, or if a matching node is found.

Examples:

xpath_double

xpath_double(xml, xpath) - Returns a double value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.

Examples:

xpath_float

xpath_float(xml, xpath) - Returns a float value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.

Examples:

xpath_int

xpath_int(xml, xpath) - Returns an integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.

Examples:

xpath_long

xpath_long(xml, xpath) - Returns a long integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.

Examples:

xpath_number

xpath_number(xml, xpath) - Returns a double value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.

Examples:

xpath_short

xpath_short(xml, xpath) - Returns a short integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.

Examples:

xpath_string

xpath_string(xml, xpath) - Returns the text contents of the first xml node that matches the XPath expression.

Examples:

year

year(date) - Returns the year component of the date/timestamp.

Examples:

Since: 1.5.0

|

expr1 | expr2 - Returns the result of bitwise OR of and .

Examples:

~

~ expr - Returns the result of bitwise NOT of .

Examples:

Sours: https://spark.apache.org/docs/2.3.0/api/sql/index.html
Spark SQL Tutorial 2 : How to Create Spark Table In Databricks #SparkTable #DeltaTable #DeltaLake

Like SQL and “, statement from popular programming languages, Spark SQL Dataframe also supports similar syntax using “” or we can also use “” statement. So let’s see an example on how to check for multiple conditions and replicate SQL CASE statement.

First Let’s do the imports that are needed and create spark context and DataFrame.

1. Using “when otherwise” on Spark D.

is a Spark function, so to use it first we should import using before. Above code snippet replaces the value of gender with derived value. when value not qualified with the condition, we are assigning “Unknown” as value.

can also be used on Spark SQL select statement.

2. Using “case when” on Spark

Similar to SQL syntax, we could use “case when” with expression .

Using within SQL select.

3. Using && and || operator

We can also use and (&&) or (||) within when function. To explain this I will use a set of data to make it simple.

Output:

Conclusion:

In this article, we have learned how to use spark “” using function and “” function on Dataframe also, we’ve learned how to use these functions with && and || logical operators. I hope you like this article.

Happy Learning !!

Tags: expr,otherwise,spark case when,spark switch statement,spark when otherwise,spark.createDataFrame,when,withColumn

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Spark SQL “case when” and “when otherwise”
Sours: https://sparkbyexamples.com/spark/spark-case-when-otherwise-example/

Now discussing:

Chapter 4. Spark SQL and DataFrames: Introduction to Built-in Data Sources

In the previous chapter, we explained the evolution of and justification for structure in Spark. In particular, we discussed how the Spark SQL engine provides a unified foundation for the high-level DataFrame and Dataset APIs. Now, we’ll continue our discussion of the DataFrame and explore its interoperability with Spark SQL.

This chapter and the next also explore how Spark SQL interfaces with some of the external components shown in Figure 4-1.

In particular, Spark SQL:

  • Provides the engine upon which the high-level Structured APIs we explored in Chapter 3 are built.

  • Can read and write data in a variety of structured formats (e.g., JSON, Hive tables, Parquet, Avro, ORC, CSV).

  • Lets you query data using JDBC/ODBC connectors from external business intelligence (BI) data sources such as Tableau, Power BI, Talend, or from RDBMSs such as MySQL and PostgreSQL.

  • Provides a programmatic interface to interact with structured data stored as tables or views in a database from a Spark application

  • Offers an interactive shell to issue SQL queries on your structured data.

  • Supports ANSI SQL:2003-compliant commands and HiveQL.

Spark SQL connectors and data sources
Figure 4-1. Spark SQL connectors and data sources

Let’s begin with how you can use Spark SQL in a Spark application.

The , introduced in Spark 2.0, provides a unified entry point for programming Spark with the Structured APIs. You can use a to access Spark functionality: just import the class and create an instance in your code.

To issue any SQL query, use the method on the instance, , such as . All queries executed in this manner return a DataFrame on which you may perform further Spark operations if you desire—the kind we explored in Chapter 3 and the ones you will learn about in this chapter and the next.

Basic Query Examples

In this section we’ll walk through a few examples of queries on the Airline On-Time Performance and Causes of Flight Delays data set, which contains data on US flights including date, delay, distance, origin, and destination. It’s available as a CSV file with over a million records. Using a schema, we’ll read the data into a DataFrame and register the DataFrame as a temporary view (more on temporary views shortly) so we can query it with SQL.

Query examples are provided in code snippets, and Python and Scala notebooks containing all of the code presented here are available in the book’s GitHub repo. These examples will offer you a taste of how to use SQL in your Spark applications via the programmatic interface. Similar to the DataFrame API in its declarative flavor, this interface allows you to query structured data in your Spark applications.

Normally, in a standalone Spark application, you will create a instance manually, as shown in the following example. However, in a Spark shell (or Databricks notebook), the is created for you and accessible via the appropriately named variable .

Let’s get started by reading the data set into a temporary view:

Note

If you want to specify a schema, you can use a DDL-formatted string. For example:

Now that we have a temporary view, we can issue SQL queries using Spark SQL. These queries are no different from those you might issue against a SQL table in, say, a MySQL or PostgreSQL database. The point here is to show that Spark SQL offers an ANSI:2003–compliant SQL interface, and to demonstrate the interoperability between SQL and DataFrames.

The US flight delays data set has five columns:

  • The column contains a string like . When converted, this maps to .

  • The column gives the delay in minutes between the scheduled and actual departure times. Early departures show negative numbers.

  • The column gives the distance in miles from the origin airport to the destination airport.

  • The column contains the origin IATA airport code.

  • The column contains the destination IATA airport code.

With that in mind, let’s try some example queries against this data set.

First, we’ll find all flights whose distance is greater than 1,000 miles:

spark.sql("""SELECT distance, origin, destination FROM us_delay_flights_tbl WHERE distance > 1000 ORDER BY distance DESC""").show(10) +--------+------+-----------+ |distance|origin|destination| +--------+------+-----------+ |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | +--------+------+-----------+ only showing top 10 rows

As the results show, all of the longest flights were between Honolulu (HNL) and New York (JFK). Next, we’ll find all flights between San Francisco (SFO) and Chicago (ORD) with at least a two-hour delay:

spark.sql("""SELECT date, delay, origin, destination FROM us_delay_flights_tbl WHERE delay > 120 AND ORIGIN = 'SFO' AND DESTINATION = 'ORD' ORDER by delay DESC""").show(10) +--------+-----+------+-----------+ |date |delay|origin|destination| +--------+-----+------+-----------+ |02190925|1638 |SFO |ORD | |01031755|396 |SFO |ORD | |01022330|326 |SFO |ORD | |01051205|320 |SFO |ORD | |01190925|297 |SFO |ORD | |02171115|296 |SFO |ORD | |01071040|279 |SFO |ORD | |01051550|274 |SFO |ORD | |03120730|266 |SFO |ORD | |01261104|258 |SFO |ORD | +--------+-----+------+-----------+ only showing top 10 rows

It seems there were many significantly delayed flights between these two cities, on different dates. (As an exercise, convert the column into a readable format and find the days or months when these delays were most common. Were the delays related to winter months or holidays?)

Let’s try a more complicated query where we use the clause in SQL. In the following example, we want to label all US flights, regardless of origin and destination, with an indication of the delays they experienced: Very Long Delays (> 6 hours), Long Delays (2–6 hours), etc. We’ll add these human-readable labels in a new column called :

spark.sql("""SELECT delay, origin, destination, CASE WHEN delay > 360 THEN 'Very Long Delays' WHEN delay >= 120 AND delay <= 360 THEN 'Long Delays' WHEN delay >= 60 AND delay < 120 THEN 'Short Delays' WHEN delay > 0 and delay < 60 THEN 'Tolerable Delays' WHEN delay = 0 THEN 'No Delays' ELSE 'Early' END AS Flight_Delays FROM us_delay_flights_tbl ORDER BY origin, delay DESC""").show(10) +-----+------+-----------+-------------+ |delay|origin|destination|Flight_Delays| +-----+------+-----------+-------------+ |333 |ABE |ATL |Long Delays | |305 |ABE |ATL |Long Delays | |275 |ABE |ATL |Long Delays | |257 |ABE |ATL |Long Delays | |247 |ABE |DTW |Long Delays | |247 |ABE |ATL |Long Delays | |219 |ABE |ORD |Long Delays | |211 |ABE |ATL |Long Delays | |197 |ABE |DTW |Long Delays | |192 |ABE |ORD |Long Delays | +-----+------+-----------+-------------+ only showing top 10 rows

As with the DataFrame and Dataset APIs, with the interface you can conduct common data analysis operations like those we explored in the previous chapter. The computations undergo an identical journey in the Spark SQL engine (see “The Catalyst Optimizer” in Chapter 3 for details), giving you the same results.

All three of the preceding SQL queries can be expressed with an equivalent DataFrame API query. For example, the first query can be expressed in the Python DataFrame API as:

This produces the same results as the SQL query:

+--------+------+-----------+ |distance|origin|destination| +--------+------+-----------+ |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | +--------+------+-----------+ only showing top 10 rows

As an exercise, try converting the other two SQL queries to use the DataFrame API.

As these examples show, using the Spark SQL interface to query data is similar to writing a regular SQL query to a relational database table. Although the queries are in SQL, you can feel the similarity in readability and semantics to DataFrame API operations, which you encountered in Chapter 3 and will explore further in the next chapter.

To enable you to query structured data as shown in the preceding examples, Spark manages all the complexities of creating and managing views and tables, both in memory and on disk. That leads us to our next topic: how tables and views are created and managed.

Tables hold data. Associated with each table in Spark is its relevant metadata, which is information about the table and its data: the schema, description, table name, database name, column names, partitions, physical location where the actual data resides, etc. All of this is stored in a central metastore.

Instead of having a separate metastore for Spark tables, Spark by default uses the Apache Hive metastore, located at /user/hive/warehouse, to persist all the metadata about your tables. However, you may change the default location by setting the Spark config variable to another location, which can be set to a local or external distributed storage.

Managed Versus UnmanagedTables

Spark allows you to create two types of tables: managed and unmanaged. For a managed table, Spark manages both the metadata and the data in the file store. This could be a local filesystem, HDFS, or an object store such as Amazon S3 or Azure Blob. For an unmanaged table, Spark only manages the metadata, while you manage the data yourself in an external data source such as Cassandra.

With a managed table, because Spark manages everything, a SQL command such as deletes both the metadata and the data. With an unmanaged table, the same command will delete only the metadata, not the actual data. We will look at some examples of how to create managed and unmanaged tables in the next section.

Creating SQL Databases and Tables

Tables reside within a database. By default, Spark creates tables under the database. To create your own database name, you can issue a SQL command from your Spark application or notebook. Using the US flight delays data set, let’s create both a managed and an unmanaged table. To begin, we’ll create a database called and tell Spark we want to use that database:

From this point, any commands we issue in our application to create tables will result in the tables being created in this database and residing under the database name .

Creating a managed table

To create a managed table within the database , you can issue a SQL query like the following:

You can do the same thing using the DataFrame API like this:

Both of these statements will create the managed table in the database.

Creating an unmanaged table

By contrast, you can create unmanaged tables from your own data sources—say, Parquet, CSV, or JSON files stored in a file store accessible to your Spark application.

To create an unmanaged table from a data source such as a CSV file, in SQL use:

spark.sql("""CREATE TABLE us_delay_flights_tbl(date STRING, delay INT, distance INT, origin STRING, destination STRING) USING csv OPTIONS (PATH '/databricks-datasets/learning-spark-v2/flights/departuredelays.csv')""")

And within the DataFrame API use:

(flights_df .write .option("path", "/tmp/data/us_flights_delay") .saveAsTable("us_delay_flights_tbl"))
Note

To enable you to explore these examples, we have created Python and Scala example notebooks that you can find in the book’s GitHub repo.

Creating Views

In addition to creating tables, Spark can create views on top of existing tables. Views can be global (visible across all s on a given cluster) or session-scoped (visible only to a single ), and they are temporary: they disappear after your Spark application terminates.

Creating views has a similar syntax to creating tables within a database. Once you create a view, you can query it as you would a table. The difference between a view and a table is that views don’t actually hold the data; tables persist after your Spark application terminates, but views disappear.

You can create a view from an existing table using SQL. For example, if you wish to work on only the subset of the US flight delays data set with origin airports of New York (JFK) and San Francisco (SFO), the following queries will create global temporary and temporary views consisting of just that slice of the table:

You can accomplish the same thing with the DataFrame API as follows:

Once you’ve created these views, you can issue queries against them just as you would against a table. Keep in mind that when accessing a global temporary view you must use the prefix , because Spark creates global temporary views in a global temporary database called . For example:

By contrast, you can access the normal temporary view without the prefix:

You can also drop a view just like you would a table:

Temporary views versus global temporary views

The difference between temporary and global temporary views being subtle, it can be a source of mild confusion among developers new to Spark. A temporary view is tied to a single within a Spark application. In contrast, a global temporary view is visible across multiple s within a Spark application. Yes, you can create multiple s within a single Spark application—this can be handy, for example, in cases where you want to access (and combine) data from two different s that don’t share the same Hive metastore configurations.

Caching SQL Tables

Although we will discuss table caching strategies in the next chapter, it’s worth mentioning here that, like DataFrames, you can cache and uncache SQL tables and views. In Spark 3.0, in addition to other options, you can specify a table as , meaning that it should only be cached when it is first used instead of immediately:

Reading Tables into DataFrames

Often, data engineers build data pipelines as part of their regular data ingestion and ETL processes. They populate Spark SQL databases and tables with cleansed data for consumption by applications downstream.

Let’s assume you have an existing database, , and table, , ready for use. Instead of reading from an external JSON file, you can simply use SQL to query the table and assign the returned result to a DataFrame:

Now you have a cleansed DataFrame read from an existing Spark SQL table. You can also read data in other formats using Spark’s built-in data sources, giving you the flexibility to interact with various common file formats.

As shown in Figure 4-1, Spark SQL provides an interface to a variety of data sources. It also provides a set of common methods for reading and writing data to and from these data sources using the Data Sources API.

In this section we will cover some of the built-in data sources, available file formats, and ways to load and write data, along with specific options pertaining to these data sources. But first, let’s take a closer look at two high-level Data Source API constructs that dictate the manner in which you interact with different data sources: and .

DataFrameReader

is the core construct for reading data from a data source into a DataFrame. It has a defined format and a recommended pattern for usage:

DataFrameReader.format(args).option("key", "value").schema(args).load()

This pattern of stringing methods together is common in Spark, and easy to read. We saw it in Chapter 3 when exploring common data analysis patterns.

Note that you can only access a through a instance. That is, you cannot create an instance of . To get an instance handle to it, use:

SparkSession.read // or SparkSession.readStream

While returns a handle to to read into a DataFrame from a static data source, returns an instance to read from a streaming source. (We will cover Structured Streaming later in the book.)

Arguments to each of the public methods to take different values. Table 4-1 enumerates these, with a subset of the supported arguments.

MethodArgumentsDescription
, , , , , , , etc.If you don’t specify this method, then the default is Parquet or whatever is set in .


A series of key/value pairs and options.
The Spark documentation shows some examples and explains the different modes and their actions. The default mode is . The and options are specific to the JSON and CSV file formats.
DDL or , e.g., or
For JSON or CSV format, you can specify to infer the schema in the method. Generally, providing a schema for any format makes loading faster and ensures your data conforms to the expected schema.
The path to the data source. This can be empty if specified in .

While we won’t comprehensively enumerate all the different combinations of arguments and options, the documentation for Python, Scala, R, and Java offers suggestions and guidance. It’s worthwhile to show a couple of examples, though:

Note

In general, no schema is needed when reading from a static Parquet data source—the Parquet metadata usually contains the schema, so it’s inferred. However, for streaming data sources you will have to provide a schema. (We will cover reading from streaming data sources in Chapter 8.)

Parquet is the default and preferred data source for Spark because it’s efficient, uses columnar storage, and employs a fast compression algorithm. You will see additional benefits later (such as columnar pushdown), when we cover the Catalyst optimizer in greater depth.

DataFrameWriter

does the reverse of its counterpart: it saves or writes data to a specified built-in data source. Unlike with , you access its instance not from a but from the DataFrame you wish to save. It has a few recommended usage patterns:

DataFrameWriter.format(args) .option(args) .bucketBy(args) .partitionBy(args) .save(path) DataFrameWriter.format(args).option(args).sortBy(args).saveAsTable(table)

To get an instance handle, use:

DataFrame.write // or DataFrame.writeStream

Arguments to each of the methods to also take different values. We list these in Table 4-2, with a subset of the supported arguments.

MethodArgumentsDescription
, , , , , , , etc.If you don’t specify this method, then the default is Parquet or whatever is set in .


A series of key/value pairs and options. The Spark documentation shows some examples. This is an overloaded method. The default mode options are and ; they throw an exception at runtime if the data already exists.
The number of buckets and names of columns to bucket by. Uses Hive’s bucketing scheme on a filesystem.
The path to save to. This can be empty if specified in .
The table to save to.

Here’s a short example snippet to illustrate the use of methods and arguments:

Parquet

We’ll start our exploration of data sources with Parquet, because it’s the default data source in Spark. Supported and widely used by many big data processing frameworks and platforms, Parquet is an open source columnar file format that offers many I/O optimizations (such as compression, which saves storage space and allows for quick access to data columns).

Because of its efficiency and these optimizations, we recommend that after you have transformed and cleansed your data, you save your DataFrames in the Parquet format for downstream consumption. (Parquet is also the default table open format for Delta Lake, which we will cover in Chapter 9.)

Reading Parquet files into a DataFrame

Parquet files are stored in a directory structure that contains the data files, metadata, a number of compressed files, and some status files. Metadata in the footer contains the version of the file format, the schema, and column data such as the path, etc.

For example, a directory in a Parquet file might contain a set of files like this:

_SUCCESS _committed_1799640464332036264 _started_1799640464332036264 part-00000-tid-1799640464332036264-91273258-d7ef-4dc7-<...>-c000.snappy.parquet

There may be a number of part-XXXX compressed files in a directory (the names shown here have been shortened to fit on the page).

To read Parquet files into a DataFrame, you simply specify the format and path:

Unless you are reading from a streaming data source there’s no need to supply the schema, because Parquet saves it as part of its metadata.

Reading Parquet files into a Spark SQL table

As well as reading Parquet files into a Spark DataFrame, you can also create a Spark SQL unmanaged table or view directly using SQL:

Once you’ve created the table or view, you can read data into a DataFrame using SQL, as we saw in some earlier examples:

Both of these operations return the same results:

+-----------------+-------------------+-----+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count| +-----------------+-------------------+-----+ |United States |Romania |1 | |United States |Ireland |264 | |United States |India |69 | |Egypt |United States |24 | |Equatorial Guinea|United States |1 | |United States |Singapore |25 | |United States |Grenada |54 | |Costa Rica |United States |477 | |Senegal |United States |29 | |United States |Marshall Islands |44 | +-----------------+-------------------+-----+ only showing top 10 rows

Writing DataFrames to Parquet files

Writing or saving a DataFrame as a table or file is a common operation in Spark. To write a DataFrame you simply use the methods and arguments to the outlined earlier in this chapter, supplying the location to save the Parquet files to. For example:

Note

Recall that Parquet is the default file format. If you don’t include the method, the DataFrame will still be saved as a Parquet file.

This will create a set of compact and compressed Parquet files at the specified path. Since we used snappy as our compression choice here, we’ll have snappy compressed files. For brevity, this example generated only one file; normally, there may be a dozen or so files created:

-rw-r--r-- 1 jules wheel 0 May 19 10:58 _SUCCESS -rw-r--r-- 1 jules wheel 966 May 19 10:58 part-00000-<...>-c000.snappy.parquet

Writing DataFrames to Spark SQL tables

Writing a DataFrame to a SQL table is as easy as writing to a file—just use instead of . This will create a managed table called :

To sum up, Parquet is the preferred and default built-in data source file format in Spark, and it has been adopted by many other frameworks. We recommend that you use this format in your ETL and data ingestion processes.

JSON

JavaScript Object Notation (JSON) is also a popular data format. It came to prominence as an easy-to-read and easy-to-parse format compared to XML. It has two representational formats: single-line mode and multiline mode. Both modes are supported in Spark.

In single-line mode each line denotes a single JSON object, whereas in multiline mode the entire multiline object constitutes a single JSON object. To read in this mode, set to true in the method.

Reading a JSON file into a DataFrame

You can read a JSON file into a DataFrame the same way you did with Parquet—just specify in the method:

Reading a JSON file into a Spark SQL table

You can also create a SQL table from a JSON file just like you did with Parquet:

Once the table is created, you can read data into a DataFrame using SQL:

Writing DataFrames to JSON files

Saving a DataFrame as a JSON file is simple. Specify the appropriate methods and arguments, and supply the location to save the JSON files to:

This creates a directory at the specified path populated with a set of compact JSON files:

-rw-r--r-- 1 jules wheel 0 May 16 14:44 _SUCCESS -rw-r--r-- 1 jules wheel 71 May 16 14:44 part-00000-<...>-c000.json

JSON data source options

Table 4-3 describes common JSON options for and . For a comprehensive list, we refer you to the documentation.

Property nameValuesMeaningScope
, , , , , , or Use this compression codec for writing. Note that read will only detect the compression or codec from the file extension.Write
or Use this format or any format from Java’s .Read/write
, Use multiline mode. Default is (single-line mode).Read
, Allow unquoted JSON field names. Default is .Read

CSV

As widely used as plain text files, this common text file format captures each datum or field delimited by a comma; each line with comma-separated fields represents a record. Even though a comma is the default separator, you may use other delimiters to separate fields in cases where commas are part of your data. Popular spreadsheets can generate CSV files, so it’s a popular format among data and business analysts.

Reading a CSV file into a DataFrame

As with the other built-in data sources, you can use the methods and arguments to read a CSV file into a DataFrame:

Reading a CSV file into a Spark SQL table

Creating a SQL table from a CSV data source is no different from using Parquet or JSON:

Once you’ve created the table, you can read data into a DataFrame using SQL as before:

Writing DataFrames to CSV files

Saving a DataFrame as a CSV file is simple. Specify the appropriate methods and arguments, and supply the location to save the CSV files to:

This generates a folder at the specified location, populated with a bunch of compressed and compact files:

-rw-r--r-- 1 jules wheel 0 May 16 12:17 _SUCCESS -rw-r--r-- 1 jules wheel 36 May 16 12:17 part-00000-251690eb-<...>-c000.csv

CSV data source options

Table 4-4 describes some of the common CSV options for and . Because CSV files can be complex, many options are available; for a comprehensive list we refer you to the documentation.

Property nameValuesMeaningScope
, , , , , or Use this compression codec for writing.Write
or Use this format or any format from Java’s .Read/write
, Use multiline mode. Default is (single-line mode).Read
, If , Spark will determine the column data types. Default is .Read
Any characterUse this character to separate column values in a row. Default delimiter is a comma ().Read/write
Any characterUse this character to escape quotes. Default is .Read/write
, Indicates whether the first line is a header denoting each column name. Default is .Read/write

Avro

Introduced in Spark 2.4 as a built-in data source, the Avro format is used, for example, by Apache Kafka for message serializing and deserializing. It offers many benefits, including direct mapping to JSON, speed and efficiency, and bindings available for many programming languages.

Reading an Avro file into a DataFrame

Reading an Avro file into a DataFrame using is consistent in usage with the other data sources we have discussed in this section:

Reading an Avro file into a Spark SQL table

Again, creating SQL tables using an Avro data source is no different from using Parquet, JSON, or CSV:

Once you’ve created a table, you can read data into a DataFrame using SQL:

Writing DataFrames to Avro files

Writing a DataFrame as an Avro file is simple. As usual, specify the appropriate methods and arguments, and supply the location to save the Avro files to:

This generates a folder at the specified location, populated with a bunch of compressed and compact files:

-rw-r--r-- 1 jules wheel 0 May 17 11:54 _SUCCESS -rw-r--r-- 1 jules wheel 526 May 17 11:54 part-00000-ffdf70f4-<...>-c000.avro

Avro data source options

Table 4-5 describes common options for and . A comprehensive list of options is in the documentation.

Property nameDefault valueMeaningScope
NoneOptional Avro schema provided by a user in JSON format. The data type and naming of record fields should match the input Avro data or Catalyst data (Spark internal data type), otherwise the read/write action will fail.Read/write
Top-level record name in write result, which is required in the Avro spec.Write
Record namespace in write result.Write
If this option is enabled, all files (with and without the .avro extension) are loaded. Otherwise, files without the .avro extension are ignored.Read
Allows you to specify the compression codec to use in writing. Currently supported codecs are , , , , and .
If this option is not set, the value in is taken into account.
Write

ORC

As an additional optimized columnar file format, Spark 2.x supports a vectorized ORC reader. Two Spark configurations dictate which ORC implementation to use. When is set to and is set to , Spark uses the vectorized ORC reader. A vectorized reader reads blocks of rows (often 1,024 per block) instead of one row at a time, streamlining operations and reducing CPU usage for intensive operations like scans, filters, aggregations, and joins.

For Hive ORC SerDe (serialization and deserialization) tables created with the SQL command , the vectorized reader is used when the Spark configuration parameter is set to .

Reading an ORC file into a DataFrame

To read in a DataFrame using the ORC vectorized reader, you can just use the normal methods and options:

Reading an ORC file into a Spark SQL table

There is no difference from Parquet, JSON, CSV, or Avro when creating a SQL view using an ORC data source:

Once a table is created, you can read data into a DataFrame using SQL as usual:

Writing DataFrames to ORC files

Writing back a transformed DataFrame after reading is equally simple using the methods:

The result will be a folder at the specified location containing some compressed ORC files:

-rw-r--r-- 1 jules wheel 0 May 16 17:23 _SUCCESS -rw-r--r-- 1 jules wheel 547 May 16 17:23 part-00000-<...>-c000.snappy.orc

Images

In Spark 2.4 the community introduced a new data source, image files, to support deep learning and machine learning frameworks such as TensorFlow and PyTorch. For computer vision–based machine learning applications, loading and processing image data sets is important.

Reading an image file into a DataFrame

As with all of the previous file formats, you can use the methods and options to read in an image file as shown here:

Binary Files

Spark 3.0 adds support for binary files as a data source. The converts each binary file into a single DataFrame row (record) that contains the raw content and metadata of the file. The binary file data source produces a DataFrame with the following columns:

  • path: StringType

  • modificationTime: TimestampType

  • length: LongType

  • content: BinaryType

Reading a binary file into a DataFrame

To read binary files, specify the data source format as a . You can load files with paths matching a given global pattern while preserving the behavior of partition discovery with the data source option . For example, the following code reads all JPG files from the input directory with any partitioned directories:

To ignore partitioning data discovery in a directory, you can set to :

Note that the column is absent when the option is set to .

Currently, the binary file data source does not support writing a DataFrame back to the original file format.

In this section, you got a tour of how to read data into a DataFrame from a range of supported file formats. We also showed you how to create temporary views and tables from the existing built-in data sources. Whether you’re using the DataFrame API or SQL, the queries produce identical outcomes. You can examine some of these queries in the notebook available in the GitHub repo for this book.

To recap, this chapter explored the interoperability between the DataFrame API and Spark SQL. In particular, you got a flavor of how to use Spark SQL to:

  • Create managed and unmanaged tables using Spark SQL and the DataFrame API.

  • Read from and write to various built-in data sources and file formats.

  • Employ the programmatic interface to issue SQL queries on structured data stored as Spark SQL tables or views.

  • Peruse the Spark to inspect metadata associated with tables and views.

  • Use the and APIs.

Through the code snippets in the chapter and the notebooks available in the book’s GitHub repo, you got a feel for how to use DataFrames and Spark SQL. Continuing in this vein, the next chapter further explores how Spark interacts with the external data sources shown in Figure 4-1. You’ll see some more in-depth examples of transformations and the interoperability between the DataFrame API and Spark SQL.

Sours: https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/ch04.html


402 403 404 405 406