
Functions (Apache Spark 2.x)
This article lists the built-in functions in Apache Spark SQL.
%
expr1 % expr2 - Returns the remainder after /.
Examples:
&
expr1 & expr2 - Returns the result of bitwise AND of and .
Examples:
*
expr1 expr2 - Returns .
Examples:
+
expr1 + expr2 - Returns +.
Examples:
-
expr1 - expr2 - Returns -.
Examples:
/
expr1 / expr2 - Returns /. It always performs floating point division.
Examples:
<
expr1 < expr2 - Returns true if is less than .
Arguments:
- expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types such array/struct, the data types of fields must be orderable.
Examples:
<=
expr1 <= expr2 - Returns true if is less than or equal to .
Arguments:
- expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types such array/struct, the data types of fields must be orderable.
Examples:
<=>
expr1 <=> expr2 - Returns same result as the EQUAL(=) operator for non-null operands, but returns true if both are null, false if one of the them is null.
Arguments:
- expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be used in equality comparison. Map type is not supported. For complex types such array/struct, the data types of fields must be orderable.
Examples:
=
expr1 = expr2 - Returns true if equals , or false otherwise.
Arguments:
- expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be used in equality comparison. Map type is not supported. For complex types such array/struct, the data types of fields must be orderable.
Examples:
==
expr1 == expr2 - Returns true if equals , or false otherwise.
Arguments:
- expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be used in equality comparison. Map type is not supported. For complex types such array/struct, the data types of fields must be orderable.
Examples:
>
expr1 > expr2 - Returns true if is greater than .
Arguments:
- expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types such array/struct, the data types of fields must be orderable.
Examples:
>=
expr1 >= expr2 - Returns true if is greater than or equal to .
Arguments:
- expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types such array/struct, the data types of fields must be orderable.
Examples:
^
expr1 ^ expr2 - Returns the result of bitwise exclusive OR of and .
Examples:
abs
abs(expr) - Returns the absolute value of the numeric value.
Examples:
acos
acos(expr) - Returns the inverse cosine (arc cosine) of , as if computed by .
Examples:
add_months
add_months(start_date, num_months) - Returns the date that is after .
Examples:
Since: 1.5.0
aggregate
aggregate(expr, start, merge, finish) - Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. The final state is converted into the final result by applying a finish function.
Examples:
Since: 2.4.0
and
expr1 and expr2 - Logical AND.
approx_count_distinct
approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++. defines the maximum estimation error allowed.
approx_percentile
approx_percentile(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column at the given percentage. The value of percentage must be between 0.0 and 1.0. The parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of yields better accuracy, is the relative error of the approximation. When is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column at the given percentage array.
Examples:
array
array(expr, …) - Returns an array with the given elements.
Examples:
array_contains
array_contains(array, value) - Returns true if the array contains the value.
Examples:
array_distinct
array_distinct(array) - Removes duplicate values from the array.
Examples:
Since: 2.4.0
array_except
array_except(array1, array2) - Returns an array of the elements in array1 but not in array2, without duplicates.
Examples:
Since: 2.4.0
array_intersect
array_intersect(array1, array2) - Returns an array of the elements in the intersection of array1 and array2, without duplicates.
Examples:
Since: 2.4.0
array_join
array_join(array, delimiter[, nullReplacement]) - Concatenates the elements of the given array using the delimiter and an optional string to replace nulls. If no value is set for nullReplacement, any null value is filtered.
Examples:
Since: 2.4.0
array_max
array_max(array) - Returns the maximum value in the array. NULL elements are skipped.
Examples:
Since: 2.4.0
array_min
array_min(array) - Returns the minimum value in the array. NULL elements are skipped.
Examples:
Since: 2.4.0
array_position
array_position(array, element) - Returns the (1-based) index of the first element of the array as long.
Examples:
Since: 2.4.0
array_remove
array_remove(array, element) - Remove all elements that equal to element from array.
Examples:
Since: 2.4.0
array_repeat
array_repeat(element, count) - Returns the array containing element count times.
Examples:
Since: 2.4.0
array_sort
array_sort(array) - Sorts the input array in ascending order. The elements of the input array must be orderable. Null elements will be placed at the end of the returned array.
Examples:
Since: 2.4.0
array_union
array_union(array1, array2) - Returns an array of the elements in the union of array1 and array2, without duplicates.
Examples:
Since: 2.4.0
arrays_overlap
arrays_overlap(a1, a2) - Returns true if a1 contains at least a non-null element present also in a2. If the arrays have no common element and they are both non-empty and either of them contains a null element null is returned, false otherwise.
Examples:
Since: 2.4.0
arrays_zip
arrays_zip(a1, a2, …) - Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
Examples:
Since: 2.4.0
ascii
ascii(str) - Returns the numeric value of the first character of .
Examples:
asin
asin(expr) - Returns the inverse sine (arc sine) the arc sin of , as if computed by .
Examples:
assert_true
assert_true(expr) - Throws an exception if is not true.
Examples:
atan
atan(expr) - Returns the inverse tangent (arc tangent) of , as if computed by
Examples:
atan2
atan2(exprY, exprX) - Returns the angle in radians between the positive x-axis of a plane and the point given by the coordinates (, ), as if computed by .
Arguments:
- exprY - coordinate on y-axis
- exprX - coordinate on x-axis
Examples:
avg
avg(expr) - Returns the mean calculated from values of a group.
base64
base64(bin) - Converts the argument from a binary to a base 64 string.
Examples:
bigint
bigint(expr) - Casts the value to the target data type .
bin
bin(expr) - Returns the string representation of the long value represented in binary.
Examples:
binary
binary(expr) - Casts the value to the target data type .
bit_length
bit_length(expr) - Returns the bit length of string data or number of bits of binary data.
Examples:
boolean
boolean(expr) - Casts the value to the target data type .
bround
bround(expr, d) - Returns rounded to decimal places using HALF_EVEN rounding mode.
Examples:
cardinality
cardinality(expr) - Returns the size of an array or a map. The function returns -1 if its input is null and spark.sql.legacy.sizeOfNull is set to true. If spark.sql.legacy.sizeOfNull is set to false, the function returns null for null input. By default, the spark.sql.legacy.sizeOfNull parameter is set to true.
Examples:
cast
cast(expr AS type) - Casts the value to the target data type .
Examples:
cbrt
cbrt(expr) - Returns the cube root of .
Examples:
ceil
ceil(expr) - Returns the smallest integer not smaller than .
Examples:
ceiling
ceiling(expr) - Returns the smallest integer not smaller than .
Examples:
char
char(expr) - Returns the ASCII character having the binary equivalent to . If n is larger than 256 the result is equivalent to chr(n % 256)
Examples:
char_length
char_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
Examples:
character_length
character_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
Examples:
chr
chr(expr) - Returns the ASCII character having the binary equivalent to . If n is larger than 256 the result is equivalent to chr(n % 256)
Examples:
coalesce
coalesce(expr1, expr2, …) - Returns the first non-null argument if exists. Otherwise, null.
Examples:
collect_list
collect_list(expr) - Collects and returns a list of non-unique elements.
collect_set
collect_set(expr) - Collects and returns a set of unique elements.
concat
concat(col1, col2, …, colN) - Returns the concatenation of col1, col2, …, colN.
Examples:
concat_ws
concat_ws(sep, [str | array(str)]+) - Returns the concatenation of the strings separated by .
Examples:
conv
conv(num, from_base, to_base) - Convert from to .
Examples:
corr
corr(expr1, expr2) - Returns Pearson coefficient of correlation between a set of number pairs.
cos
cos(expr) - Returns the cosine of , as if computed by .
Arguments:
Examples:
cosh
cosh(expr) - Returns the hyperbolic cosine of , as if computed by .
Arguments:
Examples:
cot
cot(expr) - Returns the cotangent of , as if computed by .
Arguments:
Examples:
count
count(*) - Returns the total number of retrieved rows, including rows containing null.
count(expr[, expr…]) - Returns the number of rows for which the supplied expression(s) are all non-null.
count(DISTINCT expr[, expr…]) - Returns the number of rows for which the supplied expression(s) are unique and non-null.
count_min_sketch
count_min_sketch(col, eps, confidence, seed) - Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.
covar_pop
covar_pop(expr1, expr2) - Returns the population covariance of a set of number pairs.
covar_samp
covar_samp(expr1, expr2) - Returns the sample covariance of a set of number pairs.
crc32
crc32(expr) - Returns a cyclic redundancy check value of the as a bigint.
Examples:
cube
cume_dist
cume_dist() - Computes the position of a value relative to all values in the partition.
current_database
current_database() - Returns the current database.
Examples:
current_date
current_date() - Returns the current date at the start of query evaluation.
Since: 1.5.0
current_timestamp
current_timestamp() - Returns the current timestamp at the start of query evaluation.
Since: 1.5.0
date
date(expr) - Casts the value to the target data type .
date_add
date_add(start_date, num_days) - Returns the date that is after .
Examples:
Since: 1.5.0
date_format
date_format(timestamp, fmt) - Converts to a value of string in the format specified by the date format .
Examples:
Since: 1.5.0
date_sub
date_sub(start_date, num_days) - Returns the date that is before .
Examples:
Since: 1.5.0
date_trunc
date_trunc(fmt, ts) - Returns timestamp truncated to the unit specified by the format model . should be one of [“YEAR”, “YYYY”, “YY”, “MON”, “MONTH”, “MM”, “DAY”, “DD”, “HOUR”, “MINUTE”, “SECOND”, “WEEK”, “QUARTER”]
Examples:
Since: 2.3.0
datediff
datediff(endDate, startDate) - Returns the number of days from to .
Examples:
Since: 1.5.0
day
day(date) - Returns the day of month of the date/timestamp.
Examples:
Since: 1.5.0
dayofmonth
dayofmonth(date) - Returns the day of month of the date/timestamp.
Examples:
Since: 1.5.0
dayofweek
dayofweek(date) - Returns the day of the week for date/timestamp (1 = Sunday, 2 = Monday, …, 7 = Saturday).
Examples:
Since: 2.3.0
dayofyear
dayofyear(date) - Returns the day of year of the date/timestamp.
Examples:
Since: 1.5.0
decimal
decimal(expr) - Casts the value to the target data type .
decode
decode(bin, charset) - Decodes the first argument using the second argument character set.
Examples:
degrees
degrees(expr) - Converts radians to degrees.
Arguments:
Examples:
dense_rank
dense_rank() - Computes the rank of a value in a group of values. The result is one plus the previously assigned rank value. Unlike the function rank, dense_rank will not produce gaps in the ranking sequence.
double
double(expr) - Casts the value to the target data type .
e
e() - Returns Euler’s number, e.
Examples:
element_at
element_at(array, index) - Returns element of array at given (1-based) index. If index < 0, accesses elements from the last to the first. Returns NULL if the index exceeds the length of the array.
element_at(map, key) - Returns value for given key, or NULL if the key is not contained in the map
Examples:
Since: 2.4.0
elt
elt(n, input1, input2, …) - Returns the -th input, e.g., returns when is 2.
Examples:
encode
encode(str, charset) - Encodes the first argument using the second argument character set.
Examples:
exists
exists(expr, pred) - Tests whether a predicate holds for one or more elements in the array.
Examples:
Since: 2.4.0
exp
exp(expr) - Returns e to the power of .
Examples:
explode
explode(expr) - Separates the elements of array into multiple rows, or the elements of map into multiple rows and columns.
Examples:
explode_outer
explode_outer(expr) - Separates the elements of array into multiple rows, or the elements of map into multiple rows and columns.
Examples:
expm1
expm1(expr) - Returns exp() - 1.
Examples:
factorial
factorial(expr) - Returns the factorial of . is [0..20]. Otherwise, null.
Examples:
filter
filter(expr, func) - Filters the input array using the given predicate.
Examples:
Since: 2.4.0
find_in_set
find_in_set(str, str_array) - Returns the index (1-based) of the given string () in the comma-delimited list (). Returns 0, if the string was not found or if the given string () contains a comma.
Examples:
first
first(expr[, isIgnoreNull]) - Returns the first value of for a group of rows. If is true, returns only non-null values.
first_value
first_value(expr[, isIgnoreNull]) - Returns the first value of for a group of rows. If is true, returns only non-null values.
flatten
flatten(arrayOfArrays) - Transforms an array of arrays into a single array.
Examples:
Since: 2.4.0
float
float(expr) - Casts the value to the target data type .
floor
floor(expr) - Returns the largest integer not greater than .
Examples:
format_number
format_number(expr1, expr2) - Formats the number like ‘#,###,###.##’, rounded to decimal places. If is 0, the result has no decimal point or fractional part. also accept a user specified format. This is supposed to function like MySQL’s FORMAT.
Examples:
format_string
format_string(strfmt, obj, …) - Returns a formatted string from printf-style format strings.
Examples:
from_json
from_json(jsonStr, schema[, options]) - Returns a struct value with the given and .
Examples:
Since: 2.2.0
from_unixtime
from_unixtime(unix_time, format) - Returns in the specified .
Examples:
Since: 1.5.0
from_utc_timestamp
from_utc_timestamp(timestamp, timezone) - Given a timestamp like ‘2017-07-14 02:40:00.0’, interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, ‘GMT+1’ would yield ‘2017-07-14 03:40:00.0’.
Examples:
Since: 1.5.0
get_json_object
get_json_object(json_txt, path) - Extracts a json object from .
Examples:
greatest
greatest(expr, …) - Returns the greatest value of all parameters, skipping null values.
Examples:
grouping
grouping_id
hash
hash(expr1, expr2, …) - Returns a hash value of the arguments.
Examples:
hex
hex(expr) - Converts to hexadecimal.
Examples:
hour
hour(timestamp) - Returns the hour component of the string/timestamp.
Examples:
Since: 1.5.0
hypot
hypot(expr1, expr2) - Returns sqrt(2 + 2).
Examples:
if
if(expr1, expr2, expr3) - If evaluates to true, then returns ; otherwise returns .
Examples:
ifnull
ifnull(expr1, expr2) - Returns if is null, or otherwise.
Examples:
in
expr1 in(expr2, expr3, …) - Returns true if equals to any valN.
Arguments:
- expr1, expr2, expr3, … - the arguments must be same type.
Examples:
initcap
initcap(str) - Returns with the first letter of each word in uppercase. All other letters are in lowercase. Words are delimited by white space.
Examples:
inline
inline(expr) - Explodes an array of structs into a table.
Examples:
inline_outer
inline_outer(expr) - Explodes an array of structs into a table.
Examples:
input_file_block_length
input_file_block_length() - Returns the length of the block being read, or -1 if not available.
input_file_block_start
input_file_block_start() - Returns the start offset of the block being read, or -1 if not available.
input_file_name
input_file_name() - Returns the name of the file being read, or empty string if not available.
instr
instr(str, substr) - Returns the (1-based) index of the first occurrence of in .
Examples:
int
int(expr) - Casts the value to the target data type .
isnan
isnan(expr) - Returns true if is NaN, or false otherwise.
Examples:
isnotnull
isnotnull(expr) - Returns true if is not null, or false otherwise.
Examples:
isnull
isnull(expr) - Returns true if is null, or false otherwise.
Examples:
java_method
java_method(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.
Examples:
json_tuple
json_tuple(jsonStr, p1, p2, …, pn) - Returns a tuple like the function get_json_object, but it takes multiple names. All the input parameters and output column types are string.
Examples:
kurtosis
kurtosis(expr) - Returns the kurtosis value calculated from values of a group.
lag
lag(input[, offset[, default]]) - Returns the value of at the th row before the current row in the window. The default value of is 1 and the default value of is null. If the value of at the th row is null, null is returned. If there is no such offset row (e.g., when the offset is 1, the first row of the window does not have any previous row), is returned.
last
last(expr[, isIgnoreNull]) - Returns the last value of for a group of rows. If is true, returns only non-null values.
last_day
last_day(date) - Returns the last day of the month which the date belongs to.
Examples:
Since: 1.5.0
last_value
last_value(expr[, isIgnoreNull]) - Returns the last value of for a group of rows. If is true, returns only non-null values.
lcase
lcase(str) - Returns with all characters changed to lowercase.
Examples:
lead
lead(input[, offset[, default]]) - Returns the value of at the th row after the current row in the window. The default value of is 1 and the default value of is null. If the value of at the th row is null, null is returned. If there is no such an offset row (e.g., when the offset is 1, the last row of the window does not have any subsequent row), is returned.
least
least(expr, …) - Returns the least value of all parameters, skipping null values.
Examples:
left
left(str, len) - Returns the leftmost ( can be string type) characters from the string ,if is less or equal than 0 the result is an empty string.
Examples:
length
length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
Examples:
levenshtein
levenshtein(str1, str2) - Returns the Levenshtein distance between the two given strings.
Examples:
like
str like pattern - Returns true if str matches pattern, null if any arguments are null, false otherwise.
Arguments:
str - a string expression
pattern - a string expression. The pattern is a string which is matched literally, with exception to the following special symbols:
_ matches any one character in the input (similar to . in posix regular expressions)
% matches zero or more characters in the input (similar to .* in posix regular expressions)
The escape character is ‘’. If an escape character precedes a special symbol or another escape character, the following character is matched literally. It is invalid to escape any other character.
Since Spark 2.0, string literals are unescaped in our SQL parser. For example, in order to match “abc”, the pattern should be “abc”.
When SQL config ‘spark.sql.parser.escapedStringLiterals’ is enabled, it fallbacks to Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the pattern to match “abc” should be “abc”.
Examples:
Note:
Use RLIKE to match with standard regular expressions.
ln
ln(expr) - Returns the natural logarithm (base e) of .
Examples:
locate
locate(substr, str[, pos]) - Returns the position of the first occurrence of in after position . The given and return value are 1-based.
Examples:
log
log(base, expr) - Returns the logarithm of with .
Examples:
log10
log10(expr) - Returns the logarithm of with base 10.
Examples:
log1p
log1p(expr) - Returns log(1 + ).
Examples:
log2
log2(expr) - Returns the logarithm of with base 2.
Examples:
lower
lower(str) - Returns with all characters changed to lowercase.
Examples:
lpad
lpad(str, len, pad) - Returns , left-padded with to a length of . If is longer than , the return value is shortened to characters.
Examples:
ltrim
ltrim(str) - Removes the leading space characters from .
ltrim(trimStr, str) - Removes the leading string contains the characters from the trim string
Arguments:
- str - a string expression
- trimStr - the trim string characters to trim, the default value is a single space
Examples:
map
map(key0, value0, key1, value1, …) - Creates a map with the given key/value pairs.
Examples:
map_concat
map_concat(map, …) - Returns the union of all the given maps
Examples:
Since: 2.4.0
map_from_arrays
map_from_arrays(keys, values) - Creates a map with a pair of the given key/value arrays. All elements in keys should not be null
Examples:
Since: 2.4.0
map_from_entries
map_from_entries(arrayOfEntries) - Returns a map created from the given array of entries.
Examples:
Since: 2.4.0
map_keys
map_keys(map) - Returns an unordered array containing the keys of the map.
Examples:
map_values
map_values(map) - Returns an unordered array containing the values of the map.
Examples:
max
max(expr) - Returns the maximum value of .
md5
md5(expr) - Returns an MD5 128-bit checksum as a hex string of .
Examples:
mean
mean(expr) - Returns the mean calculated from values of a group.
min
min(expr) - Returns the minimum value of .
minute
minute(timestamp) - Returns the minute component of the string/timestamp.
Examples:
Since: 1.5.0
mod
expr1 mod expr2 - Returns the remainder after /.
Examples:
monotonically_increasing_id
monotonically_increasing_id() - Returns monotonically increasing 64-bit integers. The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the lower 33 bits represent the record number within each partition. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records. The function is non-deterministic because its result depends on partition IDs.
month
month(date) - Returns the month component of the date/timestamp.
Examples:
Since: 1.5.0
months_between
months_between(timestamp1, timestamp2[, roundOff]) - If is later than , then the result is positive. If and are on the same day of month, or both are the last day of month, time of day will be ignored. Otherwise, the difference is calculated based on 31 days per month, and rounded to 8 digits unless roundOff=false.
Examples:
Since: 1.5.0
named_struct
named_struct(name1, val1, name2, val2, …) - Creates a struct with the given field names and values.
Examples:
nanvl
nanvl(expr1, expr2) - Returns if it’s not NaN, or otherwise.
Examples:
negative
negative(expr) - Returns the negated value of .
Examples:
next_day
next_day(start_date, day_of_week) - Returns the first date which is later than and named as indicated.
Examples:
Since: 1.5.0
not
not expr - Logical not.
now
now() - Returns the current timestamp at the start of query evaluation.
Since: 1.5.0
ntile
ntile(n) - Divides the rows for each window partition into buckets ranging from 1 to at most .
nullif
nullif(expr1, expr2) - Returns null if equals to , or otherwise.
Examples:
nvl
nvl(expr1, expr2) - Returns if is null, or otherwise.
Examples:
nvl2
nvl2(expr1, expr2, expr3) - Returns if is not null, or otherwise.
Examples:
octet_length
octet_length(expr) - Returns the byte length of string data or number of bytes of binary data.
Examples:
or
expr1 or expr2 - Logical OR.
parse_url
parse_url(url, partToExtract[, key]) - Extracts a part from a URL.
Examples:
percent_rank
percent_rank() - Computes the percentage ranking of a value in a group of values.
percentile
percentile(col, percentage [, frequency]) - Returns the exact percentile value of numeric column at the given percentage. The value of percentage must be between 0.0 and 1.0. The value of frequency should be positive integral
percentile(col, array(percentage1 [, percentage2]…) [, frequency]) - Returns the exact percentile value array of numeric column at the given percentage(s). Each value of the percentage array must be between 0.0 and 1.0. The value of frequency should be positive integral
percentile_approx
percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column at the given percentage. The value of percentage must be between 0.0 and 1.0. The parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of yields better accuracy, is the relative error of the approximation. When is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column at the given percentage array.
Examples:
pi
pi() - Returns pi.
Examples:
pmod
pmod(expr1, expr2) - Returns the positive value of mod .
Examples:
posexplode
posexplode(expr) - Separates the elements of array into multiple rows with positions, or the elements of map into multiple rows and columns with positions.
Examples:
posexplode_outer
posexplode_outer(expr) - Separates the elements of array into multiple rows with positions, or the elements of map into multiple rows and columns with positions.
Examples:
position
position(substr, str[, pos]) - Returns the position of the first occurrence of in after position . The given and return value are 1-based.
Examples:
positive
positive(expr) - Returns the value of .
pow
pow(expr1, expr2) - Raises to the power of .
Examples:
power
power(expr1, expr2) - Raises to the power of .
Examples:
printf
printf(strfmt, obj, …) - Returns a formatted string from printf-style format strings.
Examples:
quarter
quarter(date) - Returns the quarter of the year for date, in the range 1 to 4.
Examples:
Since: 1.5.0
radians
radians(expr) - Converts degrees to radians.
Arguments:
Examples:
rand
rand([seed]) - Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).
Examples:
randn
randn([seed]) - Returns a random value with independent and identically distributed (i.i.d.) values drawn from the standard normal distribution.
Examples:
rank
rank() - Computes the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to the current row in the ordering of the partition. The values will produce gaps in the sequence.
reflect
reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.
Examples:
regexp_replace
regexp_replace(str, regexp, rep) - Replaces all substrings of that match with .
Examples:
repeat
repeat(str, n) - Returns the string which repeats the given string value n times.
Examples:
replace
replace(str, search[, replace]) - Replaces all occurrences of with .
Arguments:
- str - a string expression
- search - a string expression. If is not found in , is returned unchanged.
- replace - a string expression. If is not specified or is an empty string, nothing replaces the string that is removed from .
Examples:
reverse
reverse(array) - Returns a reversed string or an array with reverse order of elements.
Examples:
rse logic for arrays is available since 2.4.0. Since: 1.5.0
right
right(str, len) - Returns the rightmost ( can be string type) characters from the string ,if is less or equal than 0 the result is an empty string.
Examples:
rint
rint(expr) - Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
Examples:
rlike
str rlike regexp - Returns true if matches , or false otherwise.
Arguments:
str - a string expression
regexp - a string expression. The pattern string should be a Java regular expression.
Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser. For example, to match “abc”, a regular expression for can be “^abc$”.
There is a SQL config ‘spark.sql.parser.escapedStringLiterals’ that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the that can match “abc” is “^abc$”.
Examples:
Note:
Use LIKE to match with simple string pattern.
rollup
round
round(expr, d) - Returns rounded to decimal places using HALF_UP rounding mode.
Examples:
row_number
row_number() - Assigns a unique, sequential number to each row, starting with one, according to the ordering of rows within the window partition.
rpad
rpad(str, len, pad) - Returns , right-padded with to a length of . If is longer than , the return value is shortened to characters.
Examples:
Spark SQL functions
* If count is positive, everything the left of the final delimiter (counting from left) is
* returned. If count is negative, every to the right of the final delimiter (counting from the
* right) is returned. substring_index performs a case-sensitive match when searching for delim.
* starting from byte position `pos` of `inputString` and proceeding for `len` bytes.
* starting from byte position `pos` of `inputString`.
* The characters in replaceString correspond to the characters in matchingString.
* The translate will happen when any character in the string matches the character
* in the `matchingString`.
SQL Functions
Ascend uses Spark SQL syntax. This page offers a list of functions supported by the Ascend platform.
📘
Looking for that special function?
Have a SQL function you love, but isn't yet supported in Spark SQL? Let us know in our Slack Community, and we'll do our best to add it for you!
Description
any(expr) - Returns true if at least one value of expr is true.
Argument type | Return type |
---|---|
( Bool ) | Bool |
More
ANY on Apache Spark Documentation
Description
some(expr) - Returns true if at least one value of expr is true.
Argument type | Return type |
---|---|
( Bool ) | Bool |
More
SOME on Apache Spark Documentation
Description
bool_or(expr) - Returns true if at least one value of expr is true.
Argument type | Return type |
---|---|
( Bool ) | Bool |
More
BOOL_OR on Apache Spark Documentation
Description
bool_and(expr) - Returns true if all values of expr are true.
Argument type | Return type |
---|---|
( Bool ) | Bool |
More
BOOL_AND on Apache Spark Documentation
Description
every(expr) - Returns true if all values of expr are true.
Argument type | Return type |
---|---|
( Bool ) | Bool |
More
EVERY on Apache Spark Documentation
Description
avg(expr) - Returns the mean calculated from values of a group.
Argument type | Return type |
---|---|
( Integer or Float ) | Float |
More
AVG on Apache Spark Documentation
Description
bit_or(expr) - Returns the bitwise OR of all non-null input values, or null if none.
Argument type | Return type |
---|---|
( Integer ) | Integer |
More
BIT_OR on Apache Spark Documentation
Description
bit_and(expr) - Returns the bitwise AND of all non-null input values, or null if none.
Argument type | Return type |
---|---|
( Integer ) | Integer |
More
BIT_AND on Apache Spark Documentation
Description
bit_xor(expr) - Returns the bitwise XOR of all non-null input values, or null if none.
Argument type | Return type |
---|---|
( Integer ) | Integer |
More
BIT_XOR on Apache Spark Documentation
Description
mean(expr) - Returns the mean calculated from values of a group.
Argument type | Return type |
---|---|
( Integer or Float ) | Float |
More
MEAN on Apache Spark Documentation
Description
count(*) - Returns the total number of retrieved rows, including rows containing null. count(expr) - Returns the number of rows for which the supplied expression is non-null. count(DISTINCT expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are unique and non-null.
Argument type | Return type |
---|---|
( Integer or Float or Bool or String or Binary or Date or Timestamp or Array or Struct or Map or ) | Integer |
More
COUNT on Apache Spark Documentation
Description
count_if(expr) - Returns the number of TRUE values for the expression.
Argument type | Return type |
---|---|
( Bool ) | Integer |
More
COUNT_IF on Apache Spark Documentation
Description
max(expr) - Returns the maximum value of expr.
Argument type | Return type |
---|---|
( Integer ) | Integer |
( Float ) | Float |
( Bool ) | Bool |
( String ) | String |
( Binary ) | Binary |
( Date ) | Date |
( Timestamp ) | Timestamp |
More
MAX on Apache Spark Documentation
Description
max_by(x, y) - Returns the value of x associated with the maximum value of y.
Argument type | Return type |
---|---|
( Integer, Integer or Float or Bool or String or Binary or Date or Timestamp ) | Integer |
( Float, Integer or Float or Bool or String or Binary or Date or Timestamp ) | Float |
( Bool, Integer or Float or Bool or String or Binary or Date or Timestamp ) | Bool |
( String, Integer or Float or Bool or String or Binary or Date or Timestamp ) | String |
( Binary, Integer or Float or Bool or String or Binary or Date or Timestamp ) | Binary |
( Date, Integer or Float or Bool or String or Binary or Date or Timestamp ) | Date |
( Timestamp, Integer or Float or Bool or String or Binary or Date or Timestamp ) | Timestamp |
( Array, Integer or Float or Bool or String or Binary or Date or Timestamp ) | Array |
( Struct, Integer or Float or Bool or String or Binary or Date or Timestamp ) | Struct |
( Map, Integer or Float or Bool or String or Binary or Date or Timestamp ) | Map |
More
MAX_BY on Apache Spark Documentation
Description
min_by(x, y) - Returns the value of x associated with the minimum value of y.
Argument type | Return type |
---|---|
( Integer, Integer or Float or Bool or String or Binary or Date or Timestamp ) | Integer |
( Float, Integer or Float or Bool or String or Binary or Date or Timestamp ) | Float |
( Bool, Integer or Float or Bool or String or Binary or Date or Timestamp ) | Bool |
( String, Integer or Float or Bool or String or Binary or Date or Timestamp ) | String |
( Binary, Integer or Float or Bool or String or Binary or Date or Timestamp ) | Binary |
( Date, Integer or Float or Bool or String or Binary or Date or Timestamp ) | Date |
( Timestamp, Integer or Float or Bool or String or Binary or Date or Timestamp ) | Timestamp |
( Array, Integer or Float or Bool or String or Binary or Date or Timestamp ) | Array |
( Struct, Integer or Float or Bool or String or Binary or Date or Timestamp ) | Struct |
( Map, Integer or Float or Bool or String or Binary or Date or Timestamp ) | Map |
More
MIN_BY on Apache Spark Documentation
Description
min(expr) - Returns the minimum value of expr.
Argument type | Return type |
---|---|
( Integer ) | Integer |
( Float ) | Float |
( Bool ) | Bool |
( String ) | String |
( Binary ) | Binary |
( Date ) | Date |
( Timestamp ) | Timestamp |
More
MIN on Apache Spark Documentation
Description
sum(expr) - Returns the sum calculated from values of a group.
Argument type | Return type |
---|---|
( Integer ) | Integer |
( Float ) | Float |
More
SUM on Apache Spark Documentation
Description
collect_list(expr) - Collects and returns a list of non-unique elements.
Argument type | Return type |
---|---|
( Integer or Float or Bool or String or Binary or Date or Timestamp or Array or Struct or Map ) | Array |
More
COLLECT_LIST on Apache Spark Documentation
Description
collect_set(expr) - Collects and returns a set of unique elements.
Argument type | Return type |
---|---|
( Integer or Bool or String or Binary or Date or Timestamp ) | Array |
More
COLLECT_SET on Apache Spark Documentation
Description
first(expr[, isIgnoreNull]) - Returns the first value of expr for a group of rows. If isIgnoreNull is true, returns only non-null values.
Argument type | Return type |
---|---|
( Integer ) | Integer |
( Integer, Bool ) | Integer |
( Float ) | Float |
( Float, Bool ) | Float |
( Bool ) | Bool |
( Bool, Bool ) | Bool |
( String ) | String |
( String, Bool ) | String |
( Binary ) | Binary |
( Binary, Bool ) | Binary |
( Date ) | Date |
( Date, Bool ) | Date |
( Timestamp ) | Timestamp |
( Timestamp, Bool ) | Timestamp |
( Array ) | Array |
( Array, Bool ) | Array |
( Struct ) | Struct |
( Struct, Bool ) | Struct |
( Map ) | Map |
( Map, Bool ) | Map |
More
FIRST on Apache Spark Documentation
Description
first_value(expr[, isIgnoreNull]) - Returns the first value of expr for a group of rows. If isIgnoreNull is true, returns only non-null values.
Argument type | Return type |
---|---|
( Integer ) | Integer |
( Integer, Bool ) | Integer |
( Float ) | Float |
( Float, Bool ) | Float |
( Bool ) | Bool |
( Bool, Bool ) | Bool |
( String ) | String |
( String, Bool ) | String |
( Binary ) | Binary |
( Binary, Bool ) | Binary |
( Date ) | Date |
( Date, Bool ) | Date |
( Timestamp ) | Timestamp |
( Timestamp, Bool ) | Timestamp |
( Array ) | Array |
( Array, Bool ) | Array |
( Struct ) | Struct |
( Struct, Bool ) | Struct |
( Map ) | Map |
( Map, Bool ) | Map |
More
FIRST_VALUE on Apache Spark Documentation
Description
last(expr[, isIgnoreNull]) - Returns the last value of expr for a group of rows. If isIgnoreNull is true, returns only non-null values.
Argument type | Return type |
---|---|
( Integer ) | Integer |
( Integer, Bool ) | Integer |
( Float ) | Float |
( Float, Bool ) | Float |
( Bool ) | Bool |
( Bool, Bool ) | Bool |
( String ) | String |
( String, Bool ) | String |
( Binary ) | Binary |
( Binary, Bool ) | Binary |
( Date ) | Date |
( Date, Bool ) | Date |
( Timestamp ) | Timestamp |
( Timestamp, Bool ) | Timestamp |
( Array ) | Array |
( Array, Bool ) | Array |
( Struct ) | Struct |
( Struct, Bool ) | Struct |
( Map ) | Map |
( Map, Bool ) | Map |
More
LAST on Apache Spark Documentation
Description
last_value(expr[, isIgnoreNull]) - Returns the last value of expr for a group of rows. If isIgnoreNull is true, returns only non-null values.
Argument type | Return type |
---|---|
( Integer ) | Integer |
( Integer, Bool ) | Integer |
( Float ) | Float |
( Float, Bool ) | Float |
( Bool ) | Bool |
( Bool, Bool ) | Bool |
( String ) | String |
( String, Bool ) | String |
( Binary ) | Binary |
( Binary, Bool ) | Binary |
( Date ) | Date |
( Date, Bool ) | Date |
( Timestamp ) | Timestamp |
( Timestamp, Bool ) | Timestamp |
( Array ) | Array |
( Array, Bool ) | Array |
( Struct ) | Struct |
( Struct, Bool ) | Struct |
( Map ) | Map |
( Map, Bool ) | Map |
More
LAST_VALUE on Apache Spark Documentation
Description
corr(expr1, expr2) - Returns Pearson coefficient of correlation between a set of number pairs.
Argument type | Return type |
---|---|
( Float, Float ) | Float |
More
CORR on Apache Spark Documentation
Description
covar_pop(expr1, expr2) - Returns the population covariance of a set of number pairs.
Argument type | Return type |
---|---|
( Float, Float ) | Float |
More
COVAR_POP on Apache Spark Documentation
Description
covar_samp(expr1, expr2) - Returns the sample covariance of a set of number pairs.
Argument type | Return type |
---|---|
( Float, Float ) | Float |
More
COVAR_SAMP on Apache Spark Documentation
Description
kurtosis(expr) - Returns the kurtosis value calculated from values of a group.
Argument type | Return type |
---|---|
( Float ) | Float |
More
KURTOSIS on Apache Spark Documentation
Description
skewness(expr) - Returns the skewness value calculated from values of a group.
Argument type | Return type |
---|---|
( Float ) | Float |
More
SKEWNESS on Apache Spark Documentation
Description
stddev_pop(expr) - Returns the population standard deviation calculated from values of a group.
Argument type | Return type |
---|---|
( Float ) | Float |
More
STDDEV_POP on Apache Spark Documentation
Description
stddev_samp(expr) - Returns the sample standard deviation calculated from values of a group.
Argument type | Return type |
---|---|
( Float ) | Float |
More
STDDEV_SAMP on Apache Spark Documentation
Description
std(expr) - Returns the sample standard deviation calculated from values of a group.
Argument type | Return type |
---|---|
( Float ) | Float |
More
STD on Apache Spark Documentation
Description
stddev(expr) - Returns the sample standard deviation calculated from values of a group.
Argument type | Return type |
---|---|
( Float ) | Float |
More
STDDEV on Apache Spark Documentation
Description
var_pop(expr) - Returns the population variance calculated from values of a group.
Argument type | Return type |
---|---|
( Float ) | Float |
More
VAR_POP on Apache Spark Documentation
Description
var_samp(expr) - Returns the sample variance calculated from values of a group.
Argument type | Return type |
---|---|
( Float ) | Float |
More
VAR_SAMP on Apache Spark Documentation
Description
variance(expr) - Returns the sample variance calculated from values of a group.
Argument type | Return type |
---|---|
( Float ) | Float |
More
VARIANCE on Apache Spark Documentation
Description
percentile(col, percentage [, frequency]) - Returns the exact percentile value of numeric column col at the given percentage. The value of percentage must be between 0.0 and 1.0. The value of frequency should be positive integral.
Argument type | Return type |
---|---|
( Float, Float ) | Float |
( Float, Float, Integer ) | Float |
More
PERCENTILE on Apache Spark Documentation
Description
percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column col at the given percentage. The value of percentage must be between 0.0 and 1.0. The accuracy parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation.
Argument type | Return type |
---|---|
( Float, Float ) | Float |
( Float, Float, Integer ) | Float |
More
PERCENTILE_APPROX on Apache Spark Documentation
Description
approx_percentile(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column col at the given percentage. The value of percentage must be between 0.0 and 1.0. The accuracy parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation.
Argument type | Return type |
---|---|
( Float, Float ) | Float |
( Float, Float, Integer ) | Float |
More
APPROX_PERCENTILE on Apache Spark Documentation
Description
approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++. relativeSD defines the maximum estimation error allowed.
Argument type | Return type |
---|---|
( Integer or Bool or String or Binary or Date or Timestamp ) | Integer |
( Integer or Bool or String or Binary or Date or Timestamp, Float ) | Integer |
More
APPROX_COUNT_DISTINCT on Apache Spark Documentation
Description
count_min_sketch(col, eps, confidence, seed) - Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a CountMinSketch before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.
Argument type | Return type |
---|---|
( Integer or Bool or String or Binary or Date or Timestamp, Float, Float, Integer ) | Binary |
More
COUNT_MIN_SKETCH on Apache Spark Documentation
Description
isnull(expr) - Returns true if expr is null, or false otherwise.
Argument type | Return type |
---|---|
( Integer or Float or Bool or String or Binary or Date or Timestamp or Array or Struct or Map ) | Bool |
More
ISNULL on Apache Spark Documentation
Description
isnotnull(expr) - Returns true if expr is not null, or false otherwise.
Argument type | Return type |
---|---|
( Integer or Float or Bool or String or Binary or Date or Timestamp or Array or Struct or Map ) | Bool |
More
ISNOTNULL on Apache Spark Documentation
Description
nullif(expr1, expr2) - Returns null if expr1 equals to expr2, or expr1 otherwise.
Argument type | Return type |
---|---|
( Integer, Integer ) | Integer |
( Float, Float ) | Float |
( Bool, Bool ) | Bool |
( String, String ) | String |
( Binary, Binary ) | Binary |
( Date, Date ) | Date |
( Timestamp, Timestamp ) | Timestamp |
( Array, Array ) | Array |
( Struct, Struct ) | Struct |
( Map, Map ) | Map |
More
NULLIF on Apache Spark Documentation
Description
not(expr) - Logical not.
Argument type | Return type |
---|---|
( Bool ) | Bool |
More
NOT on Apache Spark Documentation
Description
if(expr1, expr2, expr3) - If expr1 evaluates to true, then returns expr2; otherwise returns expr3.
Argument type | Return type |
---|---|
( Bool, Integer, Integer ) | Integer |
( Bool, Float, Float ) | Float |
( Bool, Bool, Bool ) | Bool |
( Bool, String, String ) | String |
( Bool, Binary, Binary ) | Binary |
( Bool, Date, Date ) | Date |
( Bool, Timestamp, Timestamp ) | Timestamp |
( Bool, Array, Array ) | Array |
( Bool, Struct, Struct ) | Struct |
( Bool, Map, Map ) | Map |
More
IF on Apache Spark Documentation
Description
ifnull(expr1, expr2) - Returns expr2 if expr1 is null, or expr1 otherwise.
Argument type | Return type |
---|---|
( Integer, Integer ) | Integer |
( Float, Float ) | Float |
( Bool, Bool ) | Bool |
( String, String ) | String |
( Binary, Binary ) | Binary |
( Date, Date ) | Date |
( Timestamp, Timestamp ) | Timestamp |
( Array, Array ) | Array |
( Struct, Struct ) | Struct |
( Map, Map ) | Map |
More
IFNULL on Apache Spark Documentation
Description
nvl(expr1, expr2) - Returns expr2 if expr1 is null, or expr1 otherwise.
Argument type | Return type |
---|---|
( Integer, Integer ) | Integer |
( Float, Float ) | Float |
( Bool, Bool ) | Bool |
( String, String ) | String |
( Binary, Binary ) | Binary |
( Date, Date ) | Date |
( Timestamp, Timestamp ) | Timestamp |
( Array, Array ) | Array |
( Struct, Struct ) | Struct |
( Map, Map ) | Map |
More
NVL on Apache Spark Documentation
Description
nvl2(expr1, expr2, expr3) - Returns expr2 if expr1 is not null, or expr3 otherwise.
Argument type | Return type |
---|---|
( Integer, Integer, Integer ) | Integer |
( Float, Float, Float ) | Float |
( Bool, Bool, Bool ) | Bool |
( String, String, String ) | String |
( Binary, Binary, Binary ) | Binary |
( Date, Date, Date ) | Date |
( Timestamp, Timestamp, Timestamp ) | Timestamp |
( Array, Array, Array ) | Array |
( Struct, Struct, Struct ) | Struct |
( Map, Map, Map ) | Map |
More
NVL2 on Apache Spark Documentation
Description
coalesce(expr1, expr2, ...) - Returns the first non-null argument if exists. Otherwise, null.
Argument type | Return type |
---|---|
( Integer ) | Integer |
( Float ) | Float |
( Bool ) | Bool |
( String ) | String |
( Binary ) | Binary |
( Date ) | Date |
( Timestamp ) | Timestamp |
( Array ) | Array |
( Struct ) | Struct |
( Map ) | Map |
More
COALESCE on Apache Spark Documentation
Description
raise_error(msg) - Raises an error with the specified string message. The function always returns a value of string data type.
Argument type | Return type |
---|---|
( String ) | String |
More
RAISE_ERROR on Apache Spark Documentation
Description
assert_true(expr) - Throws an exception if expr is not true.
Argument type | Return type |
---|---|
( Bool ) | String |
More
ASSERT_TRUE on Apache Spark Documentation
Description
typeof(expr) - Return DDL-formatted type string for the data type of the input.
Argument type | Return type |
---|---|
( Integer or Float or Bool or String or Binary or Date or Timestamp or Array or Struct or Map ) | String |
More
TYPEOF on Apache Spark Documentation
Description
bigint(expr) - Casts the value expr to the target data type bigint.
Argument type | Return type |
---|---|
( Integer or Float or Bool or String ) | Integer |
More
BIGINT on Apache Spark Documentation
Description
date(expr) - Casts the value expr to the target data type date.
Argument type | Return type |
---|---|
( String or Date or Timestamp ) | Date |
More
DATE on Apache Spark Documentation
Description
double(expr) - Casts the value expr to the target data type double.
Argument type | Return type |
---|---|
( Integer or Float or Bool or String ) | Float |
More
DOUBLE on Apache Spark Documentation
Description
int(expr) - Casts the value expr to the target data type int.
Argument type | Return type |
---|---|
( Integer or Float or Bool or String ) | Integer |
More
INT on Apache Spark Documentation
Description
timestamp(expr) - Casts the value expr to the target data type timestamp.
Argument type | Return type |
---|---|
( String ) | Timestamp |
More
TIMESTAMP on Apache Spark Documentation
Description
binary(expr) - Casts the value expr to the target data type binary.
Argument type | Return type |
---|---|
( Integer or String or Binary ) | Binary |
More
BINARY on Apache Spark Documentation
Description
boolean(expr) - Casts the value expr to the target data type boolean.
Argument type | Return type |
---|---|
( Integer or Float or Bool or String or Date or Timestamp ) | Bool |
More
BOOLEAN on Apache Spark Documentation
Description
decimal(expr) - Casts the value expr to the target data type decimal.
Argument type | Return type |
---|---|
( Integer or Float or Bool or String ) | Float |
More
DECIMAL on Apache Spark Documentation
Description
float(expr) - Casts the value expr to the target data type float.
Argument type | Return type |
---|---|
( Integer or Float or Bool or String ) | Float |
More
FLOAT on Apache Spark Documentation
Description
smallint(expr) - Casts the value expr to the target data type smallint.
Argument type | Return type |
---|---|
( Integer or Float or Bool or String ) | Integer |
More
SMALLINT on Apache Spark Documentation
Description
string(expr) - Casts the value expr to the target data type string.
Argument type | Return type |
---|---|
( Integer or Float or Bool or String or Binary or Date or Timestamp or Array or Struct or Map ) | String |
More
STRING on Apache Spark Documentation
Description
rank() - Computes the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to the current row in the ordering of the partition. The values will produce gaps in the sequence.
Argument type | Return type |
---|---|
No arguments | Integer |
More
RANK on Apache Spark Documentation
Description
dense_rank() - Computes the rank of a value in a group of values. The result is one plus the previously assigned rank value. Unlike the function rank, dense_rank will not produce gaps in the ranking sequence.
Argument type | Return type |
---|---|
No arguments | Integer |
More
DENSE_RANK on Apache Spark Documentation
Description
percent_rank() - Computes the percentage ranking of a value in a group of values.
Argument type | Return type |
---|---|
No arguments | Float |
More
PERCENT_RANK on Apache Spark Documentation
Description
cume_dist() - Computes the position of a value relative to all values in the partition.
Argument type | Return type |
---|---|
No arguments | Float |
More
CUME_DIST on Apache Spark Documentation
Description
ntile(n) - Divides the rows for each window partition into n buckets ranging from 1 to at most n.
Argument type | Return type |
---|---|
( Integer ) | Integer |
More
NTILE on Apache Spark Documentation
Description
row_number() - Assigns a unique, sequential number to each row, starting with one, according to the ordering of rows within the window partition.
Argument type | Return type |
---|---|
No arguments | Integer |
More
ROW_NUMBER on Apache Spark Documentation
Description
bin(expr) - Returns the string representation of the long value expr represented in binary.
Argument type | Return type |
---|---|
( Integer ) | String |
More
BIN on Apache Spark Documentation
Description
conv(num, from_base, to_base) - Convert num from from_base to to_base.
Argument type | Return type |
---|---|
( String, Integer, Integer ) | String |
More
CONV on Apache Spark Documentation
Description
monotonically_increasing_id() - Returns monotonically increasing 64-bit integers. The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the lower 33 bits represent the record number within each partition. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
Argument type | Return type |
---|---|
No arguments | Integer |
More
MONOTONICALLY_INCREASING_ID on Apache Spark Documentation
Description
lead(input[, offset[, default]]) - Returns the value of input at the offsetth row after the current row in the window. The default value of offset is 1 and the default value of default is null. If the value of input at the offsetth row is null, null is returned. If there is no such an offset row (e.g., when the offset is 1, the last row of the window does not have any subsequent row), default is returned.
Argument type | Return type |
---|---|
( Integer ) | Integer |
( Integer, Integer ) | Integer |
( Integer, Integer, Integer ) | Integer |
( Float ) | Float |
( Float, Integer ) | Float |
( Float, Integer, Float ) | Float |
( Bool ) | Bool |
( Bool, Integer ) | Bool |
( Bool, Integer, Bool ) | Bool |
( String ) | String |
( String, Integer ) | String |
( String, Integer, String ) | String |
( Binary ) | Binary |
( Binary, Integer ) | Binary |
( Binary, Integer, Binary ) | Binary |
( Date ) | Date |
( Date, Integer ) | Date |
( Date, Integer, Date ) | Date |
( Timestamp ) | Timestamp |
( Timestamp, Integer ) | Timestamp |
( Timestamp, Integer, Timestamp ) | Timestamp |
( Array ) | Array |
( Array, Integer ) | Array |
( Array, Integer, Array ) | Array |
( Struct ) | Struct |
( Struct, Integer ) | Struct |
( Struct, Integer, Struct ) | Struct |
( Map ) | Map |
( Map, Integer ) | Map |
( Map, Integer, Map ) | Map |
More
LEAD on Apache Spark Documentation
Description
lag(input[, offset[, default]]) - Returns the value of input at the offsetth row before the current row in the window. The default value of offset is 1 and the default value of default is null. If the value of input at the offsetth row is null, null is returned. If there is no such offset row (e.g., when the offset is 1, the first row of the window does not have any previous row), default is returned.
Argument type | Return type |
---|---|
( Integer ) | Integer |
( Integer, Integer ) | Integer |
( Integer, Integer, Integer ) | Integer |
( Float ) | Float |
( Float, Integer ) | Float |
( Float, Integer, Float ) | Float |
( Bool ) | Bool |
( Bool, Integer ) | Bool |
( Bool, Integer, Bool ) | Bool |
( String ) | String |
( String, Integer ) | String |
( String, Integer, String ) | String |
( Binary ) | Binary |
( Binary, Integer ) | Binary |
( Binary, Integer, Binary ) | Binary |
( Date ) | Date |
( Date, Integer ) | Date |
( Date, Integer, Date ) | Date |
( Timestamp ) | Timestamp |
( Timestamp, Integer ) | Timestamp |
( Timestamp, Integer, Timestamp ) | Timestamp |
( Array ) | Array |
( Array, Integer ) | Array |
( Array, Integer, Array ) | Array |
( Struct ) | Struct |
( Struct, Integer ) | Struct |
( Struct, Integer, Struct ) | Struct |
( Map ) | Map |
( Map, Integer ) | Map |
( Map, Integer, Map ) | Map |
More
LAG on Apache Spark Documentation
Description
bit_count(expr) - Returns the number of bits that are set in the argument expr as an unsigned 64-bit integer, or NULL if the argument is NULL.
Argument type | Return type |
---|---|
( Integer or Bool ) | Integer |
More
BIT_COUNT on Apache Spark Documentation
Description
shiftleft(base, expr) - Bitwise left shift.
Argument type | Return type |
---|---|
( Integer, Integer ) | Integer |
More
SHIFTLEFT on Apache Spark Documentation
Description
shiftright(base, expr) - Bitwise (signed) right shift.
Argument type | Return type |
---|---|
( Integer, Integer ) | Integer |
More
SHIFTRIGHT on Apache Spark Documentation
Description
shiftrightunsigned(base, expr) - Bitwise unsigned right shift.
Argument type | Return type |
---|---|
( Integer, Integer ) | Integer |
More
SHIFTRIGHTUNSIGNED on Apache Spark Documentation
Description
e() - Returns Euler's number, e.
Argument type | Return type |
---|---|
No arguments | Float |
More
E on Apache Spark Documentation
Description
pi() - Returns pi.
Argument type | Return type |
---|---|
No arguments | Float |
More
PI on Apache Spark Documentation
Description
abs(expr) - Returns the absolute value of the numeric value.
Argument type | Return type |
---|---|
( Integer ) | Integer |
( Float ) | Float |
More
ABS on Apache Spark Documentation
Description
negative(expr) - Returns the negated value of expr.
Argument type | Return type |
---|---|
( Integer ) | Integer |
( Float ) | Float |
( ) |
More
NEGATIVE on Apache Spark Documentation
Description
positive(expr) - Returns the value of expr.
Argument type | Return type |
---|---|
( Integer ) | Integer |
( Float ) | Float |
( ) |
More
POSITIVE on Apache Spark Documentation
Description
sign(expr) - Returns -1.0, 0.0 or 1.0 as expr is negative, 0 or positive.
Argument type | Return type |
---|---|
( Integer or Float ) | Float |
More
SIGN on Apache Spark Documentation
Description
signum(expr) - Returns -1.0, 0.0 or 1.0 as expr is negative, 0 or positive.
Argument type | Return type |
---|---|
( Integer or Float ) | Float |
More
SIGNUM on Apache Spark Documentation
Description
isnan(expr) - Returns true if expr is NaN, or false otherwise.
Argument type | Return type |
---|---|
( Float ) | Bool |
More
ISNAN on Apache Spark Documentation
Description
rand([seed]) - Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).
Argument type | Return type |
---|---|
No arguments | Float |
( Integer ) | Float |
More
RAND on Apache Spark Documentation
Description
randn([seed]) - Returns a random value with independent and identically distributed (i.i.d.) values drawn from the standard normal distribution.
Argument type | Return type |
---|---|
No arguments | Float |
( Integer ) | Float |
More
RANDN on Apache Spark Documentation
Description
uuid() - Returns an universally unique identifier (UUID) string. The value is returned as a canonical UUID 36-character string.
Argument type | Return type |
---|---|
No arguments | String |
More
UUID on Apache Spark Documentation
Description
sqrt(expr) - Returns the square root of expr.
Argument type | Return type |
---|---|
( Float ) | Float |
More
SQRT on Apache Spark Documentation
Description
cbrt(expr) - Returns the cube root of expr.
Argument type | Return type |
---|---|
( Float ) | Float |
More
CBRT on Apache Spark Documentation
Description
hypot(expr1, expr2) - Returns sqrt(expr12 + expr22).
Argument type | Return type |
---|---|
( Float, Float ) | Float |
More
HYPOT on Apache Spark Documentation
Description
pow(expr1, expr2) - Raises expr1 to the power of expr2.
Argument type | Return type |
---|---|
( Float, Float ) | Float |
More
POW on Apache Spark Documentation
Description
exp(expr) - Returns e to the power of expr.
Argument type | Return type |
---|---|
( Float ) | Float |
More
EXP on Apache Spark Documentation
Description
expm1(expr) - Returns exp(expr) - 1.
Argument type | Return type |
---|---|
( Float ) | Float |
More
EXPM1 on Apache Spark Documentation
Description
ln(expr) - Returns the natural logarithm (base e) of expr.
Argument type | Return type |
---|---|
( Float ) | Float |
More
LN on Apache Spark Documentation
Description
log(base, expr) - Returns the logarithm of expr with base.
Argument type | Return type |
---|---|
( Float ) | Float |
( Float, Float ) | Float |
More
LOG on Apache Spark Documentation
Description
log10(expr) - Returns the logarithm of expr with base 10.
Argument type | Return type |
---|---|
( Float ) | Float |
More
LOG10 on Apache Spark Documentation
Description
log1p(expr) - Returns log(1 + expr).
Argument type | Return type |
---|---|
( Float ) | Float |
More
LOG1P on Apache Spark Documentation
Description
log2(expr) - Returns the logarithm of expr with base 2.
Argument type | Return type |
---|---|
( Float ) | Float |
More
LOG2 on Apache Spark Documentation
Description
greatest(expr, ...) - Returns the greatest value of all parameters, skipping null values.
Argument type | Return type |
---|---|
( Integer, Integer ) | Integer |
( Float, Float ) | Float |
( Bool, Bool ) | Bool |
( String, String ) | String |
( Binary, Binary ) | Binary |
( Date, Date ) | Date |
( Timestamp, Timestamp ) | Timestamp |
More
GREATEST on Apache Spark Documentation
Description
least(expr, ...) - Returns the least value of all parameters, skipping null values.
Argument type | Return type |
---|---|
( Integer, Integer ) | Integer |
( Float, Float ) | Float |
( Bool, Bool ) | Bool |
( String, String ) | String |
( Binary, Binary ) | Binary |
( Date, Date ) | Date |
( Timestamp, Timestamp ) | Timestamp |
More
LEAST on Apache Spark Documentation
Description
mod(expr1, expr2) - Returns the remainder after expr1/expr2.
Argument type | Return type |
---|---|
( Integer, Integer ) | Integer |
( Float, Float ) | Float |
More
MOD on Apache Spark Documentation
Description
pmod(expr1, expr2) - Returns the positive value of expr1 mod expr2.
Argument type | Return type |
---|---|
( Integer, Integer ) | Integer |
( Float, Float ) | Float |
More
PMOD on Apache Spark Documentation
Description
factorial(expr) - Returns the factorial of expr. expr is [0..20]. Otherwise, null.
Argument type | Return type |
---|---|
( Integer ) | Integer |
More
FACTORIAL on Apache Spark Documentation
Description
nanvl(expr1, expr2) - Returns expr1 if it's not NaN, or expr2 otherwise.
Argument type | Return type |
---|---|
( Float, Float ) | Float |
More
NANVL on Apache Spark Documentation
Description
div(expr1, expr2) - Divide expr1 by expr2. It returns NULL if an operand is NULL or expr2 is 0. The result is casted to long.
Argument type | Return type |
---|---|
( Integer, Integer ) | Integer |
More
DIV on Apache Spark Documentation
Description
power(expr1, expr2) - Raises expr1 to the power of expr2.
Argument type | Return type |
---|---|
( Float, Float ) | Float |
More
POWER on Apache Spark Documentation
Description
random([seed]) - Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).
Argument type | Return type |
---|---|
No arguments | Float |
( Integer ) | Float |
More
RANDOM on Apache Spark Documentation
Description
round(expr, d) - Returns expr rounded to d decimal places using HALF_UP rounding mode.
Argument type | Return type |
---|---|
( Float ) | Float |
( Float, Integer ) | Float |
More
ROUND on Apache Spark Documentation
Description
bround(expr, d) - Returns expr rounded to d decimal places using HALF_EVEN rounding mode.
Argument type | Return type |
---|---|
( Float ) | Float |
( Float, Integer ) | Float |
More
BROUND on Apache Spark Documentation
Description
ceil(expr) - Returns the smallest integer not smaller than expr.
Argument type | Return type |
---|---|
( Float ) | Float |
More
CEIL on Apache Spark Documentation
Description
ceiling(expr) - Returns the smallest integer not smaller than expr.
Argument type | Return type |
---|---|
( Float ) | Float |
More
CEILING on Apache Spark Documentation
Description
floor(expr) - Returns the largest integer not greater than expr.
Argument type | Return type |
---|---|
( Float ) | Float |
More
FLOOR on Apache Spark Documentation
Description
rint(expr) - Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
Argument type | Return type |
---|---|
( Float ) | Float |
More
RINT on Apache Spark Documentation
Description
degrees(expr) - Converts radians to degrees.
Argument type | Return type |
---|---|
( Float ) | Float |
More
DEGREES on Apache Spark Documentation
Description
radians(expr) - Converts degrees to radians.
Argument type | Return type |
---|---|
( Float ) | Float |
More
RADIANS on Apache Spark Documentation
Description
cos(expr) - Returns the cosine of expr.
Argument type | Return type |
---|---|
( Float ) | Float |
More
COS on Apache Spark Documentation
Description
cosh(expr) - Returns the hyperbolic cosine of expr.
Argument type | Return type |
---|---|
( Float ) | Float |
More
COSH on Apache Spark Documentation
Description
acos(expr) - Returns the inverse cosine (a.k.a. arccosine) of expr if -1<=expr<=1 or NaN otherwise.
Argument type | Return type |
---|---|
( Float ) | Float |
More
ACOS on Apache Spark Documentation
Description
acosh(expr) - Returns inverse hyperbolic cosine of expr.
Argument type | Return type |
---|---|
( Float ) | Float |
More
ACOSH on Apache Spark Documentation
Description
sin(expr) - Returns the sine of expr.
Argument type | Return type |
---|---|
( Float ) | Float |
More
SIN on Apache Spark Documentation
Description
sinh(expr) - Returns the hyperbolic sine of expr.
Argument type | Return type |
---|---|
( Float ) | Float |
More
SINH on Apache Spark Documentation
Description
asin(expr) - Returns the inverse sine (a.k.a. arcsine) the arc sin of expr if -1<=expr<=1 or NaN otherwise.
Argument type | Return type |
---|---|
( Float ) | Float |
More
ASIN on Apache Spark Documentation
Description
asinh(expr) - Returns inverse hyperbolic sine of expr.
Argument type | Return type |
---|---|
( Float ) | Float |
More
ASINH on Apache Spark Documentation
Description
tan(expr) - Returns the tangent of expr.
Argument type | Return type |
---|---|
( Float ) | Float |
More
TAN on Apache Spark Documentation
Description
tanh(expr) - Returns the hyperbolic tangent of expr.
Argument type | Return type |
---|---|
( Float ) | Float |
More
TANH on Apache Spark Documentation
Description
atanh(expr) - Returns inverse hyperbolic tangent of expr.
Argument type | Return type |
---|---|
( Float ) | Float |
More
ATANH on Apache Spark Documentation
Description
cot(expr) - Returns the cotangent of expr.
Argument type | Return type |
---|---|
( Float ) | Float |
More
COT on Apache Spark Documentation
Description
atan(expr) - Returns the inverse tangent (a.k.a. arctangent).
Argument type | Return type |
---|---|
( Float ) | Float |
More
ATAN on Apache Spark Documentation
Description
atan2(expr1, expr2) - Returns the angle in radians between the positive x-axis of a plane and the point given by the coordinates (expr1, expr2).
Argument type | Return type |
---|---|
( Float, Float ) | Float |
More
Sql if spark
!
! expr - Logical not.
%
expr1 % expr2 - Returns the remainder after /.
Examples:
&
expr1 & expr2 - Returns the result of bitwise AND of and .
Examples:
*
expr1 * expr2 - Returns *.
Examples:
+
expr1 + expr2 - Returns +.
Examples:
-
expr1 - expr2 - Returns -.
Examples:
/
expr1 / expr2 - Returns /. It always performs floating point division.
Examples:
<
expr1 < expr2 - Returns true if is less than .
Arguments:
- expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types such array/struct, the data types of fields must be orderable.
Examples:
<=
expr1 <= expr2 - Returns true if is less than or equal to .
Arguments:
- expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types such array/struct, the data types of fields must be orderable.
Examples:
<=>
expr1 <=> expr2 - Returns same result as the EQUAL(=) operator for non-null operands, but returns true if both are null, false if one of the them is null.
Arguments:
- expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be used in equality comparison. Map type is not supported. For complex types such array/struct, the data types of fields must be orderable.
Examples:
=
expr1 = expr2 - Returns true if equals , or false otherwise.
Arguments:
- expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be used in equality comparison. Map type is not supported. For complex types such array/struct, the data types of fields must be orderable.
Examples:
==
expr1 == expr2 - Returns true if equals , or false otherwise.
Arguments:
- expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be used in equality comparison. Map type is not supported. For complex types such array/struct, the data types of fields must be orderable.
Examples:
>
expr1 > expr2 - Returns true if is greater than .
Arguments:
- expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types such array/struct, the data types of fields must be orderable.
Examples:
>=
expr1 >= expr2 - Returns true if is greater than or equal to .
Arguments:
- expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types such array/struct, the data types of fields must be orderable.
Examples:
^
expr1 ^ expr2 - Returns the result of bitwise exclusive OR of and .
Examples:
abs
abs(expr) - Returns the absolute value of the numeric value.
Examples:
acos
acos(expr) - Returns the inverse cosine (a.k.a. arccosine) of if -1<=<=1 or NaN otherwise.
Examples:
add_months
add_months(start_date, num_months) - Returns the date that is after .
Examples:
Since: 1.5.0
and
expr1 and expr2 - Logical AND.
approx_count_distinct
approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++. defines the maximum estimation error allowed.
approx_percentile
approx_percentile(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column at the given percentage. The value of percentage must be between 0.0 and 1.0. The parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of yields better accuracy, is the relative error of the approximation. When is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column at the given percentage array.
Examples:
array
array(expr, ...) - Returns an array with the given elements.
Examples:
array_contains
array_contains(array, value) - Returns true if the array contains the value.
Examples:
ascii
ascii(str) - Returns the numeric value of the first character of .
Examples:
asin
asin(expr) - Returns the inverse sine (a.k.a. arcsine) the arc sin of if -1<=<=1 or NaN otherwise.
Examples:
assert_true
assert_true(expr) - Throws an exception if is not true.
Examples:
atan
atan(expr) - Returns the inverse tangent (a.k.a. arctangent).
Examples:
atan2
atan2(expr1, expr2) - Returns the angle in radians between the positive x-axis of a plane and the point given by the coordinates (, ).
Examples:
avg
avg(expr) - Returns the mean calculated from values of a group.
base64
base64(bin) - Converts the argument from a binary to a base 64 string.
Examples:
bigint
bigint(expr) - Casts the value to the target data type .
bin
bin(expr) - Returns the string representation of the long value represented in binary.
Examples:
binary
binary(expr) - Casts the value to the target data type .
bit_length
bit_length(expr) - Returns the bit length of string data or number of bits of binary data.
Examples:
boolean
boolean(expr) - Casts the value to the target data type .
bround
bround(expr, d) - Returns rounded to decimal places using HALF_EVEN rounding mode.
Examples:
cast
cast(expr AS type) - Casts the value to the target data type .
Examples:
cbrt
cbrt(expr) - Returns the cube root of .
Examples:
ceil
ceil(expr) - Returns the smallest integer not smaller than .
Examples:
ceiling
ceiling(expr) - Returns the smallest integer not smaller than .
Examples:
char
char(expr) - Returns the ASCII character having the binary equivalent to . If n is larger than 256 the result is equivalent to chr(n % 256)
Examples:
char_length
char_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
Examples:
character_length
character_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
Examples:
chr
chr(expr) - Returns the ASCII character having the binary equivalent to . If n is larger than 256 the result is equivalent to chr(n % 256)
Examples:
coalesce
coalesce(expr1, expr2, ...) - Returns the first non-null argument if exists. Otherwise, null.
Examples:
collect_list
collect_list(expr) - Collects and returns a list of non-unique elements.
collect_set
collect_set(expr) - Collects and returns a set of unique elements.
concat
concat(str1, str2, ..., strN) - Returns the concatenation of str1, str2, ..., strN.
Examples:
concat_ws
concat_ws(sep, [str | array(str)]+) - Returns the concatenation of the strings separated by .
Examples:
conv
conv(num, from_base, to_base) - Convert from to .
Examples:
corr
corr(expr1, expr2) - Returns Pearson coefficient of correlation between a set of number pairs.
cos
cos(expr) - Returns the cosine of .
Examples:
cosh
cosh(expr) - Returns the hyperbolic cosine of .
Examples:
cot
cot(expr) - Returns the cotangent of .
Examples:
count
count(*) - Returns the total number of retrieved rows, including rows containing null.
count(expr) - Returns the number of rows for which the supplied expression is non-null.
count(DISTINCT expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are unique and non-null.
count_min_sketch
count_min_sketch(col, eps, confidence, seed) - Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.
covar_pop
covar_pop(expr1, expr2) - Returns the population covariance of a set of number pairs.
covar_samp
covar_samp(expr1, expr2) - Returns the sample covariance of a set of number pairs.
crc32
crc32(expr) - Returns a cyclic redundancy check value of the as a bigint.
Examples:
cube
cume_dist
cume_dist() - Computes the position of a value relative to all values in the partition.
current_database
current_database() - Returns the current database.
Examples:
current_date
current_date() - Returns the current date at the start of query evaluation.
Since: 1.5.0
current_timestamp
current_timestamp() - Returns the current timestamp at the start of query evaluation.
Since: 1.5.0
date
date(expr) - Casts the value to the target data type .
date_add
date_add(start_date, num_days) - Returns the date that is after .
Examples:
Since: 1.5.0
date_format
date_format(timestamp, fmt) - Converts to a value of string in the format specified by the date format .
Examples:
Since: 1.5.0
date_sub
date_sub(start_date, num_days) - Returns the date that is before .
Examples:
Since: 1.5.0
date_trunc
date_trunc(fmt, ts) - Returns timestamp truncated to the unit specified by the format model . should be one of ["YEAR", "YYYY", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", "WEEK", "QUARTER"]
Examples:
Since: 2.3.0
datediff
datediff(endDate, startDate) - Returns the number of days from to .
Examples:
Since: 1.5.0
day
day(date) - Returns the day of month of the date/timestamp.
Examples:
Since: 1.5.0
dayofmonth
dayofmonth(date) - Returns the day of month of the date/timestamp.
Examples:
Since: 1.5.0
dayofweek
dayofweek(date) - Returns the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday).
Examples:
Since: 2.3.0
dayofyear
dayofyear(date) - Returns the day of year of the date/timestamp.
Examples:
Since: 1.5.0
decimal
decimal(expr) - Casts the value to the target data type .
decode
decode(bin, charset) - Decodes the first argument using the second argument character set.
Examples:
degrees
degrees(expr) - Converts radians to degrees.
Examples:
dense_rank
dense_rank() - Computes the rank of a value in a group of values. The result is one plus the previously assigned rank value. Unlike the function rank, dense_rank will not produce gaps in the ranking sequence.
double
double(expr) - Casts the value to the target data type .
e
e() - Returns Euler's number, e.
Examples:
elt
elt(n, input1, input2, ...) - Returns the -th input, e.g., returns when is 2.
Examples:
encode
encode(str, charset) - Encodes the first argument using the second argument character set.
Examples:
exp
exp(expr) - Returns e to the power of .
Examples:
explode
explode(expr) - Separates the elements of array into multiple rows, or the elements of map into multiple rows and columns.
Examples:
explode_outer
explode_outer(expr) - Separates the elements of array into multiple rows, or the elements of map into multiple rows and columns.
Examples:
expm1
expm1(expr) - Returns exp() - 1.
Examples:
factorial
factorial(expr) - Returns the factorial of . is [0..20]. Otherwise, null.
Examples:
find_in_set
find_in_set(str, str_array) - Returns the index (1-based) of the given string () in the comma-delimited list (). Returns 0, if the string was not found or if the given string () contains a comma.
Examples:
first
first(expr[, isIgnoreNull]) - Returns the first value of for a group of rows. If is true, returns only non-null values.
first_value
first_value(expr[, isIgnoreNull]) - Returns the first value of for a group of rows. If is true, returns only non-null values.
float
float(expr) - Casts the value to the target data type .
floor
floor(expr) - Returns the largest integer not greater than .
Examples:
format_number
format_number(expr1, expr2) - Formats the number like '#,###,###.##', rounded to decimal places. If is 0, the result has no decimal point or fractional part. This is supposed to function like MySQL's FORMAT.
Examples:
format_string
format_string(strfmt, obj, ...) - Returns a formatted string from printf-style format strings.
Examples:
from_json
from_json(jsonStr, schema[, options]) - Returns a struct value with the given and .
Examples:
Since: 2.2.0
from_unixtime
from_unixtime(unix_time, format) - Returns in the specified .
Examples:
Since: 1.5.0
from_utc_timestamp
from_utc_timestamp(timestamp, timezone) - Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, 'GMT+1' would yield '2017-07-14 03:40:00.0'.
Examples:
Since: 1.5.0
get_json_object
get_json_object(json_txt, path) - Extracts a json object from .
Examples:
greatest
greatest(expr, ...) - Returns the greatest value of all parameters, skipping null values.
Examples:
grouping
grouping_id
hash
hash(expr1, expr2, ...) - Returns a hash value of the arguments.
Examples:
hex
hex(expr) - Converts to hexadecimal.
Examples:
hour
hour(timestamp) - Returns the hour component of the string/timestamp.
Examples:
Since: 1.5.0
hypot
hypot(expr1, expr2) - Returns sqrt(2 + 2).
Examples:
if
if(expr1, expr2, expr3) - If evaluates to true, then returns ; otherwise returns .
Examples:
ifnull
ifnull(expr1, expr2) - Returns if is null, or otherwise.
Examples:
in
expr1 in(expr2, expr3, ...) - Returns true if equals to any valN.
Arguments:
- expr1, expr2, expr3, ... - the arguments must be same type.
Examples:
initcap
initcap(str) - Returns with the first letter of each word in uppercase. All other letters are in lowercase. Words are delimited by white space.
Examples:
inline
inline(expr) - Explodes an array of structs into a table.
Examples:
inline_outer
inline_outer(expr) - Explodes an array of structs into a table.
Examples:
input_file_block_length
input_file_block_length() - Returns the length of the block being read, or -1 if not available.
input_file_block_start
input_file_block_start() - Returns the start offset of the block being read, or -1 if not available.
input_file_name
input_file_name() - Returns the name of the file being read, or empty string if not available.
instr
instr(str, substr) - Returns the (1-based) index of the first occurrence of in .
Examples:
int
int(expr) - Casts the value to the target data type .
isnan
isnan(expr) - Returns true if is NaN, or false otherwise.
Examples:
isnotnull
isnotnull(expr) - Returns true if is not null, or false otherwise.
Examples:
isnull
isnull(expr) - Returns true if is null, or false otherwise.
Examples:
java_method
java_method(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.
Examples:
json_tuple
json_tuple(jsonStr, p1, p2, ..., pn) - Returns a tuple like the function get_json_object, but it takes multiple names. All the input parameters and output column types are string.
Examples:
kurtosis
kurtosis(expr) - Returns the kurtosis value calculated from values of a group.
lag
lag(input[, offset[, default]]) - Returns the value of at the th row before the current row in the window. The default value of is 1 and the default value of is null. If the value of at the th row is null, null is returned. If there is no such offset row (e.g., when the offset is 1, the first row of the window does not have any previous row), is returned.
last
last(expr[, isIgnoreNull]) - Returns the last value of for a group of rows. If is true, returns only non-null values.
last_day
last_day(date) - Returns the last day of the month which the date belongs to.
Examples:
Since: 1.5.0
last_value
last_value(expr[, isIgnoreNull]) - Returns the last value of for a group of rows. If is true, returns only non-null values.
lcase
lcase(str) - Returns with all characters changed to lowercase.
Examples:
lead
lead(input[, offset[, default]]) - Returns the value of at the th row after the current row in the window. The default value of is 1 and the default value of is null. If the value of at the th row is null, null is returned. If there is no such an offset row (e.g., when the offset is 1, the last row of the window does not have any subsequent row), is returned.
least
least(expr, ...) - Returns the least value of all parameters, skipping null values.
Examples:
left
left(str, len) - Returns the leftmost ( can be string type) characters from the string ,if is less or equal than 0 the result is an empty string.
Examples:
length
length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
Examples:
levenshtein
levenshtein(str1, str2) - Returns the Levenshtein distance between the two given strings.
Examples:
like
str like pattern - Returns true if str matches pattern, null if any arguments are null, false otherwise.
Arguments:
- str - a string expression
pattern - a string expression. The pattern is a string which is matched literally, with exception to the following special symbols:
_ matches any one character in the input (similar to . in posix regular expressions)
% matches zero or more characters in the input (similar to .* in posix regular expressions)
The escape character is '\'. If an escape character precedes a special symbol or another escape character, the following character is matched literally. It is invalid to escape any other character.
Since Spark 2.0, string literals are unescaped in our SQL parser. For example, in order to match "\abc", the pattern should be "\abc".
When SQL config 'spark.sql.parser.escapedStringLiterals' is enabled, it fallbacks to Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the pattern to match "\abc" should be "\abc".
Examples:
Note:
Use RLIKE to match with standard regular expressions.
ln
ln(expr) - Returns the natural logarithm (base e) of .
Examples:
locate
locate(substr, str[, pos]) - Returns the position of the first occurrence of in after position . The given and return value are 1-based.
Examples:
log
log(base, expr) - Returns the logarithm of with .
Examples:
log10
log10(expr) - Returns the logarithm of with base 10.
Examples:
log1p
log1p(expr) - Returns log(1 + ).
Examples:
log2
log2(expr) - Returns the logarithm of with base 2.
Examples:
lower
lower(str) - Returns with all characters changed to lowercase.
Examples:
lpad
lpad(str, len, pad) - Returns , left-padded with to a length of . If is longer than , the return value is shortened to characters.
Examples:
ltrim
ltrim(str) - Removes the leading space characters from .
ltrim(trimStr, str) - Removes the leading string contains the characters from the trim string
Arguments:
- str - a string expression
- trimStr - the trim string characters to trim, the default value is a single space
Examples:
map
map(key0, value0, key1, value1, ...) - Creates a map with the given key/value pairs.
Examples:
map_keys
map_keys(map) - Returns an unordered array containing the keys of the map.
Examples:
map_values
map_values(map) - Returns an unordered array containing the values of the map.
Examples:
max
max(expr) - Returns the maximum value of .
md5
md5(expr) - Returns an MD5 128-bit checksum as a hex string of .
Examples:
mean
mean(expr) - Returns the mean calculated from values of a group.
min
min(expr) - Returns the minimum value of .
minute
minute(timestamp) - Returns the minute component of the string/timestamp.
Examples:
Since: 1.5.0
mod
expr1 mod expr2 - Returns the remainder after /.
Examples:
monotonically_increasing_id
monotonically_increasing_id() - Returns monotonically increasing 64-bit integers. The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the lower 33 bits represent the record number within each partition. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
month
month(date) - Returns the month component of the date/timestamp.
Examples:
Since: 1.5.0
months_between
months_between(timestamp1, timestamp2) - Returns number of months between and .
Examples:
Since: 1.5.0
named_struct
named_struct(name1, val1, name2, val2, ...) - Creates a struct with the given field names and values.
Examples:
nanvl
nanvl(expr1, expr2) - Returns if it's not NaN, or otherwise.
Examples:
negative
negative(expr) - Returns the negated value of .
Examples:
next_day
next_day(start_date, day_of_week) - Returns the first date which is later than and named as indicated.
Examples:
Since: 1.5.0
not
not expr - Logical not.
now
now() - Returns the current timestamp at the start of query evaluation.
Since: 1.5.0
ntile
ntile(n) - Divides the rows for each window partition into buckets ranging from 1 to at most .
nullif
nullif(expr1, expr2) - Returns null if equals to , or otherwise.
Examples:
nvl
nvl(expr1, expr2) - Returns if is null, or otherwise.
Examples:
nvl2
nvl2(expr1, expr2, expr3) - Returns if is not null, or otherwise.
Examples:
octet_length
octet_length(expr) - Returns the byte length of string data or number of bytes of binary data.
Examples:
or
expr1 or expr2 - Logical OR.
parse_url
parse_url(url, partToExtract[, key]) - Extracts a part from a URL.
Examples:
percent_rank
percent_rank() - Computes the percentage ranking of a value in a group of values.
percentile
percentile(col, percentage [, frequency]) - Returns the exact percentile value of numeric column at the given percentage. The value of percentage must be between 0.0 and 1.0. The value of frequency should be positive integral
percentile(col, array(percentage1 [, percentage2]...) [, frequency]) - Returns the exact percentile value array of numeric column at the given percentage(s). Each value of the percentage array must be between 0.0 and 1.0. The value of frequency should be positive integral
percentile_approx
percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column at the given percentage. The value of percentage must be between 0.0 and 1.0. The parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of yields better accuracy, is the relative error of the approximation. When is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column at the given percentage array.
Examples:
pi
pi() - Returns pi.
Examples:
pmod
pmod(expr1, expr2) - Returns the positive value of mod .
Examples:
posexplode
posexplode(expr) - Separates the elements of array into multiple rows with positions, or the elements of map into multiple rows and columns with positions.
Examples:
posexplode_outer
posexplode_outer(expr) - Separates the elements of array into multiple rows with positions, or the elements of map into multiple rows and columns with positions.
Examples:
position
position(substr, str[, pos]) - Returns the position of the first occurrence of in after position . The given and return value are 1-based.
Examples:
positive
positive(expr) - Returns the value of .
pow
pow(expr1, expr2) - Raises to the power of .
Examples:
power
power(expr1, expr2) - Raises to the power of .
Examples:
printf
printf(strfmt, obj, ...) - Returns a formatted string from printf-style format strings.
Examples:
quarter
quarter(date) - Returns the quarter of the year for date, in the range 1 to 4.
Examples:
Since: 1.5.0
radians
radians(expr) - Converts degrees to radians.
Examples:
rand
rand([seed]) - Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).
Examples:
randn
randn([seed]) - Returns a random value with independent and identically distributed (i.i.d.) values drawn from the standard normal distribution.
Examples:
rank
rank() - Computes the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to the current row in the ordering of the partition. The values will produce gaps in the sequence.
reflect
reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.
Examples:
regexp_extract(str, regexp[, idx]) - Extracts a group that matches .
Examples:
regexp_replace
regexp_replace(str, regexp, rep) - Replaces all substrings of that match with .
Examples:
repeat
repeat(str, n) - Returns the string which repeats the given string value n times.
Examples:
replace
replace(str, search[, replace]) - Replaces all occurrences of with .
Arguments:
- str - a string expression
- search - a string expression. If is not found in , is returned unchanged.
- replace - a string expression. If is not specified or is an empty string, nothing replaces the string that is removed from .
Examples:
reverse
reverse(str) - Returns the reversed given string.
Examples:
right
right(str, len) - Returns the rightmost ( can be string type) characters from the string ,if is less or equal than 0 the result is an empty string.
Examples:
rint
rint(expr) - Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
Examples:
rlike
str rlike regexp - Returns true if matches , or false otherwise.
Arguments:
- str - a string expression
regexp - a string expression. The pattern string should be a Java regular expression.
Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser. For example, to match "\abc", a regular expression for can be "^\abc$".
There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the that can match "\abc" is "^\abc$".
Examples:
Note:
Use LIKE to match with simple string pattern.
rollup
round
round(expr, d) - Returns rounded to decimal places using HALF_UP rounding mode.
Examples:
row_number
row_number() - Assigns a unique, sequential number to each row, starting with one, according to the ordering of rows within the window partition.
rpad
rpad(str, len, pad) - Returns , right-padded with to a length of . If is longer than , the return value is shortened to characters.
Examples:
rtrim
rtrim(str) - Removes the trailing space characters from .
rtrim(trimStr, str) - Removes the trailing string which contains the characters from the trim string from the
Arguments:
- str - a string expression
- trimStr - the trim string characters to trim, the default value is a single space
Examples:
second
second(timestamp) - Returns the second component of the string/timestamp.
Examples:
Since: 1.5.0
sentences
sentences(str[, lang, country]) - Splits into an array of array of words.
Examples:
sha
sha(expr) - Returns a sha1 hash value as a hex string of the .
Examples:
sha1
sha1(expr) - Returns a sha1 hash value as a hex string of the .
Examples:
sha2
sha2(expr, bitLength) - Returns a checksum of SHA-2 family as a hex string of . SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent to 256.
Examples:
shiftleft
shiftleft(base, expr) - Bitwise left shift.
Examples:
shiftright
shiftright(base, expr) - Bitwise (signed) right shift.
Examples:
shiftrightunsigned
shiftrightunsigned(base, expr) - Bitwise unsigned right shift.
Examples:
sign
sign(expr) - Returns -1.0, 0.0 or 1.0 as is negative, 0 or positive.
Examples:
signum
signum(expr) - Returns -1.0, 0.0 or 1.0 as is negative, 0 or positive.
Examples:
sin
sin(expr) - Returns the sine of .
Examples:
sinh
sinh(expr) - Returns the hyperbolic sine of .
Examples:
size
size(expr) - Returns the size of an array or a map. Returns -1 if null.
Examples:
skewness
skewness(expr) - Returns the skewness value calculated from values of a group.
smallint
smallint(expr) - Casts the value to the target data type .
sort_array
sort_array(array[, ascendingOrder]) - Sorts the input array in ascending or descending order according to the natural ordering of the array elements.
Examples:
soundex
soundex(str) - Returns Soundex code of the string.
Examples:
space
space(n) - Returns a string consisting of spaces.
Examples:
spark_partition_id
spark_partition_id() - Returns the current partition id.
split
split(str, regex) - Splits around occurrences that match .
Examples:
sqrt
sqrt(expr) - Returns the square root of .
Examples:
stack
stack(n, expr1, ..., exprk) - Separates , ..., into rows.
Examples:
std
std(expr) - Returns the sample standard deviation calculated from values of a group.
stddev
stddev(expr) - Returns the sample standard deviation calculated from values of a group.
stddev_pop
stddev_pop(expr) - Returns the population standard deviation calculated from values of a group.
stddev_samp
stddev_samp(expr) - Returns the sample standard deviation calculated from values of a group.
str_to_map
str_to_map(text[, pairDelim[, keyValueDelim]]) - Creates a map after splitting the text into key/value pairs using delimiters. Default delimiters are ',' for and ':' for .
Examples:
string
string(expr) - Casts the value to the target data type .
struct
struct(col1, col2, col3, ...) - Creates a struct with the given field values.
substr
substr(str, pos[, len]) - Returns the substring of that starts at and is of length , or the slice of byte array that starts at and is of length .
Examples:
substring
substring(str, pos[, len]) - Returns the substring of that starts at and is of length , or the slice of byte array that starts at and is of length .
Examples:
substring_index
substring_index(str, delim, count) - Returns the substring from before occurrences of the delimiter . If is positive, everything to the left of the final delimiter (counting from the left) is returned. If is negative, everything to the right of the final delimiter (counting from the right) is returned. The function substring_index performs a case-sensitive match when searching for .
Examples:
sum
sum(expr) - Returns the sum calculated from values of a group.
tan
tan(expr) - Returns the tangent of .
Examples:
tanh
tanh(expr) - Returns the hyperbolic tangent of .
Examples:
timestamp
timestamp(expr) - Casts the value to the target data type .
tinyint
tinyint(expr) - Casts the value to the target data type .
to_date
to_date(date_str[, fmt]) - Parses the expression with the expression to a date. Returns null with invalid input. By default, it follows casting rules to a date if the is omitted.
Examples:
Since: 1.5.0
to_json
to_json(expr[, options]) - Returns a json string with a given struct value
Examples:
Since: 2.2.0
to_timestamp
to_timestamp(timestamp[, fmt]) - Parses the expression with the expression to a timestamp. Returns null with invalid input. By default, it follows casting rules to a timestamp if the is omitted.
Examples:
Since: 2.2.0
to_unix_timestamp
to_unix_timestamp(expr[, pattern]) - Returns the UNIX timestamp of the given time.
Examples:
Since: 1.6.0
to_utc_timestamp
to_utc_timestamp(timestamp, timezone) - Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC. For example, 'GMT+1' would yield '2017-07-14 01:40:00.0'.
Examples:
Since: 1.5.0
translate
translate(input, from, to) - Translates the string by replacing the characters present in the string with the corresponding characters in the string.
Examples:
trim
trim(str) - Removes the leading and trailing space characters from .
trim(BOTH trimStr FROM str) - Remove the leading and trailing characters from
trim(LEADING trimStr FROM str) - Remove the leading characters from
trim(TRAILING trimStr FROM str) - Remove the trailing characters from
Arguments:
- str - a string expression
- trimStr - the trim string characters to trim, the default value is a single space
- BOTH, FROM - these are keywords to specify trimming string characters from both ends of the string
- LEADING, FROM - these are keywords to specify trimming string characters from the left end of the string
- TRAILING, FROM - these are keywords to specify trimming string characters from the right end of the string
Examples:
trunc
trunc(date, fmt) - Returns with the time portion of the day truncated to the unit specified by the format model . should be one of ["year", "yyyy", "yy", "mon", "month", "mm"]
Examples:
Since: 1.5.0
ucase
ucase(str) - Returns with all characters changed to uppercase.
Examples:
unbase64
unbase64(str) - Converts the argument from a base 64 string to a binary.
Examples:
unhex
unhex(expr) - Converts hexadecimal to binary.
Examples:
unix_timestamp
unix_timestamp([expr[, pattern]]) - Returns the UNIX timestamp of current or specified time.
Examples:
Since: 1.5.0
upper
upper(str) - Returns with all characters changed to uppercase.
Examples:
uuid
uuid() - Returns an universally unique identifier (UUID) string. The value is returned as a canonical UUID 36-character string.
Examples:
var_pop
var_pop(expr) - Returns the population variance calculated from values of a group.
var_samp
var_samp(expr) - Returns the sample variance calculated from values of a group.
variance
variance(expr) - Returns the sample variance calculated from values of a group.
weekofyear
weekofyear(date) - Returns the week of the year of the given date. A week is considered to start on a Monday and week 1 is the first week with >3 days.
Examples:
Since: 1.5.0
when
CASE WHEN expr1 THEN expr2 [WHEN expr3 THEN expr4]* [ELSE expr5] END - When = true, returns ; else when = true, returns ; else returns .
Arguments:
- expr1, expr3 - the branch condition expressions should all be boolean type.
- expr2, expr4, expr5 - the branch value expressions and else value expression should all be same type or coercible to a common type.
Examples:
window
xpath
xpath(xml, xpath) - Returns a string array of values within the nodes of xml that match the XPath expression.
Examples:
xpath_boolean
xpath_boolean(xml, xpath) - Returns true if the XPath expression evaluates to true, or if a matching node is found.
Examples:
xpath_double
xpath_double(xml, xpath) - Returns a double value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.
Examples:
xpath_float
xpath_float(xml, xpath) - Returns a float value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.
Examples:
xpath_int
xpath_int(xml, xpath) - Returns an integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.
Examples:
xpath_long
xpath_long(xml, xpath) - Returns a long integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.
Examples:
xpath_number
xpath_number(xml, xpath) - Returns a double value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.
Examples:
xpath_short
xpath_short(xml, xpath) - Returns a short integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.
Examples:
xpath_string
xpath_string(xml, xpath) - Returns the text contents of the first xml node that matches the XPath expression.
Examples:
year
year(date) - Returns the year component of the date/timestamp.
Examples:
Since: 1.5.0
|
expr1 | expr2 - Returns the result of bitwise OR of and .
Examples:
~
~ expr - Returns the result of bitwise NOT of .
Examples:
Like SQL and “, statement from popular programming languages, Spark SQL Dataframe also supports similar syntax using “” or we can also use “” statement. So let’s see an example on how to check for multiple conditions and replicate SQL CASE statement.
First Let’s do the imports that are needed and create spark context and DataFrame.
1. Using “when otherwise” on Spark D.
is a Spark function, so to use it first we should import using before. Above code snippet replaces the value of gender with derived value. when value not qualified with the condition, we are assigning “Unknown” as value.
can also be used on Spark SQL select statement.
2. Using “case when” on Spark
Similar to SQL syntax, we could use “case when” with expression .
Using within SQL select.
3. Using && and || operator
We can also use and (&&) or (||) within when function. To explain this I will use a set of data to make it simple.
Output:
Conclusion:
In this article, we have learned how to use spark “” using function and “” function on Dataframe also, we’ve learned how to use these functions with && and || logical operators. I hope you like this article.
Happy Learning !!
Tags: expr,otherwise,spark case when,spark switch statement,spark when otherwise,spark.createDataFrame,when,withColumn
NNK
SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Now discussing:
- Reception card sample
- All koa locations
- 1997 ford
- Sparklight address check
- Cartoon rocket wallpaper
- Designers fountain sconce
Chapter 4. Spark SQL and DataFrames: Introduction to Built-in Data Sources
In the previous chapter, we explained the evolution of and justification for structure in Spark. In particular, we discussed how the Spark SQL engine provides a unified foundation for the high-level DataFrame and Dataset APIs. Now, we’ll continue our discussion of the DataFrame and explore its interoperability with Spark SQL.
This chapter and the next also explore how Spark SQL interfaces with some of the external components shown in Figure 4-1.
In particular, Spark SQL:
Provides the engine upon which the high-level Structured APIs we explored in Chapter 3 are built.
Can read and write data in a variety of structured formats (e.g., JSON, Hive tables, Parquet, Avro, ORC, CSV).
Lets you query data using JDBC/ODBC connectors from external business intelligence (BI) data sources such as Tableau, Power BI, Talend, or from RDBMSs such as MySQL and PostgreSQL.
Provides a programmatic interface to interact with structured data stored as tables or views in a database from a Spark application
Offers an interactive shell to issue SQL queries on your structured data.
Supports ANSI SQL:2003-compliant commands and HiveQL.

Figure 4-1. Spark SQL connectors and data sources
Let’s begin with how you can use Spark SQL in a Spark application.
The , introduced in Spark 2.0, provides a unified entry point for programming Spark with the Structured APIs. You can use a to access Spark functionality: just import the class and create an instance in your code.
To issue any SQL query, use the method on the instance, , such as . All queries executed in this manner return a DataFrame on which you may perform further Spark operations if you desire—the kind we explored in Chapter 3 and the ones you will learn about in this chapter and the next.
Basic Query Examples
In this section we’ll walk through a few examples of queries on the Airline On-Time Performance and Causes of Flight Delays data set, which contains data on US flights including date, delay, distance, origin, and destination. It’s available as a CSV file with over a million records. Using a schema, we’ll read the data into a DataFrame and register the DataFrame as a temporary view (more on temporary views shortly) so we can query it with SQL.
Query examples are provided in code snippets, and Python and Scala notebooks containing all of the code presented here are available in the book’s GitHub repo. These examples will offer you a taste of how to use SQL in your Spark applications via the programmatic interface. Similar to the DataFrame API in its declarative flavor, this interface allows you to query structured data in your Spark applications.
Normally, in a standalone Spark application, you will create a instance manually, as shown in the following example. However, in a Spark shell (or Databricks notebook), the is created for you and accessible via the appropriately named variable .
Let’s get started by reading the data set into a temporary view:
Note
If you want to specify a schema, you can use a DDL-formatted string. For example:
Now that we have a temporary view, we can issue SQL queries using Spark SQL. These queries are no different from those you might issue against a SQL table in, say, a MySQL or PostgreSQL database. The point here is to show that Spark SQL offers an ANSI:2003–compliant SQL interface, and to demonstrate the interoperability between SQL and DataFrames.
The US flight delays data set has five columns:
The column contains a string like . When converted, this maps to .
The column gives the delay in minutes between the scheduled and actual departure times. Early departures show negative numbers.
The column gives the distance in miles from the origin airport to the destination airport.
The column contains the origin IATA airport code.
The column contains the destination IATA airport code.
With that in mind, let’s try some example queries against this data set.
First, we’ll find all flights whose distance is greater than 1,000 miles:
spark.sql("""SELECT distance, origin, destination FROM us_delay_flights_tbl WHERE distance > 1000 ORDER BY distance DESC""").show(10) +--------+------+-----------+ |distance|origin|destination| +--------+------+-----------+ |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | +--------+------+-----------+ only showing top 10 rowsAs the results show, all of the longest flights were between Honolulu (HNL) and New York (JFK). Next, we’ll find all flights between San Francisco (SFO) and Chicago (ORD) with at least a two-hour delay:
spark.sql("""SELECT date, delay, origin, destination FROM us_delay_flights_tbl WHERE delay > 120 AND ORIGIN = 'SFO' AND DESTINATION = 'ORD' ORDER by delay DESC""").show(10) +--------+-----+------+-----------+ |date |delay|origin|destination| +--------+-----+------+-----------+ |02190925|1638 |SFO |ORD | |01031755|396 |SFO |ORD | |01022330|326 |SFO |ORD | |01051205|320 |SFO |ORD | |01190925|297 |SFO |ORD | |02171115|296 |SFO |ORD | |01071040|279 |SFO |ORD | |01051550|274 |SFO |ORD | |03120730|266 |SFO |ORD | |01261104|258 |SFO |ORD | +--------+-----+------+-----------+ only showing top 10 rowsIt seems there were many significantly delayed flights between these two cities, on different dates. (As an exercise, convert the column into a readable format and find the days or months when these delays were most common. Were the delays related to winter months or holidays?)
Let’s try a more complicated query where we use the clause in SQL. In the following example, we want to label all US flights, regardless of origin and destination, with an indication of the delays they experienced: Very Long Delays (> 6 hours), Long Delays (2–6 hours), etc. We’ll add these human-readable labels in a new column called :
spark.sql("""SELECT delay, origin, destination, CASE WHEN delay > 360 THEN 'Very Long Delays' WHEN delay >= 120 AND delay <= 360 THEN 'Long Delays' WHEN delay >= 60 AND delay < 120 THEN 'Short Delays' WHEN delay > 0 and delay < 60 THEN 'Tolerable Delays' WHEN delay = 0 THEN 'No Delays' ELSE 'Early' END AS Flight_Delays FROM us_delay_flights_tbl ORDER BY origin, delay DESC""").show(10) +-----+------+-----------+-------------+ |delay|origin|destination|Flight_Delays| +-----+------+-----------+-------------+ |333 |ABE |ATL |Long Delays | |305 |ABE |ATL |Long Delays | |275 |ABE |ATL |Long Delays | |257 |ABE |ATL |Long Delays | |247 |ABE |DTW |Long Delays | |247 |ABE |ATL |Long Delays | |219 |ABE |ORD |Long Delays | |211 |ABE |ATL |Long Delays | |197 |ABE |DTW |Long Delays | |192 |ABE |ORD |Long Delays | +-----+------+-----------+-------------+ only showing top 10 rowsAs with the DataFrame and Dataset APIs, with the interface you can conduct common data analysis operations like those we explored in the previous chapter. The computations undergo an identical journey in the Spark SQL engine (see “The Catalyst Optimizer” in Chapter 3 for details), giving you the same results.
All three of the preceding SQL queries can be expressed with an equivalent DataFrame API query. For example, the first query can be expressed in the Python DataFrame API as:
This produces the same results as the SQL query:
+--------+------+-----------+ |distance|origin|destination| +--------+------+-----------+ |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | |4330 |HNL |JFK | +--------+------+-----------+ only showing top 10 rowsAs an exercise, try converting the other two SQL queries to use the DataFrame API.
As these examples show, using the Spark SQL interface to query data is similar to writing a regular SQL query to a relational database table. Although the queries are in SQL, you can feel the similarity in readability and semantics to DataFrame API operations, which you encountered in Chapter 3 and will explore further in the next chapter.
To enable you to query structured data as shown in the preceding examples, Spark manages all the complexities of creating and managing views and tables, both in memory and on disk. That leads us to our next topic: how tables and views are created and managed.
Tables hold data. Associated with each table in Spark is its relevant metadata, which is information about the table and its data: the schema, description, table name, database name, column names, partitions, physical location where the actual data resides, etc. All of this is stored in a central metastore.
Instead of having a separate metastore for Spark tables, Spark by default uses the Apache Hive metastore, located at /user/hive/warehouse, to persist all the metadata about your tables. However, you may change the default location by setting the Spark config variable to another location, which can be set to a local or external distributed storage.
Managed Versus UnmanagedTables
Spark allows you to create two types of tables: managed and unmanaged. For a managed table, Spark manages both the metadata and the data in the file store. This could be a local filesystem, HDFS, or an object store such as Amazon S3 or Azure Blob. For an unmanaged table, Spark only manages the metadata, while you manage the data yourself in an external data source such as Cassandra.
With a managed table, because Spark manages everything, a SQL command such as deletes both the metadata and the data. With an unmanaged table, the same command will delete only the metadata, not the actual data. We will look at some examples of how to create managed and unmanaged tables in the next section.
Creating SQL Databases and Tables
Tables reside within a database. By default, Spark creates tables under the database. To create your own database name, you can issue a SQL command from your Spark application or notebook. Using the US flight delays data set, let’s create both a managed and an unmanaged table. To begin, we’ll create a database called and tell Spark we want to use that database:
From this point, any commands we issue in our application to create tables will result in the tables being created in this database and residing under the database name .
Creating a managed table
To create a managed table within the database , you can issue a SQL query like the following:
You can do the same thing using the DataFrame API like this:
Both of these statements will create the managed table in the database.
Creating an unmanaged table
By contrast, you can create unmanaged tables from your own data sources—say, Parquet, CSV, or JSON files stored in a file store accessible to your Spark application.
To create an unmanaged table from a data source such as a CSV file, in SQL use:
spark.sql("""CREATE TABLE us_delay_flights_tbl(date STRING, delay INT, distance INT, origin STRING, destination STRING) USING csv OPTIONS (PATH '/databricks-datasets/learning-spark-v2/flights/departuredelays.csv')""")And within the DataFrame API use:
(flights_df .write .option("path", "/tmp/data/us_flights_delay") .saveAsTable("us_delay_flights_tbl"))Note
To enable you to explore these examples, we have created Python and Scala example notebooks that you can find in the book’s GitHub repo.
Creating Views
In addition to creating tables, Spark can create views on top of existing tables. Views can be global (visible across all s on a given cluster) or session-scoped (visible only to a single ), and they are temporary: they disappear after your Spark application terminates.
Creating views has a similar syntax to creating tables within a database. Once you create a view, you can query it as you would a table. The difference between a view and a table is that views don’t actually hold the data; tables persist after your Spark application terminates, but views disappear.
You can create a view from an existing table using SQL. For example, if you wish to work on only the subset of the US flight delays data set with origin airports of New York (JFK) and San Francisco (SFO), the following queries will create global temporary and temporary views consisting of just that slice of the table:
You can accomplish the same thing with the DataFrame API as follows:
Once you’ve created these views, you can issue queries against them just as you would against a table. Keep in mind that when accessing a global temporary view you must use the prefix , because Spark creates global temporary views in a global temporary database called . For example:
By contrast, you can access the normal temporary view without the prefix:
You can also drop a view just like you would a table:
Temporary views versus global temporary views
The difference between temporary and global temporary views being subtle, it can be a source of mild confusion among developers new to Spark. A temporary view is tied to a single within a Spark application. In contrast, a global temporary view is visible across multiple s within a Spark application. Yes, you can create multiple s within a single Spark application—this can be handy, for example, in cases where you want to access (and combine) data from two different s that don’t share the same Hive metastore configurations.
Caching SQL Tables
Although we will discuss table caching strategies in the next chapter, it’s worth mentioning here that, like DataFrames, you can cache and uncache SQL tables and views. In Spark 3.0, in addition to other options, you can specify a table as , meaning that it should only be cached when it is first used instead of immediately:
Reading Tables into DataFrames
Often, data engineers build data pipelines as part of their regular data ingestion and ETL processes. They populate Spark SQL databases and tables with cleansed data for consumption by applications downstream.
Let’s assume you have an existing database, , and table, , ready for use. Instead of reading from an external JSON file, you can simply use SQL to query the table and assign the returned result to a DataFrame:
Now you have a cleansed DataFrame read from an existing Spark SQL table. You can also read data in other formats using Spark’s built-in data sources, giving you the flexibility to interact with various common file formats.
As shown in Figure 4-1, Spark SQL provides an interface to a variety of data sources. It also provides a set of common methods for reading and writing data to and from these data sources using the Data Sources API.
In this section we will cover some of the built-in data sources, available file formats, and ways to load and write data, along with specific options pertaining to these data sources. But first, let’s take a closer look at two high-level Data Source API constructs that dictate the manner in which you interact with different data sources: and .
DataFrameReader
is the core construct for reading data from a data source into a DataFrame. It has a defined format and a recommended pattern for usage:
DataFrameReader.format(args).option("key", "value").schema(args).load()This pattern of stringing methods together is common in Spark, and easy to read. We saw it in Chapter 3 when exploring common data analysis patterns.
Note that you can only access a through a instance. That is, you cannot create an instance of . To get an instance handle to it, use:
SparkSession.read // or SparkSession.readStreamWhile returns a handle to to read into a DataFrame from a static data source, returns an instance to read from a streaming source. (We will cover Structured Streaming later in the book.)
Arguments to each of the public methods to take different values. Table 4-1 enumerates these, with a subset of the supported arguments.
Method | Arguments | Description |
---|---|---|
, , , , , , , etc. | If you don’t specify this method, then the default is Parquet or whatever is set in . | |
A series of key/value pairs and options. The Spark documentation shows some examples and explains the different modes and their actions. The default mode is . The and options are specific to the JSON and CSV file formats. | ||
DDL or , e.g., or | For JSON or CSV format, you can specify to infer the schema in the method. Generally, providing a schema for any format makes loading faster and ensures your data conforms to the expected schema. | |
The path to the data source. This can be empty if specified in . |
While we won’t comprehensively enumerate all the different combinations of arguments and options, the documentation for Python, Scala, R, and Java offers suggestions and guidance. It’s worthwhile to show a couple of examples, though:
Note
In general, no schema is needed when reading from a static Parquet data source—the Parquet metadata usually contains the schema, so it’s inferred. However, for streaming data sources you will have to provide a schema. (We will cover reading from streaming data sources in Chapter 8.)
Parquet is the default and preferred data source for Spark because it’s efficient, uses columnar storage, and employs a fast compression algorithm. You will see additional benefits later (such as columnar pushdown), when we cover the Catalyst optimizer in greater depth.
DataFrameWriter
does the reverse of its counterpart: it saves or writes data to a specified built-in data source. Unlike with , you access its instance not from a but from the DataFrame you wish to save. It has a few recommended usage patterns:
DataFrameWriter.format(args) .option(args) .bucketBy(args) .partitionBy(args) .save(path) DataFrameWriter.format(args).option(args).sortBy(args).saveAsTable(table)To get an instance handle, use:
DataFrame.write // or DataFrame.writeStreamArguments to each of the methods to also take different values. We list these in Table 4-2, with a subset of the supported arguments.
Method | Arguments | Description |
---|---|---|
, , , , , , , etc. | If you don’t specify this method, then the default is Parquet or whatever is set in . | |
A series of key/value pairs and options. The Spark documentation shows some examples. This is an overloaded method. The default mode options are and ; they throw an exception at runtime if the data already exists. | ||
The number of buckets and names of columns to bucket by. Uses Hive’s bucketing scheme on a filesystem. | ||
The path to save to. This can be empty if specified in . | ||
The table to save to. |
Here’s a short example snippet to illustrate the use of methods and arguments:
Parquet
We’ll start our exploration of data sources with Parquet, because it’s the default data source in Spark. Supported and widely used by many big data processing frameworks and platforms, Parquet is an open source columnar file format that offers many I/O optimizations (such as compression, which saves storage space and allows for quick access to data columns).
Because of its efficiency and these optimizations, we recommend that after you have transformed and cleansed your data, you save your DataFrames in the Parquet format for downstream consumption. (Parquet is also the default table open format for Delta Lake, which we will cover in Chapter 9.)
Reading Parquet files into a DataFrame
Parquet files are stored in a directory structure that contains the data files, metadata, a number of compressed files, and some status files. Metadata in the footer contains the version of the file format, the schema, and column data such as the path, etc.
For example, a directory in a Parquet file might contain a set of files like this:
_SUCCESS _committed_1799640464332036264 _started_1799640464332036264 part-00000-tid-1799640464332036264-91273258-d7ef-4dc7-<...>-c000.snappy.parquetThere may be a number of part-XXXX compressed files in a directory (the names shown here have been shortened to fit on the page).
To read Parquet files into a DataFrame, you simply specify the format and path:
Unless you are reading from a streaming data source there’s no need to supply the schema, because Parquet saves it as part of its metadata.
Reading Parquet files into a Spark SQL table
As well as reading Parquet files into a Spark DataFrame, you can also create a Spark SQL unmanaged table or view directly using SQL:
Once you’ve created the table or view, you can read data into a DataFrame using SQL, as we saw in some earlier examples:
Both of these operations return the same results:
+-----------------+-------------------+-----+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count| +-----------------+-------------------+-----+ |United States |Romania |1 | |United States |Ireland |264 | |United States |India |69 | |Egypt |United States |24 | |Equatorial Guinea|United States |1 | |United States |Singapore |25 | |United States |Grenada |54 | |Costa Rica |United States |477 | |Senegal |United States |29 | |United States |Marshall Islands |44 | +-----------------+-------------------+-----+ only showing top 10 rowsWriting DataFrames to Parquet files
Writing or saving a DataFrame as a table or file is a common operation in Spark. To write a DataFrame you simply use the methods and arguments to the outlined earlier in this chapter, supplying the location to save the Parquet files to. For example:
Note
Recall that Parquet is the default file format. If you don’t include the method, the DataFrame will still be saved as a Parquet file.
This will create a set of compact and compressed Parquet files at the specified path. Since we used snappy as our compression choice here, we’ll have snappy compressed files. For brevity, this example generated only one file; normally, there may be a dozen or so files created:
-rw-r--r-- 1 jules wheel 0 May 19 10:58 _SUCCESS -rw-r--r-- 1 jules wheel 966 May 19 10:58 part-00000-<...>-c000.snappy.parquetWriting DataFrames to Spark SQL tables
Writing a DataFrame to a SQL table is as easy as writing to a file—just use instead of . This will create a managed table called :
To sum up, Parquet is the preferred and default built-in data source file format in Spark, and it has been adopted by many other frameworks. We recommend that you use this format in your ETL and data ingestion processes.
JSON
JavaScript Object Notation (JSON) is also a popular data format. It came to prominence as an easy-to-read and easy-to-parse format compared to XML. It has two representational formats: single-line mode and multiline mode. Both modes are supported in Spark.
In single-line mode each line denotes a single JSON object, whereas in multiline mode the entire multiline object constitutes a single JSON object. To read in this mode, set to true in the method.
Reading a JSON file into a DataFrame
You can read a JSON file into a DataFrame the same way you did with Parquet—just specify in the method:
Reading a JSON file into a Spark SQL table
You can also create a SQL table from a JSON file just like you did with Parquet:
Once the table is created, you can read data into a DataFrame using SQL:
Writing DataFrames to JSON files
Saving a DataFrame as a JSON file is simple. Specify the appropriate methods and arguments, and supply the location to save the JSON files to:
This creates a directory at the specified path populated with a set of compact JSON files:
-rw-r--r-- 1 jules wheel 0 May 16 14:44 _SUCCESS -rw-r--r-- 1 jules wheel 71 May 16 14:44 part-00000-<...>-c000.jsonJSON data source options
Table 4-3 describes common JSON options for and . For a comprehensive list, we refer you to the documentation.
Property name | Values | Meaning | Scope |
---|---|---|---|
, , , , , , or | Use this compression codec for writing. Note that read will only detect the compression or codec from the file extension. | Write | |
or | Use this format or any format from Java’s . | Read/write | |
, | Use multiline mode. Default is (single-line mode). | Read | |
, | Allow unquoted JSON field names. Default is . | Read |
CSV
As widely used as plain text files, this common text file format captures each datum or field delimited by a comma; each line with comma-separated fields represents a record. Even though a comma is the default separator, you may use other delimiters to separate fields in cases where commas are part of your data. Popular spreadsheets can generate CSV files, so it’s a popular format among data and business analysts.
Reading a CSV file into a DataFrame
As with the other built-in data sources, you can use the methods and arguments to read a CSV file into a DataFrame:
Reading a CSV file into a Spark SQL table
Creating a SQL table from a CSV data source is no different from using Parquet or JSON:
Once you’ve created the table, you can read data into a DataFrame using SQL as before:
Writing DataFrames to CSV files
Saving a DataFrame as a CSV file is simple. Specify the appropriate methods and arguments, and supply the location to save the CSV files to:
This generates a folder at the specified location, populated with a bunch of compressed and compact files:
-rw-r--r-- 1 jules wheel 0 May 16 12:17 _SUCCESS -rw-r--r-- 1 jules wheel 36 May 16 12:17 part-00000-251690eb-<...>-c000.csvCSV data source options
Table 4-4 describes some of the common CSV options for and . Because CSV files can be complex, many options are available; for a comprehensive list we refer you to the documentation.
Property name | Values | Meaning | Scope |
---|---|---|---|
, , , , , or | Use this compression codec for writing. | Write | |
or | Use this format or any format from Java’s . | Read/write | |
, | Use multiline mode. Default is (single-line mode). | Read | |
, | If , Spark will determine the column data types. Default is . | Read | |
Any character | Use this character to separate column values in a row. Default delimiter is a comma (). | Read/write | |
Any character | Use this character to escape quotes. Default is . | Read/write | |
, | Indicates whether the first line is a header denoting each column name. Default is . | Read/write |
Avro
Introduced in Spark 2.4 as a built-in data source, the Avro format is used, for example, by Apache Kafka for message serializing and deserializing. It offers many benefits, including direct mapping to JSON, speed and efficiency, and bindings available for many programming languages.
Reading an Avro file into a DataFrame
Reading an Avro file into a DataFrame using is consistent in usage with the other data sources we have discussed in this section:
Reading an Avro file into a Spark SQL table
Again, creating SQL tables using an Avro data source is no different from using Parquet, JSON, or CSV:
Once you’ve created a table, you can read data into a DataFrame using SQL:
Writing DataFrames to Avro files
Writing a DataFrame as an Avro file is simple. As usual, specify the appropriate methods and arguments, and supply the location to save the Avro files to:
This generates a folder at the specified location, populated with a bunch of compressed and compact files:
-rw-r--r-- 1 jules wheel 0 May 17 11:54 _SUCCESS -rw-r--r-- 1 jules wheel 526 May 17 11:54 part-00000-ffdf70f4-<...>-c000.avroAvro data source options
Table 4-5 describes common options for and . A comprehensive list of options is in the documentation.
Property name | Default value | Meaning | Scope |
---|---|---|---|
None | Optional Avro schema provided by a user in JSON format. The data type and naming of record fields should match the input Avro data or Catalyst data (Spark internal data type), otherwise the read/write action will fail. | Read/write | |
Top-level record name in write result, which is required in the Avro spec. | Write | ||
Record namespace in write result. | Write | ||
If this option is enabled, all files (with and without the .avro extension) are loaded. Otherwise, files without the .avro extension are ignored. | Read | ||
Allows you to specify the compression codec to use in writing. Currently supported codecs are , , , , and . If this option is not set, the value in is taken into account. | Write |
ORC
As an additional optimized columnar file format, Spark 2.x supports a vectorized ORC reader. Two Spark configurations dictate which ORC implementation to use. When is set to and is set to , Spark uses the vectorized ORC reader. A vectorized reader reads blocks of rows (often 1,024 per block) instead of one row at a time, streamlining operations and reducing CPU usage for intensive operations like scans, filters, aggregations, and joins.
For Hive ORC SerDe (serialization and deserialization) tables created with the SQL command , the vectorized reader is used when the Spark configuration parameter is set to .
Reading an ORC file into a DataFrame
To read in a DataFrame using the ORC vectorized reader, you can just use the normal methods and options:
Reading an ORC file into a Spark SQL table
There is no difference from Parquet, JSON, CSV, or Avro when creating a SQL view using an ORC data source:
Once a table is created, you can read data into a DataFrame using SQL as usual:
Writing DataFrames to ORC files
Writing back a transformed DataFrame after reading is equally simple using the methods:
The result will be a folder at the specified location containing some compressed ORC files:
-rw-r--r-- 1 jules wheel 0 May 16 17:23 _SUCCESS -rw-r--r-- 1 jules wheel 547 May 16 17:23 part-00000-<...>-c000.snappy.orcImages
In Spark 2.4 the community introduced a new data source, image files, to support deep learning and machine learning frameworks such as TensorFlow and PyTorch. For computer vision–based machine learning applications, loading and processing image data sets is important.
Reading an image file into a DataFrame
As with all of the previous file formats, you can use the methods and options to read in an image file as shown here:
Binary Files
Spark 3.0 adds support for binary files as a data source. The converts each binary file into a single DataFrame row (record) that contains the raw content and metadata of the file. The binary file data source produces a DataFrame with the following columns:
path: StringType
modificationTime: TimestampType
length: LongType
content: BinaryType
Reading a binary file into a DataFrame
To read binary files, specify the data source format as a . You can load files with paths matching a given global pattern while preserving the behavior of partition discovery with the data source option . For example, the following code reads all JPG files from the input directory with any partitioned directories:
To ignore partitioning data discovery in a directory, you can set to :
Note that the column is absent when the option is set to .
Currently, the binary file data source does not support writing a DataFrame back to the original file format.
In this section, you got a tour of how to read data into a DataFrame from a range of supported file formats. We also showed you how to create temporary views and tables from the existing built-in data sources. Whether you’re using the DataFrame API or SQL, the queries produce identical outcomes. You can examine some of these queries in the notebook available in the GitHub repo for this book.
To recap, this chapter explored the interoperability between the DataFrame API and Spark SQL. In particular, you got a flavor of how to use Spark SQL to:
Create managed and unmanaged tables using Spark SQL and the DataFrame API.
Read from and write to various built-in data sources and file formats.
Employ the programmatic interface to issue SQL queries on structured data stored as Spark SQL tables or views.
Peruse the Spark to inspect metadata associated with tables and views.
Use the and APIs.
Through the code snippets in the chapter and the notebooks available in the book’s GitHub repo, you got a feel for how to use DataFrames and Spark SQL. Continuing in this vein, the next chapter further explores how Spark interacts with the external data sources shown in Figure 4-1. You’ll see some more in-depth examples of transformations and the interoperability between the DataFrame API and Spark SQL.