Title: | R Interface for Apache Impala |
---|---|
Description: | 'SQL' back-end to 'dplyr' for Apache Impala, the massively parallel processing query engine for Apache 'Hadoop'. Impala enables low-latency 'SQL' queries on data stored in the 'Hadoop' Distributed File System '(HDFS)', Apache 'HBase', Apache 'Kudu', Amazon Simple Storage Service '(S3)', Microsoft Azure Data Lake Store '(ADLS)', and Dell 'EMC' 'Isilon'. See <https://impala.apache.org> for more information about Impala. |
Authors: | Ian Cook [aut, cre], Cloudera [cph] |
Maintainer: | Ian Cook <[email protected]> |
License: | Apache License 2.0 | file LICENSE |
Version: | 0.5.0.9000 |
Built: | 2024-11-02 03:57:01 UTC |
Source: | https://github.com/ianmcook/implyr |
compute()
Executes the query and stores the result in a new Impala table
collect()
Executes the query
and returns the result to R as a data frame tbl
collapse()
Generates the query for later execution
## S3 method for class 'tbl_impala' compute( x, name, temporary = TRUE, unique_indexes = NULL, indexes = NULL, analyze = FALSE, external = FALSE, overwrite = FALSE, force = FALSE, field_terminator = NULL, line_terminator = NULL, file_format = NULL, ... ) ## S3 method for class 'tbl_impala' collect(x, ..., n = Inf, warn_incomplete = TRUE) ## S3 method for class 'tbl_impala' collapse(x, vars = NULL, ...)
## S3 method for class 'tbl_impala' compute( x, name, temporary = TRUE, unique_indexes = NULL, indexes = NULL, analyze = FALSE, external = FALSE, overwrite = FALSE, force = FALSE, field_terminator = NULL, line_terminator = NULL, file_format = NULL, ... ) ## S3 method for class 'tbl_impala' collect(x, ..., n = Inf, warn_incomplete = TRUE) ## S3 method for class 'tbl_impala' collapse(x, vars = NULL, ...)
x |
an object with class |
name |
the name for the new Impala table |
temporary |
must be set to |
unique_indexes |
not used |
indexes |
not used |
analyze |
whether to run |
external |
whether the new table will be externally managed |
overwrite |
whether to overwrite existing table data (currently ignored) |
force |
whether to silently fail if the table already exists |
field_terminator |
the deliminter to use between fields in text file data. Defaults to the ASCII control-A (hex 01) character |
line_terminator |
the line terminator. Defaults to |
file_format |
the storage format to use. Options are |
... |
other arguments passed on to methods |
n |
the number of rows to return |
warn_incomplete |
whether to issue a warning if not all rows retrieved |
vars |
not used |
Impala does not support temporary tables. When using compute()
to store results in an Impala table, you must set temporary = FALSE
.
copy_to
inserts the contents of a local data frame into a new Impala
table. copy_to
is intended to be used only with very small data
frames. It uses the SQL INSERT ... VALUES()
technique, which is not
suitable for loading large amounts of data. By default, this function will
throw an error if you attempt to copy a data frame with more than 1000
row/column positions. You can increase this limit at your own risk by setting
the option implyr.copy_to_size_limit
to a higher number.
This package does not provide tools for loading larger amounts of local data into Impala tables. This is because Impala can query data stored in several different filesystems and storage systems (HDFS, Apache Kudu, Apache HBase, Amazon S3, Microsoft ADLS, and Dell EMC Isilon) and Impala does not include built-in capability for loading local data into these systems.
## S3 method for class 'src_impala' copy_to( dest, df, name = deparse(substitute(df)), overwrite = FALSE, types = NULL, temporary = TRUE, unique_indexes = NULL, indexes = NULL, analyze = FALSE, external = FALSE, force = FALSE, field_terminator = NULL, line_terminator = NULL, file_format = NULL, ... )
## S3 method for class 'src_impala' copy_to( dest, df, name = deparse(substitute(df)), overwrite = FALSE, types = NULL, temporary = TRUE, unique_indexes = NULL, indexes = NULL, analyze = FALSE, external = FALSE, force = FALSE, field_terminator = NULL, line_terminator = NULL, file_format = NULL, ... )
dest |
an object with class with class |
df |
a (very small) local data frame |
name |
name for the new Impala table |
overwrite |
whether to overwrite existing table data (currently ignored) |
types |
a character vector giving variable types to use for the columns |
temporary |
must be set to |
unique_indexes |
not used |
indexes |
not used |
analyze |
whether to run |
external |
whether the new table will be externally managed |
force |
whether to silently continue if the table already exists |
field_terminator |
the deliminter to use between fields in text file data. Defaults to the ASCII control-A (hex 01) character |
line_terminator |
the line terminator. Defaults to |
file_format |
the storage format to use. Options are |
... |
other arguments passed on to methods |
An object with class tbl_impala
, tbl_sql
,
tbl_lazy
, tbl
Impala does not support temporary tables. When using copy_to()
to insert local data into an Impala table, you must set temporary =
FALSE
.
library(nycflights13) dim(airlines) # airlines data frame is very small # [1] 16 2 ## Not run: copy_to(impala, airlines, temporary = FALSE) ## End(Not run)
library(nycflights13) dim(airlines) # airlines data frame is very small # [1] 16 2 ## Not run: copy_to(impala, airlines, temporary = FALSE) ## End(Not run)
Describe the Impala data source
## S3 method for class 'impala_connection' db_desc(x)
## S3 method for class 'impala_connection' db_desc(x)
x |
an object with class class |
A string containing information about the connection to Impala
Closes (disconnects) the connection to Impala.
## S4 method for signature 'src_impala' dbDisconnect(conn, ...)
## S4 method for signature 'src_impala' dbDisconnect(conn, ...)
conn |
object with class class |
... |
other arguments passed on to methods |
Returns TRUE
, invisibly
## Not run: dbDisconnect(impala) ## End(Not run)
## Not run: dbDisconnect(impala) ## End(Not run)
Executes an Impala statement that returns no result.
## S4 method for signature 'src_impala,character' dbExecute(conn, statement, ...)
## S4 method for signature 'src_impala,character' dbExecute(conn, statement, ...)
conn |
object with class class |
statement |
a character string containing SQL |
... |
other arguments passed on to methods |
Depending on the package used to connect to Impala, either a scalar
numeric that specifies the number of rows affected by the statement, or
NULL
This method is for statements that return no result, such as data
definition or data manipulation statements. Use
dbGetQuery()
for
SELECT
queries.
## Not run: dbExecute(impala, "INVALIDATE METADATA") ## End(Not run)
## Not run: dbExecute(impala, "INVALIDATE METADATA") ## End(Not run)
Returns the result of an Impala SQL query as a data frame.
## S4 method for signature 'src_impala,character' dbGetQuery(conn, statement, ...)
## S4 method for signature 'src_impala,character' dbGetQuery(conn, statement, ...)
conn |
object with class class |
statement |
a character string containing SQL |
... |
other arguments passed on to methods |
A data.frame
with as many rows as records were fetched and as
many columns as fields in the result set, even if the result is a single
value or has one or zero rows
This method is for SELECT
queries only. Use
dbExecute()
for data
definition or data manipulation statements.
## Not run: flights_by_carrier_df <- dbGetQuery( impala, "SELECT carrier, COUNT(*) FROM flights GROUP BY carrier" ) ## End(Not run)
## Not run: flights_by_carrier_df <- dbGetQuery( impala, "SELECT carrier, COUNT(*) FROM flights GROUP BY carrier" ) ## End(Not run)
impala_unnest()
unnests a
column of type ARRAY
, MAP
, or STRUCT
in a tbl_impala
. These column types are referred to
as complex or nested types.
impala_unnest(data, col, ...)
impala_unnest(data, col, ...)
data |
an object with class |
col |
the unquoted name of an |
... |
ignored (included for compatibility) |
impala_unnest()
currently can unnest only
one column, can only be applied once to a tbl_impala
,
and must be applied to a tbl_impala
representing an
Impala table or view before applying any other operations.
an object with class tbl_impala
with the
complex column unnested into two or more separate columns
Returns a character vector containing the names of all the
available databases, in alphabetical order, including the
_impala_builtins
database.
src_databases(src, ...) src_schemas(src, ...)
src_databases(src, ...) src_schemas(src, ...)
src |
object with class class |
... |
Optional arguments; currently unused. |
src_schemas()
is an alias for src_databases()
src_impala
creates a SQL backend to dplyr for
Apache Impala, the massively parallel
processing query engine for Apache Hadoop.
src_impala
can work with any DBI-compatible interface that provides
connectivity to Impala. Currently, two packages that can provide this
connectivity are odbc and RJDBC.
src_impala(drv, ..., auto_disconnect = TRUE)
src_impala(drv, ..., auto_disconnect = TRUE)
drv |
an object that inherits from |
... |
arguments passed to the underlying Impala database connection
method |
auto_disconnect |
Should the connection to Impala be automatically
closed when the object returned by this function is deleted? Pass |
An object with class src_impala
, src_sql
, src
Impala ODBC driver, Impala JDBC driver
# Using ODBC connectivity: ## Not run: library(odbc) drv <- odbc::odbc() impala <- src_impala( drv = drv, driver = "Cloudera ODBC Driver for Impala", host = "host", port = 21050, database = "default", uid = "username", pwd = "password" ) ## End(Not run) # Using JDBC connectivity: ## Not run: library(RJDBC) Sys.setenv(JAVA_HOME = "/path/to/java/home/") impala_classpath <- list.files( path = "/path/to/jdbc/driver", pattern = "\\.jar$", full.names = TRUE ) .jinit(classpath = impala_classpath) drv <- JDBC( driverClass = "com.cloudera.impala.jdbc41.Driver", classPath = impala_classpath, identifier.quote = "`" ) impala <- src_impala( drv, "jdbc:impala://host:21050", "username", "password" ) ## End(Not run)
# Using ODBC connectivity: ## Not run: library(odbc) drv <- odbc::odbc() impala <- src_impala( drv = drv, driver = "Cloudera ODBC Driver for Impala", host = "host", port = 21050, database = "default", uid = "username", pwd = "password" ) ## End(Not run) # Using JDBC connectivity: ## Not run: library(RJDBC) Sys.setenv(JAVA_HOME = "/path/to/java/home/") impala_classpath <- list.files( path = "/path/to/jdbc/driver", pattern = "\\.jar$", full.names = TRUE ) .jinit(classpath = impala_classpath) drv <- JDBC( driverClass = "com.cloudera.impala.jdbc41.Driver", classPath = impala_classpath, identifier.quote = "`" ) impala <- src_impala( drv, "jdbc:impala://host:21050", "username", "password" ) ## End(Not run)
tbl
from an Impala tableCreate a lazy tbl
from an Impala table
## S3 method for class 'src_impala' tbl(src, from, ...)
## S3 method for class 'src_impala' tbl(src, from, ...)
src |
an object with class with class |
from |
a table name or identifier |
... |
not used |
An object with class tbl_impala
, tbl_sql
,
tbl_lazy
, tbl
## Not run: flights_tbl <- tbl(impala, "flights") flights_tbl <- tbl(impala, in_schema("nycflights13", "flights")) ## End(Not run)
## Not run: flights_tbl <- tbl(impala, "flights") flights_tbl <- tbl(impala, in_schema("nycflights13", "flights")) ## End(Not run)