class documentation

class Table:

View In Hierarchy

Represents the instance of the HBase table.
Class Batch The batch interface provided for a hbspark.table.Table.
Method __init__ Instantiates a new table object with a given table name.
Method batch Retrieve the batch processor of the table which allows for bulk data modification.
Method cell Retrieve the cell value (and it's hisotry) from the HBase table.
Method delete Delete a row from the HBase table.
Method families Gets all of the column families associated with the HBase tables.
Method put Insert a new row into the HBase table.
Method regions Provides all of the regions associated with a table (between the keys).
Method row Get a row from the HBase.
Method scan Retrieve all of the rows inside the HBase table.
Instance Variable _table_ref Undocumented
def __init__(self, name):
Instantiates a new table object with a given table name.
Parameters
name:stringThe name for the table to be created.
Returns
hbspark.table.TableA new instance of the HBase table.
def batch(self, timestamp=None, batch_size=None, transaction=False, wal=True):
Retrieve the batch processor of the table which allows for bulk data modification.
Parameters
timestamp:intThe timestame all batch commands should utilize.
batch_size:intThe queue length for the batch process before commands should send automatically.
transaction:boolWhether or not the batch should behave like a transaction (for the purposes of a context manager).
wal:boolWhether to write to the WAL
Returns
hbspark.table.Table.BatchThe batch processor for the current table.
def cell(self, rowkey, column, versions=None, timestamp=None, include_timestamp=False):
Retrieve the cell value (and it's hisotry) from the HBase table.
Parameters
rowkey:stringThe rowkey for the target cell.
column:stringThe column name for the target cell.
versions:intThe maximum numbers of cell versions to be retrieved.
timestamp:intThe new timestamp for the retreival. (VF)
include_timestamp:boolWhether or not to include the timestamp in the retreival. (VF)
Returns
list of pyspark.sql.RowList of each retrieved row from the table.
def delete(self, rowkey, columns=None, timestamp=None, wal=True):

Delete a row from the HBase table.

The columns payload should have the following structure:

    columns = ["cf_x:col_x", ...]
Parameters
rowkey:stringThe rowkey targeting the row to be deleted.
columns:listThe list of column names to be deleted of the form cf:col.
timestamp:intThe timestamp for the deletion operation.
wal:boolWhether or not to insert into the WAL for HBase
Returns
NoneMethod does not return.
def families(self):
Gets all of the column families associated with the HBase tables.
Returns
listA list of dictionaries representing each column family in the table and it's configuration.
def put(self, rowkey, data, timestamp=None, wal=True):

Insert a new row into the HBase table.

The data payload should have the following structure:

    data = {
        "cf_x:col_x" : "value",
        ...
    }
Parameters
rowkey:stringThe rowkey for the new inserted row.
data:dictThe dictionary mapping cf:col to values to be stored.
timestamp:intThe timestamp used for the put operation (VF).
wal:boolWhether or not to write to the WAL of HBase.
Returns
NoneMethod does not return.
def regions(self):
Provides all of the regions associated with a table (between the keys).
Returns
listA list of dictionaries representing a region and it's configuration.
def row(self, rowkey, columns=None, timestamp=None, include_timestamp=False):
Get a row from the HBase.
Parameters
rowkey:stringThe rowkey for the provided row.
columns:list of stringThe column names which should be retrieved from the row.
timestamp:intThe new timestamp for the retreival. (VF)
include_timestamp:boolWhether or not to include the timestamp in the retreival. (VF)
Returns
pyspark.sql.RowThe row as a spark manageable data structure.
def scan(self, schema=None, row_start=None, row_stop=None, row_prefix=None, columns=None, filter=None, timestamp=None, include_timestamp=False, batch_size=1000, scan_batching=None, limit=None, sorted_columns=False, reverse=False):
Retrieve all of the rows inside the HBase table.
Parameters
schema:StructTypeA list of StructField with ("cf:name", Type(), True)
row_start:stringBeginning rowkey of the scan (inclusive).
row_stop:stringEnding rowkey of the scan (exclusive)
row_prefix:stringA prefix rowkeys must match.
columns:list or tupleThe columns that should be returned for each row.
filter:stringA string to filter out results (VF)
timestamp:intThe timestamp for the scan.
include_timestamp:intWhether row timestamps are returned.
batch_size:intThe max size for a single return of retrieving results.
scan_batching:boolWhether or not the server will return by batching.
limit:intMaximum number of total returned rows
sorted_columns:boolWhether to return the sorted columns or not.
reverse:boolWhether to perform scans in reverse of natural order.
Returns
pyspark.sql.DataFrameA dataframe that consists of all the rows in the HBase table.
_table_ref =

Undocumented