7周7数据库之Hbase

Hbase

目的：Hbase的使用场景是什么？有哪些关键特性？优缺点？

Hbase为大型系统而设计，数据至少要有几十GB以上。是大型在线分析处理系统的基石，擅长大数据集扫描，大查询。大公司使用它来支持重型日志记录和搜索系统
特点：
1. 开箱即用的版本控制、压缩、垃圾收集（过期数据）、内存表
2. 强一致性，可扩展性
3. 行级别保证原子性
4. 基于google big table
5. 可容错的：使用write-ahead logging and distributed configuration

Day 1：CRUD and Table Administration

Get Started 指南 https://hbase.apache.org/book.html#quickstart

HBase supports three running modes:

• Standalone mode is a single machine acting alone.

• Pseudo-distributed mode is a single node pretending to be a cluster.

• Fully distributed mode is a cluster of nodes working together.

在HBase表中，键是任意字符串，每个字符串都映射到一行数据。行本身就是一个映射，其中键称为列，值存储为未解释的字节数组。列分为多个列族，因此列的全名包括两部分：列族名和列限定符。通常，它们使用冒号连接在一起（例如，family:qualifier）。

使用python 字典看起来是这样：

hbase_table = { # Table 
  'row1': { # Row key
    'cf1:col1': 'value1', # Column family, column, and value 
    'cf1:col2': 'value2',
    'cf2:col1': 'value3'
  }, 'row2': {
    # More row data
  } 
}

可视化：

crud

创建一个wiki table，长下面这样

hbase> create ‘wiki’, ‘text’
hbase> put ‘wiki’, ‘Home’, ‘text:’, ‘Welcome to the wiki!’
hbase> get ‘wiki’, ‘Home’, ‘text:’
hbase> scan ‘wiki’

当将新值写入相同的单元格时，旧值会徘徊，并通过其时间戳进行索引. 这是hbase特有的功能，其他数据库会要求你自己处理历史数据，但是hbase可以自动使用版本控制功能。建议将HBase的行本身视为一个小型数据库。数据库中的每个单元可以具有许多与其相关联的值（例如小型时间序列数据库）。当您在HBase中获取一行时，并没有获取一组值。您正在获取一个小世界。

Let’s expand our requirements to include the following:

• In our wiki, a page is uniquely identified by its title.

• A page can have unlimited revisions.

• A revision is identified by its timestamp.

• A revision contains text and optionally a commit comment.

• A revision was made by an author, identified by name.

如下图所示：

修改table属性

hbase> disable ‘wiki’

hbase> alter ‘wiki’, { NAME => ‘text’, VERSIONS => org.apache.hadoop.hbase.HConstants::ALL_VERSIONS }