hive表分区字段需要注意的问题-技术博客集

hive表分区字段需要注意的问题
编程技术 / houtizong 发布于 3年前 102

近期项目中使用hive做数据统计，创建了一些hive表，在这个过程中，涉及到了设计分区的问题，简单做个总结，以新增表为例：

V1版本：

CREATE TABLE IF NOT EXISTS stat_install(uuid                string,ver                 int,version_code        int,channel             int,ipaddr              bigint,dpi                 int,device              int,os                  int,country             int,language            string,province            int,agent               string,network             int,upgrade             int,install_date        string) PARTITIONED by (year int, month int, day int, hour int)ROW format delimited fields terminated by "#";

V2版本：

CREATE TABLE IF NOT EXISTS stat_install(uuid                string,ver                 int,version_code        int,channel             int,ipaddr              bigint,dpi                 int,device              int,os                  int,country             int,language            string,province            int,agent               string,network             int,upgrade             int,install_date        string) PARTITIONED by (dt string)ROW format delimited fields terminated by "#";

V1版本，按照year，month，day，hour分区。分区的字段比较细，统计小时任务只需指定 hour = ？；统计天任务只需指定day = ？；年任务 year = ？；看起来不错，但是当跑周任务或者跨天的任务时候，就很难用year，month，day去表示一个时间段了，这时候就需要使用install_date字段取表示范围

例如：查找 2015-01-19 开始一周的数据：

install_date >= '2015-01-19' and install_date < '2015-01-26'；

而install_date并非分区字段，查询起来是非常慢的；而且这样分区还有一个弊端就是文件分的过细；

V2版本，重新按照dt分区，例如：dt = ‘2015-01-19’，这样可以很好地解决跨天的问题，但是小时任务只能根据install_date去做限定。

例如：查找 2015-1-19号 hour = 10 的数据：

 install_date >= '2015-01-19 10:00:00' and install_date < '2015-01-19 11:00:00'

V2版本相对V1，处理数据更方便，如果小时数据比较多，也可以考虑V3版本使用: (dt string, hour int)进行分区。如果涉及海外数据的话，需要考虑时区问题，可以通过设置服务器时区为东八区；或者不想设置时区的话，分区使用时间戳的形式，从统计到前端展示都使用时间差戳。时间戳的缺点是不直观，如果hive中的数据出错，不好排查。

分区方式比较重要，决定了查询的性能，欢迎大家讨论和指教。

上一篇：hql 查找非group by字段

下一篇：我使用过的Linux命令之ll - 列出文件/目录的详细信息

请勿发布不友善或者负能量的内容。与人为善，比聪明更重要！

<div style="font-size: 14px;" > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 近期项目中使用hive做数据统计，创建了一些hive表，在这个过程中，涉及到了设计分区的问题，简单做个总结，以新增表为例： &nbsp; &nbsp; &nbsp; V1版本： <pre class="java" name="code">CREATE TABLE IF NOT EXISTS stat_install(uuid string,ver int,version_code int,channel int,ipaddr bigint,dpi int,device int,os int,country int,language string,province int,agent string,network int,upgrade int,install_date string) PARTITIONED by (year int, month int, day int, hour int)ROW format delimited fields terminated by &quot;#&quot;;</pre> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; V2版本： <pre class="java" name="code">CREATE TABLE IF NOT EXISTS stat_install(uuid string,ver int,version_code int,channel int,ipaddr bigint,dpi int,device int,os int,country int,language string,province int,agent string,network int,upgrade int,install_date string) PARTITIONED by (dt string)ROW format delimited fields terminated by &quot;#&quot;;</pre> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;V1版本，按照year，month，day，hour分区。分区的字段比较细，统计小时任务只需指定 hour = ？；统计天任务只需指定day = ？；年任务 year = ？；看起来不错，但是当跑周任务或者跨天的任务时候，就很难用year，month，day去表示一个时间段了，这时候就需要使用install_date字段取表示范围 &nbsp; 例如：查找&nbsp;2015-01-19 开始一周的数据： <pre class="java" name="code">install_date &gt;= '2015-01-19' and install_date &lt; '2015-01-26'；</pre> 而install_date并非分区字段，查询起来是非常慢的；而且这样分区还有一个弊端就是文件分的过细； &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;V2版本，重新按照dt分区，例如：dt = ‘2015-01-19’，这样可以很好地解决跨天的问题，但是小时任务只能根据install_date去做限定。 例如：查找 2015-1-19号 hour = 10 的数据： <pre class="java" name="code"> install_date &gt;= '2015-01-19 10:00:00' and install_date &lt; '2015-01-19 11:00:00'</pre> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;V2版本相对V1，处理数据更方便，如果小时数据比较多，也可以考虑V3版本使用: (dt string, hour int)进行分区。如果涉及海外数据的话，需要考虑时区问题，可以通过设置服务器时区为东八区；或者不想设置时区的话，分区使用时间戳的形式，从统计到前端展示都使用时间差戳。时间戳的缺点是不直观， 如果hive中的数据出错，不好排查。 &nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;分区方式比较重要，决定了查询的性能，欢迎大家讨论和指教。 &nbsp; &nbsp; &nbsp; </div>

留言需要登陆哦

技术博客集 - 网站简介：
前后端技术：
后端基于Hyperf2.1框架开发,前端使用Bootstrap可视化布局系统生成
网站主要作用：
1.编程技术分享及讨论交流，内置聊天系统;
2.测试交流框架问题，比如：Hyperf、Laravel、TP、beego;
3.本站数据是基于大数据采集等爬虫技术为基础助力分享知识，如有侵权请发邮件到站长邮箱，站长会尽快处理;
4.站长邮箱：[email protected];

文章归档

文章标签

友情链接

首页
关于我们

Auther ·HouTiZong: 侯体宗的博客