当前位置：首页 >> 编程语言 >> 【Python大数据笔记_day08_hive查询】,46英寸

【Python大数据笔记_day08_hive查询】,46英寸

0evadmin 2025-12-11 17:57:02 编程语言 1

文件名：【Python大数据笔记_day08_hive查询】,46英寸【Python大数据笔记_day08_hive查询】 hive查询语法结构: SELECT [ALL | DISTINCT] 字段名, 字段名, ...FROM 表名 [inner | left outer | right outer | full outer | left semi JOIN 表名 ON 关联条件 ][WHERE 非聚合条件][GROUP BY 分组字段名][HAVING 聚合条件][ORDER BY 排序字段名 asc | desc][CLUSTER BY 字段名 | [DISTRIBUTE BY 字段名 SORT BY 字段名]][LIMIT x,y] 类sql基本查询基础查询格式: select distinct 字段名 from 表名; 注意: *代表所有字段 distinct去重 as给表或者字段起别名条件查询格式: select distinct 字段名 from 表名 where 条件;比较运算符: > < >= <= != <>逻辑运算符: and or not模糊查询: %代表任意0个或者多个字符 _代表任意1个字符空判断: 为空is null 不为空is not null范围查询: x到y的连续范围:between x and y x或者y或者z类的非连续范围: in(x,y,z)排序查询格式: select distinct 字段名 from 表名 [where 条件] order by 排序字段名 asc|desc ;asc : 升序默认升序desc: 降序聚合查询格式: select 聚合函数(字段名) from 表名;聚合函数: 又叫分组函数或者统计函数聚合函数: count() sum() avg() max() min()分组查询格式: select 分组字段名,聚合函数(字段名) from 表名 [where 非聚合条件] group by 分组字段名 [having 聚合条件];注意: 当分组查询的时候,select后的字段名要么在groupby后出现过,要么放在聚合函数内,否则报错where和having区别? 区别1: 书写顺序不同,where在group by关键字前,having在group by关键字后区别2: 执行顺序不同,where在分组之前过滤数据,having在分组之后过滤数据区别3: 筛选数据不同,where只能在分组之前过滤非聚合数据,having在分组之后主要过滤聚合数据区别4: 操作对象不同,where底层操作伪表,having底层操作运算区分页查询格式: select 字段名 from 表名 [ order by 排序字段名 asc|desc] limit x,y;x: 起始索引默认从0开始,如果x为0可以省略计算格式: x=(页数-1)*yy: 本次查询记录数数据准备: -- 创建订单表CREATE TABLE orders (orderId bigint COMMENT '订单id',orderNo string COMMENT '订单编号',shopId bigint COMMENT '门店id',userId bigint COMMENT '用户id',orderStatus tinyint COMMENT '订单状态 -3:用户拒收 -2:未付款的订单 -1：用户取消 0:待发货 1:配送中 2:用户确认收货',goodsMoney double COMMENT '商品金额',deliverMoney double COMMENT '运费',totalMoney double COMMENT '订单金额（包括运费）',realTotalMoney double COMMENT '实际订单金额（折扣后金额）',payType tinyint COMMENT '支付方式,0:未知;1:支付宝，2：微信;3、现金；4、其他',isPay tinyint COMMENT '是否支付 0:未支付 1:已支付',userName string COMMENT '收件人姓名',userAddress string COMMENT '收件人地址',userPhone string COMMENT '收件人电话',createTime timestamp COMMENT '下单时间',payTime timestamp COMMENT '支付时间',totalPayFee int COMMENT '总支付金额') ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';-- 加载数据(因为是普通内部表可以直接上传文件到hfds表路径下) 数仓分层思想: -- 数仓分层: 本质就是分库分表-- 构建源数据层create database xls_ods;-- 构建数数据仓库create database xls_dw;-- 构建数据应用层create database xls_da;-- 转换应用场景-- 注意: 在大数据分析中转换完后为了以后方便使用一般存储起来create table xls_dw.dw_orders asselectorderid,orderno,shopid,userid,orderstatus,goodsmoney,delivermoney,totalmoney,realtotalmoney,casewhen payType=0 then '未知'when payType=1 then '支付宝'when payType=2 then '微信'when payType=3 then '现金'when payType=4 then '其他'end as payType,payType,username,useraddress,userphone,createtime,paytime,totalpayfeefrom orders;-- 修改字段类型alter table dw_orders change orderstatus orderstatus string;alter table dw_orders change ispay ispay string;-- 修改后重新修改了,需要覆盖数据insert overwrite table xls_dw.dw_ordersselectorderid,orderno,shopid,userid,casewhen orderstatus=-3 then '用户拒收'when orderstatus=-2 then '未付款的订单'when orderstatus=-1 then '用户取消'when orderstatus=0 then '待发货'when orderstatus=1 then '配送中'when orderstatus=2 then '用户确认收货'endas orderstatus,goodsmoney,delivermoney,totalmoney,realtotalmoney,casewhen payType=0 then '未知'when payType=1 then '支付宝'when payType=2 then '微信'when payType=3 then '现金'when payType=4 then '其他'end as payType,casewhen isPay=0 then '未支付'when isPay=1 then '已支付'end as isPay,username,useraddress,userphone,createtime,paytime,totalpayfeefrom orders; 课堂练习: -- 基础查询格式: select distinct 字段名 from 表名;-- 注意: *代表所有字段 distinct去重 as给表或者字段起别名且可以省略-- 指定字段查询select userName,userPhone from orders where userName='邓力夫';-- 指定字段并且取别名查询select distinct userName name,userPhone phone from orders where userName='邓力夫';-- 当然也可以给表起别名(目前单表即使起了也没有多大意义)select o.userName ,o.userPhone from orders as o ;-- 查询支付类型要求去重select distinct payType from orders;-- 2.演示where条件查询-- 查询广东省订单drop table if exists da_gd_orders;create table da_gd_orders asselect * from orders where userAddress like '广东省%';-- 3.演示聚合查询-- 查询广东省数据量select count(*) from orders where userAddress like '广东省%';-- 4.演示分组查询-- 注意: select后的字段要么在groupby后出现要么在聚合函数内出现,否则报错-- 统计已支付和未支付各自多少人select isPay,count(*) cnt from orders group by isPay;-- 5.演示条件查询,聚合查询,分组查询综合练习-- 在已支付订单中,统计每个用户最高的一笔消费金额select userId, username, max(realTotalMoney)from orderswhere isPay = 1group by userId, username;-- 统计每个用户的平均消费金额select userId, username, avg(realTotalMoney)from orderswhere isPay = 1group by userId, username;-- 统计每个用户的平均消费金额并且筛选大于10000的select userId, username, avg(realTotalMoney) as avg_moneyfrom orderswhere isPay = 1group by userId, usernamehaving avg_money > 10000;-- 统计每个用户的平均消费金额并且筛选大于10000的,平均值要求保留2位小数select userId, username,round(avg(realTotalMoney),2)from orderswhere isPay = 1group by userId, usernamehaving round(avg(realTotalMoney),2) > 10000;-- 6.演示排序查询-- asc默认升序 desc 降序-- 查询广东省订单,要求按照总价降序排序select * from orders where userAddress like '广东省%' order by realTotalMoney desc;-- 7.演示分页查询-- limit x,y 注意: x和y都是整数,x是从0开始起始索引,y是查询的条数-- 查询广东省订单总价最高的前5个订单select * from orders where userAddress like '广东省%' order by realTotalMoney desc limit 5; 类sql多表查询交叉连接格式: select 字段名 from 左表 cross join 右表;注意: 交叉连接产生的结果叫笛卡尔积,此种方式慎用!!!内连接格式: select 字段名 from 左表 inner join 右表 on 左右表关联条件;特点: 相当于只取两个表的交集左外连接格式: select 字段名 from 左表 left outer join 右表 on 左右表关联条件;特点: 以左表为主,左表数据全部保留,右表只保留和左表有交集的部分右外连接格式: select 字段名 from 左表 right outer join 右表 on 左右表关联条件;特点: 以右表为主,右表数据全部保留,左表只保留和右表有交集的部分自连接: 本质是一个特殊的内外连接,最大特点就是左右表是同一个表应用场景: 比较局限,场景1: 存储省市县三级数据的区域表场景2: 存储上下级信息的员工表子查询: 本质是一个select语句作为另外一个select语句的一部分(表或者条件)注意: 子查询作为表使用的话必须取别名数据准备: -- 创建用户表CREATE TABLE users (userId int,loginName string,loginSecret int,loginPwd string,userSex tinyint,userName string,trueName string,brithday date,userPhoto string,userQQ string,userPhone string,userScore int,userTotalScore int,userFrom tinyint,userMoney double,lockMoney double,createTime timestamp,payPwd string,rechargeMoney double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';-- 加载数据load data inpath '/source/itheima_users.txt' into table users;-- 验证数据select * from users limit 1; 练习: -- 交叉连接格式: select 字段名 from 左表 cross join 右表;-- 注意: 交叉连接产生的结果叫笛卡尔积,此种方式慎用!!!select * from users cross join orders;-- 内连接格式: select 字段名 from 左表 inner join 右表 on 左右表关联条件;-- 特点: 相当于只取两个表的交集select * from users u inner join orders o on u.userId=o.userId ;-- 左外连接格式: select 字段名 from 左表 left outer join 右表 on 左右表关联条件;-- 特点: 以左表为主,左表数据全部保留,右表只保留和左表有交集的部分select * from users u left outer join orders o on u.userId=o.userId ;-- 右外连接格式: select 字段名 from 左表 right outer join 右表 on 左右表关联条件;-- 特点: 以右表为主,右表数据全部保留,左表只保留和右表有交集的部分select * from users u right outer join orders o on u.userId=o.userId ;-- 自连接: 本质是一个特殊的内外连接,最大特点就是左右表是同一个表-- 应用场景: 比较局限,场景1: 存储省市县三级数据的区域表场景2: 存储上下级信息的员工表-- 可以运行下基础班的areas.sql脚本,做以下练习-- 方式1: 建议select xian.titlefrom(select * from areas where title = '北京市' and pid is not null) cityjoinareas xianon city.id = xian.pid;-- 方式2:select xian.titlefromareas cityjoinareas xianon city.id = xian.pidwhere city.title = '北京市' and city.pid is not null;-- 子查询: 本质是一个select语句作为另外一个select语句的一部分(表或者条件)-- 注意: 子查询作为表使用的话必须取别名;select titlefrom areaswhere pid = (select id from areas where title = '北京市' and pid is not null); hive其他join操作

在Hive中除了支持cross join（交叉连接，也叫做笛卡尔积）,inner join（内连接）、left outer join（左外连接）、right outer join（右外连接）还支持full outer join（全外连接）、left semi join（左半开连接）

全外连接: 左表 full [outer] join 右表 on 条件左半开连接: 左表 left semi join 右表 on 条件 -- hive不同于mysql的join操作-- 全外连接(左表 full outer join 右表 on 条件) 大白话就是左外和右外结果合并同时去重select * from users u full outer join orders o on u.userId = o.userId;-- 左半开连接(左表 left semi join 右表 on 条件) 大白话就是内连接的一半select * from users u left semi join orders o on u.userId = o.userId; hive其他排序操作 set mapreduce.job.reduces: 查看当前设置的reduce数量默认结果是-1,代表自动匹配reduce数量和桶数量一致set mapreduce.job.reduces = 数量 : -- 修改reduces数量cluster by 字段名: 分桶且正序排序弊端: 分和排序是同一个字段,相对不灵活distribute by 字段名 sort by 字段名: distribute by负责分,sort by负责排序, 相对比较灵活order by 字段名:全局排序注意: cluster by 和 distribute by 字段名 sort by 字段名受当前设置的reduces数量影响,但是设置的reduces数量对order by无影响,因为orderby就是全局排序,就是一个reduce建表的时候指定分桶字段和排序字段: clustered by (字段名) sorted by (字段名) into 桶数量 buckets注意: 如果建表的时候设置了桶数量,那么reduces建议设置值-1或者值大于桶数量 -- 演示4个by区别-- 创建表create table students(id int,name string,gender string,age int,cls string)row format delimitedfields terminated by ',';-- 加载数据load data inpath '/source/students.txt' into table students;-- 验证数据select * from students limit 1;-- 查询reduces的数量set mapreduce.job.reduces; -- -1代表根据任务实时改变-- 1.cluster by 字段名查询的时候分桶且排序-- 注意: 如果是1个reduces那么cluster by全局升序排序select * from students cluster by id;-- 修改reduces数量为3set mapreduce.job.reduces=3;-- 再次使用cluster by查询,查看效果-- 效果: 如果多个reduces那么cluster by桶内局部排序select * from students cluster by age;-- 2.distribute by + sort by-- 设置reduces的数量为-1set mapreduce.job.reduces = -1;-- 默认1个ruduces数量,使用distribute by + sort by查询观察结果-- 注意: 如果是1个ruduces那么distribute by + sort by全局排序select * from students distribute by name sort by age desc;-- 修改reduces数量set mapreduce.job.reduces = 2;-- 再次distribute by + sort by查询-- 效果: 如果多个redueces,那么distribute by 分reduces数量个桶,sort by桶内局部排序select * from students distribute by name sort by age desc;-- 3.order by-- 注意: order by 永远都是全局排序,不受reduces数量影响,每次只用1个reducesselect * from students order by age desc; 抽样查询 TABLESAMPLE抽样好处: 尽可能实现随机抽样,并且不走MR查询效率相对较快基于随机分桶抽样格式: SELECT 字段名 FROM tbl TABLESAMPLE(BUCKET x OUT OF y ON(字段名 | rand()))y:表示将表数据随机划分成y份（y个桶）x:表示从y里面随机抽取x份数据作为取样| : 或者字段名: 表示随机的依据基于某个列的值,每次按相关规则取样结果都是一致rand(): 表示随机的依据基于整行,每次取样结果不同 -- 随机抽样函数 tablesample-- 参考字段分桶抽样,快且随机select * from orders tablesample ( bucket 1 out of 10 on orderid);-- 参考rand()随机数,快且真正达到随机select * from orders tablesample ( bucket 1 out of 10 on rand());-- 快速取前面部分数据 : 快但没有随机-- 前100条select * from orders tablesample ( 100 rows );-- 前10%数据select * from orders tablesample ( 10 percent );-- 取1k或者1m的数据select * from orders tablesample ( 16k );select * from orders tablesample ( 167k );select * from orders tablesample ( 1m );-- 随机取100条: 随机但是不快select * from orders distribute by rand() sort by rand() limit 100; 正则模糊查询 sql模糊查询关键字: like 任意0个或者多个: % 任意1个: _ 正则模糊查询关键字: rlike 任意0个或者多个: .* 任意1个: . 正则语法还有很多... -- 正则表达式查询-- 1.查询广东省数据-- sql模糊查询select * from orders where userAddress like '广东省%';-- 正则模糊查询select * from orders where userAddress rlike '广东省.*';-- 2. 查询满足'xx省 xx市 xx区'格式的信息-- sql模糊查询select * from orders where userAddress like '__省 __市 __区';-- 正则模糊查询select * from orders where userAddress rlike '..省 ..市 ..区';-- 3.查询所有姓张王邓的用户信息-- sql模糊查询select * from orders where username like '张%' or username like '王%' or username like '邓%' ;-- 正则模糊查询select * from orders where username rlike '[张王邓].*';select * from orders where username rlike "[张王邓].+";-- 4.查找所有188开头的手机号-- sql模糊查询select * from orders where userPhone like '188________' ;-- 正则模糊查询select * from orders where userPhone rlike '188........' ;select * from orders where userPhone rlike '188.{8}' ;select * from orders where userPhone rlike '188\\*{4}[0-9]{4}' ;select * from orders where userPhone rlike '188\\*{4}\\d{4}' ; union联合查询 union联合查询: 就是把两个select语句结果合并成一个临时结果集,整体可以用于其他sql操作union [distinct]: 去重,只是省略了distinctunion all : 不去重 -- 插入数据insert into product values('p1','联想','c1'),('p2','小米','c2'),('p3','华为',null);-- 创建分类表create table category(cid varchar(100),cname varchar(100));-- 插入数据insert into category values('c1','电脑'),('c2','手机'),('c3','服饰');-- 1.如果在mysql中,使用union实现全外连接-- 使用union关键字,自动去重-- 左外 union 右外select pid,pname,p.cid,cname from product p left join category c on p.cid = c.cidunionselect pid,pname,c.cid,cname from product p right join category c on p.cid = c.cid;-- 注意: 如果不想去重使用 union all-- 左外 union all 右外select pid,pname,p.cid,cname from product p left join category c on p.cid = c.cidunion allselect pid,pname,c.cid,cname from product p right join category c on p.cid = c.cid;-- 2.在hive中使用full outer join实现全外连接select pid,pname,c.cid,cname from product p full join category c on p.cid = c.cid; CTE表达式 CTE: 公用表表达式（CTE）是一个在查询中定义的临时命名结果集将在from子句中使用它。注意: 每个CTE仅被定义一次（但在其作用域内可以被引用任意次）,仅适用于当前运行的sql语句语法如下:with 临时结果集的别名1 as (子查询语句),临时结果集的别名2 as (子查询语句)...select 字段名 from (表别名); 内置虚拟列虚拟列是Hive内置的可以在查询语句中使用的特殊标记，可以查询数据本身的详细参数。Hive目前可用3个虚拟列：INPUT__FILE__NAME:显示数据行所在的具体文件BLOCK__OFFSET__INSIDE__FILE:显示数据行所在文件的偏移量ROW__OFFSET__INSIDE__BLOCK:显示数据所在HDFS块的偏移量注意: 此虚拟列需要设置：SET hive.exec.rowoffset=true 才可使用 -- 演示内置虚拟列-- 打开ROW__OFFSET__INSIDE__BLOCKSET hive.exec.rowoffset=true;-- 演示查询SELECT *, INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE, ROW__OFFSET__INSIDE__BLOCK FROM students_bucket;

一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

【Python大数据笔记_day08_hive查询】,46英寸

【Python大数据笔记_day07_hive中的分区表、分桶表以及一些特殊类型】,a390

【Python大数据笔记_day10_Hive调优及Hadoop进阶】,DVD碟片日本黄

【PyTorch】PyTorch、Cuda 的安装和使用,hd高清

【PyTorch】模型选择、欠拟合和过拟合,麦博fc260

【Python COM】Word 自动纵向合并相同内容单元格,索爱k800

【Python CheckiO 题解】Between Markers (simplified),akg k402

【Python CheckiO 题解】Digits Multiplication,QQ音乐2009

【Python CheckiO 题解】Feed Pigeons,中恒d5

【Python CheckiO 题解】First Word (simplified),360ak47

【Python CheckiO 题解】Popular Words,firefox8

【Python CheckiO 题解】Speech Module,tcl k10

【Python大数据笔记_day08_hive查询】,46英寸

【Python大数据笔记_day07_hive中的分区表、分桶表以及一些特殊类型】,a390

【Python大数据笔记_day10_Hive调优及Hadoop进阶】,DVD碟片 日本黄

【PyTorch】PyTorch、Cuda 的安装和使用,hd高清

【PyTorch】模型选择、欠拟合和过拟合,麦博fc260

【Python COM】Word 自动纵向合并相同内容单元格,索爱k800

【Python CheckiO 题解】Between Markers (simplified),akg k402

【Python CheckiO 题解】Digits Multiplication,QQ音乐2009

【Python CheckiO 题解】Feed Pigeons,中恒d5

【Python CheckiO 题解】First Word (simplified),360ak47

【Python CheckiO 题解】Popular Words,firefox8

【Python CheckiO 题解】Speech Module,tcl k10

【Python大数据笔记_day10_Hive调优及Hadoop进阶】,DVD碟片日本黄