数据库概论.陈立军.04.非关系型数据

SQL 非关系型数据

  • 关系型数据库中的非关系数据
  • 关系型数据库的目标:One Size Fits All
  • 非关系数据
    • 序列、树、图、XML、JSON、RDF
    • ...

窗口函数

  • 股票走势的指标
    • 移动平均线
    • MACD:指数平滑异同平均线
    • KDJ
    • BOLL
    • K 线
  • 需要窗口函数的支持
  • 移动方式
    • 滑动窗口:最近
    • 跳动窗口:每隔
  • 窗口范围
    • 基于时间
    • 基于行
    • 基于值
  • 基于行的滑动窗口
    • 基准:current row

窗口函数声明

1
2
3
4
5
6
function_name(<argument>, <argument>, ...)
over (
<Partition by 子句>
<Order by 子句>
<Windowing 子句>
)
  • Partition by:对表进行分区,类似group by
  • Order by:排序
  • Windowing:窗口函数

一个例子

1
stock(stock_id, trade_day, open_price, high_price, low_price, close_price)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
partition by stock_id
-- 按照股票号分区

order by trade_day
-- 按照交易日期排序

rows between 5 preceding and 3 following
-- 基于行
-- 每行对应的数据窗口是之前 5 行, 之后 3 行
-- 一共 9 行

rows between unbounded preceding and unbounded following(current row)
-- 基于行
-- 每行对应的数据窗口是从第一行到最后一行(当前行)

range between 50 preceding and 150 following
-- 基于值
-- 每行对应的数据窗口包含比当前行值大50以及小于150的行
  • 应用例子
    • 下面例子的边界情况
    • 第一行没有前一行,那么就是当前行和下一行两行的平均
1
2
3
4
5
6
select stock_id,trade_day,close_price,
avg(close_price) over(
order by trade_day
rows between 1 preceding and 1 following
) as avg_price
from stock

窗口函数类型

  • 传统聚集函数:sum,avg,count,max,min
  • 排名函数:rank,dense_rank,row_number
  • 分布函数:percent_rank,cume_dist
  • 逆分布函数:percent_cont,percentile_disc
  • 偏移函数:lag,lead

示例表

testid studentid score
Test ABC Student E 50
Test ABC Student C 55
Test ABC Student D 55
Test ABC Student H 65
Test ABC Student I 75
Test ABC Student B 80
Test ABC Student F 80
Test ABC Student A 95
Test ABC Student G 95
Test XYZ Student E 50
Test XYZ Student C 55
Test XYZ Student D 55
Test XYZ Student H 65
Test XYZ Student I 75
Test XYZ Student B 80
Test XYZ Student F 80
Test XYZ Student A 95
Test XYZ Student G 95
Test XYZ Student J 95

一般聚合函数

1
2
3
4
5
6
7
select testid, studentid, score,
count(*) over(
partition by testid
order by score
rows between unbounded preceding and current row
) as cnt
from Stats.Scores

排名函数对比

1
2
3
4
5
select studentid, score,
rank() over (order by score desc) as rk,
dense_rank() over(order by score desc)as drk,
row_number() over(order by score desc)as rn
fromStats.Scores
  • 部分结果展示
    • rank、dense_rank、row_number 区别如下

ntile

  • 划档,尽可能让每个档中的人数相同
  • 每条记录的返回值就是一个序号
1
2
3
4
select studentid, score,
ntile(4) over(order by score desc) as nt1,
ntile(3) over(order by score desc) as nt2
from Stats.Scores
  • 参照效果如下(数据似乎和示例表有出入)

percent_rank

  • 返回分位数
  • 计算一个值在一组值当中的相对位置或排名
  • rk:rank 排名
  • nr:窗口内总行数
  • \(\mathrm{percent\_rank=\dfrac{rk-1}{nr-1}}\)
1
2
3
select studentid, score,
percent_rank() over(order by score desc) as pr
from Stats.Scores

cume_dist

  • 与 percent_rank 类似
  • 计算某个值一组值内的累积分布,也即计算某指定值在一组值中的相对位置
  • 对于值 r,假定采用升序,r 的 cume_dist 是值低于或等于 r 的值的行数除以整个行数

逆分布函数

  • percentile_disc:离散百分位
    • 找到的结果一定是数据中存在的
  • percentile_cont:连续百分位
    • 找到的结果不一定是数据中存在的
1
2
3
4
5
6
7
8
9
10
11
12
13
14
select distinct testid,
percentile_disc(0.5) with in group(order by score)
over(partition by testid) as d_median,
percentile_cont(0.5) with in group(order by score)
over(partition by testid) as c_median,
percentile_disc(0.25) with in group(order by score)
over(partition by testid) as d_lower_quartile,
percentile_cont(0.25) with in group(order by score)
over(partition by testid) as c_lower_quartile,
percentile_disc(0.75) with in group(order by score)
over(partition by testid) as d_upper_quartile,
percentile_cont(0.75) with in group(order by score)
over(partition by testid) as c_upper_quartile
from Stats.Scores

偏移函数

1
2
3
4
5
lag(表达式,偏移量,缺省值) over(...)
-- 将当前行与同一值集中的先前行进行比较

lead(表达式,偏移量,缺省值) over(...)
-- 将当前行与同一值集中的后续行进行比较
1
2
3
4
select lag(score,1,0) over(order by score) as pre_score,
score,
lead(score,1,0) over(order by score) as next_score
from Stats.Scores

层次结构

  • 数据结构
    • :职工之间的领导联系
    • 有向图:零件之间的构成联系
    • 无向图:交通网络
  • 操作需求
    • 返回指定节点的所有子(父)节点,显示格式
    • 改变隶属关系
    • 环路检测
    • 生成传递闭包
    • 最短路径

层次结构的例子

层次结构的关系表示方法

  • 对于树、图等数据结构,其关系存储有以下三种方式
graph TD;
A-->B;
A-->C;
B-->D;
B-->E;
C-->F
邻接表
  • adjacent(child,parent)
child parent
B A
C A
D B
E B
F C
物化路径
  • 记录所有的路径
  • materialize_path(node,path)
node path
A .A
B .A.B
C .A.C
D .A.B.D
E .A.B.E
F .A.C.F
嵌套集合
  • 子结点的范围是父结点的一个细分
  • nested_net(node,left_value,right_value)
node left_value right_value
A 1 12
B 2 7
C 8 11
D 3 4
E 5 6
F 9 10

应用需求

  • 找到某个结点的所有子孙节点
  • 递归查询

递归查询

  • 内部实现,多次连接直到最终形成的表为空

Oracle

  • Connect By
1
2
3
4
5
Select part, subpart
From Components
Start with part = 'trike'
Connect by prior subart=part
-- 上一条的 subpart 是本条的 part

SQL Server

  • 基本事实:儿子是孩子
  • 规则:儿子的孩子还是孩子
1
2
3
4
5
6
7
8
9
10
11
12
13
14
WITH RecursiveCTE
AS (
-- 定位点成员 Anchor Member
SELECT ...
FROM BaseTable

UNIONALL

-- 递归成员 Recursive Member
SELECT ...
FROM RecursiveCTE
);

SELECT * FROM Recursive CTE;
  • 实际例子
1
2
3
4
5
6
7
8
9
10
11
12
13
with Components(part, subpart) as (
select part, subpart
from Assembly

union all

-- C 的子零件也是 A 的子零件
select A.part, C.subpart
from Assembly A, Components C
where A.subpart=C.part
)

select * from Components where part='trike'
  • 从最基础的语法出发实现的递归查询
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
create function dbo.fn_subordinates(@root as int)
returns @Subs Table (
empid int not null primary key,
level int not null)
as
begin
declare @lvl as int;
set @lvl=0; --Initialize level counter with 0
insert into @Subs(empid,level) --Insertrootnodeto@Subs
select empid, @lvl
from emp
where empid=@root;
while @@rowcount>0 --while previous level had rows
begin
set @lvl = @lvl+1; --Increment level counter
--Insert next level of sub ordinates to@Subs
insert into @Subs(empid,level)
select C.empid,@lvl
from @Subs AS P --P=Parent
join emp AS C --C=Child
on P.lvl=@lvl-1 --Filter parents from previous level
and C.mgrid=P.empid;
end
return;
end

Hierachyid:SQL Server

  • 可以存储树的结点,存储树形结构
  • hierarchyid 表示层次结构中的位置,由应用程序来生成和分配 hierarchyid 值
1
2
3
4
5
6
7
create table emp (
NodeID hierarchyid primary key clustered,
NodeLevel as NodeID.GetLevel(),
eno int,
ename char(20),
title char(20)
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
declare @root hierarchyid,
@child1 hierarchyid,
@child2 hierarchyid,
@grandchild1 hierarchyid

insert into emp(NodeID,eno,ename,title)
values(hierarchyid::GetRoot(),1,'tom','CEO')

set @root=hierarchyid::GetRoot()
set @child1=@root.GetDescendant(null,null)
insert into emp(NodeID,eno,ename,title)
values(@child1,2,'bob','VP')

set @child2=@root.GetDescendant(@child1,null)
insert into emp (NodeID,eno,ename,title)
values(@child2,3,'Arm','MS')

set @grandchild1=@child1.GetDescendant(null, null)
insert into emp(NodeID,eno,ename,title)
values(@grandchild1,4,'jerry','PM')

select NodeID.ToString() as NodePath,* from emp

select NodeID.GetAncestor(2) as GrandPa,ename
from emp where eno=4

SQL Server

  • 图 = 顶点表 + 边表

1
2
create table (...) as NODE
create table (...) as EDGE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
create table ForumMembers(
MemberID int,
MemberName varchar(100)
)
as node

create table ForumPosts(
PostID int,
PostTitle varchar(100),
Postbody varchar(1000)
)
as node

create table Written_By
as edge
create table Likes
as edge
create table Reply_To
as edge
1
2
3
4
5
6
insert ForumMembers values(1,'Mike')
insert ForumPosts values(7,'Intro','I''m Mike from Argentina')
insert Written_By($to_id,$from_id) values(
(select $node_id from ForumMembers where MemberId=1),
(select $node_id from dbo.ForumPosts where PostID=8)
)

XML

MySQL

1
2
3
4
5
6
7
8
9
10
11
set @xml='
<bibliography>
<book>
<title>Foundations…</title>
<author>Abiteboul</author>
<author>Hull</author>
<author>Vianu</author>
<publisher>AddisonWesley</publisher>
<year>1995</year>
</book>
</bibliography>'
1
2
3
4
5
select ExtractValue(@xml, '//author')
-- Abiteboul,Hull,Vianu

select ExtractValue(@xml, '//author[2]')
-- Hull

JSON

  • 对象:{属性名:属性值,属性名:属性值 ...}
  • 数组:[ value, value, value ...]

MySQL

1
2
3
4
5
-- 解析为一个 JSON 数组
select json_array(1, "abc", null, true, curtime())

-- 解析成一个 JSON 对象
select json_object('id', 87, 'name', 'carrot')
1
2
3
4
5
6
7
8
create table tj10 (a json, b int)
insert into tj10 values
('[3,10,5,"x",44]', 33),
('[3,10,5,17,[22,"y",66]]', 0)

select a->"$[3]", a->"$[4][1]" from tj10
-- "X", NULL
-- 17 , "Y"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
create table jemp(
c json,
g int generated always as(c->"$.id"),
index i(g)
)

insert into jemp(c) values
('{"id":"1","name":"Fred"}'),
('{"id":"2","name":"Wilma"}'),
('{"id":"3","name":"Barney"}'),
('{"id":"4","name":"Betty"}');

select c, json_extract(c,"$.id"),g
from jemp
where json_extract(c,"$.id")>1
order by json_extract(c,"$.name")

RDF

  • 资源描述框架 RDF

  • 三元组
    • <标识符,属性名,属性值>