spark window的一些理解

avatar 2019年6月7日14:32:44 3 111

本文讨论apache spark的window

https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

这篇文章说的比书上清楚,虽然写书的也是这个公司的
书是指 Spark - The Definitive Guide - Big data processing made simple
请自行购买,参考第七章 Window Functions部分

 

这篇博客不是全面的描述,因为别人已经说的很清楚了,是基于书和网文的基础上的思考和试验,先读上面相关网址和书籍有助于理解本文

Every input row can have a unique frame associated with it
每一行都有一个不同的frame,那么现在的问题是,处理了windows函数之后,结果的行数是多少?是和原来一样?还是分组之后的行数?

原始数据如下,我们把它注册为productRevenue表

+-------+--------+-------+
|product|category|revenue|
+-------+--------+-------+
| p1| c1| 1|
| p2| c1| 2|
| p2| c2| 3|
| p3| c1| 4|
| p3| c2| 5|
| p3| c3| 6|
+-------+--------+-------+

执行如下sql语句,功能:计算每个类别下的每个产品销售排名

  SELECT
    product,
    category,
    revenue,
    dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank
  FROM productRevenue

结果

+-------+--------+-------+----+
|product|category|revenue|rank|
+-------+--------+-------+----+
|     p3|      c1|      4|   1|
|     p2|      c1|      2|   2|
|     p1|      c1|      1|   3|
|     p3|      c3|      6|   1|
|     p3|      c2|      5|   1|
|     p2|      c2|      3|   2|
+-------+--------+-------+----+

行数没有变,还是6行

下面写另外一功能,每个产品和该类别下的销售冠军的差别

google搜spark UNBOUNDEDPRECEDING

        select product,category,revenue,
        max(revenue) over (PARTITION BY  category order by revenue rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING) - revenue as diff
        from productRevenue

结果

+-------+--------+-------+----+
|product|category|revenue|rank|
+-------+--------+-------+----+
|     p3|      c1|      4|   1|
|     p2|      c1|      2|   2|
|     p1|      c1|      1|   3|
|     p3|      c3|      6|   1|
|     p3|      c2|      5|   1|
|     p2|      c2|      3|   2|
+-------+--------+-------+----+

行数还是6行

这里可以 select product,category,revenue,每一列,如果是一般的group by,只能选择group by使用的列,否则必须使用聚合函数,还有这里max... - revenue

这里减去的并不是一个group by使用的列,而是一个平常的列

所以猜想,window函数不会把原始的每一组变成一行,而是分到不同的组中,然后全部展示,就有点像,按照category排序一样,但是各种计算会在同一组中分别进行,比排序order by和排名rank等等

参考网文中的

there was no way to both operate on a group of rows while still returning a single value for every input row

估计window使用后行数并不会减少,如下sql会报错:

          select sum(value2) over (partition by group2 order by value2) from df1
          group by group2

报错:

 org.apache.spark.sql.AnalysisException: expression 'df1.`value2`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
Project [sum2(value2) OVER (PARTITION BY group2 ORDER BY value2 ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#1301]
+- Project [value2#38, group2#37, sum2(value2) OVER (PARTITION BY group2 ORDER BY value2 ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#1301, sum2(value2) OVER (PARTITION BY group2 ORDER BY value2 ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#1301]
   +- Window [sum2(value2#38, ch7Agg.Sum2@5d71b500, 0, 0) windowspecdefinition(group2#37, value2#38 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS sum2(value2) OVER (PARTITION BY group2 ORDER BY value2 ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#1301], [group2#37], [value2#38 ASC NULLS FIRST]
      +- Aggregate [group2#37], [value2#38, group2#37]
         +- SubqueryAlias `df1`
            +- Project [_1#34 AS group2#37, _2#35 AS value2#38]
               +- LocalRelation [_1#34, _2#35]

貌似window和group by不能同时使用

书有个好处就是比较系统,当你看到了一个知识点后,就可以上网去搜

RANGE frame
网页上的描述看的让人心惊肉跳还是不明白,但是一看下面的图就明白了

关于range frame和rows frame

刚才的语句可以改为

        select product,category,revenue,
        max(revenue) over (PARTITION BY  category order by revenue range ws between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING) - revenue as diff
        from productRevenue

结果一样

说明range和rows后面使用的范围写法是一样的
区别在于range是找取值范围,而rows是找行数范围

来看

        select product,category,revenue,
        max(revenue) over (PARTITION BY  category order by revenue range between 0 PRECEDING and 1 FOLLOWING) - revenue as diff
        from productRevenue

结果

+-------+--------+-------+----+
|product|category|revenue|diff|
+-------+--------+-------+----+
|     p1|      c1|      1|   1|
|     p2|      c1|      2|   0|
|     p3|      c1|      4|   0|
|     p3|      c3|      6|   0|
|     p2|      c2|      3|   0|
|     p3|      c2|      5|   0|
+-------+--------+-------+----+

为什么只有第一行是1?
因为看revenue那一列,1和2相差1,和1 FOLLOWING导致2也在窗口的范围内,就是说,revenue的第一行和第二行的值相差1,而1 FOLLOWING说明了只有相差的值小于或等于1,才把这一行分到当前的窗口中
再说明白一点就是
如果当前行的值是1,那么只有取值为1到2的行才属于当前行的窗口
如果当前行的值是2,那么只有取值为2到3的行才属于当前行的窗口
如果当前行的值是4,那么只有取值为4到5的行才属于当前行的窗口
类推

而后面2和4相差2,4和6相差2等等,都不在1 FOLLOWING的范围内,这也进一步说明了range是基于取值的

再看

        select product,category,revenue,
        max(revenue) over (PARTITION BY  category order by revenue rows between 0 PRECEDING and 1 FOLLOWING) - revenue as diff
        from productRevenue

结果

+-------+--------+-------+----+
|product|category|revenue|diff|
+-------+--------+-------+----+
|     p1|      c1|      1|   1|
|     p2|      c1|      2|   2|
|     p3|      c1|      4|   0|
|     p3|      c3|      6|   0|
|     p2|      c2|      3|   2|
|     p3|      c2|      5|   0|
+-------+--------+-------+----+

这里看revenuey列,因为4 6 5都是当前类别下最大的值,所以和冠军差别为0,其余都是和下一行进行差值比较

avatar

发表评论

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

目前评论:3   其中:访客  2   博主  1

    • avatar 水寒 0

      互换个友链吧: https://dp2px.com

        • 久伴 久伴 Admin

          @水寒 可以

        • avatar 弥雅 4

          虽然全部看完,但是没看懂,来自技术渣的悲哀啊!