企业大数据基础平台搭建和实用开发代码


    目录
    理解Hadoop HDFS 6
    1 介绍 6
    2 HDFS设计原 6
    21 设计目标 7
    22 系统架构容错性设计 7
    23 HDFS适合应类型 7
    3 HDFS核心概念 7
    31 Blocks 7
    32 Namenode & Datanode 7
    33 Block Caching 8
    34 HDFS Federation 8
    35 HDFS HA(High Availability高性) 8
    4 命令行接口 9
    5 Hadoop文件系统 11
    6 Java接口 13
    61 读操作 13
    62 写数 14
    63 目录操作 14
    64 删数 15
    7 数流(读写流程) 15
    71 读文件 15
    72 写文件 16
    73 致性模型 17
    74 Hadoop节点距离 18
    8 相关运维工具 18
    81 distcp行复制 18
    82 衡HDFS集群 19
    9 HDFS机架感知概念配置实现 19
    机架感知什? 19
    二告诉呢? 19
    三什情况会涉机架感知? 19
    四机架感知需考虑情况(权衡性性带宽消耗) 19
    五通什方式够告知Hadoop NameNode Slaves机器属Rack?配置步骤 19
    六网络拓扑机器间距离 21
    10 HDFS理界面 21
    YARN 22
    基架构 22
    工作机制 22
    ResourceManager 23
    资源理 23
    务调度 23
    部结构 23
    作业提交全程 24
    资源调度器 25
    务推测执行 26
    YARNWEB UI说明 27
    集群运行状态查 29
    Hadoop 30 新特性 30
    Hadoop Common 30
    Hadoop HDFS 30
    Hadoop MapReduce 31
    Hadoop YARN 31
    总结 31
    Hive部表外部表区 34
    概念理解 34
    创建部表t1 34
    装载数(t1) 35
    创建外部表t2 37
    装载数(t2) 37
    查文件位置 37
    观察HDFS文件 40
    重新创建外部表t2 41
    官网解释 42
    Hive数仓库拉链表流水表全量表增量表 42
    Hadoop 320 完全分布式集群搭建 45
    集群环境搭建 46
    二Hadoop配置修改 47
    修改 hadoopenvsh 配置 jdk 路径定义集群操作户 47
    修改 coresitexml hadoop核心配置 47
    修改 hdfssitexml hadoop 节点配置 47
    修改 workers 告知 hadoop hdfsDataNode节点 48
    修改 yarnsitexml 配置yarn服务 48
    修改mapredsitexml 文件 49
    修改配置文件 分发 节点三台服务器 49
    三Hadoop服务启动 49
    四运行WordCount 52
    Linux台载MySQL 53
    yum安装MySQL 56
    Hive环境搭建 61
    安装 61
    二配置理 61
    三运行 64
    Apache Mahout环境搭建 65
    PySpark环境搭建 66
    Linux台载Python 38 66
    1赖安装 66
    2载安装包 67
    3解压 67
    4安装 67
    5添加软链接 67
    6测试 67
    Linux升级安装python38配置pipyum 67
    Linux安装Python 38环境卸载旧Python 70
    安装新版Python 2713Python 362(Python 2Python 3存修改默认版Python 362) 72
    Linux安装Apache Spark 310详细步骤 73
    Spark安装配置 76
    Spark集群安装设置 79
    Ubuntu 1204Hadoop 220 集群搭建 82
    Ubuntu 1404安装Hadoop240(单机模式) 87
    启动Spark集群 94
    Spark性优化 99
    1Spark作业基运行原理 101
    2资源参数调优 101
    numexecutors 101
    executormemory 102
    executorcores 102
    drivermemory 102
    sparkdefaultparallelism 102
    sparkstoragememoryFraction 102
    sparkshufflememoryFraction 102
    3资源参数参考示例 103
    4Spark中三种Join策略 103
    Broadcast Hash Join 103
    Shuffle Hash Join 104
    Sort Merge Join 104
    5Spark 30 中 AQE新特性 105
    6数仓库中数优化般原 109
    7Spark 中宽赖窄赖 109
    概述 109
    详细运行原理 110
    8Spark算子 113
    9Spark RDD 144
    Spark RDD特性 144
    Spark RDD核心特性 144
    关系型数库数性优化解决方案分表(前表历史表)表分区数清理原 147
    原目 147
    数否需清理阀值判断 147
    满负载周期判断 147
    迁移周期判断 147
    类型数分区方案 148
    历史表清理方案 148
    注意点 148
    数仓库缓慢变化维(Slow changing demenison) 实现方案 149
    MySQLTeradataPySpark代码互转表代码 151
    PySpark代码基结构 196
    PySparkMySQL导出数parquet文件 197
    PySparkTeradata导出数parquet文件 197
    PySparkParquent文件写入Hive表 198
    PySpark读取HiveSQL查询数写入parquet文件 198
    PySpark获取Dataframe采样数保存CSV文件 198
    PySpark连接MySQL数库插入数 198
    PySpark连接Teradata数库插入数 198
    PySpark遍历Dataframe行 199
    PySpark移动Parquet文件目录 200
    PySpark复制Parquet文件目录 200
    PySpark删Parquet文件目录 200
    PySpark修改Hive指存储路径 200
    PySpark显示HDFS路径文件 201
    PySpark显示普通Hive表容量(GB) 201
    PySpark显示Hive分区表容量(GB) 201
    PySpark显示HDFS目录子目录容量 201
    PySpark调SqoopHDFS导入Hive表 201
    HiveQLparquet文件创建Hive表 202
    HiveQLHive表创建Hive视图 202
    HiveQL格式化显示Hive查询结果数 202
    Hive导出Hive查询结果数CSV文件 202
    HiveQL显示Hive表 202
    HiveQL显示Hive数库 202
    Shell带日期参数运行HQL脚 202
    HiveQL更新视图指天表数 203
    HiveQL修改Hive表指存储文件 203
    Shell清HDFS里数 204
    Shell查HDFS数 204
    Sqoop显示MySQL中数库 204
    SqoopMySQL数库导入HDFS 204
    SqoopHDFS数库导入MySQL 206
    Sqoop显示MySQL中数库 206
    Teradata支持数类型 207
    MySQL支持数类型 209
    Hive支持数类型 209
    Parquet文件存储格式 210
    项目组成 210
    数模型 211
    StripingAssembly算法 212
    Parquet文件格式 215
    性 216
    项目发展 218
    总结 218
    Apache Airflow文档 218
    原理 221
    导 221
    Airflow环境安装配置 222
    通安装方法 222
    Airflow环境安装(Docker) 224
    Airflow环境配置(Docker) 224
    Airflow环境安装(Windows 10) 225
    容 227
    项目 227
    许证 228
    快速开始 231
    教程 231
    Operator教程 257
    图形界面截图 301
    概念 305
    数分析 323
    命令行参数 325
    调度触发器 345
    插件 348
    安全性 351
    时区 358
    Experimental Rest API 361
    整合 362
    Lineage 432
    常见问题 433
    API 参考 436
    Apache Airflow 20 新特性 600
    TaskFlow API(AIP31) 种新编写dags方式 600
    完全REST API(AIP32) 601
    调度器性显著提升 602
    调度器高兼容 (AIP15) 603
    务组 (AIP34) 603
    崭新户界面 603
    减少传感器负载智传感器 (AIP17) 604
    简化KubernetesExecutor 604
    Airflow core(核心)providers(第三方安装包) Airflow 拆分 60 包: 604
    安全性 605
    配置 605
    基Apache Airflow企业级数框架架构设计 605


    理解Hadoop HDFS
    文详细介绍HDFS中许概念理解Hadoop分布式文件系统帮助
    1 介绍
    现代企业环境中单机容量法存储量数需跨机器存储统理分布集群文件系统称分布式文件系统旦系统中引入网络避免引入网络编程复杂性例挑战果保证节点时候数丢失
    传统网络文件系统(NFS)然称分布式文件系统存限制NFS中文件存储单机法提供性保证客户端时访问NFS Server时容易造成服务器压力造成性瓶颈外果NFS中文件进行操作需首先步修改步服务端前客户端见某种程度NFS种典型分布式系统然文件确放远端(单)服务器面


    NFS协议栈事实种VFS(操作系统文件种抽象)实现
    HDFSHadoop Distributed File System简称Hadoop抽象文件系统种实现Hadoop抽象文件系统系统Amazon S3等集成甚通Web协议(webhsfs)操作HDFS文件分布集群机器时提供副进行容错性保证例客户端写入读取文件直接操作分布集群机器没单点性压力
    果零开始搭建完整集群参考[Hadoop集群搭建详细步骤(260)](httpblogcsdnnetbingduanlbdarticledetails51892750)
    2 HDFS设计原
    HDFS设计初非常明确应场景适什类型应适什应相明确指导原
    21 设计目标
    · HDFSHadoop核心项目现数领域事实存储标准高容错性设计运行量廉价商业硬件设计会假设前提
    · 首先会首先假设硬件障通常现象说块磁盘障概率非常服务器集群数千台甚万台节点时候(磁盘障)非常司空见惯事情说快速发现定位问题快速做障恢复重设计目标
    · 第二HDFS适合容量数流式访问样场景说数场景动辄百G甚T文件相低延时言会更加意批量处理高吞吐样诉求
    · 第三形成理念移动计算代价移动数代价数领域普遍识Hadoop推出HDFS存储时候推出MapReduce计算框架出目
    · 存储非常文件:里非常指百MG者TB级实际应中已集群存储数达PB级根Hadoop官网YahooHadoop集群约10万颗CPU运行4万机器节点更世界Hadoop集群情况参考Hadoop官网
    · 采流式数访问方式 HDFS基样假设:效数处理模式次写入次读取数集常数源生成者拷贝次然做分析工作 
    分析工作常读取中部分数全部 读取整数集需时间读取第条记录延时更重
    · 运行商业硬件 Hadoop需特贵reliable()机器运行普通商机器(家供应商采购) 商机器代表低端机器集群中(尤集群)节点失败率较高HDFS目标确保集群节点失败时候会户感觉明显中断
    22 系统架构容错性设计
    HFDS典型MasterSlave架构里NameNode存储文件系统元数说分块文件路径等等类数BLOCK形式存储DataNode整体HDFS提供client客户提供文件系统命名空间操作说文件开关闭重命名移动等等
    说HDFS存储文件提数副机制说里PART 0文件包括Block 1Block 3两Block时设置replica副数等213两Block会复制出外1外3存储外DataNode节点文件拆分两Block存储拥两副里面核心问题拆分数会存储方(DataNode)样调度策略问题Block副存节点(DataBlock)节点挂掉时候数丢失? HFDS解决问题呢?引入机架感知(Rack Awareness)样概念里引入两机架Rack 1Rack 2台机架会三样节点(DataNode)
    23 HDFS适合应类型
    场景适合HDFS存储数面列举:
    1) 低延时数访问 
    延时求毫秒级应适合采HDFSHDFS高吞吐数传输设计牺牲延时HBase更适合低延时数访问
    2)量文件 
    文件元数(目录结构文件block节点列表blocknode mapping)保存NameNode存中 整文件系统文件数量会受限NameNode存 
    验言文件目录文件块般占150字节元数存空间果100万文件文件占1文件块需约300M存十亿级文件数量现商机器难支持
    3)方读写需意文件修改 
    HDFS采追加(appendonly)方式写入数支持文件意offset修改支持写入器(writer)
    3 HDFS核心概念
    31 Blocks
    物理磁盘中块概念磁盘物理Block磁盘操作单元读写操作均Block单元般512 Byte文件系统物理Block抽象层概念文件系统Block物理磁盘Block整数倍通常KBHadoop提供dffsck类运维工具文件系统Block级进行操作
    HDFSBlock块般单机文件系统默认128MHDFS文件拆分成blocksizedchunkchunk作独立单元存储Block文件会占整Block会占实际例 果文件1MHDFS中会占1M空间128M
    HDFSBlock什? 
    化查找(seek)时间控制定位文件传输文件时间例假设定位Block需时间10ms磁盘传输速度100Ms果定位Block时间占传输时间例控制1Block需约100M 
    果Block设置MapReduce务中Map者Reduce务数 果集群机器数量会作业运行效率低
    Block抽象处 
    block拆分单文件整磁盘容量构成文件Block分布整集群 理单文件占集群中机器磁盘 
    Block抽象简化存储系统Block需关注权限者等容(容文件级进行控制) 
    Block作容错高机制中副单元Block单位进行复制
    32 Namenode & Datanode
    整HDFS集群NamenodeDatanode构成masterworker()模式Namenode负责构建命名空间理文件元数等Datanode负责实际存储数负责读写工作
    Namenode
    Namenode存放文件系统树文件目录元数元数持久化2种形式:
    · namespcae image
    · edit log
    持久化数中包括Block节点列表文件Block分布集群中节点信息系统重启时候重新构建(通Datanode汇报Block信息) 
    HDFS中Namenode成集群单点障Namenode时整文件系统HDFS针单点障提供2种解决机制: 
    1)备份持久化元数 
    文件系统元数时写文件系统 例时元数写文件系统NFS备份操作步原子
    2)Secondary Namenode 
    Secondary节点定期合Namenodenamespace imageedit log 避免edit log通创建检查点checkpoint合会维护合namespace image副 Namenode完全崩溃时恢复数图Secondary Namenode理界面:

    Secondary Namenode通常运行台机器合操作需耗费量CPU存数落NamenodeNamenode完全崩溃时会出现数丢失 通常做法拷贝NFS中备份元数Second作新Namenode 
    HA(High Availability高性)中运行Hot Standby作热备份Active Namenode障代原Namenode成Active Namenode
    Datanode
    数节点负责存储提取Block读写请求namenode直接客户端数节点周期性Namenode汇报节点存储Block相关信息
    33 Block Caching
    DataNode通常直接磁盘读取数频繁Block存中缓存默认情况Block数节点会缓存针文件性化配置 
    作业调度器利缓存提升性例MapReduce务运行Block缓存节点 
    户者应NameNode发送缓存指令(缓存文件缓存久) 缓存池概念理组缓存权限资源
    34 HDFS Federation
    知道NameNode存会制约文件数量HDFS Federation提供种横扩展NameNode方式Federation模式中NameNode理命名空间部分例NameNode理user目录文件 NameNode理share目录文件 
    NameNode理namespace volumnvolumn构成文件系统元数NameNode时维护Block Pool保存Block节点映射等信息NameNode间独立节点失败会导致节点理文件 
    客户端mount table文件路径映射NameNodemount tableNamenode群组封装层层Hadoop文件系统实现通viewfs协议访问
    35 HDFS HA(High Availability高性)
    HDFS集群中NameNode然单点障(SPOF Single Point Of Failure)元数时写文件系统Second NameNode定期checkpoint利保护数丢失提高性 
    NameNode唯文件元数fileblock映射负责方 挂包括MapReduce作业法进行读写
    NameNode障时常规做法元数备份重新启动NameNode元数备份源:
    · 文件系统写入中备份
    · Second NameNode检查点文件
    启动新Namenode需重新配置客户端DataNodeNameNode信息外重启耗时般较久稍具规模集群重启常需十分钟甚数时造成重启耗时原致: 
    1) 元数镜文件载入存耗时较长 
    2) 需重放edit log 
    3) 需收DataNode状态报告满足条件离开安全模式提供写服务
    HadoopHA方案
    采HAHDFS集群配置两NameNode分处ActiveStandby状态Active NameNode障Standby接责继续提供服务户没明显中断感觉般耗时十秒数分钟 
    HA涉实现逻辑
    1) 备需享edit log存储 
    NameNode命NameNode享份edit log备切换时Standby通回放edit log步数 
    享存储通常2种选择
    · NFS:传统网络文件系统
    · QJM:quorum journal manager
    QJM专门HDFSHA实现设计提供高edit logQJM运行组journal nodeedit log必须写部分journal nodes通常3节点允许节点失败类似ZooKeeper注意QJM没ZK然HDFS HA确ZK选举Namenode般推荐QJM
    2)DataNode需时备发送Block Report 
    Block映射数存储存中(磁盘)Active NameNode挂掉新NameNode够快速启动需等DatanodeBlock ReportDataNode需时备两NameNode发送Block Report
    3)客户端需配置failover模式(失效备援模式户透明) 
    Namenode切换客户端说感知通客户端库实现客户端配置文件中HDFS URI逻辑路径映射Namenode址客户端会断尝试Namenode址直成功
    4)Standby代Secondary NameNode 
    果没启HAHDFS独立运行守护进程作Secondary Namenode定期checkpoint合镜文件edit日志
    果Namenode失败时备份Namenode正关机(停止 Standby)运维员然头启动备份Namenode样没HA时候更省事算种改进重启整程已标准化Hadoop部需运维进行复杂切换操作
    NameNode切换通代failover controller实现failover controller种实现默认实现ZooKeeper保证Namenode处active状态
    Namenode运行轻量级failover controller进程该进程简单心跳机制监控Namenode存活状态Namenode失败时触发failoverFailover运维手动触发例日常维护中需切换Namenode种情况graceful(优雅) failover非手动触发failover称ungraceful failover
    ungraceful failover情况没办法确定失败(判定失败)节点否停止运行说触发failover前Namenode运行QJM次允许Namenode写edit log前Namenode然接受读请求Hadoopfencing杀掉前NamenodeFencing通收回前Namenode享edit log访问权限关闭网络端口原Namenode继续接受服务请求STONITH技术前Namenode关机
    HA方案中Namenode切换客户端说见前面已介绍通客户端库完成
    4 命令行接口
    HDFS提供种交互方式例通Java APIHTTPshell命令行命令行交互通hadoop fs操作例:
    1 hadoop fs copyFromLocal 复制文件HDFS
    2 hadoop fs mkdir 创建目录
    3 hadoop fs ls 列出文件列表
    Hadoop中文件目录权限类似POSIX模型包括读写执行3种权限:
    · 读权限(r):读取文件者列出目录中容
    · 写权限(w):文件文件写权限目录写权限指该目录创建者删文件(目录)权限
    · 执行权限(x):文件没谓执行权限忽略目录执行权限访问器目录容
    文件目录ownergroupmode三属性owner指文件者group权限组mode 者权限文件属组中组员权限非者非组员权限组成图表示者root拥读写权限supergroup组组员读权限读权限

    文件权限否开启通dfspermissionsenabled属性控制属性默认false没开安全限制会客户端做授权校验果开启安全限制会操作文件户做权限校验特殊户superuserNamenode进程标识会针该户做权限校验
    ls命令执行结果:

    返回结果类似Unix系统ls命令第栏文件moded表示目录紧接着3种权限9位 第二栏指文件副数数量通dfsreplication配置目录表示没副说诸者组更新时间文件Unix系统中ls命令致
    果需查集群状态者浏览文件目录访问Namenode暴露Http Server查集群信息般namenode机器50070端口



    5 Hadoop文件系统
    前面Hadoop文件系统概念抽象HDFS中种实现Hadoop提供实现图:


    简单介绍Local文件系统抽象hdfs常见两种web形式(webhdfsswebhdfs)实现通HTTP提供文件操作接口harHadoop体系压缩文件文件时候压缩成文件效减少元数数量viewfs前面介绍HDFS Federation张提客户端屏蔽Namenode底层细节ftp顾名思义ftp协议实现文件操作转化ftp协议s3aAmazon云服务提供存储系统实现azure微软云服务台实现
    前面提命令行HDFS交互事实方式操作文件系统例Java应程序orgapachehadoopfsFileSystem操作形式操作基FileSystem进行封装里介绍HTTP交互方式 
    WebHDFSSWebHDFS协议文件系统暴露HTTP操作种交互方式原生Java客户端慢适合操作文件通HTTP2种访问方式直接访问通代理访问
    直接访问 
    直接访问示意图:

    NamenodeDatanode默认开嵌入式web serverdfswebhdfsenabled默认truewebhdfs通服务器交互元数操作通namenode完成文件读写首先发namenode然重定datanode读取(写入)实际数流
    通HDFS代理

    采代理示意图示 代理处通代理实现负载均衡者带宽进行限制者防火墙设置代理通HTTP者HTTPS暴露WebHDFS应webhdfsswebhdfs URL Schema
    代理作独立守护进程独立namenodedatanodehttpfssh脚默认运行14000端口
    FileSystem直接操作命令行HTTTP外C语言APINFSFUSER等方式里做介绍
    6 Java接口
    实际应中HDFS数操作通FileSystem操作部分重点介绍相关接口关注HDFS实现类DistributedFileSystem相关类
    61 读操作
    URL读取数者直接FileSystem操作
    Hadoop URL读取数
    javanetURL类提供资源定位统抽象定义种URL Schema提供相应处理类进行实际操作hdfs schema便样种实现
    1 InputStream in null
    2 try {
    3 in new URL(hdfsmasteruserhadoop)openStream()
    4 }finally{
    5 IOUtilscloseStream(in)
    6 }
    定义Schema需设置URLStreamHandlerFactory操作JVM进行次次操作会导致通常静态块中完成面截图示例:


    FileSystem API读取数
    1) 首先获取FileSystem实例般静态get工厂方法
    1 public static FileSystem get(Configuration conf) throws IOException
    2 public static FileSystem get(URI uri Configuration conf) throws IOException
    3 public static FileSystem get(URI uri Configuration confString user) throws IOException
    果文件通getLocal获取文件系统象:
    public static LocalFileSystem getLocal(COnfiguration conf) thrown IOException
    2)调FileSystemopen方法获取输入流
    1 public FSDataInputStream open(Path f) throws IOException
    2 public abstarct FSDataInputStream open(Path f int bufferSize) throws IOException
    默认情况open4KBBuffer根需行设置
    3)FSDataInputStream进行数操作 
    FSDataInputStreamjavaioDataInputStream特殊实现基础增加机读取部分读取力
    1 public class FSDataInputStream extends DataInputStream
    2 implements Seekable PositionedReadable
    3 ByteBufferReadable HasFileDescriptor CanSetDropBehind CanSetReadahead
    4 HasEnhancedByteBufferAccess
    机读取操作通Seekable接口定义:
    1 public interface Seekable {
    2 void seek(long pos) throws IOException
    3 long getPos() throws IOException
    4 }
    seek操作开销昂贵慎
    部分读取通PositionedReadable接口定义:
    1 public interface PositionedReadable{
    2 public int read(long pistion byte[] bufferint offser int length) throws IOException
    3 public int readFully(long pistion byte[] bufferint offser int length) throws IOException
    4 public int readFully(long pistion byte[] buffer) throws IOException
    5 }
    62 写数
    HDFS中文件FileSystem类create方法重载形式创建create方法返回输出流FSDataOutputStream调返回输出流getPos方法查前文件位移进行seek操作HDFS仅支持追加操作
    创建时传递回调接口Peofressable获取进度信息
    append(Path f)方法追加容已文件实现提供该方法例Amazon文件实现没提供追加功
    面例子:
    1 String localSrc args[0]
    2 String dst args[1]
    3
    4 InputStream in new BufferedInputStream(new FileInputStream(localSrc))
    5
    6 Configuration conf new Configuration()
    7 FileSystem fs FileSystemget(URIcreate(dst)conf)
    8
    9 OutputStream out fscreate(new Path(dst) new Progressable(){
    10 public vid progress(){
    11 Systemoutprint()
    12 }
    13 })
    14
    15 IOUtilscopyBytes(in out 4096true)
    63 目录操作
    mkdirs()方法会动创建没级目录
    HDFS中元数封装FileStatus类中包括长度block sizereplicaions修改时间者权限等信息FileSystem提供getFileStatus方法获取FileStatusexists()方法判断文件者目录否存
    列出文件(list)listStatus方法查文件者目录信息
    1 public abstract FileStatus[] listStatus(Path f) throws FileNotFoundException
    2 IOException
    Path文件时候返回长度1数组FileUtil提供stat2Paths方法FileStatus转化Path象
    globStatus通配符文件路径进行匹配:
    public FileStatus[] globStatus(Path pathPattern) throws IOException
    · 1
    PathFilter定义文件名滤根文件属性进行滤类似javaioFileFilter例面例子排定正表达式文件:
    1 public interfacePathFilter{
    2 boolean accept(Path path)
    3 }
    64 删数
    FileSystemdelete()方法
    public boolean delete(Path f boolean recursive) throws IOException
    · 1
    recursive参数f文件时候忽略果f文件recursicetrue删整目录否抛出异常
    7 数流(读写流程)
    接详细介绍HDFS读写数流程致性模型相关概念
    71 读文件
    致读文件流程:

    1)客户端传递文件PathFileSystemopen方法
    2)DFS采RPC远程获取文件开始blockdatanode址Namenode会根网络拓扑结构决定返回节点(前提节点block副)果客户端身Datanode节点刚block副直接读取
    3)客户端open方法返回FSDataInputStream象读取数(调read方法)
    4)DFSInputStream(FSDataInputStream实现改类)连接持第block节点反复调read方法读取数
    5)第block读取完毕寻找block佳datanode读取数果必DFSInputStream会联系Namenode获取批Block 节点信息(存放存持久化)寻址程客户端见
    6)数读取完毕客户端调close方法关闭流象
    读数程中果Datanode通信发生错误DFSInputStream象会尝试佳节点读取数记住该失败节点 续Block读取会连接该节点 
    读取BlockDFSInputStram会进行检验验证果Block损坏尝试节点读取数损坏block汇报Namenode 
    客户端连接datanode获取数namenode指导样支持量发客户端请求namenode流量均匀分布整集群 
    Block位置信息存储namenode存中相应位置请求非常高效会成瓶颈
    72 写文件

    步骤分解 
    1)客户端调DistributedFileSystemcreate方法
    2)DistributedFileSystem远程RPC调Namenode文件系统命名空间中创建新文件时该文件没关联block 程中Namenode会做校验工作例否已存名文件否权限果验证通返回FSDataOutputStream象 果验证通抛出异常客户端
    3)客户端写入数时候DFSOutputStream分解packets(数包)写入数队列中该队列DataStreamer消费
    4)DateStreamer负责请求Namenode分配新block存放数节点节点存放Block副构成道 DataStreamerpacket写入道第节点第节点存放packet转发节点节点存放 继续传递
    5)DFSOutputStream时维护ack queue队列等datanode确认消息道datanode确认packetack队列中移
    6)数写入完毕客户端close输出流packet刷新道中然安心等datanode确认消息全部确认告知Namenode文件完整 Namenode时已知道文件Block信息(DataStreamer请求Namenode分配block)需等达副数求然返回成功信息客户端
    Namenode决定副存Datanode?
    HDFS副存放策略性写带宽读带宽间权衡默认策略:
    · 第副放客户端相机器果机器集群外机选择(会选择容量太慢者前操作太繁忙)
    · 第二副机放第副机架
    · 第三副放第二副机架节点满足条件节点中机选择
    · 更副整集群机选择然会量避免太副机架 
    副位置确定建立写入道时候会考虑网络拓扑结构面存放策略:

    样选择滴衡性读写性
    · 性:Block分布两机架
    · 写带宽:写入道程需跨越交换机
    · 读带宽:两机架中选读取
    73 致性模型
    致性模型描述文件系统中读写操见性HDFS中文件旦创建文件系统命名空间中见:
    1 Path p new Path(p)
    2 fscreate(p)
    3 assertTaht(fsexists(p)is(true))
    写入文件容保证见象流已刷新 
    `java 
    Path p new Path(p) 
    OutputStream out fscreate(p) 
    outwrite(contentgetBytes(UTF8)) 
    outflush() 
    assertTaht(fsgetFileStatus(p)getLen0L) 0调flush
    1
    2 果需强制刷新数DatanodeFSDataOutputStreamhflush方法强制缓刷datanode
    3 hflushHDFS保证时间点止写入文件数达数节点
    4 ```java
    5 Path p new Path(p)
    6 OutputStream out fscreate(p)
    7 outwrite(contentgetBytes(UTF8))
    8 outflush()
    9 assertTaht(fsgetFileStatus(p)getLenis(((longcontentlength())))
    关闭象流时部会调hflush方法hflush保证datanode数已写入磁盘保证写入datanode存 机器断电时候导致数丢失果保证写入磁盘hsync方法hsync类型fsync()系统调fsync提交某文件句柄缓数
    1 FileOutputStreamout new FileOutPutStream(localFile)
    2 outwrite(contentgetBytes(UTF8))
    3 outflush()
    4 outgetFD()sync()
    5 assertTaht(localFilegetLenis(((longcontentlength())))
    hflushhsync会导致吞吐量降设计应时需吞吐量数健壮性间做权衡
    外文件写入程中前正写入BlockReader见
    74 Hadoop节点距离
    读取写入程中namenode分配Datanode时候会考虑节点间距离HDFS中距离没 
    采带宽衡量实际中难准确度量两台机器间带宽 
    Hadoop机器间拓扑结构组织成树结构达公父节点需跳转数作距离事实距离矩阵例子面例子简明说明距离计算:

    数中心机架节点距离0
    数中心机架节点距离2
    数中心机架节点距离4
    数中心机架节点距离6

    Hadoop集群拓扑结构需手动配置果没配置Hadoop默认节点位数中心机架
    8 相关运维工具
    81 distcp行复制
    前面关注点单线程访问果需行处理文件需编写应Hadoop提供distcp工具行导入数Hadoop者Hadoop导出例子:
    1 hadoop distcp file1 file2 作fs cp命令高效代
    2 hadoop distcp dir1 dir2
    3 hadoop distcp update dir1 dir2 #update参数表示步更新文件保持变
    distcp底层MapReduce实现map实现没reducemap中行复制文件 distcpmap间均分配文件map数量通m参数指定
    hadoop distcp update delete p hdfsmaster19000foo hdfsmaster2foo
    样操作常两集群间复制数update参数表示步更新数delete会删目标目录中存源目录存文件p参数表示保留文件全校block副数量等属性
    果两集群Hadoop版兼容webhdfs协议:
    hadoop distcp webhdfsnamenode150070foo webhdfsnamenode250070foo
    82 衡HDFS集群
    distcp工具中果指定map数量1仅速度慢Block第副全部落运行唯map节点直磁盘溢出distcp时候默认map数量20
    HDFSBlock均匀分布节点时候工作果没办法作业中量保持集群衡例限制map数量(便节点作业)balancer工具调整集群Block分布
    9 HDFS机架感知概念配置实现
    机架感知什?
    告诉 Hadoop 集群中台机器属机架
    二告诉呢?
    Hadoop机架感知非适应Hadoop集群分辨某台Slave 机器属Rack非智感知需 Hadoop理者告知 Hadoop台机器属Rack样HadoopNameNode启动初始化时会机器 rack 应信息保存存中作接 HDFS 写块操作分配 datanode
    列表时( 3 block 应三台 datanode)选择 datanode 策略量三副分布 rack
    三什情况会涉机架感知?
    Hadoop 集群规模情况
    四机架感知需考虑情况(权衡性性带宽消耗)
    (1)节点间通信够量发生机架跨机架
    (2)提高容错力NameNode会数块副放机架
    五通什方式够告知Hadoop NameNode Slaves机器属Rack?配置步骤
    1默认情况Hadoop机架感知没启通常情况Hadoop集群 HDFS 选机器时候机选择说写数时Hadoop第块数 block1写rack1然机选择block2 写入rack2 时两Rack间产生数传输流量接机情况block3 重新写回 rack1时两Rack间产生次数流量Job处理数量非常者Hadoop推送数量非常时候种情况会造成 Rack间网络流量成倍升成性瓶颈进影响作业性整集群服务
    Hadoop机架感知功启配置非常简单NameNode机器 hadoopsitexml 配置文件中配置选项:
    topologyscriptfilenamepathtoRackAwarepy
    配置选项 value 指定执行程序通常脚该脚接受参数输出值接受参数通常某台 DataNode机器IP址输出值通常该IP址应DataNodeRack例rack1NameNode启动时会判断该配置选项否空果非空表示已机架感知配置时NameNode会根配置寻找该脚接收 DataNodeHeartbeat时该 DataNodeIP址作参数传该脚运行输出作该DataNode属机架保存存Map中
    脚编写需真实网络拓朴机架信息解清楚通该脚够机器IP址正确映射相应机架
    简单实现:
    #usrbinpython
    #codingUTF8 –
    import sys
    rack {hadoopnode176tjrack1
    hadoopnode178tjrack1
    hadoopnode179tjrack1
    hadoopnode180tjrack1
    hadoopnode186tjrack2
    hadoopnode187tjrack2
    hadoopnode188tjrack2
    hadoopnode190tjrack2
    192168115rack1
    192168117rack1
    192168118rack1
    192168119rack1
    192168125rack2
    192168126rack2
    192168127rack2
    192168129rack2
    }
    if namemain
    print( + rackget(sysargv[1] rack0)
    没确切文档说明 底机名 ip 址会传入脚脚中兼容机名 ip 址果机房架构较复杂话脚返回:dc1rack1 类似字符串
    执行命令:chmod +x RackAwarepy
    重启NameNode果配置成功NameNode启动日志中会输出:
    INFO orgapachehadoopnetNetworkTopology Adding a new node rack119216811550010
    六网络拓扑机器间距离
    里基网络拓扑案例介绍复杂网络拓扑中Hadoop集群台机器间距离

    机架感知NameNode 画出图示DataNode网络拓扑图D1R1 交换机底层 datanode H1
    rackidD1R1H1H1 parent R1R1 D1 rackid信息通 topologyscriptfilename 配置 rackid 信息计算出意两台DataNode间距离
    1 distance(D1R1H1D1R1H1)0 相DataNode
    2 distance(D1R1H1D1R1H2)2 RackDataNode
    3 distance(D1R1H1D1R1H4)4 IDC(互联网数中心(机房))DataNode
    4 distance(D1R1H1D2R3H7)6 IDCDataNode
    10 HDFS理界面
    里HDFS理情况hadoop001开浏览器进入HDFS理界面输入:19216821612850070

    点击DataNodes


    YARN
    Yarn资源调度台负责运算程序提供服务器运算资源相分布式操作系统台MapReduce等运算程序相运行操作系统应程序
    基架构
      YARNResourceManagerNodeManagerApplicationMasterContainer等组件构成

    工作机制
    1)运行机制

    2)工作机制详解
    (0)MapReduce程序提交客户端节点
    (1)Yarn RunnerResourceManager申请Application
    (2)ResourceManager该应程序资源路径返回Yarn Runner
    (3)该程序运行需资源提交HDFS
    (4)程序资源提交完毕申请运行MapReduce Application Master
    (5)RM户请求初始化成Task
    (6)中NodeManager领取Task务
    (7)该NodeManager创建容器Container产生MapReduce Application Master
    (8)ContainerHDFS拷贝资源
    (9)MapReduce Application MasterResourceManager申请运行Map Task资源
    (10)ResourceManager运行Map Task务分配外两NodeManager两NodeManager分领取务创建容器
    (11)MapReduce两接收务NodeManager发送程序启动脚两NodeManager分启动Map TaskMap Task数分区排序
    (12)MapReduce Application Master等Map Task运行完毕RM申请容器运行Reduce Task
    (13)Reduce TaskMap Task获取相应分区数
    (14)程序运行完毕MapReduce会ResourceManager申请注销
    ResourceManager
    负责全局资源理务调度整集群成计算资源池关注分配应负责容错
    资源理
    1 前资源节点分成Map slotReduce slot现ContainerContainer根需运行ApplicationMasterMapReduce者意程序
    2 前资源分配静态目前动态资源利率更高
    3 Container资源申请单位资源申请格式: resourcename:机名机架名*(代表意机器) resourcerequirement:目前支持CPU存
    4 户提交作业ResourceManager然某NodeManager分配Container运行ApplicationMasterApplicationMaster根身程序需ResourceManager申请资源
    5 YARN套Container生命周期理机制ApplicationMasterContainer间理应程序定义
    务调度
    1 关注资源情况根需求合理分配资源
    2 Scheluer根申请需特定机器申请特定资源(ApplicationMaster负责申请资源时数化考虑ResourceManager量满足申请需求指定机器分配Container减少数移动)
    部结构

    · Client Service 应提交终止输出信息(应队列集群等状态信息)
    · Adaminstration Service 队列节点Client权限理
    · ApplicationMasterService 注册终止ApplicationMaster 获取ApplicationMaster资源申请取消请求异步传Scheduler 单线程处理
    · ApplicationMaster Liveliness Monitor 接收ApplicationMaster心跳消息果某ApplicationMaster定时间没发送心跳务失效资源会回收然ResourceManager会重新分配ApplicationMaster运行该应(默认尝试2次)
    · Resource Tracker Service 注册节点 接收注册节点心跳消息
    · NodeManagers Liveliness Monitor 监控节点心跳消息果长时间没收心跳消息认该节点效 时该节点Container标记成效会调度务该节点运行
    · ApplicationManager 理应程序记录理已完成应
    · ApplicationMaster Launcher 应提交负责NodeManager交互分配Container加载ApplicationMaster负责终止销毁
    · YarnScheduler 资源调度分配 FIFO(with Priority)FairCapacity方式
    · ContainerAllocationExpirer 理已分配没启Container超定时间回收
    作业提交全程
    1)作业提交程YARN

    作业提交全程详解
    (1)作业提交
    第0步:client调jobwaitForCompletion方法整集群提交MapReduce作业
    第1步:clientRM申请作业id
    第2步:RMclient返回该job资源提交路径作业id
    第3步:client提交jar包切片信息配置文件指定资源提交路径
    第4步:client提交完资源RM申请运行MrAppMaster
    (2)作业初始化
    第5步:RM收client请求该job添加容量调度器中
    第6步:某空闲NM领取该job
    第7步:该NM创建Container产生MRAppmaster
    第8步:载client提交资源
    (3)务分配
    第9步:MrAppMasterRM申请运行maptask务资源
    第10步:RM运行maptask务分配外两NodeManager两NodeManager分领取务创建容器
    (4)务运行
    第11步:MR两接收务NodeManager发送程序启动脚两NodeManager分启动maptaskmaptask数分区排序
    第12步:MrAppMaster等maptask运行完毕RM申请容器运行reduce task
    第13步:reduce taskmaptask获取相应分区数
    第14步:程序运行完毕MR会RM申请注销
    (5)进度状态更新
    YARN中务进度状态(包括counter)返回应理器 客户端秒(通mapreduceclientprogressmonitorpollinterval设置)应理器请求进度更新 展示户
    (6)作业完成
    应理器请求作业进度外 客户端5分钟会通调waitForCompletion()检查作业否完成时间间隔通mapreduceclientcompletionpollinterval设置作业完成 应理器container会清理工作状态作业信息会作业历史服务器存储备户核查
    2)作业提交程MapReduce

    资源调度器
    目前Hadoop作业调度器三种:FIFOCapacity SchedulerFair SchedulerHadoop272默认资源调度器Capacity Scheduler
    具体设置详见:yarndefaultxml文件

        The class to use as the resource scheduler
        yarnresourcemanagerschedulerclass
    orgapachehadoopyarnserverresourcemanagerschedulercapacityCapacityScheduler

    1)先进先出调度器(FIFO)

    2)容量调度器(Capacity Scheduler)

    3)公调度器(Fair Scheduler)

    务推测执行
    1)作业完成时间取决慢务完成时间
    作业干Map务Reduce务构成硬件老化软件Bug等某务运行非常慢
    典型案例:系统中99Map务完成少数Map老进度慢完成办?
    2)推测执行机制:
    发现拖腿务某务运行速度远慢务均速度拖腿务启动备份务时运行谁先运行完采谁结果
    3)执行推测务前提条件
    (1)task备份务
    (2)前job已完成task必须005(5)
    (3)开启推测执行参数设置Hadoop272 mapredsitexml文件中默认开

      mapreducemapspeculative
      true
      If true then multiple instances of some map tasks                may be executed in parallel

     

      mapreducereducespeculative
      true
      If true then multiple instances of some reduce tasks
                   may be executed in parallel

    4)启推测执行机制情况
       (1)务间存严重负载倾斜
       (2)特殊务务数库中写数
    5)算法原理

    YARNWEB UI说明
    安装完Yarn浏览器中通httpmaster8088访问YarnWEB UI图:

    详细解释图中标记1(cluster)2(Nodes)两界面中资源关信息

    面7字段信息进行解释:
    1Active Nodes:表示Yarn集群理节点数实NodeManager数集群2NodeManager
    2Memory Total:表示Yarn集群理存总存总等NodeManager理存NodeManager理存通yarnsitexml中配置进行配置:

    yarnnodemanagerresourcememorymb
    1630
    表示NodeManager理存

    配置中NodeManager理存1630MB整Yarn集群理存总1630MB * 2 3260MB约等318GBMemory Total
    3Vcores Total:表示Yarn集群理cpu虚拟核心总数等NodeManager理虚拟核心NodeManager理虚拟核心数通yarnsitexml中配置进行配置

    yarnnodemanagerresourcecpuvcores
    2
    表示NodeManager理虚拟核心数

    配置中NodeManager理虚拟核心数2整Yarn集群理虚拟核心总数2 * 2 4Vcores Total
    4Scheduler Type:表示资源分配类型Hadoopyarn安装文章中说三中资源调度
    5Minimum Allocation:分配资源说务Yarn申请资源时候Yarn少会分配资源务分配存核心数分配置yarnschedulerminimumallocationmb(默认值1024MB)yarnschedulerminimumallocationvcores(默认值1)控制
    6Maximum Allocation:分配资源说务Yarn申请资源时候Yarn会分配资源务分配存核心数分配置yarnschedulermaximumallocationmb(默认值8192MB)yarnschedulermaximumallocationvcores(默认值32)控制然两值肯定集群理资源

    面Yarn集群理两NodeManager状态信息分:
    1Rack:表示NodeManager机器机架
    2Node State:表示NodeManager状态
    3Mem Used:表示NodeManager已存Mem Avail:表示NodeManager剩少存VCores Used:表示NodeManager已VCores数量VCores Avail:表示NodeManager剩少VCores数量
    点击Node Address

    进入界面:

    界面信息slave2NodeManager详细信息中Total Vmem allocated for Containers表示NodeManager理虚拟存虚拟存yarnsitexml中配置设置:

    yarnnodemanagervmempmemratio
    41
    表示NodeManager理虚拟存物理存例

    面配置yarnnodemanagervmempmemratio虚拟存物理存例41说虚拟存物理存41倍虚拟存1630MB * 41 6683MB约等653GB
    集群运行状态查


    注:般资源超配置资源话  Staday Fair Shar mem  Min Resources mem

    发生然Staday Fair Shar mem  Min Resources mem

    暂时未遇单列超max资源配情况Staday Fair Shar mem  Min Resources mem情况


    Hadoop 30 新特性
    Hadoop 30功性方面hadoop核进行项重改进包括:
    Hadoop Common
    (1)精简Hadoop核包括剔期API实现默认组件实现换成高效实现(FileOutputCommitter缺省实现换v2版废hftp转webhdfs代移Hadoop子实现序列化库orgapachehadoopRecords
    (2)Classpath isolation防止版jar包突google Guava混合HadoopHBaseSpark时容易产生突(httpsissuesapacheorgjirabrowseHADOOP11656)
    (3)Shell脚重构 Hadoop 30Hadoop理脚进行重构修复量bug增加新特性支持动态命令等httpsissuesapacheorgjirabrowseHADOOP9902
    Hadoop HDFS
    (1)HDFS支持数擦编码HDFS降低性前提节省半存储空间(httpsissuesapacheorgjirabrowseHDFS7285)
    (2)NameNode支持支持集群中activestandby namenode部署方式注:ResourceManager特性hadoop 20中已支持(httpsissuesapacheorgjirabrowseHDFS6440)
    Hadoop MapReduce
    (1)Tasknative优化MapReduce增加CC++map output collector实现(包括SpillSortIFile等)通作业级参数调整切换该实现shuffle密集型应性提高约30(httpsissuesapacheorgjirabrowseMAPREDUCE2841)
    (2)MapReduce存参数动推断Hadoop 20中MapReduce作业设置存参数非常繁琐涉两参数:mapreduce{mapreduce}memorymbmapreduce{mapreduce}javaopts旦设置合理会存资源浪费严重前者设置4096MB者Xmx2g剩余2g实际法java heap(httpsissuesapacheorgjirabrowseMAPREDUCE5785)
    Hadoop YARN
    (1)基cgroup存隔离IO Disk隔离(httpsissuesapacheorgjirabrowseYARN2619)
    (2)curator实现RM leader选举(httpsissuesapacheorgjirabrowseYARN4438)
    (3)containerresizing(httpsissuesapacheorgjirabrowseYARN1197)
    (4)Timelineserver next generation (httpsissuesapacheorgjirabrowseYARN2928)
    hadoop30新参数
    hadoop30
    HADOOP
    Move to JDK8+
    Classpath isolation on by default HADOOP11656
    Shell script rewrite HADOOP9902
    Move default ports out of ephemeral range HDFS9427
    HDFS
    Removal of hftp in favor of webhdfs HDFS5570
    Support for more than two standby NameNodes HDFS6440
    Support for Erasure Codes in HDFS HDFS7285
    YARN
    MAPREDUCE
    Derive heap size or mapreduce*memorymb automatically MAPREDUCE5785
    HDFS7285中实现Erasure Coding新功鉴功远没发布阶段面块相关代码会进行进步改造做谓预分析帮助家提前解Hadoop社区目前实现功前没接触Erasure Coding技术中间程确实偶然相信文带家收获
    Erasure coding纠删码技术简称EC种数保护技术早通信行业中数传输中数恢复种编码容错技术通原始数中加入新校验数部分数产生关联性定范围数出错情况通纠删码技术进行恢复面结合图片进行简单演示首先原始数n然加入m校验数块图示
    Parity部分校验数块行数块组成Stripe条带行条带n数块m校验块组成原始数块校验数块通现数块进行恢复原
    果校验数块发生错误通原始数块进行编码重新生成果原始数块发生错误 通校验数块解码重新生成
    mn值固定变进行相应调整会奇中底什原理呢 实道理简单面图成矩阵矩阵运算具逆性数进行恢复出张标准矩阵相图家二者关联
    总结
    Hadoop 30alpha版预计2016夏天发布GA版11月12月发布
    Hadoop 30中引入重功优化包括HDFS 擦编码Namenode支持MR Native Task优化YARN基cgroup存磁盘IO隔离YARN container resizing等

     

     

     

     
            相前生产发布版Hadoop 2Apache Hadoop 3整合许重增强功 Hadoop 3版提供稳定性高质量API实际产品开发面简介绍Hadoop3变化
    · 低Java版求Java7变Java8
        Hadoopjar基Java 8运行版进行编译执行Java 7更低Java版户需升级Java 8
    · HDFS支持纠删码(erasure coding)
        纠删码种副存储更节省存储空间数持久化存储方法ReedSolomon(104)标准编码技术需14倍空间开销标准HDFS副技术需3倍空间开销纠删码额外开销重建远程读写通常存储常数(冷数)外新特性时户需考虑网络CPU开销
    · YARN时间线服务 v2(YARN Timeline Service v2)
        YARN Timeline Service v2应两挑战:(1)提高时间线服务扩展性性(2)通引入流(flow)聚合(aggregation)增强性代Timeline Service v1xYARN Timeline Service v2 alpha 2提出样户开发者进行测试提供反馈建议YARN Timeline Service v2测试容器中
    · 重写Shell脚
        Hadoopshell脚重写修补许长期存bug增加新特性
    · 覆盖客户端jar(Shaded client jars)
        2x版中hadoopclient Maven artifact配置会拉取hadoop传递赖hadoop应程序环境变量回带传递赖版应程序版相突问题
        HADOOP11804 添加新 hadoopclientapihadoopclientruntime artifcathadoop赖隔离单Jar包中避免hadoop赖渗透应程序类路径中
    · 支持Opportunistic ContainersDistributed Scheduling
        ExecutionType概念引入样应够通Opportunistic执行类型请求容器调度时没资源种类型容器会分发NM中执行程序种情况容器放入NM队列中等资源便执行Opportunistic container优先级默认Guaranteedcontainer低需情况资源会抢占便Guaranteed container样需提高集群率
        Opportunistic container默认中央RM分配目前已增加分布式调度器支持该分布式调度器做AMRProtocol解析器实现
    · MapReduce务级优化
        MapReduce添加映射输出收集器化实现支持密集型洗牌操作(shuffleintensive)jobs带30性提升
    · 支持余2NameNodes
        针HDFS NameNode高性初实现方式提供活跃(active)NameNode备(Standby)NameNode通3JournalNode法定数量复制编辑种架构够系统中节点障进行容错
        该功够通运行更备NameNode提供更高容错性满足部署需求通配置3NameNode5JournalNode集群够实现两节点障容错
    · 修改重服务默认端口
        前Hadoop版中重Hadoop服务默认端口Linux时端口范围容(3276861000)意味着启动程中服务器端口突会启动失败突端口已时端口范围移NameNodeSecondary NameNodeDataNodeKMS会受影响文档已做相应修改通阅读发布说明 HDFS9427HADOOP12811详细解修改端口
    · 提供文件系统连接器(filesystem connnector)支持Microsoft Azure Data LakeAliyun象存储系统
        Hadoop支持Microsoft Azure Data LakeAliyun象存储系统集成作Hadoop兼容文件系统
    · 数节点置衡器(Intradatanode balancer)
        单DataNode理磁盘情况执行普通写操作时磁盘量较均添加者更换磁盘时会导致DataNode磁盘量严重均衡目前HDFS均衡器关注点DataNode间(inter)intra处理种均衡情况
        hadoop3 中通DataNode部均衡功已处理述情况通hdfs diskbalancer ClI调
    · 重写守护进程务堆理机制
        针Hadoop守护进程MapReduce务堆理机制Hadoop3 做系列修改
        HADOOP10950 引入配置守护进程堆新方法特HADOOP_HEAPSIZE配置方式已弃根机存进行动调整
        MAPREDUCE5785 简化MAP配置减少务堆需务配置Java选项中明确指出需堆已明确指出堆现配置会受该改变影响
    · S3GuradS3A文件系统客户端提供致性元数缓存
        HADOOP13345 亚马逊S3存储S3A客户端提供选特性:够DynamoDB表作文件目录元数快速致性存储
    · HDFS基路器互联(HDFS RouterBased Federation)
        HDFS RouterBased Federation添加RPC路层HDFS命名空间提供联合视图现ViewFsHDFS Federation功类似区通服务端理表加载原客户端理简化现存HDFS客户端接入federated cluster操作
    · 基API配置Capacity Scheduler queue configuration
        OrgQueue扩展capacity scheduler提供种编程方法该方法提供REST API修改配置户通远程调修改队列配置样队列administer_queue ACL理员实现动化队列配置理
    · YARN资源类型
        Yarn资源模型已般化支持户定义计算资源类型仅仅CPU存集群理员定义GPU数量软件序列号连接存储资源然Yarn务够资源进行调度


    Hive部表外部表区
    未external修饰部表(managed table)external修饰外部表(external table)
    区:
    部表数Hive身理外部表数HDFS理
    部表数存储位置hivemetastorewarehousedir(默认:userhivewarehouse)外部表数存储位置制定(果没LOCATIONHiveHDFSuserhivewarehouse文件夹外部表表名创建文件夹属表数存放里)
    删部表会直接删元数(metadata)存储数删外部表仅仅会删元数HDFS文件会删
    部表修改会修改直接步元数外部表表结构分区进行修改需修复(MSCK REPAIR TABLE table_name)
    进行试验进行理解
    概念理解
    创建部表t1
    create table t1(
    id int
    name string
    hobby array
    add map
    )
    row format delimited
    fields terminated by ''
    collection items terminated by ''
    map keys terminated by ''


    2 查表描述:desc t1

    装载数(t1)
    注:般少insert (insert overwrite)语句算算插入条数会调MapReduce里选择Load Data方式
    LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1val1 partcol2val2 )]
    创建文件粘贴述记录载图:

    文件容
    1xiaomingbookTVcodebeijingchaoyangshagnhaipudong
    2lileibookcodenanjingjiangningtaiwantaibei
    3lihuamusicbookheilongjianghaerbin
    然载
    load data local inpath 'homehadoopDesktopdata' overwrite into table t1
    忘记写文件名data笔者第次忘记写整Desktop传查全null乱码
    查表容:
    select * from t1

    创建外部表t2
    create external table t2(
    id int
    name string
    hobby array
    add map
    )
    row format delimited
    fields terminated by ''
    collection items terminated by ''
    map keys terminated by ''
    location 'usert2'


    装载数(t2)
    load data local inpath 'homehadoopDesktopdata' overwrite into table t2

    查文件位置
    图NameNode50070explorerhtml#user目录t2文件

    t1呢?前配置默认路径里

    样通命令行获两者位置信息:
    desc formatted table_name


    注:图中managed table部表external table外部表
    ##分删部表外部表
    面分删部表外部表查区

    观察HDFS文件
    发现t1已存

    t2然存

    外部表仅仅删元数
    重新创建外部表t2
    create external table t2(
    id int
    name string
    hobby array
    add map
    )
    row format delimited
    fields terminated by ''
    collection items terminated by ''
    map keys terminated by ''
    location 'usert2'


    里面插入数select * 结果

    见数然
    官网解释
    官网中关external表介绍:
    A table created without the EXTERNAL clause is called a managed table because Hive manages its data
    Managed and External Tables
    By default Hive creates managed tables where files metadata and statistics are managed by internal Hive processes A managed table is stored under the hivemetastorewarehousedir path property by default in a folder path similar to appshivewarehousedatabasenamedbtablename The default location can be overridden by the location property during table creation If a managed table or partition is dropped the data and metadata associated with that table or partition are deleted If the PURGE option is not specified the data is moved to a trash folder for a defined duration
    Use managed tables when Hive should manage the lifecycle of the table or when generating temporary tables
    An external table describes the metadata schema on external files External table files can be accessed and managed by processes outside of Hive External tables can access data stored in sources such as Azure Storage Volumes (ASV) or remote HDFS locations If the structure or partitioning of an external table is changed an MSCK REPAIR TABLE table_name statement can be used to refresh metadata information
    Use external tables when files are already present or in remote locations and the files should remain even if the table is dropped
    Managed or external tables can be identified using the DESCRIBE FORMATTED table_name command which will display either MANAGED_TABLE or EXTERNAL_TABLE depending on table type
    Statistics can be managed on internal and external tables and partitions for query optimization
    Hive官网介绍:
    httpscwikiapacheorgconfluencedisplayHiveLanguageManual+DDL#LanguageManualDDLDescribeTableViewColumn
    Hive数仓库拉链表流水表全量表增量表
    1 全量表:天新状态数
    2 增量表:天新增数增量数次导出新数
    3 拉链表:维护历史状态新状态数种表拉链表根拉链粒度实际相快做优化部分变记录已通拉链表方便原出拉链时点客户记录
    4 流水表: 表修改会记录反映实际记录变更

    拉链表通常账户信息历史变动进行处理保留结果流水表天交易形成历史
    流水表统计业务相关情况拉链表统计账户客户情况
    数仓库拉链表(原理设计Hive中实现)

    情况保持历史状态需拉链表做样做目保留状态情况节省空间

    拉链表适种情况吧

    数量点表中某字段变化呢变化频率高业务需求呢需统计种变化状态天全量份呢点太现实

    仅浪费存储空间时业务统计点麻烦时拉链表作提现出节省空间满足需求

    般数仓中通增加begin_dateen_date表示例两列start_dateend_date

    1 20160820 20160820 创建 20160820 20160820
    1 20160820 20160821 支付 20160821 20160821
    1 20160820 20160822 完成 20160822 99991231
    2 20160820 20160820 创建 20160820 20160820
    2 20160820 20160821 完成 20160821 99991231
    3 20160820 20160820 创建 20160820 20160821
    3 20160820 20160822 支付 20160822 99991231
    4 20160821 20160821 创建 20160821 20160821
    4 20160821 20160822 支付 20160822 99991231
    5 20160822 20160822 创建 20160822 99991231
    begin_date表示该条记录生命周期开始时间end_date表示该条记录生命周期结束时间

    end_date 99991231’表示该条记录目前处效状态

    果查询前效记录select * from order_his where dw_end_date 99991231′

    果查询20160821历史快select * from order_his where begin_date < 20160821′ and end_date > 20160821’

    简单介绍拉链表更新:

    假设天维度天状态天终状态

    张订单表例原始数天订单状态明细

    1 20160820 20160820 创建
    2 20160820 20160820 创建
    3 20160820 20160820 创建
    1 20160820 20160821 支付
    2 20160820 20160821 完成
    4 20160821 20160821 创建
    1 20160820 20160822 完成
    3 20160820 20160822 支付
    4 20160821 20160822 支付
    5 20160822 20160822 创建
    根拉链表希


    1 20160820 20160820 创建 20160820 20160820
    1 20160820 20160821 支付 20160821 20160821
    1 20160820 20160822 完成 20160822 99991231
    2 20160820 20160820 创建 20160820 20160820
    2 20160820 20160821 完成 20160821 99991231
    3 20160820 20160820 创建 20160820 20160821
    3 20160820 20160822 支付 20160822 99991231
    4 20160821 20160821 创建 20160821 20160821
    4 20160821 20160822 支付 20160822 99991231
    5 20160822 20160822 创建 20160822 99991231
    出 1234订单状态统计前效状态

    例hive例考虑实现性关

    首先创建表

    CREATE TABLE orders (
    orderid INT
    createtime STRING
    modifiedtime STRING
    status STRING
    ) row format delimited fields terminated by '\t'


    CREATE TABLE ods_orders_inc (
    orderid INT
    createtime STRING
    modifiedtime STRING
    status STRING
    ) PARTITIONED BY (day STRING)
    row format delimited fields terminated by '\t'


    CREATE TABLE dw_orders_his (
    orderid INT
    createtime STRING
    modifiedtime STRING
    status STRING
    dw_start_date STRING
    dw_end_date STRING
    ) row format delimited fields terminated by '\t'
    首先全量更新先20160820止数

    初始化先20160820数初始化进

    INSERT overwrite TABLE ods_orders_inc PARTITION (day '20160820')
    SELECT orderidcreatetimemodifiedtimestatus
    FROM orders
    WHERE createtime < '20160821' and modifiedtime <'20160821'
    刷dw中

    INSERT overwrite TABLE dw_orders_his
    SELECT orderidcreatetimemodifiedtimestatus
    createtime AS dw_start_date
    '99991231' AS dw_end_date
    FROM ods_orders_inc
    WHERE day '20160820'

    结果

    select * from dw_orders_his
    OK
    1 20160820 20160820 创建 20160820 99991231
    2 20160820 20160820 创建 20160820 99991231
    3 20160820 20160820 创建 20160820 99991231
    剩余需进行增量更新


    INSERT overwrite TABLE ods_orders_inc PARTITION (day '20160821')
    SELECT orderidcreatetimemodifiedtimestatus
    FROM orders
    WHERE (createtime '20160821' and modifiedtime '20160821') OR modifiedtime '20160821'

    select * from ods_orders_inc where day'20160821'
    OK
    1 20160820 20160821 支付 20160821
    2 20160820 20160821 完成 20160821
    4 20160821 20160821 创建 20160821
    先放增量表中然进行关联张时表中插入新表中


    DROP TABLE IF EXISTS dw_orders_his_tmp
    CREATE TABLE dw_orders_his_tmp AS
    SELECT orderid
    createtime
    modifiedtime
    status
    dw_start_date
    dw_end_date
    FROM (
    SELECT aorderid
    acreatetime
    amodifiedtime
    astatus
    adw_start_date
    CASE WHEN borderid IS NOT NULL AND adw_end_date > '20160821' THEN '20160821' ELSE adw_end_date END AS dw_end_date
    FROM dw_orders_his a
    left outer join (SELECT * FROM ods_orders_inc WHERE day '20160821') b
    ON (aorderid borderid)
    UNION ALL
    SELECT orderid
    createtime
    modifiedtime
    status
    modifiedtime AS dw_start_date
    '99991231' AS dw_end_date
    FROM ods_orders_inc
    WHERE day '20160821'
    ) x
    ORDER BY orderiddw_start_date

    INSERT overwrite TABLE dw_orders_his
    SELECT * FROM dw_orders_his_tmp
    根面步骤20160822号数更新进结果


    select * from dw_orders_his
    OK
    1 20160820 20160820 创建 20160820 20160820
    1 20160820 20160821 支付 20160821 20160821
    1 20160820 20160822 完成 20160822 99991231
    2 20160820 20160820 创建 20160820 20160820
    2 20160820 20160821 完成 20160821 99991231
    3 20160820 20160820 创建 20160820 20160821
    3 20160820 20160822 支付 20160822 99991231
    4 20160821 20160821 创建 20160821 20160821
    4 20160821 20160822 支付 20160822 99991231
    5 20160822 20160822 创建 20160822 99991231
    想数

    值注意订单表中数天次状态更新应天状态天终状态天订单状态创建支付完成应拉取终状态进行拉练表更新否面数会出现异常
    1 6 20160822 20160822 创建 20160822 99991231
    2 6 20160822 20160822 支付 20160822 99991231
    3 6 20160822 20160822 完成 20160822 99991231


    Hadoop 320 完全分布式集群搭建
    集群环境搭建
    首先准备4台服务器(虚拟机)
    设置静态ip址映射 centos7 修改静态ip设置址映射
    址映射

    然设置集群SSH免密登录分发脚 centos7配置集群SSH免密登录(包含群发文件脚)
    果防火墙记关闭防火墙
    1 systemctl stop firewalldservice
    2 systemctl disable firewalldservice
    载 hadoop312 分传4台服务器 roothadoop 目录
    址 httparchiveapacheorgdisthadoopcommonhadoop320hadoop320targz
    传 4台设备 分执行解压 重命名
    1 cd roothadoop
    2 tar zxvf roothadoophadoop320targz
     
    安装JDK18 jdk载解压 rootjdk 重命名 jdk8
    载址 
    httpswwworaclecomtechnetworkjavajavasedownloadsjdk8downloads2133151html

    编辑 etcprofile 设置环境变量
    vim etcprofile
     export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE HISTCONTROL 添加
    1 export JAVA_HOMErootjdkjdk8
    2 export JRE_HOMErootjdkjdk8jre
    3 export HADOOP_HOMEroothadoophadoop320
    4 PATHPATHHOMEbinJAVA_HOMEbinHADOOP_HOMEbinHADOOP_HOMEsbinrootbin
    5 export LD_LIBRARY_PATHHADOOP_HOMElibnative
    6 export HADOOP_CONF_DIRHADOOP_HOMEetchadoop
     
    更新 profile 文件
    source etcprofile
    测试环境变量否生效
    java version
    hadoop version

    二Hadoop配置修改
    进入 hadoop 配置文件目录
    cd roothadoophadoop320etchadoop
    修改 hadoopenvsh 配置 jdk 路径定义集群操作户
    面增加
    1 export JAVA_HOMErootjdkjdk8
    2
    3 export HDFS_NAMENODE_USERroot
    4 export HDFS_DATANODE_USERroot
    5 export HDFS_SECONDARYNAMENODE_USERroot
    6 export YARN_RESOURCEMANAGER_USERroot
    7 export YARN_NODEMANAGER_USERroot
    8
    9 export HADOOP_PID_DIRroothadoopdatapids
    10 export HADOOP_LOG_DIRroothadoopdatalogs
     
    修改 coresitexml hadoop核心配置
    1
    2
    3 fsdefaultFS
    4 hdfshadoop18020
    5
    6
    7 hadooptmpdir
    8 roothadoopdatatmp
    9
    10
    · fsdefaultFSNameNode址hadooptmpdirhadoop时目录址
     
    修改 hdfssitexml hadoop 节点配置
    1
    2
    3 dfsnamenodehttpaddress
    4 hadoop19870
    5
    6
    7 dfsnamenodesecondaryhttpaddress
    8 hadoop250090
    9
    10
    11 dfsreplication
    12 2
    13
    14
    15 dfsnamenodenamedir
    16 fileroothadoopdatahdfsname
    17
    18
    19 dfsdatanodedatadir
    20 fileroothadoopdatahdfsdata
    21
    22
    23
    · dfsreplication 副数
    · dfsnamenodesecondaryhttpaddress指定secondaryNameNodehttp访问址端口号
    · 里 hadoop2 设置 SecondaryNameNode服务器 
     
    修改 workers 告知 hadoop hdfsDataNode节点
    1 hadoop2
    2 hadoop3
    3 hadoop4
     
    修改 yarnsitexml 配置yarn服务
    1
    2
    3 yarnnodemanagerauxservices
    4 mapreduce_shuffle
    5
    6
    7 yarnnodemanagerlocalizeraddress
    8 00008140
    9
    10
    11 yarnresourcemanagerhostname
    12 hadoop1
    13
    14
    15 yarnresourcemanagerwebappaddress
    16 hadoop18088
    17
    18
    19 yarnlogaggregationenable
    20 true
    21
    22
    23 yarnlogaggregationretainseconds
    24 604800
    25
    26
    27 yarnlogserverurl
    28 httphadoop419888jobhistorylogs
    29
    30
    · yarnresourcemanagerwebappaddress  配置 resourcemanager  服务器址端口
    · yarnresourcemanagerhostname  指定  resourcemanager  服务器
    · yarnlogaggregationenable  配置否启日志聚集功
    · yarnlogaggregationretainseconds  配置聚集日志HDFS保存长时间
    · yarnlogserverurl  配置yarn日志服务器址
    修改mapredsitexml 文件
    1
    2
    3 mapreduceframeworkname
    4 yarn
    5
    6
    7 yarnappmapreduceamenv
    8 HADOOP_MAPRED_HOMEroothadoophadoop320
    9
    10
    11 mapreducemapenv
    12 HADOOP_MAPRED_HOMEroothadoophadoop320
    13
    14
    15 mapreducereduceenv
    16 HADOOP_MAPRED_HOMEroothadoophadoop320
    17
    18
    19 mapreducejobhistoryaddress
    20 hadoop410020
    21
    22
    23 mapreducejobhistorywebappaddress
    24 hadoop419888
    25
    26
    · yarnappmapreduceamenv   mapreducemapenv   mapreducereduceenv 
    · 三mapreduce指定hadoop目录 果配置会出现  运行mapreduce找main方法等错误
    · mapreducejobhistoryaddress  配置务历史服务器址
    · mapreducejobhistorywebappaddress 配置历史服务器web访问址
     
    修改配置文件 分发 节点三台服务器
    前目录 执行xsync分发脚
    1 xsync hadoopenvsh
    2 xsync coresitexml
    3 xsync hdfssitexml
    4 xsync workers
    5 xsync yarnsitexml
    6 xsync mapredsitexml
     
    配置完成
    三Hadoop服务启动
    hadoop1节点执行namenode初始格式化命令 (仅第次启动需执行)
    hdfs namenode format

    执行成功生成目录
    cd roothadoopdatadfsname


    生成集群唯id说明执行成功
    hadoop1执行命令
    1 startdfssh
    2 startyarnsh
    者执行
    startallsh
    hadoop4执行命令 启动job历史服务
    mapred daemon start historyserver
    执行完成分4台设备jps查进程

    启动成功 
    · hadoop1  NameNode ResourceManager 节点
    · hadoop2  SecondaryNameNode DataNode NodeManager 节点
    · hadoop3  DataNode NodeManager 节点
    · hadoop4  DataNode NodeManager JobHistoryServer 节点
    查HDFS Web界面 httphadoop19870

    查 YARN web界面  httphadoop18088 

    hadoop搭建完成
    四运行WordCount
    首先 root 目录创建txt文件
    vim roottesttxt

    1 hadoop 1
    2 hadoop 2
    3 hadoop 3
    4 hadoop 4
    5 hadoop 5
    6 hadoop 6
    7 hadoop 7
    8 hadoop 8
    9 hadoop 9
    10 hadoop 10
    testtxt文件传 hdfs 执行命令
    1 hdfs dfs mkdir userroot
    2 hdfs dfs put roottesttxt userroot
    找官方带案例jar
    cd roothadoophadoop320sharehadoopmapreduce

    运行jar执行MapReduce WordCount案例
    hadoop jar hadoopmapreduceexamples320jar wordcount userroottesttxt rootoutput
    ·  wordcount 第路径 文件路径
    · 第二路径 结果输出路径 (必须存)

    执行成功 查执行结果
    hdfs dfs lsr rootoutput

    success执行成功
    partr00000  m mapper 输出 r reduce 输出 00000 job 务编号 整文件结果输出文件
    hdfs dfs cat rootoutputpartr00000

    文件中 hadoop 词出现 10次 结果正确

    Linux台载MySQL
    1 官网载安装包
    载链接:点击开链接
    httpsdevmysqlcomdownloadsmysql

    果系统32位选择第64位选择第二
    wget 载
    wget httpsdevmysqlcomgetDownloadsMySQL80mysql8011linuxglibc212i686targz
    解压文件
    tar zxvf mysql8011linuxglibc212i686targz
    2  移动压缩包usrlocal目录重命名文件
    mv rootmysql8011linuxglibc212i686  usrlocalmysql
    3MySQL根目录新建文件夹data存放数
    mkdir data
    4创建 mysql 户组 mysql 户
    1 groupadd mysql
    2
    3 useradd g mysql mysql
     
    5改变 mysql 目录权限
    1 chown R mysqlmysql usrlocalmysql
    2
    3 者
    4
    5 chown R mysql
    6
    7 chgrp R mysql
    注意点
     
    6初始化数库
    创建mysql_install_db安装文件
    1 mkdir mysql_install_db
    2 chmod 777 mysql_install_db
    初始化 
    binmysqld initialize usermysql basedirusrlocalmysql datadirusrlocalmysqldata 初始化数库

    usrlocalmysqlbinmysqld initialize usermysql
    1 usrlocalmysqlbinmysqld initialize usermysql
    2
    3 usrlocalmysqlbinmysqld (mysqld 8011) initializing of server in progress as process 5826
    4
    5 [Server] A temporary password is generated for root@localhost twiTlsi<0O
    6
    7 usrlocalmysqlbinmysqld (mysqld 8011) initializing of server has completed
    记录时密码:
       twiTlsi<0O
     
    里遇问题没libnumaso1
     
    zsh command not found mysqld
     binmysqld initialize
    binmysqld error while loading shared libraries libnumaso1 cannot open shared object file No such file or directory
    20180429 170630 [WARNING] mysql_install_db is deprecated Please consider switching to mysqld initialize
    20180429 170630 [ERROR]   Can't locate the language directory
    需安装 libnuma
    1 yum install libnuma
    2
    3 yum y install  numactl
    4
    5 yum install libaio1 libaiodev
    安装文件
     
     7mysql配置
    cp usrlocalmysqlsupportfilesmysqlserver etcinitdmysqld
    修改mycnf文件
    vim  etcmycnf
    1
    2 [mysqld]
    3 basedir usrlocalmysql
    4 datadir usrlocalmysqldata
    5 socket usrlocalmysqlmysqlsock
    6 charactersetserverutf8
    7 port 3306
    8 sql_modeNO_ENGINE_SUBSTITUTIONSTRICT_TRANS_TABLES
    9 [client]
    10 socket usrlocalmysqlmysqlsock
    11 defaultcharactersetutf8
    esc保存
    wq 退出
     
    8建立MySQL服务
    cp a supportfilesmysqlserver etcinitdmysqld
    1 cp mysqlserver etcinitdmysql
    2 chmod +x etcinitdmysql
    添加系统服务
    chkconfig add mysql
    cp a supportfilesmysqlserver etcinitdmysqld
     chmod +x etcrcdinitdmysqld    
    chkconfig add mysqld
    检查服务否生效  
    chkconfig  list mysqld
    9 配置全局环境变量
    编辑 etcprofile 文件
    # vi etcprofile
    profile 文件底部添加两行配置保存退出
    export PATHPATHusrlocalmysqlbinusrlocalmysqllib
    export PATH
    设置环境变量立生效
     source etcprofile
    10启动MySQL服务
    service mysql start
    查初始密码
    cat rootmysql_secret
    11登录MySQL
    mysql uroot p密码
    修改密码:
    SET PASSWORD FOR 'root'@localhostPASSWORD('123456')   #应换成密码

    12设置远程登录
                                 mysql>use mysql
                                mysql>update user set host'' where user'root' limit 1
                                    刷新权限
                                 mysql>flush privileges
    然检查3306端口否开放
    netstat nupl|grep 3306
    开放3306端口
    firewall cmd permanent addprot3306tcp
    重启防火墙
    firewall cmd reload

    yum安装MySQL
    安装环境:AliyunLinux(阿里linux系统64位)
    cat etcosrelease

    getconf LONG_BIT


    首先系统中没带mysql东西先删掉
    查:
     
    find name mysql
    删:
     
    rm rf 边查找路径路径空格隔开
    #者边条命令
     
    find namemysql|xargs rm rf

    开始安装
    rpm Uvh httpsrepomysqlcommysql57communityreleaseel711noarchrpm

    yum enablerepomysql80community install mysqlcommunityserver

    步开始询问选择概意思:

    总371M否载?
    输入y然回车

    概意思文件中检索密钥MySQL导入GPG问否OK?(英文谅解)
    输入y然回车

    Complete 完成
    查mysql状态:
    service mysqld start

    接需查mysql创建默认密码首次登陆配置mysql时需
    grep A temporary password varlogmysqldlog

    mysql默认密码开始配置mysql
    mysql_secure_installation





    登陆数库:mysql u root p

    功告成咯
    需提醒阿里云版系统防火墙默认关闭设置果需外连接数库话记检查阿里云服务器安全组里否开放数库默认端口3306
    然进入mysql库中修改update user set host'' where user'root'
    sqlyog等工具连接数库
    坑:
    sqlyog连接数库时出现错误提示:Authentication plugin caching_sha2_password’ cannot be loaded
    客户端支持caching_sha2_password种密码加密方式
    需修改密码老版密码验证方式
    登陆数库进入mysql库
    update user set host'' where user'root'
    重启:service mysqld restart
    ALTER USER 'root'@'' IDENTIFIED WITH mysql_native_password BY '新密码'
    重启:service mysqld restart

    里 Abc123456a 新密码

    修改退出sqlyog连接试试?

    连接成功

    Hive环境搭建
    前提:
    1 安装Hive前求先预装:
    2 安装JDK 8
    3 安装Hadoop277
    4 安装MySQL
    安装
    1 载hive解压缩户目录:
    1 tar xzvf apachehive236bintargz
    2 改名:
    3 mv apachehive236bin hive
    2 设置环境变量:
    二配置理
    首先进入conf目录带template缀文件移缀
    中hivedefaultxml移缀需修改名hivesitexml

    1 通方法Hive进行配置:
    11 修改hiveenvsh
    1 cp hiveenvshtemplate hiveenvsh
    2 Hive Hadoop 需 hiveenvsh 文件中指定 Hadoop 安装路径:
    3
    4 vim hiveenvsh
    5
    6 开配置文件中添加行:
    7
    8 export JAVA_HOMEusrlocalhadoopjdk180_221
    9 export HADOOP_HOMEusrlocalhadoophadoop277
    10 export HADOOP_CONF_DIRHADOOP_HOMEetchadoop
    11 export HIVE_HOMEusrlocalhive
    12 export HIVE_CONF_DIRHIVE_HOMEconf
    13 export HIVE_AUX_JARS_PATHHIVE_HOMElib
    12 修改hivelog4j2properties配置hivelog
    1 cp hivelog4j2propertiestemplate hivelog4j2properties
    2
    3 vim confhivelog4j2properties
    4
    5 配置面参数(果没logs目录hive根目录创建):
    6
    7 propertyhivelogdirusrlocalhivelogs
    13 usrlocalhive215新建tmp目录tmp新建hive目录
    1 cd usrlocalhive
    2 mkdir tmp
    3 mkdir tmphive
    14 修改hivesitexml
    1 cp hivedefaultxmltemplate hivesitexml
    2
    3 hivesitexml文件中:
    4
    5 {systemjavaiotmpdir}换成homehduserhivetmp
    6
    7 {systemusername}换1921688101 节点名
    2) hivesitexml 中配置 MySQL 数库连接信息:
    面配置信息需改写出需文件弄外部ctrl+f进行搜索应里数然进行修改
    1
    2
    3
    4 < 设置面属性 >
    5
    6 hiveexecscratchdir
    7 tmphive
    8
    9
    10
    11 hiveexeclocalscratchdir
    12 usrlocalhivetmphive
    13 Local scratch space for Hive jobs
    14
    15
    16
    17 hivedownloadedresourcesdir
    18 usrlocalhivetmp{hivesessionid}_resources
    19 Temporary local directory for added resources in the remote file system
    20
    21
    22
    23 hivequeryloglocation
    24 usrlocalhivetmphive
    25 Location of Hive run time structured log file
    26
    27
    28
    29 hiveauxjarspath
    30 usrlocalhivelibusrlocalhivejdbc
    31 These JAR file are available to all users for all jobs
    32
    33
    34 hivemetastorewarehousedir
    35 hdfs 19216881019000userhivewarehouse
    36 相fsdefaultname关目录理表存储位置
    37
    38
    39 <配置Hive Metastore>
    40
    41 javaxjdooptionConnectionURL
    42 jdbcmysql 19216881013306hivecreateDatabaseIfNotExisttrue&characterEncodingUTF8
    43
    44
    45
    46 javaxjdooptionConnectionDriverName
    47 commysqljdbcDriver 高版驱动需改成commysqlcjjdbcDriver
    48
    49
    50
    51 javaxjdooptionConnectionUserName
    52 root
    53
    54
    55
    56 javaxjdooptionConnectionPassword
    57 123 里mysql密码
    58
    59
    60 <配置hiveserver2机(里配置ip址便Windows连接)>
    61
    62 hiveserver2thriftbindhost
    63 1921688101
    64 Bind host on which to run the HiveServer2 Thrift service
    65
    66
    67 <配置beeline远程客户端连接时户名密码户名应hadoop配置文件coresitexml中配置>
    68
    69 hiveserver2thriftclientuser
    70 1921688101
    71 Username to use against thrift client default is 'anonymous'
    72
    73
    74
    75 hiveserver2thriftclientpassword
    76 123 里机户密码
    77 Password to use against thrift client default is 'anonymous'
    78
    79
    80 < 配置面两属性配置 hive 2x web ui >
    81
    82 hiveserver2webuihost
    83 1921688101
    84
    85 < 重启HiveServer2访问http172162121710002 >
    86
    14 配置Hive Metastore
    1 默认情况 Hive元数保存嵌derby数库里 般情况生产环境MySQL存放Hive元数
    2 mysqlconnectorjavaxxxjar 放入 HIVE_HOMElib (mysql jdbc驱动程序)
    里注意mysql版定mysqlconnectorjavaxxxjar版低然会报错兼容(里重时卡久问题时候问题)
    三运行
    1 运行Hive CLI
    命令行运行hive命令时必须保证HDFS已启动startdfssh启动HDFS
    (特说明: Hive 21 版开始 第次运行hive前需先运行schematool命令执行初始化操作)
    1 果MySQL数库:
    先启动mysql服务器:执行
    systemctl enable mysqldservice
    · 执行初始化操作
    schematool initSchema dbType mysql
    执行成功查MySQL中元数库hive否已创建成功
    2 果derby数库:
    schematool initSchema dbType derby
    进入hive命令行:
    hive
    (hive service metastore &)
    show tables 显示表
    1 hive> show tables
    2
    3 退出hive
    4 hive> quit
    创建数库然创建表
    1 hive> drop table chun
    2 OK
    3 Time taken 1125 seconds
    4 hive> create database chun
    5 OK
    6 Time taken 0099 seconds
    7 hive> use chun
    8 OK
    9 Time taken 0024 seconds
    ·
    面进入HDFS web端查Hive仓库
    浏览器输入:19216822813850070刚创建表

    Apache Mahout环境搭建
    1载解压Mahout
    httparchiveapacheorgdistmahout
    tar zxvf mahoutdistribution09targz

    2配置环境变量
    # set mahout environment
    export MAHOUT_HOMEmntjediaelmahoutmahoutdistribution09
    export MAHOUT_CONF_DIRMAHOUT_HOMEconf
    export PATHMAHOUT_HOMEconfMAHOUT_HOMEbinPATH

    3安装mahout
    [jediael@master mahoutdistribution09] pwd
    mntjediaelmahoutmahoutdistribution09
    [jediael@master mahoutdistribution09] mvn install

    4验证Mahout否安装成功
        执行命令mahout列出算法成功:
    [jediael@master mahoutdistribution09] mahout
    Running on hadoop using mntjediaelhadoop121binhadoop and HADOOP_CONF_DIR
    MAHOUTJOB mntjediaelmahoutmahoutdistribution09examplestargetmahoutexamples09jobjar
    An example program must be given as the first argument
    Valid program names are
    arffvector Generate Vectors from an ARFF file or directory
    baumwelch BaumWelch algorithm for unsupervised HMM training
    canopy Canopy clustering
    cat Print a file or resource as the logistic regression models would see it
    cleansvd Cleanup and verification of SVD output
    clusterdump Dump cluster output to text
    clusterpp Groups Clustering Output In Clusters
    cmdump Dump confusion matrix in HTML or text formats
    concatmatrices Concatenates 2 matrices of same cardinality into a single matrix
    cvb LDA via Collapsed Variation Bayes (0th deriv approx)
    cvb0_local LDA via Collapsed Variation Bayes in memory locally
    evaluateFactorization compute RMSE and MAE of a rating matrix factorization against probes
    fkmeans Fuzzy Kmeans clustering
    hmmpredict Generate random sequence of observations by given HMM
    itemsimilarity Compute the itemitemsimilarities for itembased collaborative filtering
    kmeans Kmeans clustering
    lucenevector Generate Vectors from a Lucene index
    lucene2seq Generate Text SequenceFiles from a Lucene index
    matrixdump Dump matrix in CSV format
    matrixmult Take the product of two matrices
    parallelALS ALSWR factorization of a rating matrix
    qualcluster Runs clustering experiments and summarizes results in a CSV
    recommendfactorized Compute recommendations using the factorization of a rating matrix
    recommenditembased Compute recommendations using itembased collaborative filtering
    regexconverter Convert text files on a per line basis based on regular expressions
    resplit Splits a set of SequenceFiles into a number of equal splits
    rowid Map SequenceFile to {SequenceFile SequenceFile}
    rowsimilarity Compute the pairwise similarities of the rows of a matrix
    runAdaptiveLogistic Score new production data using a probably trained and validated AdaptivelogisticRegression model
    runlogistic Run a logistic regression model against CSV data
    seq2encoded Encoded Sparse Vector generation from Text sequence files
    seq2sparse Sparse Vector generation from Text sequence files
    seqdirectory Generate sequence files (of Text) from a directory
    seqdumper Generic Sequence File dumper
    seqmailarchives Creates SequenceFile from a directory containing gzipped mail archives
    seqwiki Wikipedia xml dump to sequence file
    spectralkmeans Spectral kmeans clustering
    split Split Input data into test and train sets
    splitDataset split a rating dataset into training and probe parts
    ssvd Stochastic SVD
    streamingkmeans Streaming kmeans clustering
    svd Lanczos Singular Value Decomposition
    testnb Test the Vectorbased Bayes classifier
    trainAdaptiveLogistic Train an AdaptivelogisticRegression model
    trainlogistic Train a logistic regression using stochastic gradient descent
    trainnb Train the Vectorbased Bayes classifier
    transpose Take the transpose of a matrix
    validateAdaptiveLogistic Validate an AdaptivelogisticRegression model against holdout data set
    vecdist Compute the distances between a set of Vectors (or Cluster or Canopy they must fit in memory) and a list of Vectors
    vectordump Dump vectors from a sequence file to text
    viterbi Viterbi decoding of hidden states from given output states sequence



    二简单示例验证mahout
    1启动Hadoop
    2载测试数
               httparchiveicsuciedumldatabasessynthetic_control链接中synthetic_controldata
    者百度容易找示例数
    3传测试数
    hadoop fs put synthetic_controldata testdata
    4 Mahout中kmeans聚类算法执行命令:
    mahout core  orgapachemahoutclusteringsyntheticcontrolkmeansJob
    花费9分钟左右完成聚类
    5查聚类结果
        执行hadoop fs ls userrootoutput查聚类结果

    [jediael@master mahoutdistribution09] hadoop fs ls output
    Found 15 items
    rwrr 2 jediael supergroup 194 20150307 1507 userjediaeloutput_policy
    drwxrxrx jediael supergroup 0 20150307 1507 userjediaeloutputclusteredPoints
    drwxrxrx jediael supergroup 0 20150307 1502 userjediaeloutputclusters0
    drwxrxrx jediael supergroup 0 20150307 1502 userjediaeloutputclusters1
    drwxrxrx jediael supergroup 0 20150307 1507 userjediaeloutputclusters10final
    drwxrxrx jediael supergroup 0 20150307 1503 userjediaeloutputclusters2
    drwxrxrx jediael supergroup 0 20150307 1503 userjediaeloutputclusters3
    drwxrxrx jediael supergroup 0 20150307 1504 userjediaeloutputclusters4
    drwxrxrx jediael supergroup 0 20150307 1504 userjediaeloutputclusters5
    drwxrxrx jediael supergroup 0 20150307 1505 userjediaeloutputclusters6
    drwxrxrx jediael supergroup 0 20150307 1505 userjediaeloutputclusters7
    drwxrxrx jediael supergroup 0 20150307 1506 userjediaeloutputclusters8
    drwxrxrx jediael supergroup 0 20150307 1507 userjediaeloutputclusters9
    drwxrxrx jediael supergroup 0 20150307 1502 userjediaeloutputdata
    drwxrxrx jediael supergroup 0 20150307 1502 userjediaeloutputrandomseeds

    PySpark环境搭建
    Linux台载Python 38
    1赖安装
    1 yum y install zlibdevel bzip2devel openssldevel ncursesdevel sqlitedevel readlinedevel tkdevel gdbmdevel db4devel libpcapdevel xzdevel libffidevel
    2
    3 yum y install gcc
    2载安装包
    载安装包:httpswwwpythonorgftppython380Python380tgz
    放Linux操作系统中
    3解压
    1 # 解压
    2 tar zxvf Python380tgz C <目标文件夹选>
    进入解压文件目录

    4安装
    1 configure prefixusrlocalpython3
    2 make && make install
    5添加软链接
    1 ln s usrlocalpython3binpython38 usrbinpython3
    2 ln s usrlocalpython3binpip38 usrbinpip3
    6测试
    1 python3 V
    2 pip3 V



    Linux升级安装python38配置pipyum
    查版
    安装前查否已安装python里带python275版需删情况安装python381版
    python V
    二安装Python381
    官网载址:httpswwwpythonorgdownloadssource

    1 # 解压
    2 tar zxf Python381tgz
    3 # 安装赖包
    4 yum install zlibdevel bzip2devel openssldevel ncursesdevel sqlitedevel readlinedevel tkdevel gcc libffidevel
    5 # 进入python目录
    6 cd Python381
    7 # 编译
    8 configure prefixusrlocalpython3
    9 #安装
    10 make && make install
    系统默认python备份
    里前带python275版避免文件重名直接名字改成python275
    直接Xftp直接修改 命令
    mv usrbinpython usrbinpython275
    创建新软连接
    软连接相windows新建快捷方式方便Linux需先找文件直接命令
    快捷方式:Windows提供种快速启动程序开文件文件夹方法应程序快速连接
    1 ln s usrlocalpython3binpython38 usrbinpython
    2
    3 ln s usrlocalpython3binpython38 usrbinpython3
    输入面两命令python  python3命令 指定 python38
    面命令时候直接复制时出现点问题:
    ln invalid option ''
    Try 'ln help' for more information
    果出现问题话手动敲遍面软连接命令
    查python版安装成功显示:Python 381
    1 python V
    2
    3 python3 V
    三修改yum配置
    升级python38yum命令会运行需修改yum应头
    yumurlgrabberextdown两文件 #usrbinpython 改 #usrbinpython27 
    vi usrbinyum
    vi usrlibexecurlgrabberextdown


    四配置pip3
    安装完 python381 pip install 载插件会动载 python27 带 pip 包里
    pip软连接 python27 里先前 python27 版pip修改成 python38 版
    备份27版软连接
    mv usrbinpip usrbinpip275
    配置pip3软连接 pip3python安装路径 bin 目录
    1 ln s usrlocalpython3binpip3 usrbinpip
    2
    3 ln s usrlocalpython3binpip3 usrbinpip3
    查pip版
    1 pip V
    2
    3 pip3 V
    五关yum删重新安装
    1 删yum
    rpm qa | grep yum | xargs rpm e nodeps
    2 查Linux系统版
    cat etcredhatrelease
    3 查Linux核版
    file binls
    4 安装yum
     接需载安装具体址载路径:httpmirrors163com
    次执行3命令centos78网站址找指定载
    1 rpm ivh nodeps httpmirrors163comcentos782003osx86_64Packagesyummetadataparser11410el7x86_64rpm
    2
    3 rpm ivh nodeps httpmirrors163comcentos782003osx86_64Packagesyumpluginfastestmirror113153el7noarchrpm
    4
    5 rpm ivh nodeps httpmirrors163comcentos782003osx86_64Packagesyum343167el7centosnoarchrpm
    果linux安装python 3X版需改yum文件中配置具体见该篇文章第三部分修改yum配置
    Linux安装Python 38环境卸载旧Python
    前提条件
    首先连接网络(会转>linux虚拟机连接网络)搭建网络yum源
    cd etcyumreposd
    rm rf *
    wget httpmirrors163comhelpCentOS7Base163repo
    yum clean all
    yum makecache
    · 1
    · 2
    · 3
    · 4
    · 5
    安装环境
    yum install gcc patch libffidevel pythondevel zlibdevel bzip2devel openssldevel ncursesdevel sqlitedevel readlinedevel tkdevel gdbmdevel db4devel libpcapdevel xzdevel y
    · 1
    载Python 38源代码
    Windows载址:python38
    wget命令linux中载
    wget httpswwwpythonorgftppython380Python380a2tgz
    · 1
    载速度真令心寒分享百度云盘中直接载
    链接:httpspanbaiducoms1O5W8G66nKoFVphheedNAfQ
    提取码:ysem
    安装Python 38
    tar包放入linux中然执行操作
    tar zxf Python380a2tgz
    cd Python380a2
    configure prefixusrlocalpython_38
    make j 4 make install
    · 1
    · 2
    · 3
    · 4
    配置
    配置PATH变量
    ln s usrlocalpython_38bin* usrbin
    · 1

    [root@test2 ]# python38
    Python 380a2 (default Mar 29 2020 145852)
    [GCC 485 20150623 (Red Hat 48539)] on linux
    Type help copyright credits or license for more information
    >>> print('hello')
    hello
    >>> #Ctrl+d退出
    · 1
    · 2
    · 3
    · 4
    · 5
    · 6
    · 7
    想手动操作shell脚键安装转>CentOS7中shell脚安装python38环境载
    卸载原Python环境
    CentOS7中默认安装python27环境需卸载卸载导致系统崩溃请谨慎处理
    rpm qa|grep python|xargs rpm ev allmatches nodeps
    whereis python |xargs rm frv
    whereis python

    安装新版Python 2713Python 362(Python 2Python 3存修改默认版Python 362)
    准备工作:
    1 安装wget命令(线载安装包命令)
      yum y install wget
    2 准备编译环境
    1   yum groupinstall 'Development Tools'
    2   yum install zlibdevel bzip2devel openssldevel ncursesdevel
     
    开始安装:
    1 进入载目录:
      cd usrlocalsrc
     
    2 载安装新版python2:
    1   wget httpswwwpythonorgftppython2713Python2713tgz
    2   tar zxvf Python2713tgz
    3   cd Python2713
    4   configure 
    5   make all
    6   make install
    7   make clean
    8   make distclean
    9   rm rf usrbinpython
    10   rm rf usrbinpython2
    11   rm rf usrbinpython27
    12   ln s usrlocalbinpython27 usrbinpython
    13   ln s usrlocalbinpython27 usrbinpython2
    14   ln s usrlocalbinpython27 usrbinpython27
    15   usrbinpython V
    16   usrbinpython2 V
    17   usrbinpython27 V
    18   rm rf usrlocalbinpython
    19   rm rf usrlocalbinpython2
    20   ln s usrlocalbinpython27 usrlocalbinpython
    21   ln s usrlocalbinpython27 usrlocalbinpython2
    22   python V
    23   python2 V
    24   python27 V

    3 载安装新版python3:
    1   wget httpswwwpythonorgftppython362Python362tgz
    2   tar zxvf Python362tgz
    3   cd Python362
    4   configure
    5   make all
    6   make install
    7   make clean
    8   make distclean
    9   rm rf usrbinpython
    10   rm rf usrbinpython3
    11   rm rf usrbinpython36
    12   ln s usrlocalbinpython36 usrbinpython
    13   ln s usrlocalbinpython36 usrbinpython3
    14   ln s usrlocalbinpython36 usrbinpython36
    15   usrbinpython V
    16   usrbinpython3 V
    17   usrbinpython36 V
    18   rm rf usrlocalbinpython
    19   rm rf usrlocalbinpython3
    20   ln s usrlocalbinpython36 usrlocalbinpython
    21   ln s usrlocalbinpython36 usrlocalbinpython3
    22   python V
    23   python3 V
    24   python36 V
    安装pip
    1 curl httpsbootstrappypaiogetpippy o getpippy   # 载安装脚
    2 sudo python getpippy    # 运行安装脚
    安装django
    1 su root
    2 pip install django1106
     
    附加安装sqlite38方法
    1 1 wget httpwwwsqliteorg2015sqliteautoconf3081101targz 
    2
    3 2tar xvzf sqliteautoconf3081101targz


    Linux安装Apache Spark 310详细步骤
    Linux安装spark 前提部署Hadoop安装Scala
    应版
    名称

    JDK
    18271
    Hadoop
    260
    Scala
    2110
    Apache Spark
    310
     第步 载jdk 8u271 for linux x64
    载httpswwworaclecomjavatechnologiesjavasejavasejdk8downloadshtml
    wget nocookies nocheckcertificate header Cookie gpw_e24http3A2F2Fwwworaclecom2F oraclelicenseacceptsecurebackupcookie httpsdownloadoraclecomotnjavajdk8u271b0961ae65e088624f5aaa0b1d2d801acb16jdk8u271linuxx64targzAuthParam1610434774_54f5ca4ffe47aeb4b53c758f1306d437


    载  httpssparkapacheorgdownloadshtml者命令行wget httpsmirrorsocfberkeleyeduapachesparkspark301spark301binhadoop27tgz

     
     
    第二步解压
    tar zxvf spark220binhadoop26tgz
    第三步配置环境变量
    vi etcprofile
    #SPARK_HOME
    export SPARK_HOMEhomehadoopspark220binhadoop26
    export PATHSPARK_HOMEbinPATH
    第四步spark配置
    sparkenvsh
    JAVA_HOMEhomehadoopjdk180_144
    SCALA_HOMEhomehadoopscala2110
    HADOOP_HOMEhomehadoophadoop260
    HADOOP_CONF_DIRhomehadoophadoop260etchadoop
    SPARK_MASTER_IPltt1bgcn
    SPARK_MASTER_PORT7077
    SPARK_MASTER_WEBUI_PORT8080
    SPARK_WORKER_CORES1
    SPARK_WORKER_MEMORY2g #spark里许存方默认1g 2g 里设置1g
    SPARK_WORKER_PORT7078
    SPARK_WORKER_WEBUI_PORT8081
    SPARK_WORKER_INSTANCES1
    sparkdefaultsconf  
    sparkmaster sparkltt1bgcn7077
    slaves
    ltt3bgcn
    ltt4bgcn
    ltt5bgcn

    果整合hivehivemysql数库话需mysql数库连接驱动jmysqlconnectorjava517binjar放SPARK_HOMEjars目录

    第五步spark220binhadoop26 分发节点启动
    [hadoop@ltt1 sbin] startallsh
    starting orgapachesparkdeploymasterMaster logging to homehadoopspark220binhadoop26logssparkhadooporgapachesparkdeploymasterMaster1ltt1bgcnout
    ltt5bgcn starting orgapachesparkdeployworkerWorker logging to homehadoopspark220binhadoop26logssparkhadooporgapachesparkdeployworkerWorker1ltt5bgcnout
    ltt4bgcn starting orgapachesparkdeployworkerWorker logging to homehadoopspark220binhadoop26logssparkhadooporgapachesparkdeployworkerWorker1ltt4bgcnout
    ltt3bgcn starting orgapachesparkdeployworkerWorker logging to homehadoopspark220binhadoop26logssparkhadooporgapachesparkdeployworkerWorker1ltt3bgcnout
    查进程
    master节点
    [hadoop@ltt1 sbin] jps
    1346 NameNode
    1539 JournalNode
    1812 ResourceManager
    1222 QuorumPeerMain
    1706 DFSZKFailoverController
    2588 Master
    2655 Jps
    worker节点
    [hadoop@ltt5 ~] jps
    1299 NodeManager
    1655 Worker
    1720 Jps
    1192 DataNode
    进入SparkWeb理页面: httpltt1bgcn8080

     
     spark安装完成
    1 解压缩安装包 tar xvf jdk8u271linuxx64targz



    1 进入解压缩jdk文件中pwd查前工作路径然修改文件vi ~bash_profile 

     3 ~bash_profile 文件末尾加jdk环境变量容: 
    4刚修改文件生效java version查否配置成功
           
    Spark安装配置
    1 解压spark安装包tar xvf spark243binhadoop27tgz

    2 进入解压文件中进入conf目录查配置文件

    3修改配置文件sparkenvsh注意文件默认存里sparkenvshtemplate复制份命名新文件spaekenvsh

    4查前JAVA_HOME路径步中

    5修改文件sparkenvsh文件末尾添加容:

    6回spark目录中找sbin目录然启动spark命令sbinstartallsh

    7jps查否启动成功

    8spark根目录examplesjars目录jar文件里面存放例子

    9里jar包进行测试求圆周率

    10回spark目录运行命令里面100设置值选择更值进行测试会更精确

    结果显示:

    11创建两目录inputoutput作文件输入输出目录

    12输入目录中创建datatxt文件容


    13启动sparkshell交互式工具黄框标记日志表示变量sc操作Spark context

    14spark中scala语言统计单词出现次数
    sctextFile读取文件split( )空格分隔字符  map((_1))单词计数里元祖
    reduceByKey相进行累加

    Spark集群安装设置
    spark100新版20140530正式发布啦新spark版带新特性提供更API支持spark100增加Spark SQL组件增强标准库(MLstreamingGraphX)JAVAPython语言支持
    面首先进行spark100集群安装里两台服务器台作masternamenode机台作slavedatanode机增加更slave需重复slave部分容:
    系统版:
    · master:Ubuntu 1204
    · slave:Ubuntu 1204
    · hadoop:hadoop 220
    · spark:spark 100
    1 安装JDKhadoop集群
    安装程参见里httpwwwcnblogscomtecvegetablesp3778358html

    2 载安装Scala
    · scala载址httpwwwscalalangorgdownload2111html里载新版scala2111版
    · 解压scala放usrlib目录
        tar xzvf scala2111tgz
        mv scala2111 usrlib
    · 配置scala环境变量:sudo vi etcprofile
       文件末尾添加scala路径
       
       输入 source etcprofile 路径生效
    · 测试scala:scala version   #出现scala版信息说明安装成功
    PS:scala需slave节点配置
    3 载安装spark
    · spark100载址httpsparkapacheorgdownloadshtml解压spark放homehadoop
      tar xzvf spark100binhadoop2tgz
    · 配置spark环境变量:sudo vi etcprofile
      文件末尾添加spark路径

      输入  source etcprofile  路径生效
    · 配置confsparkenvsh文件
      没该文件 sparkenvshtemplate 文件重命名文件中添加scalajavahadoop路径master ip等信息
      mv sparkenvshtemplate sparkenvsh
      vi sparkenvsh
      
      
      
    · confslaves中添加slave节点hostname行:
      vi slaves
      
    4 slave机器安装配置spark
    现master机spark文件分发slave节点注意slavemasterspark目录必须致master会登录slave执行命令认slavespark路径样
    scp r spark100binhadoop2 hadoop@slavehomehadoop
    5启动spark集群
    master机执行命令:
    cd ~spark100binhadoop2sbin
    startallsh
    检测进程否启动:输入 jps

    配置完成
    6 面体验spark带例子
    binrunexample SparkPi

    scala实现spark app
    官方说明址例子统计输入文件中字母a字母b数网站提供scalajavapython三种实现里做scala吧里需安装SBT( sbt 创建测试运行提交作业简单SBT做Scala世界Maven)
    spark100木带sbt选择手动安装然选择sudo aptget install sbt方式(系统中木找sbt包手动安装咯)安装方法:
    · 载:sbt载址载现新版sbt0135
    · 解压sbthomehadoop目录(hadoop户名实HOME啦)
      tar zxvf sbt0135tgz
      cd sbtbin
      java jar sbtlaunchjar    #进行sbt安装时间约时吧会载东东记联网哦
    · 成功etcprofile中配置sbt环境变量
      sudo vi etcprofile
      
      输入source etcprofile 路径生效
    sbt安装完成面写简单spark app吧
    · 创建目录:mkdir ~SimpleApp
    · SimpleApp目录创建目录结构:
            
    · simplesbt文件容:
    name Simple Project
    version 10
    scalaVersion 2104
    libraryDependencies + orgapachespark sparkcore 100
    resolvers + Akka Repository at httprepoakkaioreleases
    · SimpleAppscala文件容:
    * SimpleAppscala *
    import orgapachesparkSparkContext
    import orgapachesparkSparkContext_
    import orgapachesparkSparkConf
    object SimpleApp {
    def main(args Array[String]) {
    val logFile YOUR_SPARK_HOMEREADMEmd Should be some file on your system
    val conf new SparkConf()setAppName(Simple Application)
    val sc new SparkContext(conf)
    val logData sctextFile(logFile 2)cache()
    val numAs logDatafilter(line > linecontains(a))count()
    val numBs logDatafilter(line > linecontains(b))count()
    println(Lines with a s Lines with b sformat(numAs numBs))
    }
    }
    PS:前spark配置程中hadoop路径配置里输入路径YOUR_SPARK_HOMEXXX实际HDFS文件系统中文件存储位置hadoop配置文件coresitexml中相关(具体参见里方容易出错)需先READMEmd文件puthdfs面:

    · 编译:
      cd ~SimpleApp
      sbt package     #包程时间会较长会出现[success]XXX
      PS:成功会生成许文件 targetscala210simpleproject_21010jar等
    · 运行:
      sparksubmit class SimpleApp master local targetscala210simpleproject_21010jar
    · 结果:

    7 停止spark集群
    cd ~spark100binhadoop2sbin
    stopallsh

    JDKhadoop安装
    Ubuntu 1204Hadoop 220 集群搭建
    现家起实现Ubuntu 1204Hadoop 220 集群搭建里两台服务器台作masternamenode机台作slavedatanode机增加更slave需重复slave部分容
    系统版:
    · master:Ubuntu 1204
    · slave:Ubuntu 1204
    · hadoop:hadoop 220
    · 安装ssh服务:sudo aptget install ssh
    · 时更新vim:sudo aptget install vim  #刚安装系统会出现vi 命令键变成AB情况
    masterslave机器安装JDK环境
    载jdk果安装版170_60官方载址:java载
    解压jdk: tar xvf  jdk7u60linuxi586targz
    usrlocal新建java文件夹:mkdir usrlocaljava
    解压文件移动创建java文件夹:sudo mv jdk170_60  usrlocaljava
    修改etcprofile文件:sudo vi etcprofile
    文件末尾添加jdk路径:

    输入 source etcprofile java生效
    测试java否完全安装:java version   #出现版信息说明安装成功
    二修改namenode(master)子节点(slave)机器名:
    sudo vi etchostname   
     

    修改需重启生效:sudo reboot
    三修改namenode(master)节点映射ip
    sudo vi etchosts    #添加slavemaster机器名应ip
     
    PS:masterslave分namenodedatanode机器名hostname名字
    四masterslave分创建Hadoop户户组赋予sudo权限
    sudo addgroup hadoop
    sudo adduser ingroup hadoop hadoop   #第hadoop户组第二hadoop户名
    面hadoop户赋予sudo权限:修改 etcsudoers 文件
    sudo vi etcsudoers
    添加hadoop  ALL(ALLALL) ALL

    PS:该操作需masterslave机进行
    五建立ssh密码登陆环境
    hadoop身份登录系统:su hadoop
    生成密钥建立namenodedatanode信关系ssh生成密钥rsadsa方式默认采rsa方式:
    homehadoop目录输入: sshkeygen t rsa P
    确认信息回车会homehadoopssh生成文件:
    id_rsapub追加authorized_keys授权文件中: cat id_rsapub >> authorized_keys

    子节点生成密钥:sshkeygen t rsa P
    masterauthorized_keys发送子节点:
    scp ~sshauthorized_keys  hadoop@slave1~ssh
    面测试ssh互信: ssh hadoop@slave1
    果需输入密码登录成功表示ssh互信成功建立
    六安装hadoop(需配置master机slave机直接复制)
    载hadoopusrlocal:载址
    解压hadoop220targz:sudo tar zxf hadoop220targz 
    解压出文件夹重命名hadoop: sudo mv hadoop220 hadoop
    hadoop文件夹属户设hadoop:sudo chown R hadoophadoop hadoop
    1配置etchadoophadoopenvsh文件
    sudo vi usrlocalhadoopetchadoophadoopenvsh
    找export JAVA_HOME部分修改机jdk路径

    2配置etchadoopcoresitexml文件
    sudo vi usrlocalhadoopetchadoopcoresitexml
    中间添加容:

    PS:masternamenode机名etchosts文件里名字
    3配置etchadoopmapredsitexml文件路径没文件mapredsitexmltemplate重命名
    sudo vi usrlocalhadoopetchadoopmapredsitexml
    中间添加容:

    PS:masternamenode机名etchosts文件里名字
    4配置hdfssitexml文件路径没文件hdfssitexmltemplate重命名
    sudo vi hdfssitexml
    中间添加容:

    PS:slave节点间容改:
    · usrlocalhadoopdatalog1usrlocalhadoopdatalog2
    · usrlocalhadoopdata1usrlocalhadoopdata2
    · 1中间数字表示slave节点数
    5slaves文件中添加slave机名行

    七slave节点分发配置文件
    配置文件发送slave子节点先文件复制子节点homehadoop面(子节点hadoop户登录:su hadoop)
    sudo scp etchosts hadoop@slave1homehadoop
    scp r usrlocalhadoop hadoop@slave1homehadoop
    PS:slave1slave子节点名slave节点应全部分发
    datanode机器(slave节点)文件移动master相路径
    sudo mv homehadoophosts etchosts  (子节点执行)
    sudo mv homelocalhadoop usrlocal  (子节点执行)
    PS:提示mv文件夹加r 参数
    加入属户: sudo chown R hadoophadoop hadoop    (子节点执行)
    PS:子节点datanode机器复制hadoop里面data1data2logs删掉
    配置完成
    PS:hadoop命令路径写入etcprofile文件样hadoophdfs命令否命令时加入binhadoop样路径:
    sudo vi etcprofile

    输入:source etcprofile
    八运行WordCount示例
    首先进入usrlocalhadoop目录重启hadoop
    cd usrlocalhadoopsbin
    stopallsh
    cd usrlocalhadoopbin
    hdfs namenode format    # 格式化集群
    cd usrlocalhadoopsbin
    startallsh 
    namenode查连接情况
    hdfs dfsadmin report      #面机器结果

    假设测试文件test1txttest2txt首先创建目录input
    hadoop dfs mkdir input
    测试文件传hadoop:
    hadoop dfs put test1txt input
    hadoop dfs put test2txt input
    子节点离开安全模式否会导致法读取input文件:
    hdfs dfsadmin –safemode leave
    运行wordcount程序:
    hadoop jar usrlocalhadoopsharehadoopmapreducehadoopmapreduceexamples220jar wordcount input output
    查结果
    hadoop dfs cat outputpartr00000
    PS:次运行时需删output文件夹 :hadoop dfs rmr output
    参考资料:
    httpwwwcnblogscomkinglaup3794433html
    httpwwwcnblogscomtecvegetablesp3778358html
    httpblogcsdnnetlaoyi_gracearticledetails6254743

    Ubuntu 1404安装Hadoop240(单机模式)
    Ubuntu创建hadoop组hadoop户
        增加hadoop户组时该组里增加hadoop户续涉hadoop操作时该户
    1创建hadoop户组 
            
    2创建hadoop户    
        sudo adduser ingroup hadoop hadoop
        回车会提示输入新UNIX密码新建户hadoop密码输入回车
        果输入密码回车会重新提示输入密码密码空
        确认信息否正确果没问题输入 Y回车
     3hadoop户添加权限
         输入:sudo gedit etcsudoers
         回车开sudoers文件
    hadoop户赋予root户样权限




      
    二新增加hadoop户登录Ubuntu系统
     
    三安装ssh
    sudo aptget install opensshserver安装完成启动服务
    sudo etcinitdssh start
     
    查服务否正确启动:ps e | grep ssh

     
     
     
     
     
     
    设置免密码登录生成私钥公钥
    sshkeygen t rsa P

    cat ~sshid_rsapub >> ~sshauthorized_keys

     
     
     
    登录ssh
    ssh localhost


    退出
    exit
     
    四安装Java环境
    sudo aptget install openjdk7jdk

     查安装结果输入命令:java version结果表示安装成功

     
     
     
     
    五安装hadoop240
        1官网载httpmirrorbiteducnapachehadoopcommon
     
        2安装
     
            解压
            sudo tar xzf hadoop240targz        
            假hadoop安装usrlocal
            拷贝usrlocal文件夹hadoop
            sudo mv hadoop240 usrlocalhadoop        

             
    赋予户该文件夹读写权限
            sudo chmod 774 usrlocalhadoop

         
    3配置      
            1)配置~bashrc      
    配置该文件前需知道Java安装路径设置JAVA_HOME环境变量面命令行查安装路径
            updatealternatives config java
            执行结果:

    完整路径
        usrlibjvmjava7openjdkamd64jrebinjava
        取前面部分 usrlibjvmjava7openjdkamd64
        配置bashrc文件
        sudo gedit ~bashrc
        
        该命令会开该文件编辑窗口文件末尾追加面容然保存关闭编辑窗口
    #HADOOP VARIABLES START
    export JAVA_HOMEusrlibjvmjava7openjdkamd64
    export HADOOP_INSTALLusrlocalhadoop
    export PATHPATHHADOOP_INSTALLbin
    export PATHPATHHADOOP_INSTALLsbin
    export HADOOP_MAPRED_HOMEHADOOP_INSTALL
    export HADOOP_COMMON_HOMEHADOOP_INSTALL
    export HADOOP_HDFS_HOMEHADOOP_INSTALL
    export YARN_HOMEHADOOP_INSTALL
    export HADOOP_COMMON_LIB_NATIVE_DIRHADOOP_INSTALLlibnative
    export HADOOP_OPTSDjavalibrarypathHADOOP_INSTALLlib
    #HADOOP VARIABLES END
     
     
    终结果图:

     
    执行面命添加环境变量生效:
            source ~bashrc
    2)编辑usrlocalhadoopetchadoophadoopenvsh
             执行面命令开该文件编辑窗口
            sudo gedit usrlocalhadoopetchadoophadoopenvsh
    找JAVA_HOME变量修改变量
            export JAVA_HOMEusrlibjvmjava7openjdkamd64    
    修改hadoopenvsh文件示:     

    六WordCount测试
     单机模式安装完成面通执行hadoop带实例WordCount验证否安装成功
        usrlocalhadoop路径创建input文件夹    
    mkdir input
     
        拷贝READMEtxtinput    
    cp READMEtxt input
        执行WordCount
        binhadoop jar sharehadoopmapreducesourceshadoopmapreduceexamples240sourcesjar orgapachehadoopexamplesWordCount input output  执行结果:  
      执行 cat output*查字符统计结果




    启动Spark集群
    检查进程否启动Spark集群环境否搭建成功
    1)启动HDFS集群

    2)检查进程否启动

    3)启动Spark集群
    cd homehadoopspark121binhadoop24sbin
    startallsh
    jps (没启动hdfs 集群spark进程情况)
    jps(启动hdfs集群spark进程运行情况)
    浏览器输入httpmaster8080查spark集群运行状况


    6)进入Sparkbin目录启动sparkshell 控制台

    7)访问 httpmaster4040Spark WebUI页面Spark集群环境搭建成功








    8)运行sparkshell测试
    前userinput传READMEtxt文件现Spark读取HDFS中READMEtxt文件

    取HDFS文件
    (连接成功例子)

    连接成功例子似Hadoop26版识localhost机IP址

    CountREADMEtxt文件中文字总数


    滤READMEtxt文件
    包括单词



    通wcREADMEtxt统计4The 单词

    实现Hadoop wordcount 功
    首先读取readmeFile执行命令

    然collect命令提交job


    WebUI执行效果:

    cd
    binrunexample SparkPi

    停止集群
    sbinstopallsh


    OkSpark集群环境测试结束总结步骤:
    (1)cd usrlocalhadoopsbin
    startallsh
    (2)cd homehadoopspark121binhadoop24sbin
    startallsh
    (3)jps
    (4)http19216801188080
    (5)cd homehadoopspark121binhadoop24bin
    sparkshell
    (6)http19216801184040
    (7)scala编程程序提交测试
    (8)集群停止
    Spark性优化
    般开发完Spark作业该作业配置合适资源Spark资源参数基sparksubmit命令中作参数设置资源参数设置会导致没充分利集群资源作业运行会极缓慢者设置资源队列没足够资源提供进导致种异常

    pyspark开发代码例子说明
    运行pyspark程序终端命令模式Linux终端输入pyspark然复制粘贴代码sparksubmit命令行Hive样yarn调度运行

    # *codingutf8*
    from pysparksql import HiveContext SparkSession

    # 初始化SparkContext时启Hive支持
    # 终端命令行测试模式输出字段长度设置100字符
    spark SparkSessionbuilderappName(name)config(
    sparkdebugmaxToStringFields 100)enableHiveSupport()getOrCreate()
    # 初始化HiveContext
    hive HiveContext(sparksparkContext)
    # 启SparkSQL表连接支持
    sparkconfset(sparksqlcrossJoinenabled true)

    # 读取parquet文件数代码
    # Parquet面分析型业务列式存储格式TwitterCloudera合作开发AWS中
    # parquet文件数存储AWS S3
    # AWSS3作数存储服务S3 全名 Simple Storage Service简便存储服务
    df1 sparkreadload(
    path'
    format'parquet' headerTrue)

    # 读取CSV文件数代码
    # 边CSV文件作手工交换文件标准
    # 原csv格式简单数字类型数字符串存储精度保证
    df2 sparkreadload(
    path'
    format'csv' headerTrue)

    # 读取Hive表视图数代码
    df3 hivesql(
    select
    *
    from <数库名><表名>)

    # 次表数集进行数存缓存(第条Spark优化策略)
    # 样话pyspark代码次调数时候Spark会重复读取相文件数
    df4 sparkreadload(
    path'
    format'parquet' headerTrue)cache()

    # 刚数集命名便放入SparkSQL编写查询语句
    df1createOrReplaceTempView(DF1)

    df2createOrReplaceTempView(DF2)

    df3createOrReplaceTempView(DF3)

    df4createOrReplaceTempView(DF4)

    # 创建SparkSQL数集代码
    # 果数量较业务逻辑复杂话数时缓存存储服务磁盘
    # 避免pyspark代码SparkSQL调里SparkSQL数集时候
    # 里SparkSQL数集重复运行计算逻辑节约计算资源(第二条Spark优化策略)
    df5 sparksql(
    SELECT

    from DF1 AS D1
    LEFT JOIN DF2 AS D2
    ON
    LEFT JOIN DF4 AS D4
    ON
    WHERE
    )persist()
    # countAction算子会触发sparksubmit事件前persist()缓存操作刻生效
    # count()操作persist()缓存操作会Action算子处程序结束处生效
    df5count()
    df5createOrReplaceTempView(DF5)

    # 创建SparkSQL数集代码
    df6 sparksql(
    SELECT

    from DF5 AS D5
    LEFT JOIN DF3 AS D3
    ON
    LEFT JOIN DF4 AS D4
    ON
    WHERE
    )

    # 写入结果数集parquet文件
    df6writeparquet(
    path'
    modeoverwrite)

    # 释放磁盘缓存
    df5unpersist()

    # sparkContext停止
    sparkstop()

    1Spark作业基运行原理

          详细原理见图sparksubmit提交Spark作业作业会启动应Driver进程根部署模式(deploymode)Driver进程启动集群中某工作节点启动Driver进程身会根设置参数占定数量存CPU coreDriver进程做第件事情集群理器(Spark Standalone集群资源理集群般YARN作资源理集群)申请运行Spark作业需资源里资源指Executor进程YARN集群理器会根Spark作业设置资源参数工作节点启动定数量Executor进程Executor进程占定数量存CPU core
      申请作业执行需资源Driver进程会开始调度执行编写作业代码Driver进程会编写Spark作业代码分拆stagestage执行部分代码片段stage创建批task然task分配Executor进程中执行task计算单元负责执行模样计算逻辑(编写某代码片段)task处理数已stagetask执行完毕会节点磁盘文件中写入计算中间结果然Driver会调度运行stagestagetask输入数stage输出中间结果循环复直编写代码逻辑全部执行完计算完数想结果止
      Spark根shuffle类算子进行stage划分果代码中执行某shuffle类算子(reduceByKeyjoin等)会该算子处划分出stage界限致理解shuffle算子执行前代码会划分stageshuffle算子执行代码会划分stagestage刚开始执行时候task会stagetask节点通网络传输拉取需处理key然拉取相key编写算子函数执行聚合操作(reduceByKey()算子接收函数)程shuffle
      代码中执行cachepersist等持久化操作时根选择持久化级task计算出数会保存Executor进程存者节点磁盘文件中
      Executor存分三块:第块task执行编写代码时默认占Executor总存20第二块task通shuffle程拉取stagetask输出进行聚合等操作时默认占Executor总存20第三块RDD持久化时默认占Executor总存60
      task执行速度Executor进程CPU core数量直接关系CPU core时间执行线程Executor进程分配tasktask条线程方式线程发运行果CPU core数量较充足分配task数量较合理通常说较快速高效执行完task线程
      Spark作业基运行原理说明家结合图理解理解作业基原理进行资源参数调优基前提
    2资源参数调优
          解完Spark作业运行基原理资源相关参数容易理解谓Spark资源参数调优实Spark运行程中资源方通调节种参数优化资源效率提升Spark作业执行性参数Spark中资源参数参数应着作业运行原理中某部分时出调优参考值
    numexecutors
      参数说明:该参数设置Spark作业总少Executor进程执行DriverYARN集群理器申请资源时YARN集群理器会设置集群工作节点启动相应数量Executor进程参数非常重果设置话默认会启动少量Executor进程时Spark作业运行速度非常慢
      参数调优建议:Spark作业运行般设置50~100左右Executor进程较合适设置太少太Executor进程设置太少法充分利集群资源设置太话部分队列法予充分资源
    executormemory
      参数说明:该参数设置Executor进程存Executor存时候直接决定Spark作业性常见JVM OOM异常直接关联
      参数调优建议:Executor进程存设置4G~8G较合适参考值具体设置根部门资源队列定团队资源队列存限制少numexecutorsexecutormemory代表Spark作业申请总存量(Executor进程存总)量超队列存量外果团队里享资源队列申请总存量超资源队列总存13~12避免Spark作业占队列资源导致学作业法运行
    executorcores
      参数说明:该参数设置Executor进程CPU core数量参数决定Executor进程行执行task线程力CPU core时间执行task线程Executor进程CPU core数量越越够快速执行完分配task线程
      参数调优建议:ExecutorCPU core数量设置2~4较合适样根部门资源队列定资源队列CPU core限制少设置Executor数量决定Executor进程分配CPU core样建议果享队列numexecutors * executorcores超队列总CPU core13~12左右较合适避免影响学作业运行
    drivermemory
      参数说明:该参数设置Driver进程存
      参数调优建议:Driver存通常说设置者设置1G左右应该够唯需注意点果需collect算子RDD数全部拉取Driver进行处理必须确保Driver存足够否会出现OOM存溢出问题
    sparkdefaultparallelism
      参数说明:该参数设置stage默认task数量参数极重果设置会直接影响Spark作业性
      参数调优建议:Spark作业默认task数量500~1000较合适学常犯错误设置参数时会导致Spark根底层HDFSblock数量设置task数量默认HDFS block应task通常说Spark默认设置数量偏少(十task)果task数量偏少话会导致前面设置Executor参数前功弃试想Executor进程少存CPUtask1者1090Executor进程根没task执行白白浪费资源Spark官网建议设置原设置该参数numexecutors * executorcores2~3倍较合适Executor总CPU core数量300设置1000task时充分利Spark集群资源
    sparkstoragememoryFraction
      参数说明:该参数设置RDD持久化数Executor存中占例默认06说默认Executor 60存保存持久化RDD数根选择持久化策略果存够时数会持久化者数会写入磁盘
      参数调优建议:果Spark作业中较RDD持久化操作该参数值适提高保证持久化数够容纳存中避免存够缓存数导致数写入磁盘中降低性果Spark作业中shuffle类操作较持久化操作较少参数值适降低较合适外果发现作业频繁gc导致运行缓慢(通spark web ui观察作业gc耗时)意味着task执行户代码存够样建议调低参数值
    sparkshufflememoryFraction
      参数说明:该参数设置shuffle程中task拉取stagetask输出进行聚合操作时够Executor存例默认02说Executor默认20存进行该操作shuffle操作进行聚合时果发现存超出20限制余数会溢写磁盘文件中时会极降低性
      参数调优建议:果Spark作业中RDD持久化操作较少shuffle操作较时建议降低持久化操作存占提高shuffle操作存占例避免shuffle程中数时存够必须溢写磁盘降低性外果发现作业频繁gc导致运行缓慢意味着task执行户代码存够样建议调低参数值
    资源参数调优没固定值需学根实际情况(包括Spark作业中shuffle操作数量RDD持久化操作数量spark web ui中显示作业gc情况)时参考篇文章中出原理调优建议合理设置述参数
    3资源参数参考示例
          份sparksubmit命令示例家参考根实际情况进行调节:
    binsparksubmit \
    master yarncluster \
    numexecutors 100 \
    executormemory 6G \
    executorcores 4 \
    drivermemory 1G \
    conf sparkdefaultparallelism1000 \
    conf sparkstoragememoryFraction05 \
    conf sparkshufflememoryFraction03 \
    4Spark中三种Join策略
    Spark通常三种Join策略方式
    1 Broadcast Hash Join(BHJ)
    2 Shuffle Hash Join(SHJ)
    3 Sort Merge Join(SMJ)
    Broadcast Hash Join
    表表进行Join操作时避免shuffle操作表数分发节点表进行Join操作牺牲空间避免耗时Shuffle操作

    1 表需broadcast必须sparksqlautoBroadcastJoinThreshold配置值(默认10M)者明确添加broadcast join hint
    2 base tablebroadcast例left outer join中仅仅right表broadcast
    3 种算法仅仅broadcast表否数传输shuffle操作成高
    4 broadcast需driver果太broadcastdriver存压力
    Shuffle Hash Join
    broadcast策略首先收集数driver节点然分发executor节点表太时broadcastchelve会driverexecutor造成压力
    Shuffle Hash Join会减少driverexeuctor压力操作步骤:
    1 两张表分连接列进行重组目相连接列记录分配分区
    2 两张表表分区构造成hash表表根相应记录进行映射

    Sort Merge Join
    面两种发现适应定表两张表足够时面方法存造成压力两张表做Hash Join时中张表必须完成加载存中
    两张表时Spark SQL种新算法做Join操作做Sort Merge Join种算法会加载数然开始Hash JoinJoin前进行数排序
    两张表需进行数重组保证相连接列值分区中分区数排序分区中数然相应记录进行关联
    表分区Sort Merge Join会加载张表数存


    5Spark 30 中 AQE新特性
    年Spark SQL优化CBO成功特性
    CBO会计算业务数相关统计数优化查询例行数重行数空值值等
    Spark根数动选择BHJ者SMJJoin场景Costbased Join Reorder(参考前写篇文章)达优化执行计划目
    统计数需预先处理会时时数进行判断某情况反会变成负面效果拉低SQL执行效率
    AQE执行程中统计数动态调节执行计划解决问题
    1框架
    AQE言重问题什时候重新计算优化执行计划Spark务算子果道排列次行执行然shuffle者broadcast exchange会断算子排列执行称物化点(Materialization Points)Query Stages代表物化点分割片段Query Stage会产出中间结果仅该stage行stage执行完成游Query Stage执行游部分stage执行完成partitions统计数获取游未开始执行AQE提供reoptimization机会


    查询开始时生成完执行计划AQE框架首先会找执行存游stages旦stage完成AQE框架会physical plan中标记完成根已完成stages提供执行数更新整logical plan基新产出统计数AQE框架会执行optimizer根系列优化规进行优化AQE框架会执行生成普通physical planoptimizer适应执行专属优化规例分区合数倾斜处理等获新优化执行计划已执行完成stages次循环接着需继续重复面步骤直整query跑完
    Spark 30中AQE框架拥三特征:
    · 动态折叠shuffle程中partition
    · 动态选择join策略
    · 动态优化存数倾斜join
    接具体三特征
    ① 动态合shuffle partitions
    处理数量级非常时shuffle通常说影响性shuffle非常耗时算子需通网络移动数分发游算子
    shuffle中partition数量十分关键partition佳数量取决数数querystage会差异难确定具体数目:
    · 果partition少partition数量会会导致量数落磁盘拖慢查询
    · 果partitionpartition数量会少会产生额外网络开销影响Spark task scheduler拖慢查询
    解决该问题开始设置相较shuffle partition数通执行程中shuffle文件数合相邻partitions
    例假设执行SELECT max(i) FROM tbl GROUP BY j表tbl2partition数量非常初始shuffle partition设5分组会出现5partitions进行AQE优化会产生5tasks做聚合结果事实3partitions数量非常

    然种情况AQE会生成3reduce task

    ② 动态切换join策略
    Spark支持众join中broadcast hash join性果需广播表预估广播限制阈值应该设BHJ表估计会导致决策错误join表filter(容易表估)者join表算子(容易表估)仅仅全量扫描张表
    AQE拥精确游统计数解决该问题面例子右表实际15M该场景filter滤实际参join数8M默认broadcast阈值10M应该广播

    执行程中转化BHJ时甚传统shuffle优化shuffle(例shuffle读mapper基reducer)减网络开销
    ③ 动态优化数倾斜
    数倾斜集群数分区间分布均匀导致会拉慢join场景整查询AQE根shuffle文件统计数动检测倾斜数倾斜分区散成子分区然进行join
    场景Table A join Table B中Table Apartition A0数远分区

    AQE会partition A0切分成2子分区独Table Bpartition B0进行join

    果做优化SMJ会产生4tasks中执行时间远优化join会5taskstask执行耗时差相整查询带更性
    2
    设置参数sparksqladaptiveenabledtrue开启AQESpark 30中默认false满足条件:
    · 非流式查询
    · 包含少exchange(join聚合窗口算子)者子查询
    AQE通减少静态统计数赖成功解决Spark CBO难处理trade off(生成统计数开销查询耗时)数精度问题相前具局限性CBO现显非常灵活需提前分析数
    6数仓库中数优化般原
    l 种数放份:两张张表中种数字段数量字段键抽出变成张独立新表(数仓库设计)
    l ETL中数筛选聚合操作量前置取数源数方减少整数流数量(ETL设计)
    l 张表连接时提前张表中连接数进行排表连接时间复杂度笛卡尔积减少数性提高较明显应筛选数中参连接字段添加索引便进步提高表连接性

    7Spark 中宽赖窄赖
    Spark中RDD高效DAG图着莫关系DAG调度中需计算程划分stage划分RDD间赖关系针转换函数RDD间赖关系分类窄赖(narrow dependency)宽赖(wide dependency 称 shuffle dependency)
    概述
    · 窄赖指父RDD分区子RDD分区子RDD分区通常应常数父RDD分区(O(1)数规模关)
    · 相应宽赖指父RDD分区子RDD分区子RDD分区通常应父RDD分区(O(n)数规模关)
    宽赖窄赖图示:

    相宽赖窄赖优化利 基两点:
    1 宽赖应着shuffle操作需运行程中父RDD分区传入子RDD分区中中间涉节点间数传输窄赖父RDD分区会传入子RDD分区中通常节点完成转换
    2 RDD分区丢失时(某节点障)spark会数进行重算
    1 窄赖父RDD分区应子RDD分区样需重算子RDD分区应父RDD分区重算数利率100
    2 宽赖重算父RDD分区应子RDD分区样实际父RDD 中部分数恢复丢失子RDD分区部分应子RDD未丢失分区造成余计算更般宽赖中子RDD分区通常父RDD分区极端情况父RDD分区进行重新计算
    3 图示b1分区丢失需重新计算a1a2a3产生冗余计算(a1a2a3中应b2数)

    详细运行原理

    图中左边宽赖父RDD4号分区数划分子RDD分区(分区分区)表明shuffle程父分区数shuffle程hash分区器(定义分区器)划分子RDD例GroupByKeyreduceByKeyjoinsortByKey等操作
    图右边窄赖父RDD分区数直接子RDD应分区(分区分区)例1号5号分区数进入子RDD分区程没shuffleSpark中Stage划分通shuffle划分(shuffle理解数原分区乱重组新分区):mapfilter
    总结:果父RDDPartition子RDDPartition窄赖否话宽赖
    宽窄赖容错性
    Spark基lineage容错性指果RDD出错父RDD重新计算果RDD仅父RDD(窄赖)种重新计算代价会非常

    Spark基Checkpoint(物化)容错机制解?图中宽赖结果(历Shuffle程)昂贵Spark结果物化磁盘备面
    join操作两种情况果join操作partition 仅仅已知Partition进行join时join操作窄赖情况join操作宽赖确定Partition数量赖关系窄赖出推窄赖仅包含窄赖包含固定数窄赖(说父RDD赖Partition数量会着RDD数规模改变改变)
    Stage划分
    名词解析
    1 job rdd action 触发动作简单理解需执行 rdd action 时候会生成 job
    2stage  stage job 组成单位说 job 会切分成 1 1 stage然 stage 会执行序次执行
    3task  stage 务执行单元般说 rdd 少partition会少 task task 处理partition 数
    划分规
    1前推理遇宽赖断开遇窄赖前RDD加入Stage中
    2Stage里面Task数量该Stage中 RDDPartition数量决定
    3Stage里面务类型ResultTask前面Stage里面务类型ShuffleMapTask
    4代表前Stage算子定该Stage计算步骤
    总结:spark中stage划分根shuffle划分宽赖必然shuffle程说spark根宽窄赖划分stage
    Spark优化
    窄赖优化利逻辑RDD算子forkjoin(join非文join算子指步行务barrier):计算fork分区算完join然forkjoinRDD算子果直接翻译物理实现济:RDD( 中间结果)需物化存存储中费时费空间二join作全局barrier昂贵会慢节点拖死果子RDD分区 父RDD分区窄赖实施典fusion优化两forkjoin合果连续变换算子序列窄赖 forkjoin减少量全局barrier需物化中间结果RDD极提升性Spark做流水线(pipeline)优化
    Spark流水线优化:

    变换算子序列碰shuffle类操作宽赖发生流水线优化终止具体实现 中DAGScheduler前算子前回溯赖图碰宽赖生成stage容纳已遍历算子序列stage里安全实施流水线优化然宽赖开始继续回溯生成stage
    Pipeline
    Spark中pipelinepartition应partitionstage部窄赖pipeline详解
    stagestage间宽赖
    分布式计算程

    图Sparkwordcount例子根述stage划分原job划分2stage三行分数读取计算存储程
    仅代码户根体会数背行计算图中出数分布分区(理解机器)数flapMapmapreduceByKey算子RDD分区中流转(算子面说RDD进行计算函数)
    图更高角度:

    Spark运行架构Driver(理解master)Executor(理解workerslave)组成Driver负责户代码进行DAG切分划分Stage然Stage应task调度提交Executor进行计算样Executor行执行Stagetask
    (里DriverExecutor进程般分布机器)
    里理解Stagetask图Spark作业划分层次:
     
    Application户submit提交整体代码代码中action操作action算子Application划分jobjob根宽赖划分StageStage划分许(数量分区决定分区数task计算)功相task然task提交Executor进行计算执行结果返回Driver汇总存储
    体现 Driver端总规划–Executor端分计算–结果汇总回Driver 思想分布式计算思想
    8Spark算子
    分类
    方说Spark 算子致分两类
         1)Transformation 变换转换算子:种变换触发提交作业完成作业中间程处理
       Transformation 操作延迟计算说RDD 转换生成 RDD 转换操作马执行需等 Action 操作时候会真正触发运算
         2)Action 行动算子:类算子会触发 SparkContext 提交 Job 作业
        Action 算子会触发 Spark 提交作业(Job)数输出 Spark系统
     
      方说Spark 算子致分三类
      1)Value数类型Transformation算子种变换触发提交作业针处理数项Value型数
      2)KeyValue数类型Transfromation算子种变换触发提交作业针处理数项KeyValue型数
      3)Action算子类算子会触发SparkContext提交Job作业
    1)Value数类型Transformation算子  
      输入分区输出分区型
        1map算子
        2flatMap算子
        3mapPartitions算子
        4glom算子
      二输入分区输出分区型 
        5union算子
        6cartesian算子
      三输入分区输出分区型
        7grouBy算子
      四输出分区输入分区子集型
        8filter算子
        9distinct算子
        10subtract算子
        11sample算子
            12takeSample算子
       五Cache型
        13cache算子  
        14persist算子
    2)KeyValue数类型Transfromation算子
      输入分区输出分区
        15mapValues算子
      二单RDD两RDD聚集
       单RDD聚集
        16combineByKey算子
        17reduceByKey算子
        18partitionBy算子
       两RDD聚集
        19Cogroup算子
      三连接
        20join算子
        21leftOutJoin rightOutJoin算子
      Spark算子作详细见httpwwwcnblogscomzlslchp5723979html
     3)Action算子
      输出
        22foreach算子
      二HDFS
        23saveAsTextFile算子
        24saveAsObjectFile算子
      三Scala集合数类型
        25collect算子
        26collectAsMap算子
          27reduceByKeyLocally算子
          28lookup算子
        29count算子
        30top算子
        31reduce算子
        32fold算子
        33aggregate算子
      Spark算子作详细见httpwwwcnblogscomzlslchp5723979html
     
     
         1 Transformations 算子
     (1) map
      原 RDD 数项通 map 中户定义函数 f 映射转变新元素源码中 map 算子相初始化 RDD 新 RDD 做 MappedRDD(this scclean(f))
         图 1中方框表示 RDD 分区左侧分区户定义函数 fT>U 映射右侧新 RDD 分区实际等 Action算子触发 f 函数会函数stage 中数进行运算图 1 中第分区数记录 V1 输入 f通 f 转换输出转换分区中数记录 V'1
                                

          图1    map 算子 RDD 转换                   
     
        (2) flatMap
         原 RDD 中元素通函数 f 转换新元素生成 RDD 集合中元素合集合部创建 FlatMappedRDD(thisscclean(f))
      图 2 表 示 RDD 分 区 进 行 flatMap函 数 操 作 flatMap 中 传 入 函 数 fT>U T U 意数类型分区中数通户定义函数 f 转换新数外部方框认 RDD 分区方框代表集合 V1 V2 V3 集合作 RDD 数项存储数组容器转换V'1 V'2 V'3 原数组容器结合拆散拆散数形成 RDD 中数项

    图2     flapMap 算子 RDD 转换
        (3) mapPartitions
          mapPartitions 函 数 获 取 分 区 迭 代器 函 数 中 通 分 区 整 体 迭 代 器 整 分 区 元 素 进 行 操 作 部 实 现 生 成
    MapPartitionsRDD图 3 中方框代表 RDD 分区图 3 中户通函数 f (iter)>iterf ilter(_>3) 分区中数进行滤等 3 数保留方块代表 RDD 分区含 1 2 3 分区滤剩元素 3


        图3  mapPartitions 算子 RDD 转换
      (4)glom
      glom函数分区形成数组部实现返回GlommedRDD 图4中方框代表RDD分区图4中方框代表分区 该图表示含V1 V2 V3分区通函数glom形成数组Array[(V1)(V2)(V3)]

          图 4   glom算子RDD转换
         (5) union
          union 函数时需保证两 RDD 元素数类型相返回 RDD 数类型合 RDD 元素数类型相进行重操作保存元素果想重
    distinct()时 Spark 提供更简洁 union API通 ++ 符号相 union 函数操作
         图 5 中左侧方框代表两 RDD方框方框代表 RDD 分区右侧方框代表合 RDD方框方框代表分区
      含V1V2U1U2U3U4RDD含V1V8U5U6U7U8RDD合元素形成RDDV1V1V2V8形成分区U1U2U3U4U5U6U7U8形成分区

      图 5  union 算子 RDD 转换 

    (6) cartesian
            两 RDD 元 素 进 行 笛 卡 尔 积 操 作 操 作 部 实 现 返 回CartesianRDD图6中左侧方框代表两 RDD方框方框代表 RDD 分区右侧方框代表合 RDD方框方框代表分区图6中方框代表RDD方框中方框代表RDD分区
          例 : V1 RDD 中 W1 W2 Q5 进 行 笛 卡 尔 积 运 算 形 成 (V1W1)(V1W2) (V1Q5)
         
           图 6  cartesian 算子 RDD 转换
    (7) groupBy
      groupBy :元素通函数生成相应 Key数转化 KeyValue 格式 Key 相元素分组
      函数实现:
      1)户函数预处理:
      val cleanF scclean(f)
      2)数 map 进行函数操作进行 groupByKey 分组操作
         thismap(t > (cleanF(t) t))groupByKey(p)
      中 p 确定分区数分区函数决定行化程度
      图7 中方框代表 RDD 分区相key 元素合组例 V1 V2 合 V Value V1V2形成 VSeq(V1V2)

      图 7 groupBy 算子 RDD 转换
    (8) filter
        filter 函数功元素进行滤 元 素 应 f 函 数 返 回 值 true 元 素 RDD 中保留返回值 false 元素滤掉 部 实 现 相 生 成 FilteredRDD(thisscclean(f))
        面代码函数质实现:
        deffilter(fT>Boolean)RDD[T]newFilteredRDD(thisscclean(f))
      图 8 中方框代表 RDD 分区 T 意类型通户定义滤函数 f数项操作满足条件返回结果 true 数项保留例滤掉 V2 V3 保留 V1区分命名 V'1

      图 8  filter 算子 RDD 转换
         
      (9)distinct
      distinctRDD中元素进行重操作图9中方框代表RDD分区通distinct函数数重 例重复数V1 V1重保留份V1

        图9  distinct算子RDD转换
    (10)subtract
      subtract相进行集合差操作RDD 1RDD 1RDD 2交集中元素图10中左侧方框代表两RDD方框方框代表RDD分区 右侧方框
    代表合RDD方框方框代表分区 V1两RDD中均根差集运算规新RDD保留V2第RDD第二RDD没新RDD元素中包含V2
      
              图10   subtract算子RDD转换
    (11) sample
           sample RDD 集合元素进行采样获取元素子集户设定否放回抽样百分机种子进决定采样方式部实现生成 SampledRDD(withReplacement fraction seed)
      函数参数设置:
    ‰   withReplacementtrue表示放回抽样
    ‰   withReplacementfalse表示放回抽样
      图 11中 方 框 RDD 分 区 通 sample 函 数 采 样 50 数 V1 V2 U1 U2U3U4 采样出数 V1 U1 U2 形成新 RDD
         
           图11  sample 算子 RDD 转换
      (12)takeSample
      takeSample()函数面sample函数原理相例采样设定采样数进行采样时返回结果RDD相采样数进行
    Collect()返回结果集合单机数组
      图12中左侧方框代表分布式节点分区右侧方框代表单机返回结果数组 通takeSample数采样设置采样份数返回结果V1

      图12    takeSample算子RDD转换
      (13) cache
         cache  RDD 元素磁盘缓存存 相 persist(MEMORY_ONLY) 函数功
         图13 中方框代表 RDD 分区左侧相数分区存储磁盘通 cache 算子数缓存存
          
          图 13 Cache 算子 RDD 转换
      (14) persist
          persist 函数 RDD 进行缓存操作数缓存里 StorageLevel 枚举类型进行确定 种类型组合(见10) DISK 代表磁盘MEMORY 代表存 SER 代表数否进行序列化存储
      面函数定义 StorageLevel 枚举类型代表存储模式户通图 141 需进行选择
      persist(newLevelStorageLevel)
      图 141 中列出persist 函数进行缓存模式例MEMORY_AND_DISK_SER 代表数存储存磁盘序列化方式存储理

                图 141  persist 算子 RDD 转换
      图 142 中方框代表 RDD 分区 disk 代表存储磁盘 mem 代表存储存数初全部存储磁盘通 persist(MEMORY_AND_DISK) 数缓存存
    分区法容纳存含 V1 V2 V3 RDD存储磁盘含U1U2RDD旧存储存

          图 142   Persist 算子 RDD 转换
    (15) mapValues
          mapValues :针(Key Value)型数中 Value 进行 Map 操作 Key 进行处理
        图 15 中方框代表 RDD 分区 a>a+2 代表 (V11) 样 Key Value 数数 Value 中 1 进行加 2 操作返回结果 3
         
          图 15   mapValues 算子 RDD 转换
    (16) combineByKey
      面代码 combineByKey 函数定义:
      combineByKey[C](createCombiner(V) C
      mergeValue(C V) C
      mergeCombiners(C C) C
      partitionerPartitioner
      mapSideCombineBooleantrue
      serializerSerializernull)RDD[(KC)]
    说明:
    ‰   createCombiner: V > C C 存情况通 V 创建 seq C
    ‰   mergeValue: (C V) > C C 已存情况需 merge item V
    加 seq C 中者叠加
       mergeCombiners: (C C) > C合两 C
    ‰   partitioner: Partitioner Shuff le 时需 Partitioner
    ‰   mapSideCombine : Boolean true减传输量 combine map
    端先做叠加先 partition 中相 key value 叠加
    shuff le
    ‰   serializerClass: String null传输需序列化户定义序列化类:
      例相元素 (Int Int) RDD 转变 (Int Seq[Int]) 类型元素 RDD图 16中方框代表 RDD 分区图通 combineByKey (V12) (V11)数合( V1Seq(21))
      
          图 16  comBineByKey 算子 RDD 转换
    (17) reduceByKey
         reduceByKey combineByKey 更简单种情况两值合成值( Int Int V)to (Int Int C)叠加 createCombiner reduceBykey 简单直接返回 v mergeValue mergeCombiners 逻辑相没区
        函数实现:
        def reduceByKey(partitioner Partitioner func (V V) > V) RDD[(K V)]
    {
    combineByKey[V]((v V) > v func func partitioner)
    }
      图17中方框代表 RDD 分区通户定义函数 (AB) > (A + B) 函数相 key 数 (V12) (V11) value 相加运算结果( V13)
         
            图 17 reduceByKey 算子 RDD 转换
    (18)partitionBy
      partitionBy函数RDD进行分区操作
      函数定义
      partitionBy(partitioner:Partitioner)
      果原RDD分区器现分区器(partitioner)致重分区果致相根分区器生成新ShuffledRDD
      图18中方框代表RDD分区 通新分区策略原分区V1 V2数合分区

     
        图18  partitionBy算子RDD转换
     (19)Cogroup
       cogroup函数两RDD进行协划分cogroup函数定义
      cogroup[W](other: RDD[(K W)] numPartitions: Int): RDD[(K (Iterable[V] Iterable[W]))]
      两RDD中KeyValue类型元素RDD相Key元素分聚合集合返回两RDD中应Key元素集合迭代器
      (K (Iterable[V] Iterable[W]))
      中KeyValueValue两RDD相Key两数集合迭代器构成元组
      图19中方框代表RDD方框方框代表RDD中分区 RDD1中数(U11) (U12)RDD2中数(U12)合(U1((12)(2)))

            图19  Cogroup算子RDD转换
     (20) join
           join 两需连接 RDD 进行 cogroup函数操作相 key 数够放分区 cogroup 操作形成新 RDD key 元素进行笛卡尔积操作返回结果展应 key 元组形成集合返回 RDD[(K (V W))]
       面 代 码 join 函 数 实 现 质 通 cogroup 算 子 先 进 行 协 划 分 通 flatMapValues 合数散
           thiscogroup(otherpartitioner)f latMapValues{case(vsws) > for(v图 20两 RDD join 操作示意图方框代表 RDD方框代表 RDD 中分区函数相 key 元素 V1 key 做连接结果 (V1(11)) (V1(12))

                        图 20   join 算子 RDD 转换
    (21)eftOutJoinrightOutJoin
      LeftOutJoin(左外连接)RightOutJoin(右外连接)相join基础先判断侧RDD元素否空果空填充空 果空数进行连接运算
    返回结果
    面代码leftOutJoin实现
    if (wsisEmpty) {
    vsmap(v > (v None))
    } else {
    for (v < vs w < ws) yield (v Some(w))
    }
     
    2 Actions 算子
      质 Action 算子中通 SparkContext 进行提交作业 runJob 操作触发RDD DAG 执行
    例 Action 算子 collect 函数代码感兴趣读者着入口进行源码剖析:

    **
    * Return an array that contains all of the elements in this RDD
    *
    def collect() Array[T] {
    * 提交 Job*
    val results scrunJob(this (iter Iterator[T]) > itertoArray)
    Arrayconcat(results _*)
    }
    (22) foreach
      foreach RDD 中元素应 f 函数操作返回 RDD Array 返回Uint图22表示 foreach 算子通户定义函数数项进行操作例中定义函数 println()控制台印数项
      
          图 22 foreach 算子 RDD 转换
      (23) saveAsTextFile
      函数数输出存储 HDFS 指定目录
    面 saveAsTextFile 函数部实现部
      通调 saveAsHadoopFile 进行实现:
    thismap(x > (NullWritableget() new Text(xtoString)))saveAsHadoopFile[TextOutputFormat[NullWritable Text]](path)
    RDD 中元素映射转变 (null xtoString)然写入 HDFS
      图 23中左侧方框代表 RDD 分区右侧方框代表 HDFS Block通函数RDD 分区存储 HDFS 中 Block
      
                图 23   saveAsHadoopFile 算子 RDD 转换
      (24)saveAsObjectFile
      saveAsObjectFile分区中10元素组成Array然Array序列化映射(NullBytesWritable(Y))元素写入HDFSSequenceFile格式
      面代码函数部实现
      map(x>(NullWritableget()new BytesWritable(Utilsserialize(x))))
      图24中左侧方框代表RDD分区右侧方框代表HDFSBlock 通函数RDD分区存储HDFSBlock

                图24 saveAsObjectFile算子RDD转换
     
     (25) collect
      collect 相 toArray toArray 已时推荐 collect 分布式 RDD 返回单机 scala Array 数组数组运 scala 函数式操作
      图 25中左侧方框代表 RDD 分区右侧方框代表单机存中数组通函数操作结果返回 Driver 程序节点数组形式存储

      图 25   Collect 算子 RDD 转换  
    (26)collectAsMap
      collectAsMap(KV)型RDD数返回单机HashMap 重复KRDD元素面元素覆盖前面元素
      图26中左侧方框代表RDD分区右侧方框代表单机数组 数通collectAsMap函数返回Driver程序计算结果结果HashMap形式存储

     
              图26 CollectAsMap算子RDD转换
     
     (27)reduceByKeyLocally
      实现先reducecollectAsMap功先RDD整体进行reduce操作然收集结果返回HashMap
     (28)lookup
    面代码lookup声明
    lookup(key:K):Seq[V]
    Lookup函数(KeyValue)型RDD操作返回指定Key应元素形成Seq 函数处理优化部分果RDD包含分区器会应处理K分区然返回(KV)形成Seq 果RDD包含分区器需全RDD元素进行暴力扫描处理搜索指定K应元素
      图28中左侧方框代表RDD分区右侧方框代表Seq结果返回Driver节点应中

          图28  lookupRDD转换
    (29) count
      count 返回整 RDD 元素数
      部函数实现:
      defcount()LongscrunJob(thisUtilsgetIteratorSize_)sum
      图 29中返回数数 5方块代表 RDD 分区

         图29 count RDD 算子转换
    (30)top
    top返回k元素 函数定义
    top(num:Int)(implicit ord:Ordering[T]):Array[T]
    相函数说明
    ·top返回k元素
    ·take返回k元素
    ·takeOrdered返回k元素返回数组中保持元素序
    ·first相top(1)返回整RDD中前k元素定义排序方式Ordering[T]
    返回含前k元素数组
     
    (31)reduce
      reduce函数相RDD中元素进行reduceLeft函数操作 函数实现
      Some(iterreduceLeft(cleanF))
      reduceLeft先两元素进行reduce函数操作然结果迭代器取出元素进行reduce函数操作直迭代器遍历完元素结果RDD中先分区中元素集合分进行reduceLeft 分区形成结果相元素结果集合进行reduceleft操作
      例:户定义函数
      f:(AB)>(A_1+@+B_1A_2+B_2)
      图31中方框代表RDD分区通户定函数f数进行reduce运算 示例
    返回结果V1@[1]V2U@U2@U3@U412

     
    图31 reduce算子RDD转换
    (32)fold
      foldreduce原理相reduce相reduce时迭代器取第元素zeroValue
      图32中通面户定义函数进行fold运算图中方框代表RDD分区 读者参reduce函数理解
      fold((V0@2))( (AB)>(A_1+@+B_1A_2+B_2))


              图32  fold算子RDD转换
     (33)aggregate
       aggregate先分区元素进行aggregate操作分区结果进行fold操作
      aggreagatefoldreduce处aggregate相采方式进行数聚集种聚集行化 foldreduce函数运算程中分区中需进行串行处理分区串行计算完结果结果前方式进行聚集返回终聚集结果
      函数定义
    aggregate[B](z: B)(seqop: (BA) > Bcombop: (BB) > B): B
      图33通户定义函数RDD 进行aggregate聚集操作图中方框代表RDD分区
      rddaggregate(V0@2)((AB)>(A_1+@+B_1A_2+B_2))(AB)>(A_1+@+B_1A_@+B_2))
      介绍两计算模型中两特殊变量
      广播(broadcast)变量:广泛广播Map Side Join中表广播变量等场景 数集合单节点存够容纳需RDD样节点间散存储
    Spark运行时广播变量数发节点保存续计算复 相Hadoodistributed cache广播容跨作业享 Broadcast底层实现采BT机制

            图33  aggregate算子RDD转换
    ②代表V
    ③代表U
    accumulator变量:允许做全局累加操作accumulator变量广泛应中记录
    前运行指标情景

    详细介绍
    官方文档列举32种常见算子包括Transformation20种操作Action12种操作
    (注:截图windows运行结果)
    Transformation:
    1map
    map输入变换函数应RDD中元素mapPartitions应分区区mapPartitions调粒度parallelize(1 to 10 3)map函数执行10次mapPartitions函数执行3次

    2filter(function)
    滤操作满足filterfunction函数trueRDD元素组成新数集:filter(a 1)

    3flatMap(function)
    mapRDD中元素逐进行函数操作映射外RDDflatMap操作函数应RDD中元素返回迭代器容构成新RDDflatMap操作函数应RDD中元素返回迭代器容构成RDD
    flatMapmap区map映射flatMap先映射扁化map次(func)产生元素返回象flatMap步象合象

    4mapPartitions(function)
    区foreachPartition(属Action返回值)mapPartitions获取返回值map区前面已提单独运行RDD分区(block)类型TRDD运行时(function)必须Iterator > Iterator类型方法(入参)

    5mapPartitionsWithIndex(function)
    mapPartitions类似需提供表示分区索引值整型值作参数function必须(int Iterator)>Iterator类型

    6sample(withReplacement fraction seed)
    采样操作样中取出部分数withReplacement否放回fraction采样例seed指定机数生成器种子(否放回抽样分truefalsefraction取样例(0 1]seed种子整型实数)

    7union(otherDataSet)
    源数集数集求集重

    8intersection(otherDataSet)
    源数集数集求交集重序返回

    9distinct([numTasks])
    返回源数集重新数集重局部序整体序返回(详细介绍见
    httpsblogcsdnnetFortuna_iarticledetails81506936)
    注:groupByKeyreduceByKeyaggregateByKeysortByKeyjoincogroup等Transformation操作均包含[numTasks]务数参数参考行链接理解

    注:pairRDD进行操作添加pairRDD简易创建程

    10groupByKey([numTasks])
    PairRDD(kv)RDD调返回(kIterable)作相键值分组集合序列中序确定groupByKey键值集合加载存中存储计算键应值太易导致存溢出
    前求集union方法pair1pair2变相键值pair3进行groupByKey

    11reduceByKey(function[numTasks])
    groupByKey类似(a1) (a2) (b1) (b2)groupByKey产生中间结果( (a1) (a2) ) ( (b1) (b2) )reduceByKey(a3) (b3)
    reduceByKey作聚合groupByKey作分组(functionkey值进行聚合)

    12aggregateByKey(zeroValue)(seqOp combOp [numTasks])
    类似reduceByKeypairRDD中想key值进行聚合操作初始值(seqOp中combOpenCL中未)应返回值pairRDD区aggregate(返回值非RDD)

    13sortByKey([ascending] [numTasks])
    样基pairRDD根key值进行排序ascending升序默认true升序numTasks

    14join(otherDataSet[numTasks])
    加入RDD(kv)(kw)类型dataSet调返回(k(vw))pair dataSet

    15cogroup(otherDataSet[numTasks])
    合两RDD生成新RDD实例中包含两Iterable值第表示RDD1中相值第二表示RDD2中相值(key值)操作需通partitioner进行重新分区需执行次shuffle操作(两RDD前进行shuffle需)

    16cartesian(otherDataSet)
    求笛卡尔积该操作会执行shuffle操作

    17pipe(command[envVars])
    通shell命令RDD分区进行道化通pipe变换shell命令Spark中生成新RDD:

    (图莫怪^_^)
    18coalesce(numPartitions)
    重新分区减少RDD中分区数量numPartitions

    19repartition(numPartitions)
    repartitioncoalesce接口中shuffletrue简易实现Reshuffle RDD机分区分区数量衡分区分区数远原分区数需shuffle

    20repartitionAndSortWithinPartitions(partitioner)
    该方法根partitionerRDD进行分区结果分区中key进行排序
     
    Action:
    1reduce(function)
    reduceRDD中元素两两传递输入函数时产生新值新值RDD中元素传递输入函数直值止

    2collect()
    RDDArray数组形式返回中元素(具体容参见:
    httpsblogcsdnnetFortuna_iarticledetails80851775)

    3count()
    返回数集中元素数默认Long类型

    4first()
    返回数集第元素(类似take(1))

    5takeSample(withReplacement num [seed])
    数集进行机抽样返回包含num机抽样元素数组withReplacement表示否放回抽样参数seed指定生成机数种子
    该方法仅预期结果数组情况数加载driver端存中

    6take(n)
    返回包含数集前n元素数组(0标n1标元素)排序

    7takeOrdered(n[ordering])
    返回RDD中前n元素默认序排序(升序)者定义较器序排序

    8saveAsTextFile(path)
    dataSet中元素文文件形式写入文件系统者HDFS等Spark元素调toString方法数元素转换文文件中行记录
    文件保存文件系统会保存executor机器目录


    9saveAsSequenceFile(path)(Java and Scala)
    dataSet中元素Hadoop SequenceFile形式写入文件系统者HDFS等(pairRDD操作)


    10saveAsObjectFile(path)(Java and Scala)
    数集中元素ObjectFile形式写入文件系统者HDFS等


    11countByKey()
    统计RDD[KV]中K数量返回具key计数(kint)pairshashMap

    12foreach(function)
    数集中元素运行函数function

    补充:Spark23官方文档中原[numTasks]务数参数改[numPartitions]分区数
    实践
    SparkRDD算子分两类:TransformationAction
    Transformation:延迟加载数Transformation会记录元数信息计算务触发Action时会真正开始计算
    Action:立加载数开始计算
    创建RDD方式两种:
    1通sctextFile(rootwordstxt)文件系统中创建 RDD
    2#通行化scala集合创建RDD:val rdd1 scparallelize(Array(12345678))
    1简单算子说明
    里先说简单Transformation算子
    通行化scala集合创建RDD
    val rdd1  scparallelize(Array(12345678))
    查该rdd分区数量
    rdd1partitionslength
    map方法scala中样List中数出做函数运算
    sortBy:数进行排序
    val rdd2 scparallelize(List(56473829110))map(_*2)sortBy(x>xtrue)
    filter:List中数进行函数造作挑选出10值
    val rdd3 rdd2filter(_>10)
    collect:终结果显示出
    flatMap数先进行map操作进行flat(碾压)操作
    rdd4flatMap(_split(’ ))collect
    运行效果图


    val rdd1 scparallelize(List(56473829110))
    val rdd2 scparallelize(List(56473829110))map(_*2)sortBy(x>xtrue)
    val rdd3 rdd2filter(_>10)
    val rdd2 scparallelize(List(56473829110))map(_*2)sortBy(x>x+true)
    val rdd2 scparallelize(List(56473829110))map(_*2)sortBy(x>xtoStringtrue)


    intersection求交集
    val rdd9 rdd6intersection(rdd7)
    val rdd1 scparallelize(List((tom 1) (jerry 2) (kitty 3)))
    val rdd2 scparallelize(List((jerry 9) (tom 8) (shuke 7)))


    join
    val rdd3 rdd1join(rdd2)

    val rdd3 rdd1leftOuterJoin(rdd2)

    val rdd3 rdd1rightOuterJoin(rdd2)

    union:求集注意类型致
    val rdd6 scparallelize(List(5647))
    val rdd7 scparallelize(List(1234))
    val rdd8 rdd6union(rdd7)
    rdd8distinctsortBy(x>x)collect


    groupByKey
    val rdd3 rdd1 union rdd2
    rdd3groupByKey
    rdd3groupByKeymap(x>(x_1x_2sum))


    cogroup
    val rdd1 scparallelize(List((tom 1) (tom 2) (jerry 3) (kitty 2)))
    val rdd2 scparallelize(List((jerry 2) (tom 1) (shuke 2)))
    val rdd3 rdd1cogroup(rdd2)
    val rdd4 rdd3map(t>(t_1 t_2_1sum + t_2_2sum))


    cartesian笛卡尔积
    val rdd1 scparallelize(List(tom jerry))
    val rdd2 scparallelize(List(tom kitty shuke))
    val rdd3 rdd1cartesian(rdd2)


    接说简单Action算子
    val rdd1 scparallelize(List(12345) 2)
    #collect
    rdd1collect
    #reduce
    val rdd2 rdd1reduce(+)
    #count
    rdd1count
    #top
    rdd1top(2)
    #take
    rdd1take(2)
    #first(similer to take(1))
    rdd1first
    #takeOrdered
    rdd1takeOrdered(3)

    2复杂算子说明
    mapPartitionsWithIndex  partition中分区号应值出 源码
    val func (index Int iter Iterator[(Int)]) > {
    itertoListmap(x > [partID + index + val + x + ])iterator
    }
    val rdd1 scparallelize(List(123456789) 2)
    rdd1mapPartitionsWithIndex(func)collect


    aggregate
    def func1(index Int iter Iterator[(Int)]) Iterator[String] {
    itertoListmap(x > [partID + index + val + x + ])iterator
    }
    val rdd1 scparallelize(List(123456789) 2)
    rdd1mapPartitionsWithIndex(func1)collect
    ###action操作 第参数初始值 二2函数(第函数先分区进行合 第二函数分区合结果进行合)
    ###0 + (0+1+2+3+4 + 0+5+6+7+8+9)
    rdd1aggregate(0)(_+_ _+_)
    · 1


    rdd1aggregate(0)(mathmax( ) _ + _)
    ###0分01分区List元素分区中值里分37然0+3+710


    ###51 52345 –> 567899 –> 5 + (5+9)
    rdd1aggregate(5)(mathmax( ) _ + _)

    val rdd3 scparallelize(List(12233454567)2)
    rdd3aggregate()((xy) > mathmax(xlength ylength)toString (xy) > x + y)
    ######### length分两分区元素length进行较0分区字符串21分区字符串4然结果返回分先结果2442


    val rdd4 scparallelize(List(1223345)2)
    rdd4aggregate()((xy) > mathmin(xlength ylength)toString (xy) > x + y)
    ######## length012较字符串0然字符串023较值1


    aggregateByKey
    val pairRDD scparallelize(List( (cat2) (cat 5) (mouse 4)(cat 12) (dog 12) (mouse 2)) 2)
    def func2(index Int iter Iterator[(String Int)]) Iterator[String] {
    itertoListmap(x > [partID + index + val + x + ])iterator
    }
    pairRDDmapPartitionsWithIndex(func2)collect


    pairRDDaggregateByKey(0)(mathmax( ) _ + _)collect
    ########## 先0号分区中数进行操作(初始值数进行较)(cat5)(mouse4)然1号分区中数进行操作(cat12)(dog12)(mouse2)然两分区数进行相加终结果


    coalesce
    #coalesce(2 false)代表数重新分成2区进行shuffle(数重新进行机分配数通网络分配机器)
    val rdd1 scparallelize(1 to 10 10)
    val rdd2 rdd1coalesce(2 false)
    rdd2partitionslength


    repartition
    repartition效果等coalesce(x true)

    collectAsMap Map(b > 2 a > 1)
    val rdd scparallelize(List((a 1) (b 2)))
    rddcollectAsMap


    combineByKey reduceByKey相效果
    ###第参数x原封动取出 第二参数函数 局部运算 第三函数 局部运算结果做运算
    ###分区中key中value中第值 (hello1)(hello1)(good1)–>(hello(11)good(1))–>x相hello第1 good中1
    val rdd1 sctextFile(hdfsmaster9000wordcountinput)flatMap(split( ))map(( 1))
    val rdd2 rdd1combineByKey(x > x (a Int b Int) > a + b (m Int n Int) > m + n)
    rdd1collect


    ###input3文件时(3block块分三区 3文件3block ) 会加310
    val rdd3 rdd1combineByKey(x > x + 10 (a Int b Int) > a + b (m Int n Int) > m + n)
    rdd3collect


    val rdd4 scparallelize(List(dogcatgnusalmonrabbitturkeywolfbearbee) 3)
    val rdd5 scparallelize(List(112221222) 3)
    val rdd6 rdd5zip(rdd4)


    第参数List(_)代表第元素转换List第 二参数x List[String] y String) > x + y代表元素y加入list中第三参数(m List[String] n List[String]) > m ++ n)代表两分区list合成新List
    val rdd7 rdd6combineByKey(List(_) (x List[String] y String) > x + y (m List[String] n List[String]) > m ++ n)


    countByKey
    val rdd1 scparallelize(List((a 1) (b 2) (b 2) (c 2) (c 1)))
    rdd1countByKey
    rdd1countByValue


    filterByRange
    val rdd1 scparallelize(List((e 5) (c 3) (d 4) (c 2) (a 1)))
    val rdd2 rdd1filterByRange(b d)
    rdd2collect


    flatMapValues  Array((a1) (a2) (b3) (b4))
    val rdd3 scparallelize(List((a 1 2) (b 3 4)))
    val rdd4 rdd3flatMapValues(_split( ))
    rdd4collect


    foldByKey
    val rdd1 scparallelize(List(dog wolf cat bear) 2)
    val rdd2 rdd1map(x > (xlength x))
    val rdd3 rdd2foldByKey()(+)


    keyBy 传入参数做key
    val rdd1 scparallelize(List(dog salmon salmon rat elephant) 3)
    val rdd2 rdd1keyBy(_length)
    rdd2collect


    keys values
    val rdd1 scparallelize(List(dog tiger lion cat panther eagle) 2)
    val rdd2 rdd1map(x > (xlength x))
    rdd2keyscollect
    rdd2valuescollect


    方法英文解释
    #
    map(func)
    Return a new distributed dataset formed by passing each element of the source through a function func
    filter(func)
    Return a new dataset formed by selecting those elements of the source on which func returns true
    flatMap(func)(部执行序右左先执行Map执行Flat)
    Similar to map but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item)
    mapPartitions(func)
    Similar to map but runs separately on each partition (block) of the RDD so func must be of type Iterator > Iterator when running on an RDD of type T
    mapPartitionsWithIndex(func)
    Similar to mapPartitions but also provides func with an integer value representing the index of the partition so func must be of type (Int Iterator) > Iterator when running on an RDD of type T
    sample(withReplacement fraction seed)
    Sample a fraction fraction of the data with or without replacement using a given random number generator seed
    union(otherDataset)
    Return a new dataset that contains the union of the elements in the source dataset and the argument
    intersection(otherDataset)
    Return a new RDD that contains the intersection of elements in the source dataset and the argument
    distinct([numTasks]))
    Return a new dataset that contains the distinct elements of the source dataset
    groupByKey([numTasks])
    When called on a dataset of (K V) pairs returns a dataset of (K Iterable) pairs
    reduceByKey(func [numTasks])
    When called on a dataset of (K V) pairs returns a dataset of (K V) pairs where the values for each key are aggregated using the given reduce function func which must be of type (VV) > V Like in groupByKey the number of reduce tasks is configurable through an optional second argument
    aggregateByKey(zeroValue)(seqOp combOp [numTasks])
    When called on a dataset of (K V) pairs returns a dataset of (K U) pairs where the values for each key are aggregated using the given combine functions and a neutral zero value Allows an aggregated value type that is different than the input value type while avoiding unnecessary allocations Like in groupByKey the number of reduce tasks is configurable through an optional second argument
    sortByKey([ascending] [numTasks])
    When called on a dataset of (K V) pairs where K implements Ordered returns a dataset of (K V) pairs sorted by keys in ascending or descending order as specified in the boolean ascending argument
    join(otherDataset [numTasks])
    When called on datasets of type (K V) and (K W) returns a dataset of (K (V W)) pairs with all pairs of elements for each key Outer joins are supported through leftOuterJoin rightOuterJoin and fullOuterJoin
    cogroup(otherDataset [numTasks])
    When called on datasets of type (K V) and (K W) returns a dataset of (K (Iterable Iterable)) tuples This operation is also called groupWith
    cartesian(otherDataset)
    When called on datasets of types T and U returns a dataset of (T U) pairs (all pairs of elements)
    pipe(command [envVars])
    Pipe each partition of the RDD through a shell command eg a Perl or bash script RDD elements are written to the process’s stdin and lines output to its stdout are returned as an RDD of strings
    coalesce(numPartitions)
    Decrease the number of partitions in the RDD to numPartitions Useful for running operations more efficiently after filtering down a large dataset
    repartition(numPartitions)
    Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them This always shuffles all data over the network
    repartitionAndSortWithinPartitions(partitioner)
    Repartition the RDD according to the given partitioner and within each resulting partition sort records by their keys This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery
    (K(IterableIterable))

    9Spark RDD
    着时间推移数分析已达新程度反改变运作模式期天数分析仅处理量数具快速周转时间定目标然Hadoop数分析背伦技术快速处理方面存足着Spark出现数处理速度便更期
    谈Spark时想第术语弹性分布式数集(RDD)Spark RDD数处理更快外Spark关键特性支持计算期间数集进行逻辑分区
    文中讨Spark RDD技术方面进步解Spark RDD底层技术细节外概述RDDSpark中
    Spark RDD特性
    RDD定义Resilient Distributed Dataset(弹性分布式数集)中术语表示特性
    · Resilient 通RDD谱系图(DAG)实现容错节点发生障时进行重新计算
    · Distributed  Spark RDD数集驻留节点中
    · Dataset 您数记录
    Hadoop设计中RDD挑战然Spark RDD解决方案似非常高效取决惰性计算Spark中RDDs需工作节省量数处理时间整程效率
    Hadoop Mapreduce通特性克服Spark RDD许缺点Spark RDD流行原
    Spark RDD核心特性
    · 存计算
    · 惰性计算
    · 容错
    · 变性
    · 分区
    · 持久性
    · 粗粒度操作
    · 位置粘性
    节中逐步讨问题
     
    Spark RDD种表示分布节点数集技术行操作换句话说Spark RDDApache Spark容错抽象Apache Spark基数结构
    Spark中RDD变分布式象集合支持两种方法
    · cache()
    · persist()
     
    Spark RDD存缓存技术Spark RDD中数集进行逻辑分区存缓存处果数合适会余数发送磁盘进行重新计算什称弹性您需时Spark中提取RDD整数处理更快
    Spark数处理方面Hadoop快100倍面Apache Spark更快素
     
     
    Spark RDD支持操作
    Spark中RDD支持两种类型操作
    1 Transformations
    2 Actions
     
    Transformation
    transformation情况Spark RDD现数集创建新数集引Spark RDD转换示例说map转换通函数传递dataset元素作返回值发送表示结果新RDD
    Scala
    val l sctextFile(exampletxt)

    val lLengths lmap(s > slength)

    val totalLength lLengthsreduce((a b) > a + b)
    果想llengthpersist()函数示
    lLengths persist()
    您httpssparkapacheorg参考API文档获Spark RDD支持转换详细列表
     
    Spark RDD支持两种类型转换
    1 Narrow transformation
    2 Wide transformation
    Narrow transformation情况输出RDD父RDD数分区相关联广泛转换中输出RDD许父RDD分区结果换句话说谓shuffle transformation
     
    Spark RDD转换惰性会立计算结果相反记住基数集应转换数集引文件示例示操作需结果时转换Spark RDD中计算进导致更快更效数处理
     
    次转换RDD运行操作时会Spark RDD中进行重新计算persist方法Spark元素保存集群中便次查询时更快访问支持磁盘持久存储Spark RDDs跨节点进行复制
     
     
    Actions
    操作期间RDD数集执行计算值返回驱动程序例reduce某函数聚合RDD元素终结果返回程序操作
    创建Spark RDD三程
    1 行集合
    2 外部数集(外部存储系统享文件系统HBaseHDFS)
    3 现Apache Spark RDDs
    接讨方法中种解创建Spark RDDs
     
    弹性分布式数集(RDD)Apache Spark重特性非常重解Apache Spark数行业中重性
     
    行集合
    您通JavaScalaPython中现驱动程序集合调SparkContext接口parallelize方法创建行集合例中复制集合元素构成分布式数集行操作
     
    Scala中行化集合Spark RDD例子
    数字26保存行集合:
    val collection Array(2 3 4 56)

    val prData sparksparkContextparallelize(collection)
     
    里创建分布式数集prData够行操作您调prDatareduce()数组中元素相加
     
    行化集合关键参数决定数集分割成分区号例中Spark集群分区运行单务通常集群中单CPU 24分区理想Spark会根集群动设置分区数量户通作行化第二参数传递手动设置
     
    外部数库
    Apache SparkHadoop支持文件存储创建分布式数集中包括
    · Local file system
    · HDFS
    · Cassandra
    · HBase
    · Amazon S3
    Spark支持类似文件格式:
    · Text files
    · Sequence Files
    · CSV
    · JSON
    · Any Hadoop Input Format
    例SparkContext接口textFile方法创建文文件Spark RDDs方法接受文件URL系统路径hdfs等等)文件作行集合读取
     
    里重素果文件系统路径必须节点相路径访问该文件必须数文件复制节点需网络挂载享文件系统
     
    您数帧读取器接口加载外部数集然 RDD方法数集转换RDD
     
    面文文件转换示例稍返回字符串数集
    val exDataRDD sparkreadtextFile(pathoftextfile)rdd
     
     
    现RDDS
    RDDS变改变transformation您现RDD创建新RDD没突变发生变化集群中保持致性目操作少
    · map
    · filter
    · count
    · distinct
    · flatmap
    例:
    val seasons sparksparkContextparallelize(Seq(summer monsoon spring winter))

    val seasons1 seasonsmap(s > (scharAt(0) s))

    关系型数库数性优化解决方案分表(前表历史表)表分区数清理原
    原目
    · 交易量者日积月累造成数库数量越越会导致系统性幅降部分业务表数作备份清理
    · 减少数量提升请求响应速度提升户体验
    数否需清理阀值判断
    通常表磁盘超 5GB OLTP 系统(联机事务处理)表记录超 3000 万应考虑表进行分区者分表
    述阀值外根数库性指标情况考虑分区者分表已充分挖掘表设计索引设计查询设计等单表性优化手段然满足业务需时候表容量记录尚未达述阀值考虑分区者分表时间点记录数该表阀值
    般讲记录数标记阀值会表磁盘容量更容易操作般达该阀值时记录数作阀值文会作阀值表数格式索引设计单位事务处理时间容忍度阀值
    满负载周期判断
    说张空表需久达阀值周期称满负载周期
    果业务量已稳定数量积累前阀值需时间该表满负载周期果业务量稳定升数量递增根递增速度估算出满负载周期——说什时候达阀值什时候进行迁移做提前规划心中数
    迁移周期判断
    该表阀值需迁移周期进行判断久该表迁移次?
    天迁移历史表次迁移周期天样操作简单完全做成定时务系统干频繁
    三分法满负载周期找出 13 周期数进行迁移迁移周期 13 满负载周期样然较计划性操作稍微复杂(特业务量稳定时候)次迁移需开发运维计划参
    数流历史
    前表 > 历史表 > 备份表

    图示绿色代表前表表示然存高频率写入操作表查询性非常高黄色代表历史表表示写入频率已查询需求查询性高灰色代表备份表表示数张表中已提供写入操作视情况会提供查询操作查询性般
    类型数分区方案
    · 联机交易(实时类交易写频繁)提供日交易查询建立AB表进行日切日切换提供服务张表进行迁移迁移联机交易进入联机历史表原表(A B 表)执行 truncate 操作历史表交易创建时间(更新时间完成时间)进行 range 分区操作(月分区具体业务量递增情况)
    · 清分交易(非实时类交易读频繁)供日外联机交易查询台查询清分等数操作月清分交易进行分表操作样做话前端应需清分查询进行数库路跨月查询结果进行整合滤清分表建立前表历史表机制根迁移周期进行定期迁移
    · 日志类数非实时类交易
    · 户认证信息(访问度高读写)根业务查询需考虑时间进行 range 分区(推荐 range扩展性较高般讲业务量暴增满足需)者户 id 进行 hash range 分区分区字段选取参考注意点中列举事项
    历史表清理方案
    · 交易类(包括联机清分分润)数5年清理次
    · 日志类数5年清理次
    · 户认证商户认证类数直保留
    注意点
    · 日志类表(操作日志系统日志结果记录务记录等)分区处理:时间进行分区
    · 交易类表(流水明细账差异等)分区处理:创建时间更新时间进行分区
    · 通知类表分区处理:时间进行分区
    · 分区字段选取:般常查询字段进行分区样会助提高查询速度建议 id 进行分区非业务系统里固定 id 查询特否仅分区索引浪费会没分区慢
    · range 分区 hash 分区做分区时候考虑分区扩展性原分区 2 年应该考虑重新分区事情分区期根业务量增长情况加 2 年分区…类推
    · 分区数较均匀太太少根分区字段快定位分区范围
    · 具体分区数量少合适?原数范围缩做全盘扫描会慢时候佳视具体情况十万百万等
    · 相关业务操作(SQL)量分区部完成必须跨分区提取话建议行提取提高速度

    数仓库缓慢变化维(Slow changing demenison) 实现方案
    缓慢变化维定义
    Wikipedia中定义:
    Dimension is a term in data management and data warehousing that refers to logical groupings of data such as geographical location customer information or product information
    Slowly Changing Dimensions (SCD) are dimensions that have data that slowly changes
    意说数会发生缓慢变化维度缓慢变化维
    举例子清楚:
    零售业数仓库中事实表保存着销售员销售记录某天销售员北京分公司调海分公司保存变化呢?说销售员维度恰处理变化先回答问题什处理保存变化?果统计北京区海区总销售情况时候销售员销售记录应该算北京算海?然调离前算北京调离算海标记销售员属区域?里需处理维度数缓慢变化维需做事情
    处理缓慢变化维般情况种解决方案:
    新数覆盖旧数

    方法必须前提条件关心数剧变化例某销售员英文名改果关心员工英文名什变化直接覆盖(修改)数仓库中数  

    二 保存条记录添加字段加区分  
    种情况直接新添条记录时保留原记录单独专字段保存区:  
    (表格中Supplier_State表示面例子中属区域描述清晰代理键表示)  

    Supplier_key Supplier_Code Supplier_Name Supplier_State Disable  
    001 ABC Phlogistical Supply Company CA Y  
    002 ABC Phlogistical Supply Company IL N  
    :  

    Supplier_key Supplier_Code Supplier_Name Supplier_State Version  
    001 ABC Phlogistical Supply Company CA 0  
    002 ABC Phlogistical Supply Company IL 1  

    两种添加数版信息否标识新旧数  
    面种添加记录生效日期失效日期标识新旧数:  

    Supplier_key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date  
    001 ABC Phlogistical Supply Company CA 01Jan2000 21Dec2004  
    002 ABC Phlogistical Supply Company IL 22Dec2004  

    空End_Date表示前版数者默认时间 ( 12319999)代空值 样数索引识  

    三 字段保存值  

    Supplier_key Supplier_Name Original_Supplier_State Effective_Date Current_Supplier_State  
    001 Phlogistical Supply Company CA 22Dec2004 IL  

    种方法字段保存变化痕迹种方法象第二种方法样保存变化记录保存两次变化记录适变化超两次维度  

    四 外建表保存历史记录  

    外建历史表表存变化历史记录维度保存前数  

    Supplier  
    Supplier_key Supplier_Name Supplier_State  
    001 Phlogistical Supply Company IL  

    Supplier_History  
    Supplier_key Supplier_Name Supplier_State Create_Date  
    001 Phlogistical Supply Company CA 22Dec2004  

    种方法仅仅记录变化历史痕迹实做起统计运算方便  

    五 混合模式  
    种模式种模式混合体相言种方法更全面更应错综复杂易变化户需求较常  

    Row_Key Supplier_key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date Current Indicator 
    1 001 ABC001 Phlogistical Supply Company CA 22Dec2004 15Jan2007 N  
    2 001 ABC001 Phlogistical Supply Company IL 15Jan2007 1Jan2099 Y  


    中方法条优点:  
    1 简单滤条件选出维度前值  
    2 较容易关联出历史意时刻事实数值  
    3 果事实表中时间字段(:Order Date Shipping Date Confirmation Date)容易选择条维度数进行关联分析  

    中Row_Key Current Indicator字段加更方便毕竟维度表数点冗余字段占太空间提高查询效率  
    种设计模式事实表应Supplier_key外键然字段唯标识条维度数形成事实表维表关系做事实维度做关联时应加时间戳字段(Indicator字段)  

    六 非常规混合模式  
    面说第五种实现方式点弊端事实表维表关系关系种关系建模时解决报表层面报表运行时解决BI语意层建模时需添加时间滤条件较繁琐  
    面种解决方案解决关系修改事实表:  

    Supplier Dimension  
    Version_Number Supplier_key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date  
    1 001 ABC001 Phlogistical Supply Company CA 22Dec2004 15Jan2007  
    0 001 ABC001 Phlogistical Supply Company IL 15Jan2007 1Jan2099  

    Fact Delivery (描述清晰样代理键标识维度)  
    Delivery_Key Supplier_key Supplier_version_number Quantity Product Delivery_Date Order_Date  
    1 001 0 132 Bags 22Dec2006 15Oct2006  
    2 001 0 324 Chairs 15Jan2007 1Jan2007  

    方案中维表中前数版号始终0插入维度数时先老版数version_number改成1(递增)然插入前数时保持前数版号始终0  
    事实表中插入数时维度数版号始终全部0  
    方案完全解决事实表维表关系问题外优点保证事实表维表参完整性ERwinPowerDesigner等建模工具建模时Version_NumberSupplier_key作复合键两实体间建立链接  






    MySQLTeradataPySpark代码互转表代码
    代码描述
    MySQL
    Teradata SQL
    PySpark
    添加删列
    1添加列
    alter table [`<架构名称>`] `<表名>` add column <字段名> <类型>

    2删列
    alter table [`<架构名称>`] `<表名>` drop column <字段名>
    1添加列
    ALTER TABLE [<架构名称>]<表名> ADD <字段名> <类型>

    2删列
    ALTER TABLE [<架构名称>]<表名> DROP <字段名>
    1添加列

    withColumn('<字段名>' sum( [col] for col in columns))

    2删列
    drop('<字段名>')
    删库
    DROP DATABASE IF EXISTS] <库名>
    DELETE DATABASE <库名> ALL
    Parquet文件中:
    import subprocess
     
    subprocesscheck_call('rm r <存储路径>')shellTrue)

    Hive表中:
    from pysparksql import HiveContext
    hive HiveContext(sparksparkContext)
    hivesql('drop database if exists <库名> cascade')
    删表
    DROP TABLE [`<架构名称>`] `<表名>`
    DROP TABLE [<架构名称>]<表名>
    Parquet文件中:
    import subprocess
     
    subprocesscheck_call('rm r <存储路径><表名>')shellTrue)
     
    Hive表中:
    from pysparksql import HiveContext
    hive HiveContext(sparksparkContext)
     
    hivesql('drop table if exists [`<库名>`]`<表名>` purge')
    清表中数
    TUNCATE TABLE [`<架构名称>`] `<表名>`
    DELETE [<架构名称>]<表名> ALL
    Parquet文件中:
    import subprocess
    import pysparksqlfunctions as F
    from pysparksqltypes import LongType
    import copy

    # 读取parquet文件数代码
    df1 sparkreadload(
    path'<存储路径><表名>'
    format'parquet' headerTrue)

    # 获取表结构
    _schema copydeepcopy(df1schema)
    df2 df1rddzipWithIndex()map(lambda l list(l[0]) + [l[1]])toDF(_schema)
    subprocesscheck_call('rm r <存储路径><表名>')shellTrue)

    # 写入空数集parquet文件
    df2writeparquet(
    path'<存储路径><表名>'
    modeoverwrite)

    Hive部表中:
    from pysparksql import HiveContext
    hive HiveContext(sparksparkContext)
    hivesql('truncate table [<架构名称>]表名')
    Hive外部表中:
    from pysparksql import HiveContext
    hive HiveContext(sparksparkContext)
    hivesql('insert overwrite table [<架构名称>]表名 select * from [<架构名称>]表名 where 12')
    复制表结构新表
    CREATE TABLE [`<架构名称2>`] `<表名2>` LIKE [`<架构名称1>`] `<表名1>`

    通show create table [`<架构名称1>`] `<表名1>`语句旧表创建命令列出需该命令拷贝出更改table名字变成[`<架构名称2>`] `<表名2>`建立完全样表
    CREATE TABLE [<架构名称2>]<表名2> AS [<架构名称1>]<表名1> WITH NO DATA

    通show table [<架构名称1>] <表名1>语句旧表创建命令列出需该命令拷贝出更改table名字变成[<架构名称2>] <表名2>建立完全样表
    Parquet文件中:
    import pysparksqlfunctions as F
    from pysparksqltypes import LongType
    import copy
     
    # 读取parquet文件数代码
    df1 sparkreadload(
    path'<存储路径1><表名1>'
    format'parquet' headerTrue)
     
    # 获取表结构
    _schema copydeepcopy(df1schema)
    df2 df1rddzipWithIndex()map(lambda l list(l[0]) + [l[1]])toDF(_schema)
     
    # 写入空数集parquet文件
    df2writeparquet(
    path'<存储路径2><表名2>'
    modeoverwrite)
     
    Hive表中:
    CREATE TABLE [<架构名称2>]<表名2> LIKE [<架构名称1>]<表名1>

    通desc formmated [<架构名称1>] <表名1>语句show create table [<架构名称1>] <表名1>语句旧表创建命令列出需该命令拷贝出更改表名字变成[<架构名称2>] <表名2>建立完全样表
    创建表插入查询数
    CREATE TABLE [`<架构名称>`] `<表名>` (
    <字段名1> <类型1>[ AUTO_INCREMENT]
    <字段名2> <类型2>[ AUTO_INCREMENT]
    <字段名3> <类型3>[ AUTO_INCREMENT]

    <字段名n> <类型3n>[ AUTO_INCREMENT] [
    PRIMARY KEY (<键字段名>)][
    UNIQUE (<唯值字段名1> <唯值字段名2><唯值字段名3>…<唯值字段名m>)]
    ) [ENGINE{InnoDB|MYISAM|BDB} DEFAULT CHARSET{utf8|gbk}]
     
    INSERT INTO [`<架构名称>`]`<表名>`

     

     
    CREATE TABLE [`<架构名称>`] `<表名>`
     
     
    CREATE {MULTISET|SET} TABLE [<架构名称>]<表名>[
    <参数1>
    <参数2>
    <参数3>

    <参数n>]
    (
    <字段名1> <类型1> [CHARACTER SET <字符集1> NOT CASESPECIFIC]
    <字段名2> <类型2> [CHARACTER SET <字符集2> NOT CASESPECIFIC]
    <字段名3> <类型3> [CHARACTER SET <字符集3> NOT CASESPECIFIC]

    <字段名n> <类型n> [CHARACTER SET <字符集3> NOT CASESPECIFIC]
    )
    [UNIQUE] [PRIMARY INDEX (<键字段名>)]
     
    INSERT INTO [<架构名称>]<表名>

     

     
    CREATE TABLE [<架构名称>]<表名> AS (

    ) WITH DATA


     
    CREATE TABLE [<架构名称1>]<表名1> AS [<架构名称2>]<表名2> WITH DATA
    Parquet文件中:
    sparksql(
    <查询语句>
    )
    writeparquet(
    path'<存储路径><表名>'
    modeoverwrite)

    Hive表中:
    部表
    from pysparksql import HiveContext
     
    hive HiveContext(sparksparkContext)
    hivesql(
    CREATE TABLE [`<库名>`]`<表名>`(
    `<字段名1>` <类型1>
    `<字段名2>` <类型2>
    `<字段名3>` <类型3>

    `<字段名n>` <类型n>)
    [PARTITIONED BY (
    `<分区字段1>` <分区字段类型1>
    `<分区字段2>` <分区字段类型2>
    `<分区字段3>` <分区字段类型3>

    `<分区字段n>` <分区字段类型n>
    )]
    ROW FORMAT SERDE
    'orgapachehadoophiveqlioparquetserdeParquetHiveSerDe'
    STORED AS INPUTFORMAT
    'orgapachehadoophiveqlioparquetMapredParquetInputFormat'
    OUTPUTFORMAT
    'orgapachehadoophiveqlioparquetMapredParquetOutputFormat'
    LOCATION
    'hdfs<表名>')

    hivesql(
    CREATE TABLE [`<库名>`]`<表名>`(
    `<字段名1>` <类型1>
    `<字段名2>` <类型2>
    `<字段名3>` <类型3>

    `<字段名n>` <类型n>)
    [PARTITIONED BY (
    `<分区字段1>` <分区字段类型1>
    `<分区字段2>` <分区字段类型2>
    `<分区字段3>` <分区字段类型3>

    `<分区字段n>` <分区字段类型n>
    )]
    ROW FORMAT SERDE
    'orgapachehadoophiveqlioparquetserdeParquetHiveSerDe'
    STORED AS PARQUET)
     
    外部表
    from pysparksql import HiveContext
     
    hive HiveContext(sparksparkContext)
    hivesql(
    CREATE EXTERNAL TABLE [IF NOT EXISTS] [`<库名>`]`<表名>`(
    `<字段名1>` <类型1>
    `<字段名2>` <类型2>
    `<字段名3>` <类型3>

    `<字段名n>` <类型n>)
    [PARTITIONED BY (
    `<分区字段1>` <分区字段类型1>
    `<分区字段2>` <分区字段类型2>
    `<分区字段3>` <分区字段类型3>

    `<分区字段n>` <分区字段类型n>
    )]
    ROW FORMAT SERDE
    'orgapachehadoophiveqlioparquetserdeParquetHiveSerDe'
    STORED AS INPUTFORMAT
    'orgapachehadoophiveqlioparquetMapredParquetInputFormat'
    OUTPUTFORMAT
    'orgapachehadoophiveqlioparquetMapredParquetOutputFormat'
    LOCATION
    'hdfs<表名>')

    hivesql(INSERT OVERWRITE TABLE [`<库名>`]`<表名>` <查询语句> )
    插入少量数
    INSERT INTO [`<架构名称>`]`<表名>` (<字段名1><字段名2><字段名3><字段名n>) VALUES
    (<值1><值2><值3>…<值n>)
    (<值n+1><值n+2><值n+3>…<值2n>)
    (<值2n+1><值2n+2><值2n+3>…<值3n>)

    (<值mn+1><值mn+2><值mn+3>…<值(m+1)n >)
    INSERT INTO [<架构名称>]<表名>
    (<字段名1><字段名2><字段名3><字段名n>)
    SELECT *
    FROM (SELECT *
    FROM (SELECT <值1><值2><值3>…<值n>) t
    UNION ALL
    SELECT *
    FROM (SELECT <值n+1><值n+2><值n+3>…<值2n>) t
    UNION ALL
    SELECT *
    FROM (SELECT <值2n+1><值2n+2><值2n+3>…<值3n>) t

    UNION ALL
    SELECT *
    FROM (SELECT <值mn+1><值mn+2><值mn+3>…<值(m+1)n>) t
    ) tt
    PySpark文件中:
    <表名>_df sparkcreateDataFrame([(<值1><值2><值3><值n>)(<值n+1><值n+2><值n+3><值2n>)(<值2n+1><值2n+2><值2n+3><值3n>)(<值mn+1><值mn+2><值mn+3><值(m+1)n>)]['<字段名1>''<字段名2>''<字段名3>''<字段名n>'])
    dfwriteparquet(
    path'<存储路径><表名>[<分区字段><分区值>]')
    modeoverwrite)

    <表名>_df sparkparallelize([(<值1><值2><值3><值n>)(<值n+1><值n+2><值n+3><值2n>) (<值2n+1><值2n+2><值2n+3><值3n>)…(<值mn+1><值mn+2><值mn+3><值(m+1)n>)])toDF(['<字段名1>''<字段名2>''<字段名3>''<字段名n>'])
    <表名>_dfwriteparquet(
    path'<存储路径><表名>[<分区字段><分区值>]')
    modeoverwrite)

    Hive表中:
    INSERT INTO TABLE [<库名>]<表名> [PARTITION (<分区字段> '<分区值>')]
    VALUES (<值1><值2><值3><值n>) (<值n+1><值n+2><值n+3><值2n>) (<值2n+1><值2n+2><值2n+3><值3n>)…(<值mn+1><值mn+2><值mn+3><值(m+1)n>)
    限制查询返回行数
    SELECT <字段列表>
    FROM [`<架构名称1>`]`<表名>`
    {INNERLEFTRIGHTFULL} JOIN [`<架构名称2>`]`<维度表名1>`
    ON <表连接条件1>
    {INNERLEFTRIGHTFULL} JOIN [`<架构名称3>`]`<度量表名1>`

    ON <表连接条件2>
    [WHERE <筛选条件>] LIMIT <限制返回行数>
    SELECT TOP <限制返回行数> <字段列表>
    FROM [<架构名称1>]<表名1>
    {INNERLEFTRIGHTFULL} JOIN [<架构名称2>]<维度表名1>
    ON <表连接条件1>
    {INNERLEFTRIGHTFULL} JOIN [<架构名称3>]<度量表名1>

    ON <表连接条件2>
    [WHERE <筛选条件>]
    sparksql(
    SELECT * FROM [<架构名称1>]<表名1>
    )
    createOrReplaceTempView(<表名1>)

    sparksql(
    SELECT * FROM [<架构名称2>]<维度表名1>
    )cache()
    createOrReplaceTempView(<表名2>)

    sparksql(
    SELECT * FROM [<架构名称3>]<度量表名1>
    )
    createOrReplaceTempView(<表名3>)

    sparksql(
    SELECT <字段列表>
    FROM <表名1>
    JOIN <表名2> ON <表连接条件1>
    {INNERLEFTRIGHTFULL} JOIN <表名3> ON <表连接条件2>
    [WHERE <筛选条件>])limit(<限制返回行数>)
    带表连接查询
    SELECT <字段列表>
    FROM [`<架构名称1>`]`<表名1>`
    {INNERLEFTRIGHTFULL} JOIN [`<架构名称2>`]`<维度表名1>`
    ON <表连接条件1>
    {INNERLEFTRIGHTFULL} JOIN [`<架构名称3>`]`<度量表名1>`

    ON <表连接条件2>
    [WHERE <筛选条件>]
    SELECT <字段列表>
    FROM [<架构名称1>]<表名1>
    {INNERLEFTRIGHTFULL} JOIN [<架构名称2>]<维度表名1>
    ON <表连接条件1>
    {INNERLEFTRIGHTFULL} JOIN [<架构名称3>]<度量表名1>

    ON <表连接条件2>
    [WHERE <筛选条件>]
    sparksql(
    SELECT * FROM [<架构名称1>]<表名1>
    )
    createOrReplaceTempView(<表名1>)

    sparksql(
    SELECT * FROM [<架构名称2>]<维度表名1>
    )cache()
    createOrReplaceTempView(<表名2>)

    sparksql(
    SELECT * FROM [<架构名称3>]<度量表名1>
    )
    createOrReplaceTempView(<表名3>)

    sparksql(
    SELECT <字段列表>
    FROM <表名1>
    JOIN <表名2> ON <表连接条件1>
    {INNERLEFTRIGHTFULL} JOIN <表名3> ON <表连接条件2>
    [WHERE <筛选条件>])
    带表连接更新表记录
    CREATE TABLE [`<架构名称1>`] `<表名1>` (
    <字段名1> <类型1>[ AUTO_INCREMENT]
    <字段名2> <类型2>[ AUTO_INCREMENT]
    <字段名3> <类型3>[ AUTO_INCREMENT]

    <字段名n> <类型3n>[ AUTO_INCREMENT] [
    PRIMARY KEY (<键字段名>)][
    UNIQUE (<唯值字段名1> <唯值字段名2><唯值字段名3>…<唯值字段名m>)]
    ) [ENGINE{InnoDB|MYISAM|BDB} DEFAULT CHARSET{utf8|gbk}]
     
    INSERT INTO [`<架构名称1>`] `<表名1>`
    SELECT <键字段>
    <值变字段1>
    <值变字段2>
    <值变字段3>

    <值变字段n>
    <值改变字段1>
    <值改变字段1>
    <值改变字段2>

    <值改变字段n>
    FROM [`<架构名称2>`]`<表名2>`
    WHERE <筛选条件>
     
     
    UPDATE <名1>
    FROM [`<架构名称1>`]`<表名1>` AS <名1>[`<架构名称3>``<表名3>`] SET
    <值改变字段1><改变值1>
    <值改变字段2><改变值2>
    <值改变字段3><改变值3>

    <值改变字段n><改变值n>
     
    WHERE <表连接条件>
    [AND <筛选条件>]
    CREATE [MULTISET] TABLE [<架构名称1>]<表名1>[
    <参数1>
    <参数2>
    <参数3>

    <参数n>]
    (
    <字段名1> <类型1> [CHARACTER SET <字符集1> NOT CASESPECIFIC]
    <字段名2> <类型2> [CHARACTER SET <字符集2> NOT CASESPECIFIC]
    <字段名3> <类型3> [CHARACTER SET <字符集3> NOT CASESPECIFIC]

    <字段名n> <类型3> [CHARACTER SET <字符集3> NOT CASESPECIFIC]
    )
    [UNIQUE] [PRIMARY INDEX (<键字段名>)]
     
    INSERT INTO [<架构名称1>]<表名1>
    SELECT <键字段>
    <值变字段1>
    <值变字段2>
    <值变字段3>

    <值变字段n>
    <值改变字段1>
    <值改变字段1>
    <值改变字段2>

    <值改变字段n>
    FROM [<架构名称2>]<表名2>
    WHERE <筛选条件>
     
     
    UPDATE <名1>
    FROM [<架构名称1>]<表名1> AS <名1>[<架构名称3><表名3>] SET
    <值改变字段1><改变值1>
    <值改变字段2><改变值2>
    <值改变字段3><改变值3>

    <值改变字段n><改变值n>
     
    WHERE <表连接条件>
    [AND <筛选条件>]
    sparksql(
    SELECT * FROM <架构名称2><表名2>
    )
    createOrReplaceTempView(<表名1>)

    sparksql(
    SELECT * FROM <架构名称3><表名3>
    )
    createOrReplaceTempView(<表名2>)

    sparksql(
    SELECT <名1><键字段>
    <值变字段1>
    <值变字段2>
    <值变字段3>

    <值变字段n>
    if(<名2><键字段> is null <名1><值改变字段1> <改变值1>) AS <值改变字段1>
    if(<名2><键字段> is null <名1><值改变字段2> <改变值2>) AS <值改变字段2>
    if(<名2><键字段> is null <名1><值改变字段3> <改变值3>) AS <值改变字段3>

    if(<名2><键字段> is null <名1><值改变字段n> <改变值n>) AS <值改变字段n>
    FROM <表名1> AS <名1>
    INNER JOIN <表名2> AS <名2>
    ON <表连接条件>
    [WHERE <筛选条件>])
    writeparquet(
    path'<存储路径><表名1>'
    modeoverwrite)
    合数
    REPLACE INTO [`<架构名称>`] `<表名>` (<键字段名> <字段名1> <字段名2> <字段名3> … <字段名n>) VALUES (<键值> <值1> <值2> <值3> … <值n>)

    LOAD DATA LOCAL INFILE '<存储路径><文件名>' REPLACE INTO TABLE [`<架构名称>`] `<表名>`

    INSERT INTO [`<架构名称>`] `<表名>` (<键字段名><字段名1> <字段名2> <字段名3> … <字段名n>)
    VALUES (<键值1> <值1> <值2> <值3> … <值n>) (<键值2> <值n+1> <值n+2> <值n+3> … <值2n>)
    ON DUPLICATE KEY UPDATE <字段名1> VALUES(<字段名1>)<字段名2> VALUES(<字段名2>)<字段名3> VALUES(<字段名3>)…<字段名n> VALUES(<字段名n>)

    insert into [`<架构名称>`] `<表名>`(<键字段名><字段名1> <字段名2> <字段名3> … <字段名n>) select * from dupnew on duplicate key update <字段名1> VALUES(<字段名1>)<字段名2> VALUES(<字段名2>)<字段名3> VALUES(<字段名3>)…<字段名n> VALUES(<字段名n>)

    insert  ignore  into  [`<架构名称>`] `<表名>`(<键字段名><字段名1> <字段名2> <字段名3> … <字段名n>) values (<键值1> <值1> <值2> <值3> … <值n>)  

    INSERT IGNORE INTO [`<架构名称>`] `<表名1>` SELECT <键字段名> <字段名1> <字段名2> <字段名3> … <字段名n> FROM [`<架构名称>`] `<表名2>`

    SELECT <键字段名> <字段名1> <字段名2> <字段名3> … <字段名n> FROM [`<架构名称>`] `<表名1>` UNION DISTINCT SELECT <键字段名> <字段名1> <字段名2> <字段名3> … <字段名n> FROM [`<架构名称>`] `<表名2>`

    创建测试表
    drop table test_a
    create table test_a(
    id VARCHAR (16)
    name VARCHAR (16)
    Operatime datetime
    )
    drop table test_b
    create table test_b(
    id VARCHAR (16)
    name VARCHAR (16)
    Operatime datetime
    )

    插入模拟数
    INSERT into test_b values(11now())(22now())
    INSERT into test_a values(11now())(33now())

    查询数
    SELECT * FROM test_b
    SELECT * FROM test_a



    delimiter
    CREATE PROCEDURE merge_a_to_b () BEGIN
    定义需插入a表插入b表程变量
    DECLARE _ID VARCHAR (16)
    DECLARE _NAME VARCHAR (16)
    游标遍历数结束标志
    DECLARE done INT DEFAULT FALSE
    游标指a表结果集第条1位置
    DECLARE cur_account CURSOR FOR SELECT ID NAME FROM test_a
    游标指a表结果集条加1位置 设置结束标志
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done TRUE
    开游标
    OPEN cur_account
    遍历游标
    read_loop
    LOOP
    取值a表前位置数时变量
    FETCH NEXT FROM cur_account INTO _ID_NAME

    果取值结束 跳出循环
    IF done THEN LEAVE read_loop
    END IF

    前数做 果b表存更新时间 存插入
    IF NOT EXISTS ( SELECT 1 FROM TEST_B WHERE ID _ID AND NAME_NAME )
    THEN
    INSERT INTO TEST_B (ID NAMEoperatime) VALUES (_ID_NAMEnow())
    ELSE
    UPDATE TEST_B set operatime now() WHERE ID _ID AND NAME_NAME
    END IF

    END LOOP
    CLOSE cur_account

    END


    delimiter
    CREATE PROCEDURE merge_a_to_b () BEGIN
    定义需插入a表插入b表程变量
    DECLARE _ID VARCHAR (16)
    DECLARE _NAME VARCHAR (16)
    游标遍历数结束标志
    DECLARE done INT DEFAULT FALSE
    游标指a表结果集第条1位置
    DECLARE cur_account CURSOR FOR SELECT ID NAME FROM test_a
    游标指a表结果集条加1位置 设置结束标志
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done TRUE
    开游标
    OPEN cur_account
    遍历游标
    read_loop
    LOOP
    取值a表前位置数时变量
    FETCH NEXT FROM cur_account INTO _ID_NAME

    果取值结束 跳出循环
    IF done THEN LEAVE read_loop
    END IF

    前数做 果b表存更新时间 存插入
    IF NOT EXISTS ( SELECT 1 FROM TEST_B WHERE ID _ID AND NAME_NAME )
    THEN
    INSERT INTO TEST_B (ID NAMEoperatime) VALUES (_ID_NAMEnow())
    ELSE
    UPDATE TEST_B set operatime now() WHERE ID _ID AND NAME_NAME
    END IF

    END LOOP
    CLOSE cur_account

    END
    merge into [<架构名称1>]<表名1> <名1>
    using [<架构名称2>]<表名2> <名2>
    on (<名1><连接字段名1> <名2><连接字段名2>)
    when matched then
    update set <字段名1> <名2><字段名1><字段名2> <名2><字段名3><字段名3> <名2><字段名3>…<字段名n> <名2><字段名n>
    when not matched then
    insert values(<名2><连接字段名2><名2><字段名3><名2><字段名3>…<字段名n> <名2><字段名n>)
    py
    <表名1>_df sparkreadload(
    path'<存储路径1><表名1>'
    format'parquet' headerTrue)

    <表名2>_df sparkreadload(
    path'<存储路径2><表名2>'
    format'parquet' headerTrue)

    <表名1>_dfcreateOrReplaceTempView(<表名1>)
    <表名2>_dfcreateOrReplaceTempView(<表名2>)

    <表名1>_merge_df sparksql(
    SELECT ifnull(ODS<键字段名>STG<键字段名>) AS <键字段名>ifnull(ODS<字段名1>STG<字段名1>) AS <字段名1>ifnull(ODS<字段名2>STG<字段名2>) AS <字段名2>ifnull(ODS<字段名3>STG<字段名3>) AS <字段名3> …ifnull(ODS<字段名n>STG<字段名n>) AS <字段名n> FROM
    (
    SELECT <键字段名> <字段名1> <字段名2> <字段名3> … <字段名n>
    FROM <表名2>
    ) STG
    FULL JOIN <表名1> AS ODS ON STG<键字段名> ODS<键字段名>
    )

    <表名1>_merge_dfwriteparquet(
    path'<存储路径1><表名1>_merge'
    modeoverwrite)

    py
    <表名1>_merge_df sparkreadload(
    path'<存储路径1><表名1>_merge'
    format'parquet' headerTrue)

    <表名1>_merge_dfwriteparquet(
    path'<存储路径1><表名1>'
    modeoverwrite)

    Hive表中:
    CREATE TABLE [`<库名>`]`<表名>` (
    `<字段名1>` <类型1>
    `<字段名2>` <类型2>
    `<字段名3>` <类型3>

    `<字段名n>` <类型n>
    )
    CLUSTERED BY (<键字段>) INTO <数字> buckets
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ''
    STORED AS orc
    TBLPROPERTIES('transactional''true')

    MERGE INTO [`<库名>`]`<表名>` AS <名1>
    USING (
    <查询语句>
    ) AS <名2>
    ON …
    WHEN MATCHED
    THEN
    UPDATE
    SET <字段名1> <值1>
    <字段名2> <值2>
    <字段名3> <值3>

    <字段名n> <值n>
    查询分组排名数
    SELECT <字段1>
    <字段2>
    <字段3>

    <字段n>
    FROM (
    SELECT <字段1>
    <字段2>
    <字段3>

    <字段n>
    ROW_NUMBER() OVER (
    PARTITION BY <分组字段> ORDER BY <排序字段> [DESC]
    ) AS rn
    FROM [`<架构名称>`] ` <表名> `
    [WHERE <筛选条件>]
    ) t
    WHERE rn 1
    SELECT <字段1>
    <字段2>
    <字段3>

    <字段n>
    FROM [<架构名称>]<表名>
    QUALIFY ROW_NUMBER() OVER(PARTITION BY <分组字段> ORDER BY <排序字段> [DESC]) 1
    [WHERE <筛选条件>] 
    <表名>_df sparksql(
    SELECT * FROM [<架构名称>]<表名>
    )
    <表名>_dfcreateOrReplaceTempView(<表名>)
    <表名>_unique_df sparksql(
    SELECT <字段1>
    <字段2>
    <字段3>

    <字段n>
    FROM (
    SELECT <字段1>
    <字段2>
    <字段3>

    <字段n>
    ROW_NUMBER() OVER (
    PARTITION BY <分组字段> ORDER BY <排序字段> [DESC]
    ) AS rn
    FROM <表名>
    [WHERE <筛选条件>]
    ) t
    WHERE rn 1)

    Hive表中:
    from pysparksql import HiveContext

    hive HiveContext(sparksparkContext)
    hivesql(
    SELECT
    <字段1>
    <字段2>
    <字段3>

    <字段n>
    FROM (
    SELECT
    <字段1>
    <字段2>
    <字段3>

    <字段n>
    ROW_NUMBER() OVER (
    PARTITION BY <分组字段> ORDER BY <排序字段>
    ) AS rn
    FROM [`<库名>`]`<表名>`
    [WHERE <筛选条件>]
    ) t
    WHERE rn 1)
    字符串连接
    SELECT CONCAT(<字符串变量字段常量1><字符串变量字段常量2>)
    SELECT <字符串变量字段常量1> || <字符串变量字段常量2>
    sparksql(
    SELECT CONCAT(<字符串变量字段常量1><字符串变量字段常量2>))
    查询分组里数字
    SELECT <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>
    <聚合函数1>(<度量字段1>) AS <名1>
    <聚合函数2>(<度量字段2>) AS <名2>
     
    <聚合函数3>(<度量字段3>) AS <名3>

    <聚合函数m>(<度量字段m>) AS <名m>
    FROM [<架构名称>]<表名>
    GROUP BY 123…n
    SELECT <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>
    <聚合函数1>(<度量字段1>) AS <名1>
    <聚合函数2>(<度量字段2>) AS <名2>
     
    <聚合函数3>(<度量字段3>) AS <名3>

    <聚合函数m>(<度量字段m>) AS <名m>
    FROM [<架构名称>]<表名>
    GROUP BY 123…n
    sparksql(
    SELECT <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>
    <聚合函数1>(<度量字段1>) AS <名1>
    <聚合函数2>(<度量字段2>) AS <名2>
     
    <聚合函数3>(<度量字段3>) AS <名3>

    <聚合函数m>(<度量字段m>) AS <名m>
    FROM [<架构名称>]<表名>
    GROUP BY <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>)
     
    DECIMAL类型转换
    SELECT (CAST(<数值字段变量常量1> AS DECIMAL(382)) CAST(<数值字段变量常量2> AS DECIMAL(382)))
    SELECT (CAST(<数值字段变量常量1> AS DECIMAL(382)) CAST(<数值字段变量常量2> AS DECIMAL(382)))
    <变量> sparksql(
    SELECT <数值字段变量常量1> * 100 <数值字段变量常量2> 100)
    NULL值换
    IFNULL(exp1exp2)

    COALESCE(exp1exp2)
    字段exp1NULL值时返回exp1否返回exp2
    NULLIF(exp1exp2)

    COALESCE(exp1exp2)
    字段exp1NULL值时返回exp1否返回exp2
    >>> df sparkcreateDataFrame([(1) (2) (3) (None)] ['col'])
    >>> dfshow()
    ++
    | col|
    ++
    | 1|
    | 2|
    | 3|
    |null|
    ++
    >>> df dffillna({'col''4'})
    >>> dfshow()
    or dffillna({'col''4'})show()
    ++
    |col|
    ++
    | 1|
    | 2|
    | 3|
    | 4|
    ++

    sparksql(
    SELECT IFNULL(exp1exp2) …
    )
    获取年月日获取中国时区天日期
    SELECT YEAR(CURRENT_DATE()) MONTH(CURRENT_DATE()) DAY(CURRENT_DATE()) CONVERT_TZ(create_time @@sessiontime_zone'+800')
    SET time_zone'AsiaShanghai'
    select now()
    SELECT EXTRACT(YEAR FROM CURRENT_DATE) EXTRACT(MONTH FROM CURRENT_DATE) EXTRACT(DAY FROM CURRENT_DATE) CAST(CONVERT_TIMEZONE('AsiaShanghai'CAST(GETDATE() AS TIMESTAMP)) AS DATE)
    <变量> sparksql(
    SELECT YEAR(CURRENT_DATE) MONTH(CURRENT_DATE) DAY(CURRENT_DATE) CAST(CONVERT_TIMEZONE('AsiaShanghai'CAST(GETDATE() AS TIMESTAMP)) AS DATE))
    时间戳间间隔天数计算
    SELECT TIMESTAMPDIFF(DAY <开始时间戳> <结束时间戳>)
    SELECT EXTRACT(DAY FROM (<结束时间戳> <开始时间戳> DAY(4) TO SECOND)) * 86400
    from pysparksqlfunctions import *

    dates [(120190701 120119111)
    (220190624 120119222)
    (320191116 164455406)
    (420191116 165059406)
    ]

    df sparkcreateDataFrame(datadates schema[idinput_timestamp])

    #Calculate Time difference in Seconds
    <变量>dfwithColumn('from_timestamp'to_timestamp(col('from_timestamp')))\
    withColumn('end_timestamp' current_timestamp())\
    withColumn('DiffInDays'(col(end_timestamp)cast(long) col('from_timestamp')cast(long))246060)
    <变量>show(truncateFalse)
    列出表字段信息
    查系统表
    SELECT
    ColumnId 字段键(建表序致)
    DataBaseName属库
    TableName表名
    DefaultValue默认值        
    ColumnName字段名
    ColumnTitle字段名
    ColumnType字段类型
    ColumnLength字段长度
    DecimalTotalDigits精度
    DecimalFractionalDigits 标度
    ColumnFormat 格式
    FROM
    DBCColumns
    WHERE
    DATABASENAME'<库名>'
    AND TABLENAME'<表名>'
    ORDER BY 1
    查表结构        
    SHOW TABLE [`<库名>`] `<表名>`
    字段类型映射关系
     
    字段类型
    映射值
    拼接规
    CF
    CHAR
    a ASCII编码(LATIN) CHAR(长度)
    b UNICODE编码 CHAR(长度2)
    CV
    VARCHAR
    a ASCII编码(LATIN) VARCHAR(长度)
    b UNICODE编码 VARCHAR(长度2)
    D
    DECIMAL
    DECIMAL(精度标度)
    DA
    DATE
    DATE FORMAT 格式’
    I
    INTEGER
    INTEGER
    I8
    BIGINT
    BIGINT

    SELECT

    COLUMN_NAME AS '字段名'

    DATA_TYPE AS `数类型`

    CHARACTER_MAXIMUM_LENGTH AS `字符长度`

    NUMERIC_PRECISION AS `数字长度`

    NUMERIC_SCALE AS `数位数`

    IS_NULLABLE AS `否允许非空`

    CASE WHEN EXTRA 'auto_increment' THEN 1 ELSE 0 END AS `否增`

    COLUMN_DEFAULT AS `默认值`

    COLUMN_COMMENT AS `备注`

    FROM information_schemaCOLUMNS WHERE TABLE_SCHEMA'<库名>' AND TABLE_NAME'<表名>'
    df…

    dfschema

    dfprintSchema()

    for name dtype in dfdtypes
    print(name dtype)
    分区操作
    查MySQL否支持分区
    1MySQL56前版
    show variables like 'partition'

    2MySQL57
    show plugins

     
    二分区表分类限制
    1分区表分类
    RANGE分区:基属定连续区间列值行分配分区
     
    LIST分区:类似RANGE分区区LIST分区基列值匹配离散值集合中某值进行选择
     
    HASH分区:基户定义表达式返回值进行选择分区该表达式插入表中行列值进行计算函数包含MySQL 中效产生非负整数值表达式
     
    KEY分区:类似HASH分区区KEY分区支持计算列列MySQL服务器提供身哈希函数必须列列包含整数值
     
    复合分区:MySQL 56版中支持RANGELIST子分区子分区类型HASHKEY
     
    2分区表限制
    1)分区键必须包含表键唯键中
     
    2)MYSQL分区函数列身进行较时滤分区根表达式值滤分区表达式分区函数行
     
    3)分区数: NDB存储引擎定表分区数8192(包括子分区)果分区数未达8192时提示 Got error … from storage engine Out of resources when opening file通增加open_files_limit系统变量值解决问题然时开文件数量操作系统限制
     
    4)支持查询缓存: 分区表支持查询缓存涉分区表查询动禁 查询缓存法启类查询
     
    5)分区innodb表支持外键
     
    6)服务器SQL_mode影响分区表步复制 机机SQL_mode会导致sql语句 导致分区间数分配定位置甚导致插入机成功分区表库失败 获佳效果您应该始终机机相服务器SQL模式
     
    7)ALTER TABLE … ORDER BY: 分区表运行ALTER TABLE … ORDER BY列语句会导致分区中行排序
     
    8)全文索引 分区表支持全文索引InnoDBMyISAM存储引擎分区表
    9)分区表法外键约束
    10)Spatial columns: 具空间数类型(POINTGEOMETRY)列分区表中
    11)时表: 时表分区
    12)subpartition问题: subpartition必须HASHKEY分区 RANGELIST分区分区 HASHKEY分区子分区
    13)分区表支持mysqlcheckmyisamchkmyisampack

    三创建分区表
    1range分区
    行数基定连续区间列值放入分区
    CREATE TABLE `test_11` (
    `id` int(11) NOT NULL
    `t` date NOT NULL
    PRIMARY KEY (`id``t`)
    ) ENGINEInnoDB DEFAULT CHARSETutf8
    PARTITION BY RANGE (to_days(t))
    (PARTITION p20170801 VALUES LESS THAN (736907) ENGINE InnoDB
    PARTITION p20170901 VALUES LESS THAN (736938) ENGINE InnoDB
    PARTITION pmax VALUES LESS THAN maxvalue ENGINE InnoDB)123456789
    然插入4条数:
    insert into test_11 values (120170722)(220170822)(320170823)(420170824)1
    然查informationpartitions分区信息统计:
    select PARTITION_NAME as 分区TABLE_ROWS as 行数 from information_schemapartitions where table_schemamysql_test and table_nametest_11
    +++
    | 分区 | 行数 |
    +++
    | p20170801 | 1 |
    | p20170901 | 3 |
    +++
    2 rows in set (000 sec)12345678
    出分区p20170801插入1行数p20170901插入3行数
    yearto_daysunix_timestamp等函数相应时间字段进行转换然分区
    2list分区
    range分区样list分区面离散值
    mysql> CREATE TABLE h2 (
    > c1 INT
    > c2 INT
    > )
    > PARTITION BY LIST(c1) (
    > PARTITION p0 VALUES IN (1 4 7)
    > PARTITION p1 VALUES IN (2 5 8)
    > )
    Query OK 0 rows affected (011 sec)123456789
    RANGE分区情况没catchallMAXVALUE 分区表达式预期值应PARTITION … VALUES IN(…)子句中涵盖 包含匹配分区列值INSERT语句失败显示错误示例示:
    mysql> INSERT INTO h2 VALUES (3 5)
    ERROR 1525 (HY000) Table has no partition for value 312
    3hash分区
    根户定义表达式返回值进行分区返回值负数
    CREATE TABLE t1 (col1 INT col2 CHAR(5) col3 DATE)
    PARTITION BY HASH( YEAR(col3) )
    PARTITIONS 4123
    果插入col3数值’20050915’根计算选择插入分区:
    MOD(YEAR('20050901')4)
    MOD(20054)
    1123
    4key分区
    根MySQL数库提供散列函数进行分区
    CREATE TABLE k1 (
    id INT NOT NULL
    name VARCHAR(20)
    UNIQUE KEY (id)
    )
    PARTITION BY KEY()
    PARTITIONS 21234567
    KEY仅列出零列名称 作分区键列必须包含表键部分全部果该表具 果没列名称作分区键表键(果)果没键唯键唯键分区键果唯键列未定义NOT NULL条语句失败
    分区类型KEY分区限整数空值 例CREATE TABLE语句效:
    CREATE TABLE tm1 (
    s1 CHAR(32) PRIMARY KEY
    )
    PARTITION BY KEY(s1)
    PARTITIONS 1012345
    注意:key分区表执行ALTER TABLE DROP PRIMARY KEY样做会生成错误 ERROR 1466 (HY000) Field in list of fields for partition function not found in table
    5Column分区
    COLUMN分区55开始引入分区功RANGE COLUMNLIST COLUMN两种分区支持整形日期字符串RANGELIST分区方式非常相似
    COLUMNSRANGELIST分区区
    1)针日期字段分区需函数进行转换例针date字段进行分区需YEAR()表达式进行转换
    2)COLUMN分区支持字段作分区键支持表达式作分区键
    column支持数类型:
    1)整型floatdecimal支持
    2)日期类型:datedatetime支持
    3)字符类型:CHAR VARCHAR BINARYVARBINARYblobtext支持
    单列column range分区mysql> show create table list_c
    CREATE TABLE `list_c` (
    `c1` int(11) DEFAULT NULL
    `c2` int(11) DEFAULT NULL
    ) ENGINEInnoDB DEFAULT CHARSETlatin1
    *50500 PARTITION BY RANGE COLUMNS(c1)
    (PARTITION p0 VALUES LESS THAN (5) ENGINE InnoDB
    PARTITION p1 VALUES LESS THAN (10) ENGINE InnoDB) *
    列column range分区mysql> show create table list_c
    CREATE TABLE `list_c` (
    `c1` int(11) DEFAULT NULL
    `c2` int(11) DEFAULT NULL
    `c3` char(20) DEFAULT NULL
    ) ENGINEInnoDB DEFAULT CHARSETlatin1
    *50500 PARTITION BY RANGE COLUMNS(c1c3)
    (PARTITION p0 VALUES LESS THAN (5'aaa') ENGINE InnoDB
    PARTITION p1 VALUES LESS THAN (10'bbb') ENGINE InnoDB) *
    单列column list分区mysql> show create table list_c
    CREATE TABLE `list_c` (
    `c1` int(11) DEFAULT NULL
    `c2` int(11) DEFAULT NULL
    `c3` char(20) DEFAULT NULL
    ) ENGINEInnoDB DEFAULT CHARSETlatin1
    *50500 PARTITION BY LIST COLUMNS(c3)
    (PARTITION p0 VALUES IN ('aaa') ENGINE InnoDB
    PARTITION p1 VALUES IN ('bbb') ENGINE InnoDB) *
    6子分区(组合分区)
    分区基础进步分区时成复合分区
    MySQL数库允许rangelist分区进行HASHKEY子分区例:
    CREATE TABLE ts (id INT purchased DATE)
    PARTITION BY RANGE( YEAR(purchased) )
    SUBPARTITION BY HASH( TO_DAYS(purchased) )
    SUBPARTITIONS 2 (
    PARTITION p0 VALUES LESS THAN (1990)
    PARTITION p1 VALUES LESS THAN (2000)
    PARTITION p2 VALUES LESS THAN MAXVALUE
    )
    [root@mycat3 ~]# ll datamysql_data_3306mysql_testts*
    rwr 1 mysql mysql 8596 Aug 8 1354 datamysql_data_3306mysql_testtsfrm
    rwr 1 mysql mysql 98304 Aug 8 1354 datamysql_data_3306mysql_testts#P#p0#SP#p0sp0ibd
    rwr 1 mysql mysql 98304 Aug 8 1354 datamysql_data_3306mysql_testts#P#p0#SP#p0sp1ibd
    rwr 1 mysql mysql 98304 Aug 8 1354 datamysql_data_3306mysql_testts#P#p1#SP#p1sp0ibd
    rwr 1 mysql mysql 98304 Aug 8 1354 datamysql_data_3306mysql_testts#P#p1#SP#p1sp1ibd
    rwr 1 mysql mysql 98304 Aug 8 1354 datamysql_data_3306mysql_testts#P#p2#SP#p2sp0ibd
    rwr 1 mysql mysql 98304 Aug 8 1354 datamysql_data_3306mysql_testts#P#p2#SP#p2sp1ibd
    1234567891011121314151617
    ts表根purchased进行range分区然进行次hash分区形成3*2分区物理文件证实分区方式通subpartition语法显示指定子分区名称
    注意:子分区数量必须相果分区表子分区已subpartition必须表明子分区名称subpartition子句必须包括子分区名字子分区名字必须致
    外MyISAM表index directorydata direactory指定分区数索引目录innodb表说该存储引擎表空间动进行数索引理会忽略指定indexdata语法


    四普通表转换分区表
    1alter table table_name partition by命令重建分区表
     
    alter table jxfp_data_bak PARTITION BY KEY(SH) PARTITIONS 8
     
    五分区表操作
    CREATE TABLE t1 (
    id INT
    year_col INT
    )
    PARTITION BY RANGE (year_col) (
    PARTITION p0 VALUES LESS THAN (1991)
    PARTITION p1 VALUES LESS THAN (1995)
    PARTITION p2 VALUES LESS THAN (1999)
    )
     
    1ADD PARTITION (新增分区)
    ALTER TABLE t1 ADD PARTITION (PARTITION p3 VALUES LESS THAN (2002))
     
    2DROP PARTITION (删分区)
    ALTER TABLE t1 DROP PARTITION p0 p1
     
    3TRUNCATE PARTITION(截取分区)
    ALTER TABLE t1 TRUNCATE PARTITION p0
     
    ALTER TABLE t1 TRUNCATE PARTITION p1 p3
     
    4COALESCE PARTITION(合分区)
    CREATE TABLE t2 (
    name VARCHAR (30)
    started DATE
    )
    PARTITION BY HASH( YEAR(started) )
    PARTITIONS 6
     
    ALTER TABLE t2 COALESCE PARTITION 2
     
    5REORGANIZE PARTITION(拆分重组分区)
    1)拆分分区
     
    ALTER TABLE table ALGORITHMINPLACE REORGANIZE PARTITION
     
    ALTER TABLE employees ADD PARTITION (
    PARTITION p5 VALUES LESS THAN (2010)
    PARTITION p6 VALUES LESS THAN MAXVALUE
    )
     
    2)重组分区
     
    ALTER TABLE members REORGANIZE PARTITION s0s1 INTO (
    PARTITION p0 VALUES LESS THAN (1970)
    )
    ALTER TABLE tbl_name
    REORGANIZE PARTITION partition_list
    INTO (partition_definitions)
    ALTER TABLE members REORGANIZE PARTITION p0p1p2p3 INTO (
    PARTITION m0 VALUES LESS THAN (1980)
    PARTITION m1 VALUES LESS THAN (2000)
    )
    ALTER TABLE tt ADD PARTITION (PARTITION np VALUES IN (4 8))
    ALTER TABLE tt REORGANIZE PARTITION p1np INTO (
    PARTITION p1 VALUES IN (6 18)
    PARTITION np VALUES in (4 8 12)
    )
     
    6ANALYZE CHECK PARTITION(分析检查分区)
    1)ANALYZE 读取存储分区中值分布情况
     
    ALTER TABLE t1 ANALYZE PARTITION p1 ANALYZE PARTITION p2
     
    ALTER TABLE t1 ANALYZE PARTITION p1 p2
     
    2)CHECK 检查分区否存错误
     
    ALTER TABLE t1 ANALYZE PARTITION p1 CHECK PARTITION p2
     
    7REPAIR分区
    修复破坏分区
     
    ALTER TABLE t1 REPAIR PARTITION p0p1
     
    8OPTIMIZE
    该命令回收空闲空间分区碎片整理分区执行该命令相次分区执行 CHECK PARTITION ANALYZE PARTITIONREPAIR PARTITION命令
     

     
    ALTER TABLE t1 OPTIMIZE PARTITION p0 p1
     
    9REBUILD分区
    重建分区相先删分区中数然重新插入分区碎片整理
     
    ALTER TABLE t1 REBUILD PARTITION p0 p1
     
    10EXCHANGE PARTITION(分区交换)
    分区交换语法
     
    ALTER TABLE pt EXCHANGE PARTITION p WITH TABLE nt
     
    中pt分区表ppt分区(注子分区)nt目标表
     
    实分区交换限制蛮
     
    1) nt分区表
     
    2)nt时表
     
    3)ntpt结构必须致
     
    4)nt存外键约束键外键
     
    5)nt中数位p分区范围外
     
    具体参考MySQL官方文档
     
    11迁移分区(DISCARD IMPORT )
    ALTER TABLE t1 DISCARD PARTITION p2 p3 TABLESPACE
     
    ALTER TABLE t1 IMPORT PARTITION p2 p3 TABLESPACE
     
    实验环境:(mysql57)
    源库:1921682200 mysql5716 zhangdbemp_2分区表
    目标库:1921682100 mysql5718 test (zhangdbemp表导入目标库test schema)
    :源数库中创建测试分区表emp_2然导入数
    MySQL [zhangdb]> CREATE TABLE emp_2(
    id BIGINT unsigned NOT NULL AUTO_INCREMENT
    x VARCHAR(500) NOT NULL
    y VARCHAR(500) NOT NULL
    PRIMARY KEY(id)
    )
    PARTITION BY RANGE COLUMNS(id)
    (
    PARTITION p1 VALUES LESS THAN (1000)
    PARTITION p2 VALUES LESS THAN (2000)
    PARTITION p3 VALUES LESS THAN (3000)
    )
    (接着创建存储程导入测试数)
    DELIMITER
    CREATE PROCEDURE insert_batch()
    begin
    DECLARE num INT
    SET num1
    WHILE num < 3000 DO
    IF (num100000) THEN
    COMMIT
    END IF
    INSERT INTO emp_2 VALUES(NULL REPEAT('X' 500) REPEAT('Y' 500))
    SET numnum+1
    END WHILE
    COMMIT
    END
    DELIMITER
    mysql> select TABLE_NAMEPARTITION_NAME from information_schemapartitions where table_schema'zhangdb'
    +++
    | TABLE_NAME | PARTITION_NAME |
    +++
    | emp | NULL |
    | emp_2 | p1 |
    | emp_2 | p2 |
    | emp_2 | p3 |
    +++
    4 rows in set (000 sec)
    mysql> select count(*) from emp_2 partition (p1)
    ++
    | count(*) |
    ++
    | 999 |
    ++
    1 row in set (000 sec)
    mysql> select count(*) from emp_2 partition (p2)
    ++
    | count(*) |
    ++
    | 1000 |
    ++
    1 row in set (000 sec)
    mysql> select count(*) from emp_2 partition (p3)
    ++
    | count(*) |
    ++
    | 1000 |
    ++
    1 row in set (000 sec)
    面出emp_2分区表已创建完成3子分区分区点数
    :目标数库中创建emp_2表结构数(源库show create table emp_2\G 方法 查创建该表sql)
    MySQL [test]> CREATE TABLE `emp_2` (
    `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT
    `x` varchar(500) NOT NULL
    `y` varchar(500) NOT NULL
    PRIMARY KEY (`id`)
    ) ENGINEInnoDB AUTO_INCREMENT3000 DEFAULT CHARSETutf8mb4
    *50500 PARTITION BY RANGE COLUMNS(id)
    (PARTITION p1 VALUES LESS THAN (1000) ENGINE InnoDB
    PARTITION p2 VALUES LESS THAN (2000) ENGINE InnoDB
    PARTITION p3 VALUES LESS THAN (3000) ENGINE InnoDB) *
    [root@localhost test]# ll
    rwr 1 mysql mysql 98304 May 25 1558 emp_2#P#p0ibd
    rwr 1 mysql mysql 98304 May 25 1558 emp_2#P#p1ibd
    rwr 1 mysql mysql 98304 May 25 1558 emp_2#P#p2ibd
    注意:
    ※约束条件字符集等等必须致建议show create table t1 获取创建表SQL否新服务器导入表空间时候会提示1808错误
    :目标数库丢弃分区表表空间
    MySQL [test]> alter table emp_2 discard tablespace
    Query OK 0 rows affected (012 sec)
    [root@localhost test]# ll 时候刚3分区idb文件没
    rwr 1 mysql mysql 8604 May 25 0414 emp_2frm
    :源数库运行FLUSH TABLES … FOR EXPORT 锁定表生成cfg元数文件cfgibd文件传输目标数库中
    mysql> flush tables emp_2 for export
    Query OK 0 rows affected (000 sec)
    [root@localhost zhangdb]# scp emp_2* root@1921682100mysqldatatest 文件cp目标数库
    mysql> unlock tables 表锁否
    :目标数库中文件授权然导入表空间查数否完整
    [root@localhost test]# chown mysqlmysql emp_2#*
    MySQL [test]> alter table emp_2 import tablespace
    Query OK 0 rows affected (096 sec)
    MySQL [test]> select count(*) from emp_2
    ++
    | count(*) |
    ++
    | 2999 |
    ++
    1 row in set (063 sec)
    面查知分区表已导入目标数库中
    外部分子分区导入目标数库中(整分区表会需子分区导入目标数库中)
    部分子分区导入目标数库方法:
    1)创建目标表时候需创建导入分区: 创建p2 p3两分区
    CREATE TABLE `emp_2` (
    `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT
    `x` varchar(500) NOT NULL
    `y` varchar(500) NOT NULL
    PRIMARY KEY (`id`)
    ) ENGINEInnoDB AUTO_INCREMENT3000 DEFAULT CHARSETutf8mb4
    *50500 PARTITION BY RANGE COLUMNS(id)
    (
    PARTITION p2 VALUES LESS THAN (2000) ENGINE InnoDB
    PARTITION p3 VALUES LESS THAN (3000) ENGINE InnoDB) *
    2)源库cp目标库文件然俩需分区
    3)操作方法样
     
    六获取分区相关信息
    1 通 SHOW CREATE TABLE 语句查分区表分区子句
    譬mysql> show create table eG
     
    2 通 SHOW TABLE STATUS 语句查表否分区应Create_options字段

    mysql> show table statusG
     
    *************************** 1 row ***************************
     
    Name e Engine InnoDB Version 10 Row_format Compact Rows 6 Avg_row_length 10922 Data_length 65536Max_data_length 0 Index_length 0 Data_free 0 Auto_increment NULL Create_time 20151207 222606 Update_time NULL Check_time NULL Collation latin1_swedish_ci Checksum NULL Create_options partitioned Comment
     
    3 查 INFORMATION_SCHEMAPARTITIONS表
    4 通 EXPLAIN PARTITIONS SELECT 语句查具体SELECT语句会访问分区
     
    七MySQL57partition表改进
    HANDLER statements:MySQL 571分区表开始支持HANDLER语句
    index condition pushdown:MySQL573分区表开始支持ICP
    load data:MySQL57开始缓存实现性提升分区130KB缓区实现点
    Perpartition索引缓存:MySQL57开始支持CACHE INDEXLOAD INDEX INTO CACHE语句分区MyISAM表支持索引缓存
    FOR EXPORT选项(FLUSH TABLES)MySQL 574分区InnoDB表开始支持FLUSH TABLES语句FOR EXPORT选项
    MySQL 572开始子分区支持ANALYZECHECKOPTIMIZEREPAIRTRUNCATE操作
    Teradata分区中常时间分区例添加create table语句末尾实现2013年全年天分区(省事次分510年):
     
    PARTITION BY RANGE_N(
     
    Rcd_Dt BETWEEN DATE '20130101' AND DATE '20131231'
     
    EACH INTERVAL '1' DAY NO RANGE
     
    )
     
    外常(容易掌握)字符串取值分区述时间分区中RANGE_N关键字值分区采CASE_N关键字例示:
     
    PARTITION BY CASE_N(
     
    (CASE WHEN (my_field'A') THEN (1) ELSE (0) END)1
     
    (CASE WHEN (my_field'B') THEN (2) ELSE (0) END)2
     
    (CASE WHEN (my_field'C') THEN (3) ELSE (0) END)3
     
    NO CASE OR UNKNOWN)
     
    更进步中面语法元素:
     
    my_field'A'
     
    修改类似样形式:
     
    SUBSTR(my_field11) IN ('E''F''G')
     
    现实中访问数全表扫描变成分区扫描原某步骤达成10100倍性提升复杂耗时较长作业总够缩短半运行时间
     
    1数分区
     
    分布式程序中通信代价较通数集节点间分区进行控制获较少网络传输提升整体性果定RDD需扫描次完全没必预先进行处理数集次诸连接种基键操作中时分区会帮助
     
    Spark法显示控制键具体落工作节点Spark确保组键出现节点
     
    Join操作例果未根RDD中键重新分默认情况连接操作会两数集中键哈希值求出哈希值相记录通网络传输台机器然台机器键相记录进行连接操作
     
    2partitionBy()算子
    from pyspark import SparkContextSparkConf
    if __name__ '__main__'
    conf SparkConf()setMaster(local)setAppName(word_count)
    scSparkContext(confconf)
    pair_rddscparallelize([('a'1)('b'10)('c'4)('d'7)('e'9)('f'10)])
    rdd_1pair_rddpartitionBy(2partitionFunclambda xord(x)2)persist()
    print(rdd_1glom()collect())
     
    结果:
    rdd_1
    [[('b' 10) ('d' 7) ('f' 10)] [('a' 1) ('c' 4) ('e' 9)]]
    parittionBy()pairRDDpairRDD中key传入partitionFunc函数中需注意果partitonBy()操作结果持久化面次RDD时会重复数进行分区操作样话partitionBy()重新分区带处会抵消通算子完成python中定义分区scala语言中样麻烦(spark身提供HashPartitioner RangePartitioner)
     
    3影响分区方式操作
     
    算子会生成结果RDD设分区方式:
     
    cogroup()groupWith()join()leftOuterJoin()rightOuterJoin()groupByKey()reduceByKey()combineByKey()partitionBy()sort()mapValues(果父RDD分区方式)flatMapValues(果父RDD分区方式)filter(果父RDD分区方式)
     
    写入分区
     <变量> sparksql(
    <查询语句>
    )
    <变量>writeparquet(
    path'<存储路径><表名>{par_col}{par_val}'format(par_col'<分区列名>'par_val'<分区值>')
    modeoverwrite)
     
    删分区
    Parquet文件中:
    import subprocess
     
    subprocesscheck_call('rm r <存储路径><表名>{par_col}{par_val}'format(par_col'<分区列名>'par_val'<分区值>')shellTrue)
     
    Hive表中:
    from pysparksql import HiveContext
     
    hive HiveContext(sparksparkContext)
    hivesql('alter table [`<库名>`]`<表名>` drop [if exists] partition(<分区列名>'<分区值>')')
     
    查询分区:
    Hive表中:
    from pysparksql import HiveContext
     
    hive HiveContext(sparksparkContext)
    hivesql('show partitions `[<库名>]<表名>`')
     
    显示分区HDFS存储路径
    Parquet文件中:
    import subprocess
     
    subprocesscheck_call('hdfs dfs ls hdfs<表名>')shellTrue)
     
    Hive表中:
    from pysparksql import HiveContext
     
    hive HiveContext(sparksparkContext)
    hivesql(
    DESC FORMATTED [`<库名>`]`<表名>` partition (<分区列名>'<分区值>'))

    hivesql(
    DESC EXTENDED [`<库名>`]`<表名>` partition (<分区列名>'<分区值>'))

    hivesql(
    USE `<库名>`
    SHOW TABLE EXTENDED LIKE `<表名>` PARTITION(<分区列名>'<分区值>'))
     
    添加分区加载分区数:
    from pysparksql import HiveContext
     
    hive HiveContext(sparksparkContext)
    hivesql(
    alter table [`<库名>`]`<表名>` add partition (<分区列名>'<分区值>') location 'hdfs<表名>') #改变源数存储位置
    hivesql(
    load data inpath '<存储路径><表名>[<分区列名><分区值>]' [overwrite] into table [`<库名>`]`<表名>` partition(<分区列名>'<分区值>')) #会源数切hive表指定路径
     
    修改表分区名称
    ALTER TABLE [`<库名>`]`<表名>` PARTITION ((<分区列名>'<分区值1>') RENAME TO PARTITION ((<分区列名>'<分区值2>')
     
    修复Hive表元数分区(般加表创建语句分区操作语句面)
    from pysparksql import HiveContext
     
    hive HiveContext(sparksparkContext)
    hivesql(
    MSCK REPAIR TABLE [`<库名>`]`<表名>`)

    Shell命令提交运行脚
    vim [<存储路径>]<文件名>sql

    mysql [h ] [u <账号>] [p<密码>] [<数库名>] < [<存储路径>]<文件名>sql

    mysql [h ] [u <账号>] [p<密码>]
    mysql> use <数库名>
    mysql> source [<存储路径>]<文件名>sql
    mysql> QUIT
    文件容:
    #binbash
    usrbinbteq  
     LOGON <账号><密码>
     <查询语句数操作语句1>
    <查询语句数操作语句2>
    <查询语句数操作语句3>

    <查询语句数操作语句n>
     IF ERRORCODE <> 0 THEN QUIT ERRORCODE
     LOGOFF  
    QUIT
     

    运行方式:
    sh [<存储路径>]<文件名>sh
    nohup sh <文件名>sh & #台运行登录断开中断运行
    sh <文件名>sh #前台运行登录断开中断运行



    文件容:
    vim [<存储路径>]<文件名>sh
    script_path<脚存储路径>
    current_data(date +Ymd)
    current_time(date +HMS)
    scripts_sql{script_path}sql
    sql_file_content`cat {scripts_sql}1`
    log_path{scripts_sql}log
    log_file{log_path}1{current_time}log
    tdserver
    dbuser<账号>
    dbpass<密码>
    dbinfotdserverdbuserdbpass

    bteq << END >> log_file 2>&1
    LOGON dbinfo
    MAXERROR 1

    sql_file_content

    LOGOFF
    END

    if [[ ne 0 ]] then
    echo e Betq executed 1 failed \n Please check the log in \n {log_file}
    fi

    RETCODE
    if [ {RETCODE} 0 ]then
    echo Please check the error log file log_file
    exit 1
    else
    echo Query executed successfully
    fi

    cat log_file

    运行方式:
    nohup sh <文件名>sh sql & #台运行登录断开中断运行
    sh <文件名>sh sql #前台运行登录断开中断运行
    sparkenvsh
    vim HADOOP_HOMEconfsparkenvsh

    vim [<存储路径>]sh
    SCRIPT_PATH`dirname 0`

    SCRIPT_FILENAME`basename 0`
    SCRIPT_PATH`dirname 0`
    SCRIPT_NAME{SCRIPT_FILENAME*}
    SCRIPT_FULL_PATH(readlink f 0)
    SCRIPT_ROOT_DIR(dirname {SCRIPT_FULL_PATH})

    SCALA_VERSION
    SCALA_PATH
    SCALA_HOME{SCALA_PATH}scala{SCALA_VERSION}
    SPARK_HOME{SPARK_HOME}
    SPARK_CONF_DIR
    HADOOP_CONF_DIRHADOOP_HOMEetchadoop
    HADOOP_VERSION
    HBASE_CONF_DIR
    STAGE_DIRhdfs
    QUEUE{QUEUE}
    JAR_DIR{JAR_DIRSCRIPT_ROOT_DIRlib}
    LOG_DIRSCRIPT_ROOT_DIRlogs
    CONF_DIR{CONF_DIRSCRIPT_ROOT_DIRconf}
    EMAIL_SERVER{EMAIL_SERVER<邮件服务器域名IP>}
    EMAIL_FROM{EMAIL_FROM<发件邮箱址>}
    EMAIL_TO{EMAIL_TO<收件邮箱址>}
    TODAY`date +Ymd`
    THIS_HOUR`date +HM`
    YEAR`date d {TODAY} +Y`
    MONTH`date d {TODAY} +m`
    DAY`date d {TODAY} +d`
    ALERT_EMAILS<通知邮件收件邮箱>
    SPARK_DRIVER_CORE{SPARK_DRIVER_CORE2}
    SPARK_DRIVER_MEMORY{SPARK_DRIVER_MEMORY8G}
    SPARK_EXECUTOR_MEMORY{SPARK_EXECUTOR_MEMORY16G}
    SPARK_EXECUTOR_CORE{SPARK_EXECUTOR_CORE2}
    SPARK_DEFAULT_PARALLELISM{SPARK_EXECUTOR_CORE150}
    SPARK_YARN_TAGS{SPARK_YARN_TAGSLLAMASLAtrueproject_namegalaxi}
    EXECUTOR_MEMORY_OVERHEAD{EXECUTOR_MEMORY_OVERHEAD8192}
    DRIVER_MEMORY_OVERHEAD{DRIVER_MEMORY_OVERHEAD1024}
    BROADCAST_JOIN_THRESHOLD{BROADCAST_JOIN_THRESHOLD104857600}
    SHUFFLE_PARTITIONS{SHUFFLE_PARTITIONS6001}
    SPARK_DYNAMICALLOCATION_ENABLED{SPARK_DYNAMICALLOCATION_ENABLEDtrue}
    SPARK_DYNAMICALLOCATION_MINEXECUTORS{SPARK_DYNAMICALLOCATION_MINEXECUTORS10}
    SPARK_DYNAMICALLOCATION_MAXEXECUTORS{SPARK_DYNAMICALLOCATION_MAXEXECUTORS500}
    NUM_EXECUTORS{NUM_EXECUTORS{SPARK_DYNAMICALLOCATION_MINEXECUTORS}}
    SPARK_MEMORY_FRACTION{SPARK_MEMORY_FRACTION06}
    SPARK_SHUFFLE_SORT_BYPASSMERGETHRESHOLD{SPARK_SHUFFLE_SORT_BYPASSMERGETHRESHOLD200}
    SQL_OUTPUT_PARTITIONS{SQL_OUTPUT_PARTITIONS200}
    SPARK_SHUFFLE_SERVICE_ENABLED{SPARK_DYNAMICALLOCATION_ENABLED}

    umask 000

    input

    while [ 1 ] do
    if [ 1 i ] then
    shift
    # readlinklinux找出符号链接指位置
    file(readlink f 1)
    export full_name(dirname {file})
    export files`find {full_name} regex *\\(py) | paste sd `
    PY_NAME`basename 1`
    input1
    fi
    shift
    done

    PY_NAME`echo 1 | grep o '^[^\]*'`
    CUR_PATH(cd (dirname 0)pwd)
    echo CUR_PATH

    DEPLOY_MODE{{cluster|client}}
    JOB_NAMEPY{PY_NAME}{USER}_`date +s`

    {SPARK_HOME}binsparksubmit \
    master yarn \
    deploymode cluster \
    name {JOB_NAME} \
    queue {QUEUE} \
    executormemory {SPARK_EXECUTOR_MEMORY} \
    executorcores {SPARK_EXECUTOR_CORE} \
    conf sparkdriverextraJavaOptionsDhdpversion{HADOOP_VERSION} Dhadoop{HADOOP_CONF_DIR} Dlog4jconfigurationlog4jproperties DLOG_DIR{LOG_DIR} DJOB_NAME{JOB_NAME} DEMAIL_SERVER{EMAIL_SERVER} DEMAIL_FROM{EMAIL_FROM} DEMAIL_TO{EMAIL_TO} \
    conf sparkexecutorextraJavaOptionsDhdpversion{HADOOP_VERSION} XX+PrintGCDateStamps XX+PrintFlagsFinal XX+PrintGCDetails XX+PrintGC XX+PrintGCTimeStamps \
    conf sparkyarnamextraJavaOptionsDhdpversion{HADOOP_VERSION} \
    conf sparksqlautoBroadcastJoinThreshold{BROADCAST_JOIN_THRESHOLD} \
    conf sparkdrivermemory{SPARK_DRIVER_MEMORY} \
    conf sparkdrivercores{SPARK_DRIVER_CORE} \
    conf sparkdriverextraClassPathusrhdpcurrenthadoopclientlibsnappy*jar \
    conf sparkdriverextraLibraryPathusrhdpcurrenthadoopclientlibnative \
    conf sparkexecutorextraLibraryPathusrhdpcurrenthadoopclientlibnative \
    conf sparkyarndrivermemoryOverhead{DRIVER_MEMORY_OVERHEAD} \
    conf sparkyarnexecutormemoryOverhead{EXECUTOR_MEMORY_OVERHEAD} \
    conf sparkyarnmaxAppAttempts1 \
    conf sparkshuffleiopreferDirectBufsfalse \
    conf sparkdrivermaxResultSize{SPARK_DRIVER_MEMORY} \
    conf sparktaskmaxFailures10 \
    conf sparknetworktimeout600s \
    conf sparksqlshufflepartitions{SHUFFLE_PARTITIONS} \
    conf sparkyarnstagingDir{STAGE_DIR} \
    conf sparkhadoopyarntimelineserviceenabledfalse \
    conf sparkdynamicAllocationenabled{SPARK_DYNAMICALLOCATION_ENABLED} \
    conf sparkdynamicAllocationminExecutors{SPARK_DYNAMICALLOCATION_MINEXECUTORS} \
    conf sparkdynamicAllocationmaxExecutors{SPARK_DYNAMICALLOCATION_MAXEXECUTORS} \
    conf sparkdynamicAllocationexecutorIdleTimeout3600s \
    conf sparkdynamicAllocationschedulerBacklogTimeout600s \
    conf sparkyarntagsLLAMASLAtrueproject_name<项目名称> \
    numexecutors {NUM_EXECUTORS} \
    conf sparkmemoryfraction{SPARK_MEMORY_FRACTION} \
    conf sparkshufflesortbypassMergeThreshold{SPARK_SHUFFLE_SORT_BYPASSMERGETHRESHOLD} \
    conf sqloutputpartitions{SQL_OUTPUT_PARTITIONS} \
    conf sparkshuffleserviceenabled{SPARK_SHUFFLE_SERVICE_ENABLED} \
    conf sparkyarnaccesshadoopFileSystemshdfs
    files {files}{HADOOP_CONF_DIR}hdfssitexml{SPARK_CONF_DIR}hivesitexml{CONF_DIR}zookeeperproperties{CONF_DIR}dragonkeytab{CONF_DIR}graphjson{CONF_DIR}tablejson \
    [principal <账号>@<密钥分发中心 KDC> \
    keytab <存储路径><认证文件>keytab \]
    archives <存储路径><包Python虚拟环境>zip \
    conf sparkyarnappMasterEnvPYSPARK_PYTHONusrbinpython3 \
    input

    运行Shell:
    sh [<存储路径>]sh i py

    Shell脚容:
    vim [<存储路径>]sh

    for i in @
    do
    case i in
    i*|input*)
    INPUT{i#*}
    shift # past argumentvalue

    *)

    esac
    done

    echo input sql path{INPUT}

    commandsh [<存储路径>]sh i {INPUT}
    echo {command}
    eval command 2
    res
    if [ {res}X 0X ]then
    echo INFO Run PySpark file successfully
    else
    echo ERROR Run PySpark failed and check log for detail please
    fi
    return res

    运行Shell:
    sh [<存储路径>]sh ipy

    readlink:

    readlinklinux找出符号链接指位置
    例1:
    readlink f usrbinawk
    结果:
    usrbingawk #usrbinawk软连接指gawk
    例2:
    readlink f homesoftwarelog
    homesoftwarelog #果没链接显示身绝路径

    获取前脚路径:
    pathsh
    #binbash
    path(cd `dirname 0`pwd)
    echo path
    path2(dirname 0)
    echo path2
    前脚存路径:homesoftware
    sh pathsh
    homesoftware

    解释:
    dirname 0 获取前脚相路径
    cd `dirname 0`pwd 先cd前路径然pwd印成绝路径

    方法二:
    pathsh
    #binbash
    path(dirname 0)
    path2(readlink f path)
    echo path2
    sh pathsh
    homesoftware
    解释:
    readlink f path 果path没链接显示身绝路径
    获取路径较
    pathsh
    #binbash
    PATH1(dirname 0)
    PATH2(cd `dirname 0`pwd)
    PATH3(readlink f PATH1)
    echo PATH1
    echo PATH2
    echo PATH3
    前脚存路径:homesoftware
    sh pathsh
    echo PATH1
    homesoftware echo PATH2
    home echo PATH3
    Shell命令行交互式运行代码
    mysql [h ] [u <账号>] [p<密码>] [<数库名>]
    mysql><查询语句数操作语句1>

    mysql><查询语句数操作语句2>

    mysql><查询语句数操作语句3>



    mysql><查询语句数操作语句n>
    mysql>QUIT
    usrbinbteq  c UTF8
    LOGON <账号><密码>
     
    <查询语句数操作语句1>

    <查询语句数操作语句2>

    <查询语句数操作语句3>



    <查询语句数操作语句n>
     
    IF ERRORCODE <> 0 THEN QUIT ERRORCODE
     
    LOGOFF  
    QUIT
     


    BTEQBasic Teradata QueryTeradata发行提交SQL查询前端工具BTEQ命令必须开头结尾者什

    BTEQ常报表格式化输出设置:
    SET DEFAULTS:输出格式定义成默认值
    SET ECHOREQ ONOFF:否SQL请求BTEQ命令复制输出报表中
    SET FOLDLINE ON 1:第1字段显示第1行字段值显示第2行
    SET FOOTING [NULL'string']:定义脚注包含&DATE&TIME&PAGE&n
    SET FORMAT ONOFF:设置OFF时BTEQ忽略FOOTINGFORMCHARRTITLEHEADINGPAGEBREAK等设置
    SET HEADING [NULL'string']:定义页头FOOTING样
    SET NULL AS 'string':NULL默认值问号改变缺省值
    SET OMIT ONOFF [nALL]:指定字段包括报表中页头脚注中
    SET PAGEBREAK ONOFF [nALL]:指定字段值发生变化时插入分页符开始新页
    SET PAGELENGTH n:定义页面长度默认值55行
    SET RTITLE ['string']:定义页方标题动包含日期页号
    SET SEPARATOR ['string'n]:定义字段间分隔符n表示空格
    SET SUPPRESS ONOFF [nALL]:指定字段果遇连续重复值空格代
    SET SKIPLINE ONOFF [nALL]:指定字段值发生变化时插入空行
    SET SKIPDOUBLE ONOFF [nALL]:指定字段值发生变化时插入两空行
    SET UNDERLINE ONOFF [nALL]:行输出中指定字段加划线
    SET WIDTH n:设置报表宽度默认值75


    假定Teradata数库DEMO(名字必须HOSTS文件中进行定义)SQL01户名进行登录键入logon demosql01退出Teradata命令logoff退出BTEQ命令quit果需BTEQ中运行unix命令必须运行os xxxx

    BTEQ交互方式运行批处理方式运行BTEQ输出保存文件中重新恢复标准输出:export filexxxxexport reset


    编写BTEQ脚时插入行进行注释采单行进行注释:
    SET SESSION TRANSACTION ANSI
    脚文件中必须户密码logon命令中提供
    LOGIN sql01sql01
    SELECT FROM WHERE
    QUIT
    保存脚testScript通run filetestScript进行脚运行
    binpyspark
    >>>

    >>>

    >>>



    >>>
    >>> exit()

    hive
    hive>

    hive>

    hive>


    hive>

    hive> exit

    导入导出CSV文件
    导入:
    LOAD DATA INFILE '<存储路径>\<文件名>csv' INTO TABLE [`<库名>`] `<表名>` FIELDS TERMINATED BY '' OPTIONALLY ENCLOSED BY '' LINES TERMINATED BY '\n'
    常参数:
    FIELDS TERMINATED BY '':指定字段分隔符
    OPTIONALLY ENCLOSED BY '':认双引号中独立字段Excel 转 CSV 时特殊字符(逗号顿号等)字段会动双引号引起
    LINES TERMINATED BY '\n':指定行分隔符注意 Windows 台创建文件分隔符 '\r\n'

    导出:
    mysql > SELECT <字段名1><字段名2><字段名3><字段名n> INTO OUTFILE '<存储路径>\<文件名>csv'
    FIELDS TERMINATED BY '' OPTIONALLY ENCLOSED BY ''
    LINES TERMINATED BY '\n'
    FROM [`<库名>`] `<表名>`
    [筛选条件]
    导入:
    编辑文件:
    vim <文件名>in
    SET width 64000
    SET session transaction btet
    logmech ldap
    logon <户名><密码>

    DATABASE <库名>

    PACK 1000
    IMPORT VARTEXT '' FILE<存储路径>\<文件名>csv
    REPEAT *
    USING(<字段名1> <类型1>
    <字段名2> <类型2>
    <字段名3> <类型3>

    <字段名n> <类型n>)

    insert into <库名><表名> (
    <字段名1>
    <字段名2>
    <字段名3>

    <字段名n>
    )
    values
    ( <字段名1>
    <字段名2>
    <字段名3>

    <字段名n>
    )
    LOGOFF
    EXIT

    执行文件:
    binbteq < <文件名>in

    导出:
    vim <文件名>out
    SET SESSION TRANSACTION BTET
    LOGON <户名><密码>
    EXPORT FILE <存储路径>\<文件名>csv
    SET SEPARATOR ''
    DATABASE <库名>
    SELECT * FROM <表名>
    [筛选条件]
     
    LOGOFF
    EXIT

    执行文件:
    binbteq < <文件名>out
    导入:
    df sparkreadload(
    path'
    format'csv' headerTrue)


    导出:
    df…
    dfrepartition(1)writecsv(path'<存储路径><表名>[<分区字段><分区值>]csv' headerTrue sep mode'overwrite')

    dfwriteformat(comdatabrickssparkcsv’)save(<存储路径><表名>[<分区字段><分区值>]csv’)

    dftoPandas()to_csv(<存储路径><表名>[<分区字段><分区值>]csv’)

    index否索引header否列名True需
    outputpath<存储路径><表名>[<分区字段><分区值>]csv’
    dfto_csv(outputpathsep’’indexFalseheaderFalse)


    #方法

    df sparkreadcsv(r<存储路径><表名>[<分区字段><分区值>]csv encoding'gbk' headerTrue inferSchemaTrue) # header表示数第行否列名inferSchema表示动推断schema时未指定schema

    者:

    df sparkreadcsv(r<存储路径><表名>[<分区字段><分区值>]csv encoding'gbk' headerTrue schemaschema)#指定schema

    #方法二

    df sparkreadformat(csv)option(headerTrue)option(encoding gbk)load(r<存储路径><表名>[<分区字段><分区值>]csv)

    者:

    df sparkreadformat(csv)option(encodinggbk)option(headerTrue)load(r<存储路径><表名>[<分区字段><分区值>]csv schemaschema)

    # 写csv例追加数格式:

    dfwritemode('append')option()option()format()save()

    #注意:数建立csv第行列名情况列名时应该掉header属性
    权限控制
    查询户权限:
    Global level privileges
    SELECT CONCAT(user '@' host)delete_privdrop_priv FROM mysqluser

    Table level privileges
    select CONCAT(user '@' host)usertable from mysqltables_priv

    SHOW GRANTS
    查询户权限:
    SELECT
    UserName
    DatabaseName
    TableName
    ColumnName
    AccessRight
    GrantAuthority
    GrantorName
    AllnessFlag
    CreatorName
    CreateTimeStamp
    FROM dbcallrights
    WHERE username'<户ID>'
    AND databasename'<库名>'
     
    查询户权限
    execute <库名>AllUserRights ('<户名>')
    UDF宏定义
    create macro <库名>AllUserRights (UserName char(128)) as (
    locking row for access select
    UserName (varchar(128))
    AccessType (varchar(128))
    RoleName (varchar(128))
    DatabaseName (varchar(128))
    TableName (varchar(128))
    ColumnName (varchar(128))
    AccessRight
    case
    when accessright'AE' then 'ALTER EXTERNALPROCEDURE'
    when accessright'AF' then 'ALTER FUNCTION'
    when accessright'AP' then 'ALTER PROCEDURE'
    when accessright'AS' then 'ABORT SESSION'
    when accessright'CA' then 'CREATE AUTHORIZATION'
    when accessright'CD' then 'CREATE DATABASE'
    when accessright'CE' then 'CREATE EXTERNAL PROCEDURE'
    when accessright'CF' then 'CREATE FUNCTION'
    when accessright'CG' then 'CREATE TRIGGER'
    when accessright'CM' then 'CREATE MACRO'
    when accessright'CO' then 'CREATE PROFILE'
    when accessright'CP' then 'CHECKPOINT'
    when accessright'CR' then 'CREATE ROLE'
    when accessright'CS' then 'CREATE SERVER'
    when accessright'CT' then 'CREATE TABLE'
    when accessright'CU' then 'CREATE USER'
    when accessright'CV' then 'CREATE VIEW'
    when accessright'CZ' then 'CREATE ZONE'
    when accessright'C1' then 'CREATE DATASET SCHEMA'
    when accessright'D' then 'DELETE'
    when accessright'DA' then 'DROP AUTHORIZATION'
    when accessright'DD' then 'DROP DATABASE'
    when accessright'DF' then 'DROP FUNCTION'
    when accessright'DG' then 'DROP TRIGGER'
    when accessright'DM' then 'DROP MACRO'
    when accessright'DO' then 'DROP PROFILE'
    when accessright'DP' then 'DUMP'
    when accessright'DR' then 'DROP ROLE'
    when accessright'DS' then 'DROP SERVER'
    when accessright'DT' then 'DROP TABLE'
    when accessright'DU' then 'DROP USER'
    when accessright'DV' then 'DROP VIEW'
    when accessright'DZ' then 'DROP ZONE'
    when accessright'D1' then 'DROP DATASET SCHEMA'
    when accessright'E' then 'EXECUTE'
    when accessright'EF' then 'EXECUTE FUNCTION'
    when accessright'GC' then 'CREATE GLOP'
    when accessright'GD' then 'DROP GLOP'
    when accessright'GM' then 'GLOP MEMBER'
    when accessright'I' then 'INSERT'
    when accessright'IX' then 'INDEX'
    when accessright'MC' then 'CREATE MAP'
    when accessright'MD' then 'DROP MAP'
    when accessright'MR' then 'MONITOR RESOURCE'
    when accessright'MS' then 'MONITOR SESSION'
    when accessright'NT' then 'NONTEMPORAL'
    when accessright'OD' then 'OVERRIDE DELETE POLICY'
    when accessright'OI' then 'OVERRIDE INSERT POLICY'
    when accessright'OP' then 'CREATE OWNER PROCEDURE'
    when accessright'OS' then 'OVERRIDE SELECT POLICY'
    when accessright'OU' then 'OVERRIDE UPDATE POLICY'
    when accessright'PC' then 'CREATE PROCEDURE'
    when accessright'PD' then 'DROP PROCEDURE'
    when accessright'PE' then 'EXECUTE PROCEDURE'
    when accessright'R' then 'RETRIEVESELECT'
    when accessright'RF' then 'REFERENCES'
    when accessright'RS' then 'RESTORE'
    when accessright'SA' then 'SECURITY CONSTRAINT ASSIGNMENT'
    when accessright'SD' then 'SECURITY CONSTRAINT DEFINITION'
    when accessright'ST' then 'STATISTICS'
    when accessright'SS' then 'SET SESSION RATE'
    when accessright'SR' then 'SET RESOURCE RATE'
    when accessright'TH' then 'CTCONTROL'
    when accessright'U' then 'UPDATE'
    when accessright'UU' then 'UDT Usage'
    when accessright'UT' then 'UDT Type'
    when accessright'UM' then 'UDT Method'
    when accessright'W1' then 'WITH DATASET SCHEMA'
    when accessright'ZO' then 'ZONE OVERRIDE'
    else''
    end (varchar(26)) as AccessRightDesc
    GrantAuthority
    GrantorName (varchar(128))
    AllnessFlag
    CreatorName (varchar(128))
    CreateTimeStamp
    from
    (
    select
    UserName
    'User' (varchar(128)) as AccessType
    '' (varchar(128)) as RoleName
    DatabaseName
    TableName
    ColumnName
    AccessRight
    GrantAuthority
    GrantorName
    AllnessFlag
    CreatorName
    CreateTimeStamp
    from dbcallrights
    where UserName username
    and CreatorName not username
    union all
    select
    Grantee as UserName
    'Member' as UR
    rRoleName
    DatabaseName
    TableName
    ColumnName
    AccessRight
    null (char(1)) as GrantAuthority
    GrantorName
    null (char(1)) as AllnessFlag
    null (char(1)) as CreatorName
    CreateTimeStamp
    from dbcallrolerights r
    join dbcrolemembers m
    on mRoleName rRoleName
    where UserName username
    union all
    select
    User as UserName
    mGrantee as UR
    rRoleName
    DatabaseName
    TableName
    ColumnName
    AccessRight
    null (char(1)) as GrantAuthority
    GrantorName
    null (char(1)) as AllnessFlag
    null (char(1)) as CreatorName
    CreateTimeStamp
    from dbcallrolerights r
    join dbcrolemembers m
    on mRoleName rRoleName
    where mgrantee in (select rolename from dbcrolemembers where grantee
    username)
    ) AllRights
    order by 4567 )
     
    通dbcallrights表中UserName列DatabaseName列TableName列AccessRight列查询获取指定指定数库中指定表操作权限执行某条SQL语句前判定前户否执行语句权限权限足时尝试动授权(太安全执行完应revoke)等措施
    AccessRight列缩写词应列表(40):
     
    AccessRight
    含义
    AF
    ALTER FUNCTION
    AP
    ALTER PROCEDURE
    AS
    ABORT SESSION
    CD
    CREATE DATABASE
    CF
    CREATE FUNCTION
    CG
    CREATE TRIGGER
    CM
    CREATE MACRO
    CO
    CREATE PROFILE
    CP
    CHECKPOINT
    CR
    CREATE ROLE
    CT
    CREATE TABLE
    CU
    CREATE USER
    CV
    CREATE VIEW
    D
    DELETE
    DD
    DROP DATABASE
    DF
    DROP FUNCTION
    DG
    DROP TRIGGER
    DM
    DROP MACRO
    DO
    DROP PROFILE
    DP
    DUMP
    DR
    DROP ROLE
    DT
    DROP TABLE
    DU
    DROP USER
    DV
    DROP VIEW
    E
    EXECUTE
    EF
    EXECUTE FUNCTION
    I
    INSERT
    IX
    INDEX
    MR
    MONITOR RESOURCE
    MS
    MONITOR SESSION
    PC
    CREATE PROCEDURE
    PD
    DROP PROCEDURE
    PE
    EXECUTE PROCEDURE
    RO
    REPLICATION OVERRIDE
    R
    RETRIEVESELECT
    RF
    REFERENCE
    RS
    RESTORE
    SS
    SET SESSION RATE
    SR
    SET RESOURCE RATE
    U
    UPDATE
    示例SQL语句:
     
    select username databasename tablename accessright from dbcallrights
    where databasename'systemfe' and username'dbc' and tablename'opt_ras_table'
    述语句执行结果:
    *** Query completed 12 rows found 4 columns returned
    *** Total elapsed time was 1 second

    UserName DatabaseName TableName AccessRight

    DBC SystemFe opt_ras_table DT
    DBC SystemFe opt_ras_table U
    DBC SystemFe opt_ras_table DG
    DBC SystemFe opt_ras_table RF
    DBC SystemFe opt_ras_table RS
    DBC SystemFe opt_ras_table R
    DBC SystemFe opt_ras_table I
    DBC SystemFe opt_ras_table CG
    DBC SystemFe opt_ras_table ST
    DBC SystemFe opt_ras_table DP
    DBC SystemFe opt_ras_table D
    DBC SystemFe opt_ras_table IX
    SQL语句动构建出授予权限SQL语句( GRANT语句):
     
    SEL
    TRIM(username)
    TRIM(databasename)
    TRIM(tablename)
    'GRANT '|| CASE
    WHEN AccessRight 'AF ' THEN 'ALTER FUNCTION'
    WHEN AccessRight 'AP ' THEN 'ALTER PROCEDURE'
    WHEN AccessRight 'AS ' THEN 'ABORT SESSION'
    WHEN AccessRight 'CD ' THEN 'CREATE DATABASE'
    WHEN AccessRight 'CF ' THEN 'CREATE FUNCTION'
    WHEN AccessRight 'CG ' THEN 'CREATE TRIGGER'
    WHEN AccessRight 'CM ' THEN 'CREATE MACRO'
    WHEN AccessRight 'CO ' THEN 'CREATE PROFILE'
    WHEN AccessRight 'CP ' THEN 'CHECKPOINT'
    WHEN AccessRight 'CR ' THEN 'CREATE ROLE'
    WHEN AccessRight 'CT ' THEN 'CREATE TABLE'
    WHEN AccessRight 'CU ' THEN 'CREATE USER'
    WHEN AccessRight 'CV ' THEN 'CREATE VIEW'
    WHEN AccessRight 'D ' THEN 'DELETE'
    WHEN AccessRight 'DD ' THEN 'DROP DATABASE'
    WHEN AccessRight 'DF ' THEN 'DROP FUNCTION'
    WHEN AccessRight 'DG ' THEN 'DROP TRIGGER'
    WHEN AccessRight 'DM ' THEN 'DROP MACRO'
    WHEN AccessRight 'DO ' THEN 'DROP PROFILE'
    WHEN AccessRight 'DP ' THEN 'DUMP'
    WHEN AccessRight 'DR ' THEN 'DROP ROLE'
    WHEN AccessRight 'DT ' THEN 'DROP TABLE'
    WHEN AccessRight 'DU ' THEN 'DROP USER'
    WHEN AccessRight 'DV ' THEN 'DROP VIEW'
    WHEN AccessRight 'E ' THEN 'EXECUTE'
    WHEN AccessRight 'EF ' THEN 'EXECUTE FUNCTION'
    WHEN AccessRight 'I ' THEN 'INSERT'
    WHEN AccessRight 'IX ' THEN 'INDEX'
    WHEN AccessRight 'MR ' THEN 'MONITOR RESOURCE'
    WHEN AccessRight 'MS ' THEN 'MONITOR SESSION'
    WHEN AccessRight 'PC ' THEN 'CREATE PROCEDURE'
    WHEN AccessRight 'PD ' THEN 'DROP PROCEDURE'
    WHEN AccessRight 'PE ' THEN 'EXECUTE PROCEDURE'
    WHEN AccessRight 'RO ' THEN 'REPLICATION OVERRIDE'
    WHEN AccessRight 'R ' THEN 'RETRIEVESELECT'
    WHEN AccessRight 'RF ' THEN 'REFERENCE'
    WHEN AccessRight 'RS ' THEN 'RESTORE'
    WHEN AccessRight 'SS ' THEN 'SET SESSION RATE'
    WHEN AccessRight 'SR ' THEN 'SET RESOURCE RATE'
    WHEN AccessRight 'U ' THEN 'UPDATE'
    END || ' ON '||TRIM(databasename)||''||TRIM(tablename)||' to '||TRIM(username)||'' AS Permission
    FROM dbcAllRights
    WHERE DatabaseName '<库名>' and USERNAME '<户名>' AND TABLENAME '<表名>'
    Hive表中:
    认证(authentication):验证户身份否
    授权(authorization):验证户身份操作否权限
    目前hive(版0120)支持简单权限理默认情况开启样户具相权限时超级理员hive中表查改动权利样符合般数仓库安全原Hive基元数权限理基文件存储级权限理次介绍MetaData权限理通配置开启Hive身份认证功进行权限检查:

    配置
    1开启启身份认证户必须grant privilege实体进行操作
    hivesecurityauthorizationenabled true

    2表示创建表时动赋予户角色相应权限
    hivesecurityauthorizationcreatetableownergrants ALL
    hivesecurityauthorizationcreatetablerolegrants admin_roleALL
    hivesecurityauthorizationcreatetableusergrants user1user2selectuser3create

    3< 假出现错误: Error while compiling statement FAILED SemanticException The current builtin authorization in Hive is incomplete and disabled 需配置面属性 >
    hivesecurityauthorizationtaskfactory orgapachehadoophiveqlparseauthorizationHiveAuthorizationTaskFactoryImpl

    角色理
    创建删角色
    create role role_name
    drop role role_name
    展示roles
    show roles
    赋予角色权限
    grant select on database db_name to role role_name
    grant select on [table] t_name to role role_name
    查角色权限
    show grant role role_name on database db_name
    show grant role role_name on [table] t_name
    角色赋予户
    grant role role_name to user user_name
    回收角色权限
    revoke select on database db_name from role role_name
    revoke select on [table] t_name from role role_name
    查某户角色
    show role grant user user_name

    超级权限
    Hive权限功需完善方超级理员
    Hive中没超级理员户进行GrantRevoke操作完善超级理员必须添加hivesemanticanalyzerhook配置实现权限控制类

    hivesemanticanalyzerhook commycompanyAuthHook

    编译面代码(需导入赖antlrruntime34jarhiveexec0120cdh512jar)
    包成jar放置hiveclasspath(客户端hive shell机hiveenvsh 中环境变量:HIVE_AUX_JARS_PATH指路径配置仅hive shell生效)
    hivesitexml中添加参数hiveauxjarspath(目前仅支持路径) fileusrlibhivelibHiveAuthHookjar(配置仅hive server效)重启hiveserver


    package comnewland

    import orgapachehadoophiveqlparseASTNode
    import orgapachehadoophiveqlparseAbstractSemanticAnalyzerHook
    import orgapachehadoophiveqlparseHiveParser
    import orgapachehadoophiveqlparseHiveSemanticAnalyzerHookContext
    import orgapachehadoophiveqlparseSemanticException
    import orgapachehadoophiveqlsessionSessionState

    public class AuthHook extends AbstractSemanticAnalyzerHook {
    private static String[] admin { root hadoop }

    @Override
    public ASTNode preAnalyze(HiveSemanticAnalyzerHookContext context
    ASTNode ast) throws SemanticException {
    switch (astgetToken()getType()) {
    case HiveParserTOK_CREATEDATABASE
    case HiveParserTOK_DROPDATABASE
    case HiveParserTOK_CREATEROLE
    case HiveParserTOK_DROPROLE
    case HiveParserTOK_GRANT
    case HiveParserTOK_REVOKE
    case HiveParserTOK_GRANT_ROLE
    case HiveParserTOK_REVOKE_ROLE
    String userName null
    if (SessionStateget() null
    && SessionStateget()getAuthenticator() null) {
    userName SessionStateget()getAuthenticator()getUserName()
    }
    if (admin[0]equalsIgnoreCase(userName)
    && admin[1]equalsIgnoreCase(userName)) {
    throw new SemanticException(userName
    + can't use ADMIN options except + admin[0] +
    + admin[1] + )
    }
    break
    default
    break
    }
    return ast
    }

    public static void main(String[] args) throws SemanticException {
    String[] admin { admin root }
    String userName root
    for (String tmp admin) {
    Systemoutprintln(tmp)
    if (tmpequalsIgnoreCase(userName)) {
    throw new SemanticException(userName
    + can't use ADMIN options except + admin[0] +
    + admin[1] + )
    }
    }
    }
    }

    HIVE支持权限:

    权限名称 含义
    ALL 权限
    ALTER 允许修改元数(modify metadata data of object)表信息数
    UPDATE 允许修改物理数(modify physical data of object)实际数
    CREATE 允许进行Create操作
    DROP 允许进行DROP操作
    INDEX 允许建索引(目前没实现)
    LOCK 出现发允许户进行LOCKUNLOCK操作
    SELECT 允许户进行SELECT操作
    SHOW_DATABASE 允许户查数库
    附:
    登录hive元数库发现表
    Db_privs记录UserRoleDB权限
    Tbl_privs记录UserRoletable权限
    Tbl_col_privs:记录UserRoletable column权限
    Roles:记录创建role
    Role_map:记录UserRole应关系
    行列转换
    列转行:
    SELECT <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>
    SUM(CASE <分类字段> WHEN <分类值1> THEN <度量字段> ELSE 0 END) AS <分类值名1>
    SUM(CASE <分类字段> WHEN <分类值2> THEN <度量字段> ELSE 0 END) AS <分类值名2>
    SUM(CASE <分类字段> WHEN <分类值3> THEN <度量字段> ELSE 0 END) AS <分类值名3>

    SUM(CASE <分类字段> WHEN <分类值n> THEN <度量字段> ELSE 0 END) AS <分类值名n>
    FROM [`<架构名称>`] `<表名>`
    GROUP BY <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>

    SELECT <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>
    <分类字段><度量字段1> <度量字段2> <度量字段3>… <度量字段n>
    FROM [`<架构名称>`] `<表名>`
    PIVOT(SUM(<度量字段1>) AS <度量字段1>SUM(<度量字段2>) AS <度量字段2>SUM(<度量字段3>) AS <度量字段3>…SUM(<度量字段n>) AS <度量字段n> FOR <分类字段> IN (<分类值1><分类值2><分类值3>…<分类值n>))

    SET @sql NULL
    SELECT
    GROUP_CONCAT(DISTINCT
    CONCAT(
    'SUM(CASE <分类字段> WHEN '''
    <分类字段>
    ''' THEN IFNULL(<度量字段>0) ELSE 0 END) AS `'
    <分类字段> '`'
    )
    ) INTO @sql
    FROM
    (
    SELECT DISTINCT <分类字段>
    FROM [`<架构名称>`] `<表名>`
    ORDER BY <分类字段>
    ) T

    SET @sql
    CONCAT('SELECT <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>

    @sql
    ' FROM [`<架构名称>`] `<表名>`
    GROUP BY <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>')

    PREPARE stmt FROM @sql
    EXECUTE stmt
    DEALLOCATE PREPARE stmt

    列转行分类值字符串连接:
    SELECT <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>
    GROUP_CONCAT(TRIM(<分类字段>)) AS <分类字段> GROUP_CONCAT(CAST(<度量字段> AS VARCHAR(30))) AS <度量字段>
    FROM [`<架构名称>`] `<表名>`
    GROUP BY <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>

    行转列:
    SELECT <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>
    <分类字段><度量字段> from [`<架构名称>`] `<表名>`
    UNPIVOT
    (<度量字段> FOR <分类字段> IN (<分类值1><分类值2><分类值3>…<分类值n>))

    逗号分隔数拆分成行:
    SELECT <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>
    substring_index(substring_index(a<带逗号数字段>''bhelp_topic_id+1)''1)
    FROM [`<架构名称>`] `<表名>` a
    JOIN mysqlhelp_topic b
    ON bhelp_topic_id < (length(a<带逗号数字段>) length(replace(a<带逗号数字段>''''))+1)
    ORDER BY <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>
    列转行:
    SELECT <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>
    <分类字段><度量字段1> <度量字段2> <度量字段3>… <度量字段n>
    FROM [<架构名称>] <表名>
    PIVOT(SUM(<度量字段1>) AS <度量字段1>SUM(<度量字段2>) AS <度量字段2>SUM(<度量字段3>) AS <度量字段3>…SUM(<度量字段n>) AS <度量字段n> for <分类字段> IN (<分类值1><分类值2><分类值3>…<分类值n>))

    列转行分类值字符串连接:
    SELECT <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>
    CAST(tdstatsudfconcat(TRIM(<分类字段>)) AS varchar(500)) AS <分类字段>tdstatsudfconcat(CAST(<度量字段> AS VARCHAR(500))) AS <度量字段>
    FROM [<架构名称>] <表名>
    GROUP BY 123…n

    SELECT <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>
    TRIM(TRAILING '' FROM (XMLAGG(<分类字段> || '')(VARCHAR(500)))) AS <分类字段>TRAILING '' FROM (XMLAGG(CAST(<度量字段> AS VARCHAR(500)) || '')) AS <度量字段>
    FROM [<架构名称>] <表名>
    GROUP BY 123…n

    行转列:
    SELECT <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>
    <分类字段><度量字段1> <度量字段2> <度量字段3>… <度量字段n>
    FROM [<架构名称>] <表名> UNPIVOT [{INCLUDE|EXCLUDE} NULLS] (
    (<度量字段1> <度量字段2> <度量字段3>… <度量字段n>)
    FOR <分类字段> IN (
    (<分类值1> <分类值2> <分类值3>…<分类值n>) AS '<分类值123…n名>'
    (<分类值n+1> <分类值n+2> <分类值n+3>…<分类值2n>) AS '<分类值n+1n+2n+3…2n名>'
    (<分类值2n+1> <分类值2n+2> <分类值2n+3>…<分类值3n>) AS '<分类值2n+12n+22n+3…3n名>'

    (<分类值mn+1> <分类值mn+2> <分类值mn+3>…<分类值(m+1)n>) AS '<分类值mn+1mn+2mn+3…(m+1)n名>'
    )
    ) T

    逗号分隔数拆分成行:
    USE [<架构名称>]

    SELECT A* FROM TABLE (strtok_split_to_table( <表名><维度字段1>
    <表名><维度字段2>
    <表名><维度字段3>

    <表名><维度字段n>
    <表名><带逗号数字段> '')
    RETURNS (<维度字段名1> <维度字段类型1>
    <维度字段名2> <维度字段类型2>
    <维度字段名3> <维度字段类型3>

    <维度字段名n> <维度字段类型n>
    <带逗号数字段名>_num integer <带逗号数字段名> varchar(100) character set unicode) ) AS A
    ORDER BY 123…n
    PySpark里:
    列转行:
    import pysparksqlfunctions as func

    <表名>_df…
    <表名>_dfgroupBy('<分类字段>') \
    pivot('项目' ['<度量字段名1>' '<度量字段名2>' '<度量字段名3>'… <度量字段名m>]) \
    agg(funcsum('<度量字段>')) \
    fillna(0)

    列转行分类值字符串连接:
    from pysparksql import functions as func
    dfgroupby(<分类字段>)agg(funccollect_set(<度量字段1>)funccollect_list(<度量字段2>)funccollect_list(<度量字段3>)…funccollect_list(<度量字段m>))

    行转列:
    sparksql('''
    SELECT <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>
    stack(m '<度量字段名1>' <度量字段1> '<度量字段名2>' <度量字段2> '<度量字段名3>' <度量字段3> '<度量字段名m>' <度量字段m>) AS (
    <分类字段>
    <度量字段名>
    )
    FROM <表名>
    [WHERE <筛选条件>]
    [ORDER BY <排序字段>]
    ''')

    <表名>_df…
    <表名>_dfselectExpr(`<分类字段>`
    stack(m '<度量字段名1>' `<度量字段1>` '<度量字段名2>' `<度量字段2>` '<度量字段名3>' `<度量字段3>` '<度量字段名m>' `<度量字段m>`) AS (`<分类字段>``<度量字段名>`)) \
    [filter([筛选条件]) \]
    [orderBy([`年月` `项目`])]

    列转行分类值字符串连接:
    from pysparksqlfunctions import split explode
    <表名>_df…
    <表名>_dfwithColumn('<带逗号数字段名>'explode(split('<带逗号数字段>'' ')))

    Hive表中:
    列转行:
    SELECT <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>
    SUM(CASE <分类字段> WHEN <分类值1> THEN <度量字段> ELSE 0 END) AS <分类值名1>
    SUM(CASE <分类字段> WHEN <分类值2> THEN <度量字段> ELSE 0 END) AS <分类值名2>
    SUM(CASE <分类字段> WHEN <分类值3> THEN <度量字段> ELSE 0 END) AS <分类值名3>

    SUM(CASE <分类字段> WHEN <分类值n> THEN <度量字段> ELSE 0 END) AS <分类值名n>
    FROM [<架构名称>] <表名>
    [WHERE <筛选条件>]
    GROUP BY <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>

    列转行分类值字符串连接:
    SELECT <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>
    concat_ws(''collect_set(TRIM(<分类字段>))) AS <分类字段>concat_ws(''collect_set(CAST(<度量字段> AS STRING))) AS <度量字段>
    FROM [<架构名称>] <表名>
    GROUP BY <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>

    行转列:
    SELECT <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>
    <表名名><字段名>
    FROM [<架构名称>] <表名>
    explode(<度量字段>) <表名名> AS <字段名>

    逗号分隔数拆分成行:
    SELECT <维度字段1>
    <维度字段2>
    <维度字段3>

    <维度字段n>
    <表名名><带逗号数字段名>
    FROM [<架构名称>] <表名>
    explode(split(<带逗号数字段>'')) <表名名> AS <带逗号数字段名>
    数采样
    SELECT AVG(<度量字段>) FROM [`<架构名称>`] `<表名>` GROUP BY <键字段> DIV 10000
    通<键字段>10000间隔采样采样块均值中:AVG()MAX()MIN()SUM()COUNT()等聚合操作
    <表名>_dfsample(withReplacementboolean fractiondoubleseedlong)

    sample算子时抽样3参数

    withReplacement:表示抽出样否放回true表示会放回意味着抽出样重复

    fraction :抽出少double类型参数01间eg03表示抽出30

    seed:表示种子根seed机抽取般情况前两参数参数干嘛呢参数般调试时候知道程序出问题数出问题参数设置定值

    机采样列值满足特定条件Pyspark数框sample方法根列值机选择行假设数框:

    ++++++
    | id|code| amt|flag_outliers|result|
    ++++++
    | 1| a| 109| 0| 00|
    | 2| b| 207| 0| 00|
    | 3| c| 304| 0| 10|
    | 4| d| 4098| 0| 10|
    | 5| e| 5021| 0| 20|
    | 6| f| 607| 0| 20|
    | 7| g| 708| 0| 20|
    | 8| h| 8043| 0| 30|
    | 9| i| 9012| 0| 30|
    | 10| j|10065| 0| 30|
    ++++++
    想0 1 2 3基该result列抽样1(特定数量)终点:

    ++++++
    | id|code| amt|flag_outliers|result|
    ++++++
    | 1| a| 109| 0| 00|
    | 3| c| 304| 0| 10|
    | 5| e| 5021| 0| 20|
    | 8| h| 8043| 0| 30|
    ++++++
    否种良编程方式实现目标某列中出值采相数量行?帮助非常感谢


    解决方案
    您sampleBy()返回分层样需根层次出分数进行换

    >>> from pysparksqlfunctions import col
    >>> dataset sqlContextrange(0 100)select((col(id) 3)alias(result))
    >>> sampled datasetsampleBy(result fractions{0 01 1 02} seed0)
    >>> sampledgroupBy(result)count()orderBy(key)show()

    +++
    |result|count|
    +++
    | 0| 5|
    | 1| 9|
    +++
    SELECT * FROM [<架构名称>] <表名> SAMPLE 1000   采样1000条数
    SELECT * FROM [<架构名称>] <表名> SAMPLE 025   采样25数
    时表
    CREATE TEMPORARY TABLE `<表名>` (
    <字段名1> <字段类型1> [NOT NULL] [DEFAULT <默认值1>]
    <字段名2> <字段类型2> [NOT NULL] [DEFAULT <默认值2>]
    <字段名3> <字段类型3> [NOT NULL] [DEFAULT <默认值3>]

    <字段名n> <字段类型n> [NOT NULL] [DEFAULT <默认值4>]
    )

    INSERT INTO `<表名>`
    <查询语句>

    CREATE TEMPORARY TABLE `<表名>` AS
    (
    <查询语句>
    )
    CREATE {MULTISET|SET} [GLOBAL] TEMPORARY TABLE <表名> AS
    (
    <查询语句>
    )
    WITH NO DATA
    ON COMMIT PRESERVE ROWS

    CREATE VOLATILE TABLE <表名> LOG
    (
    <字段名1> <字段类型1>
    <字段名2> <字段类型2>
    <字段名3> <字段类型3>

    <字段名n> <字段类型n>
    )
    ON COMMIT PRESERVE ROWS

    INSERT INTO <表名>
    <查询语句>
    <表名>_dfsparksql(
    <查询语句>
    )

    <表名>_dfregisterTempTable(<表名>)

    <表名>_df sparkreadload(
    path'<存储路径><表名>'
    format'parquet' headerTrue)

    <表名>_df registerTempTable(<表名>)

    from pysparksql import HiveContext SparkSession

    # 初始化SparkContext时启Hive支持
    # 终端命令行测试模式输出字段长度设置100字符
    spark SparkSessionbuilderappName(name)config(
    sparkdebugmaxToStringFields 100)enableHiveSupport()getOrCreate()
    hive HiveContext(sparksparkContext)
    <表名>_df hivesql(
    <查询语句>)

    <表名>_df registerTempTable(<表名>)
    时间日期转换
    1时间格式转换
    时间'20190122 154506' 转换成 unix 时间戳
    SELECT UNIX_TIMESTAMP(<时间日期字段>) FROM [`<架构名称>`] `<表名>`

    字符串转时间格式化
    SELECT str_to_date('<年><月><日>''Ymd') FROM [`<架构名称>`] `<表名>`

    时间转字符串 格式化
    SELECT DATE_FORMAT(<时间日期字段>'Ymd His') FROM [`<架构名称>`] `<表名>`

    2 时区转换
    加N时
    SELECT DATE_ADD(<时间日期字段> INTERVAL N HOUR) from [`<架构名称>`] `<表名>`

    减N时
    SELECT DATE_SUB(<时间日期字段>INTERVAL N HOUR) from [`<架构名称>`] `<表名>`

    转换时区
    SELECT CONVERT_TZ(<时间日期字段>+0800+0100) from [`<架构名称>`] `<表名>`

    11 获前日期+时间(date + time)函数:now()
    now() 函数获前日期时间外MySQL 中面函数:
    current_timestamp()   current_timestamp
    localtime()   localtime
    localtimestamp()   localtimestamp    
    日期时间函数等 now()鉴 now() 函数简短易记建议总 now() 代面列出函数
     
    12 获前日期+时间(date + time)函数:sysdate()
    sysdate() 日期时间函数 now() 类似处:now() 执行开始时值 sysdate() 函数执行时动态值
     
    2 获前日期(date)函数:curdate()
    中面两日期函数等 curdate(): current_date()current_date
     
    3 获前时间(time)函数:curtime()
    中面两时间函数等 curtime():current_time()current_time
     
    4 获前 UTC 日期时间函数:utc_date() utc_time() utc_timestamp()
    国位东八时区时间 UTC 时间 + 8 时UTC 时间业务涉国家区时候非常
     
    二MySQL 日期时间 Extract(选取) 函数
    1 选取日期时间部分:日期时间年季度月日时分钟秒微秒
    set @dt '20080910 071530123456'
     
    select date(@dt) 20080910
    select time(@dt) 071530123456
    select year(@dt) 2008
    select quarter(@dt) 3
    select month(@dt) 9
    select week(@dt) 36
    select day(@dt) 10
    select hour(@dt) 7
    select minute(@dt) 15
    select second(@dt) 30
    select microsecond(@dt) 123456
     
    2 MySQL Extract() 函数面实现类似功:
    set @dt '20080910 071530123456'
     
    select extract(year from @dt) 2008
    select extract(quarter from @dt) 3
    select extract(month from @dt) 9
    select extract(week from @dt) 36
    select extract(day from @dt) 10
    select extract(hour from @dt) 7
    select extract(minute from @dt) 15
    select extract(second from @dt) 30
    select extract(microsecond from @dt) 123456
    select extract(year_month from @dt) 200809
    select extract(day_hour from @dt) 1007
    select extract(day_minute from @dt) 100715
    select extract(day_second from @dt) 10071530
    select extract(day_microsecond from @dt) 10071530123456
    select extract(hour_minute from @dt) 715
    select extract(hour_second from @dt) 71530
    select extract(hour_microsecond from @dt) 71530123456
    select extract(minute_second from @dt) 1530
    select extract(minute_microsecond from @dt) 1530123456
    select extract(second_microsecond from @dt) 30123456
    MySQL Extract() 函数没date()time()
    功外功应具全具选取day_microsecond’ 等功注意里选取 day
    microsecond日期 day 部分直选取 microsecond 部分
    MySQL Extract() 函数唯方:需敲次键盘
     
    3 MySQL dayof… 函数:dayofweek() dayofmonth() dayofyear()
    分返回日期参数周月年中位置
    set @dt '20080808'
    select dayofweek(@dt) 6
    select dayofmonth(@dt) 8
    select dayofyear(@dt) 221
    日期 20080808′ 周中第 6 天(1 Sunday 2 Monday … 7 Saturday)月中第 8 天年中第 221 天
     
    4 MySQL week… 函数:week() weekofyear() dayofweek() weekday() yearweek()
    set @dt '20080808'
    select week(@dt) 31
    select week(@dt3) 32
    select weekofyear(@dt) 32
    select dayofweek(@dt) 6
    select weekday(@dt) 4
    select yearweek(@dt) 200831
    MySQL week() 函数两参数具体手册 weekofyear() week() 样计算某天位年中第周 weekofyear(@dt) 等价 week(@dt3)
    MySQL weekday() 函数 dayofweek() 类似返回某天周中位置点参考标准
    weekday:(0 Monday 1 Tuesday … 6 Sunday) dayofweek:(1 Sunday
    2 Monday … 7 Saturday)
    MySQL yearweek() 函数返回 year(2008) + week 位置(31)
     
    5 MySQL 返回星期月份名称函数:dayname() monthname()
    set @dt '20080808'
    select dayname(@dt) Friday
    select monthname(@dt) August
     
    6 MySQL last_day() 函数:返回月份中天
    select last_day('20080201') 20080229
    select last_day('20080808') 20080831
     
     
    三MySQL 日期时间计算函数
    1 MySQL 日期增加时间间隔:date_add()
    set @dt now()
    select date_add(@dt interval 1 day) add 1 day
    select date_add(@dt interval 1 hour) add 1 hour
    select date_add(@dt interval 1 minute)
    select date_add(@dt interval 1 second)
    select date_add(@dt interval 1 microsecond)
    select date_add(@dt interval 1 week)
    select date_add(@dt interval 1 month)
    select date_add(@dt interval 1 quarter)
    select date_add(@dt interval 1 year)select date_add(@dt interval 1 day) sub 1 day
     
    MySQL adddate() addtime()函数 date_add() 代面 date_add() 实现 addtime() 功示例:
    mysql> set @dt '20080809 121233'
    mysql> select date_add(@dt interval '011530' hour_second)
    ++
    | date_add(@dt interval '011530' hour_second) |
    ++
    | 20080809 132803 |
    ++
    mysql> select date_add(@dt interval '1 011530' day_second)
    ++
    | date_add(@dt interval '1 011530' day_second) |
    ++
    | 20080810 132803 |
    ++
    date_add() 函数分 @dt 增加1时 15分 30秒 1天 1时 15分 30秒
    建议:总 date_add() 日期时间函数代 adddate() addtime()


     
    2 MySQL 日期减时间间隔:date_sub()
    MySQL date_sub() 日期时间函数 date_add() 法致赘述外MySQL 中两函数 subdate() subtime()建议 date_sub() 代
     
    3 MySQL 类日期函数:period_add(PN) period_diff(P1P2)
    函数参数P 格式YYYYMM 者 YYMM第二参数N 表示增加减 N month(月)
    MySQL period_add(PN):日期加减N月
     
    4 MySQL 日期时间相减函数:datediff(date1date2) timediff(time1time2)
    MySQL datediff(date1date2):两日期相减 date1 date2返回天数
    select datediff('20080808' '20080801') 7
    select datediff('20080801' '20080808') 7
    MySQL timediff(time1time2):两日期相减 time1 time2返回 time 差值
    select timediff('20080808 080808' '20080808 000000') 080808
    select timediff('080808' '000000') 080808
    注意:timediff(time1time2) 函数两参数类型必须相
     
    四MySQL 日期转换函数时间转换函数
    1 MySQL (时间秒)转换函数:time_to_sec(time) sec_to_time(seconds)
    select time_to_sec('010005') 3605
    select sec_to_time(3605) '010005'
     
    2 MySQL (日期天数)转换函数:to_days(date) from_days(days)
    select to_days('00000000') 0
    select to_days('20080808') 733627
    select from_days(0) '00000000'
    select from_days(733627) '20080808'
     
    3 MySQL Str to Date (字符串转换日期)函数:str_to_date(str format)
    select str_to_date('08092008' 'mdY') 20080809
    select str_to_date('080908' 'mdy') 20080809
    select str_to_date('08092008' 'mdY') 20080809
    select str_to_date('080930' 'his') 080930
    select str_to_date('08092008 080930' 'mdY his') 20080809 080930
    str_to_date(strformat) 转换函数杂乱章字符串转换日期格式外转换时间format 参 MySQL 手册
     
    4 MySQL DateTime to Str(日期时间转换字符串)函数:date_format(dateformat) time_format(timeformat)
    MySQL 日期时间转换函数:date_format(dateformat) time_format(timeformat)
    够日期时间转换成种样字符串格式 str_to_date(strformat) 函数 逆转换
     
    5 MySQL 获国家区时间格式函数:get_format()
    MySQL get_format() 语法:
    get_format(date|time|datetime 'eur'|'usa'|'jis'|'iso'|'internal'
    MySQL get_format() 法全部示例:
    select get_format(date'usa') 'mdY'
    select get_format(date'jis') 'Ymd'
    select get_format(date'iso') 'Ymd'
    select get_format(date'eur') 'dmY'
    select get_format(date'internal') 'Ymd'
    select get_format(datetime'usa') 'Ymd His'
    select get_format(datetime'jis') 'Ymd His'
    select get_format(datetime'iso') 'Ymd His'
    select get_format(datetime'eur') 'Ymd His'
    select get_format(datetime'internal') 'YmdHis'
    select get_format(time'usa') 'his p'
    select get_format(time'jis') 'His'
    select get_format(time'iso') 'His'
    select get_format(time'eur') 'His'
    select get_format(time'internal') 'His'
    MySQL get_format() 函数实际中机会较少
     
    6 MySQL 拼凑日期时间函数:makdedate(yeardayofyear) maketime(hourminutesecond)
    select makedate(200131) '20010131'
    select makedate(200132) '20010201'select maketime(121530) '121530'
     
    五MySQL 时间戳(Timestamp)函数
    1 MySQL 获前时间戳函数:current_timestamp current_timestamp()
    2 MySQL (Unix 时间戳日期)转换函数:
    unix_timestamp()
    unix_timestamp(date)
    from_unixtime(unix_timestamp)
    from_unixtime(unix_timestampformat)
     
    3 MySQL 时间戳(timestamp)转换增减函数:
    timestamp(date) date to timestamp
    timestamp(dttime) dt + time
    timestampadd(unitintervaldatetime_expr)
    timestampdiff(unitdatetime_expr1datetime_expr2)
    MySQL timestampdiff() 函数 datediff() 功强datediff() 计算两日期(date)间相差天数
     
    六MySQL 时区(timezone)转换函数convert_tz(dtfrom_tzto_tz)select
    convert_tz('20080808 120000' '+0800' '+0000') 20080808
    040000
    时区转换通 date_add date_sub timestampadd 实现
    select date_add('20080808 120000' interval 8 hour) 20080808 040000
    select date_sub('20080808 120000' interval 8 hour) 20080808 040000
    select timestampadd(hour 8 '20080808 120000') 20080808 040000
    1时间格式转换
    字符串转时间格式化
    SELECT CAST('<年><月><日>' AS DATE FORMAT 'YYYYMMDD')

    毫秒转换时间戳
    SELECT CAST (to_timestamp(CAST(15253140630001000 as BIGINT)) AS DATE) 结果20180503 022103000000

    时间转字符串 格式化
    SELECT CAST((<时间戳字段> (FORMAT ' YYYYMMDDBHHMISSS(6)')) AS VARCHAR(26))

    SELECT CAST ((curent_timestamp(6) (FORMAT 'YYYYMMDDHHMISS') ) AS VARCHAR(19)) 结果20180615164201

    SELECT CAST((<时间日期字段> (FORMAT 'YYYYMMDD')) AS VARCHAR(10))


    格式化间互转
    格式912 1445 91200 144500
    SELECT CASE WHEN INDEX(dt_time’’)2 THEN 0’||dt_time||’00’ ELSE dt_time||’00’ END

    格式2016012020160120
    SELECT SUBSTRING(dt_date from 1 for 4)||’|| SUBSTRING(dt_date from 5 for 2) ||’|| SUBSTRING(dt_date from 7 for 2)


    select to_month_end(date)
    select extract(yearmonthday from date)
    select lastday(datetimestamp)
    select (datedate) day(3)month(3)year(3)
    select months_between(datedate)
    select (timetime) hour(3)minute(3)second(3)
    select current_timestamp interval '2' hourmibutesecond
    select current_date interval '2' yearmonthdayhourminutesecond
    select next_day(date'friday''fri')
    select numtoyminterval(20'month''year')
    from datetime import datetime
    from datetime import timedelta

    # 日期格式转换字符串
    NOW datetimenow()
    TODAY NOWstrftime(Ymd)
    YESTERDAY (NOW timedelta(days1))strftime(Ymd)

    # 字符串转换日期
    d1 str(20180301)
    d2 str(20180226)

    print type(d1)

    d1 datetimestrptime(d1 'Ymd')
    d2 datetimestrptime(d2 'Ymd')

    # 计算 d1d2间差值
    print (d1 d2)days

    from pysparksqlfunctions import unix_timestamp from_unixtime

    1 获取前日期

    from pysparksqlfunctions import current_date

    sparkrange(3)withColumn('date'current_date())show()
    # +++
    # | id| date|
    # +++
    # | 0|20180323|
    # | 1|20180323|
    2 获取前日期时间
    from pysparksqlfunctions import current_timestamp

    sparkrange(3)withColumn('date'current_timestamp())show()
    # +++
    # | id| date|
    # +++
    # | 0|20180323 1740|
    # | 1|20180323 1740|
    # | 2|20180323 1740|
    # +++
    3 日期格式转换

    from pysparksqlfunctions import date_format

    df sparkcreateDataFrame([('20150408')] ['a'])

    dfselect(date_format('a' 'MMddyyy')alias('date'))show()

    4 字符转日期

    from pysparksqlfunctions import to_date to_timestamp

    # 1转日期
    df sparkcreateDataFrame([('19970228 103000')] ['t'])
    dfselect(to_date(dft)alias('date'))show()
    # [Row(datedatetimedate(1997 2 28))]


    # 2带时间日期

    df sparkcreateDataFrame([('19970228 103000')] ['t'])
    dfselect(to_timestamp(dft)alias('dt'))show()
    # [Row(dtdatetimedatetime(1997 2 28 10 30))]

    # 指定日期格式
    df sparkcreateDataFrame([('19970228 103000')] ['t'])
    dfselect(to_timestamp(dft 'yyyyMMdd HHmmss')alias('dt'))show()
    # [Row(dtdatetimedatetime(1997 2 28 10 30))]
    5 获取日期中年月日

    from pysparksqlfunctions import year month dayofmonth

    df sparkcreateDataFrame([('20150408')] ['a'])
    dfselect(year('a')alias('year')
    month('a')alias('month')
    dayofmonth('a')alias('day')
    )show()
    6 获取时分秒

    from pysparksqlfunctions import hour minute second
    df sparkcreateDataFrame([('20150408 130815')] ['a'])
    dfselect(hour('a')alias('hour')
    minute('a')alias('minute')
    second('a')alias('second')
    )show()
    7 获取日期应季度

    from pysparksqlfunctions import quarter

    df sparkcreateDataFrame([('20150408')] ['a'])
    dfselect(quarter('a')alias('quarter'))show()
    8 日期加减

    from pysparksqlfunctions import date_add date_sub
    df sparkcreateDataFrame([('20150408')] ['d'])
    dfselect(date_add(dfd 1)alias('dadd')
    date_sub(dfd 1)alias('dsub')
    )show()
    9 月份加减

    from pysparksqlfunctions import add_months
    df sparkcreateDataFrame([('20150408')] ['d'])

    dfselect(add_months(dfd 1)alias('d'))show()
    10 日期差月份差

    from pysparksqlfunctions import datediff months_between

    # 1日期差
    df sparkcreateDataFrame([('20150408''20150510')] ['d1' 'd2'])
    dfselect(datediff(dfd2 dfd1)alias('diff'))show()

    # 2月份差
    df sparkcreateDataFrame([('19970228 103000' '19961030')] ['t' 'd'])
    dfselect(months_between(dft dfd)alias('months'))show()
    11 计算日子日期

    计算前日期星期1234567具体日子属实函数

    from pysparksqlfunctions import next_day

    # Mon Tue Wed Thu Fri Sat Sun
    df sparkcreateDataFrame([('20150727')] ['d'])
    dfselect(next_day(dfd 'Sun')alias('date'))show()
    12 月日期

    from pysparksqlfunctions import last_day

    df sparkcreateDataFrame([('19970210')] ['d'])
    dfselect(last_day(dfd)alias('date'))show()

    Hive常日期格式转换
    固定日期转换成时间戳
    SELECT unix_timestamp('20160816''yyyyMMdd') 1471276800
    SELECT unix_timestamp('20160816''yyyyMMdd') 1471276800
    SELECT unix_timestamp('20160816T100241Z' yyyyMMdd'T'HHmmss'Z') 1471312961

    16Mar2017122501 +0800 转成正常格式(yyyyMMdd hhmmss)
    SELECT from_unixtime(to_unix_timestamp('16Mar2017122501 +0800' 'ddMMMyyyHHmmss Z'))

    时间戳转换程固定日期
    SELECT from_unixtime(1471276800'yyyyMMdd') 20160816
    SELECT from_unixtime(1471276800'yyyyMMdd') 20160816
    SELECT from_unixtime(1471312961)     20160816 100241
    SELECT from_unixtime( unix_timestamp('20160816''yyyyMMdd')'yyyyMMdd')  20160816
    SELECT date_format('20160816''yyyyMMdd') 20160816

    返回日期时间字段中日期部分
    SELECT to_date('20160816 100301') 20160816
    取前时间
    SELECT from_unixtime(unix_timestamp()'yyyyMMdd HHmmss')
    SELECT from_unixtime(unix_timestamp()'yyyyMMdd')
    返回日期中年
    SELECT year('20160816 100301') 2016
    返回日期中月
    SELECT month('20160816 100301') 8
    返回日期中日
    SELECT day('20160816 100301') 16
    返回日期中时
    SELECT hour('20160816 100301') 10
    返回日期中分
    SELECT minute('20160816 100301') 3
    返回日期中秒
    SELECT second('20160816 100301') 1

    返回日期前周数
    SELECT weekofyear('20160816 100301') 33

    返回结束日期减开始日期天数
    SELECT datediff('20160816''20160811') 

    返回开始日期startdate增加days天日期
    SELECT date_add('20160816'10)

    返回开始日期startdate减少days天日期
    SELECT date_sub('20160816'10)

    返回天三种方式
    SELECT CURRENT_DATE
    20170615
    SELECT CURRENT_TIMESTAMP返回时分秒
    20170615 195444
    SELECT from_unixtime(unix_timestamp())
    20170615 195504
    返回前时间戳
    SELECT current_timestamp20180618 103753278

    返回月第天
    SELECT trunc('20160816''MM') 20160801
    返回年第天
    SELECT trunc('20160816''YEAR') 20160101

    df sparkcreateDataFrame(

    [(11251991) (11241991) (11301991)]

    ['date_str']

    )

    df2 dfselect(

    'date_str'

    from_unixtime(unix_timestamp('date_str' 'MMddyyy'))alias('date')

    )

    print(df2)

    #DataFrame[date_str string date timestamp]

    df2show(truncateFalse)

    #+++

    #|date_str |date |

    #+++

    #|11251991|19911125 000000|

    #|11241991|19911124 000000|

    #|11301991|19911130 000000|

    #+++

    更新(1102018):

    Spark 22+方法formatformat函数支持format参数 文档:

    >>> df sparkcreateDataFrame([('19970228 103000')] ['t'])

    >>> dfselect(to_timestamp(dft 'yyyyMMdd HHmmss')alias('dt'))collect()

    [Row(dtdatetimedatetime(1997 2 28 10 30))]

    santon answered 20200102T131732Z

    37 votes

    from datetime import datetime

    from pysparksqlfunctions import col udf

    from pysparksqltypes import DateType

    # Creation of a dummy dataframe

    df1 sqlContextcreateDataFrame([(112519911124199111301991)

    (112513911124199211301992)] schema['first' 'second' 'third'])

    # Setting an user define function

    # This function converts the string cell into a date

    func udf (lambda x datetimestrptime(x 'mdY') DateType())

    df df1withColumn('test' func(col('first')))

    dfshow()

    dfprintSchema()

    输出:

    +++++

    | first| second| third| test|

    +++++

    |11251991|11241991|11301991|19910125|

    |11251391|11241992|11301992|13910117|

    +++++

    root

    | first string (nullable true)

    | second string (nullable true)

    | third string (nullable true)

    | test date (nullable true)

    Hugo Reyes answered 20200102T131754Z

    22 votes

    strptime()方法起作 更清洁解决方案演员:

    from pysparksqltypes import DateType

    spark_df1 spark_dfwithColumn(record_datespark_df['order_submitted_date']cast(DateType()))

    #below is the result

    spark_df1select('order_submitted_date''record_date')show(10False)

    +++

    |order_submitted_date |record_date|

    +++

    |20150819 1254160|20150819 |

    |20160414 1355500|20160414 |

    |20131011 1823360|20131011 |

    |20150819 2018550|20150819 |

    |20150820 1207400|20150820 |

    |20131011 2124120|20131011 |

    |20131011 2329280|20131011 |

    |20150820 1659350|20150820 |

    |20150820 1732030|20150820 |

    |20160413 1656210|20160413 |

    Frank answered 20200102T131814Z

    7 votes

    接受答案更新中您没to_date函数示例该函数种解决方案:

    from pysparksql import functions as F

    df dfwithColumn(

    'new_date'

    Fto_date(

    Funix_timestamp('STRINGCOLUMN' 'MMddyyyy')cast('timestamp'))

    Manrique answered 20200102T131835Z

    1 votes

    尝试:

    df sparkcreateDataFrame([('20180727 103000')] ['Date_col'])

    dfselect(from_unixtime(unix_timestamp(dfDate_col 'yyyyMMdd HHmmss'))alias('dt_col'))

    dfshow()

    ++

    | Date_col|

    ++

    |20180727 103000|

    ++

    Vishwajeet Pol answered 20200102T131855Z

    1 votes

    没答案想分享代码帮助某

    from pysparksql import SparkSession

    from pysparksqlfunctions import to_date

    spark SparkSessionbuilderappName(Python Spark SQL basic example)\

    config(sparksomeconfigoption somevalue)getOrCreate()

    df sparkcreateDataFrame([('20190622')] ['t'])

    df1 dfselect(to_date(dft 'yyyyMMdd')alias('dt'))

    print df1

    print df1show()

    输出

    DataFrame[dt date]

    ++

    | dt|

    ++

    |20190622|

    ++

    果转换日期时间述代码转换日期然to_timestamp
    PySpark代码基结构
    # *codingutf8*
    from pysparksql import HiveContext SparkSession

    # 初始化SparkContext时启Hive支持
    # 终端命令行测试模式输出字段长度设置100字符
    spark SparkSessionbuilderappName(name)config(
    sparkdebugmaxToStringFields 100)enableHiveSupport()getOrCreate()
    # 初始化HiveContext
    hive HiveContext(sparksparkContext)
    # 启SparkSQL表连接支持
    sparkconfset(sparksqlcrossJoinenabled true)

    # 读取parquet文件数代码
    # Parquet面分析型业务列式存储格式TwitterCloudera合作开发AWS中
    # parquet文件数存储AWS S3
    # AWSS3作数存储服务S3 全名 Simple Storage Service简便存储服务
    df1 sparkreadload(
    path'
    format'parquet' headerTrue)

    # 读取CSV文件数代码
    # 边CSV文件作手工交换文件标准
    # 原csv格式简单数字类型数字符串存储精度保证
    df2 sparkreadload(
    path'
    format'csv' headerTrue)

    # 读取Hive表视图数代码
    df3 hivesql(
    select
    *
    from <数库名><表名>)

    # 次表数集进行数存缓存(第条Spark优化策略)
    # 样话pyspark代码次调数时候Spark会重复读取相文件数
    df4 sparkreadload(
    path'
    format'parquet' headerTrue)cache()

    # 刚数集命名便放入SparkSQL编写查询语句
    df1createOrReplaceTempView(DF1)

    df2createOrReplaceTempView(DF2)

    df3createOrReplaceTempView(DF3)

    df4createOrReplaceTempView(DF4)

    # 创建SparkSQL数集代码
    # 果数量较业务逻辑复杂话数时缓存存储服务磁盘
    # 避免pyspark代码SparkSQL调里SparkSQL数集时候
    # 里SparkSQL数集重复运行计算逻辑节约计算资源(第二条Spark优化策略)
    df5 sparksql(
    SELECT

    from DF1 AS D1
    LEFT JOIN DF2 AS D2
    ON
    LEFT JOIN DF4 AS D4
    ON
    WHERE
    )persist()
    # countAction算子会触发sparksubmit事件前persist()缓存操作刻生效
    # count()操作persist()缓存操作会Action算子处程序结束处生效
    df5count()
    df5createOrReplaceTempView(DF5)

    # 创建SparkSQL数集代码
    df6 sparksql(
    SELECT

    from DF5 AS D5
    LEFT JOIN DF3 AS D3
    ON
    LEFT JOIN DF4 AS D4
    ON
    WHERE
    )

    # 写入结果数集parquet文件
    df6writeparquet(
    path'
    modeoverwrite)

    # 释放磁盘缓存
    df5unpersist()

    # sparkContext停止
    sparkstop()


    PySparkMySQL导出数parquet文件
    from pysparksql import SparkSession

    spark SparkSessionBuilder()getOrCreate()
    urljdbcmysql[<端口号>]useTimezonefalse&serverTimezoneUTC
    mysql_df sparkreadjdbc(url url table<查询语句> properties{user<户名> password <密码> database <选择数库>})
    mysql_dfwriteparquet(
    path'<存储路径><表名>'
    modeoverwrite)

    sparkstop()

    PySparkTeradata导出数parquet文件
    from pysparksql import SparkSession

    spark SparkSessionBuilder()getOrCreate()
    url jdbcteradata{ip} \
    DATABASE{database} \
    DBS_PORT{dbs_port} \
    LOGMECH{LDAP} \
    CHARSET{ASCII|UTF8} \
    COLUMN_NAMEON \
    MAYBENULLONformat(ipdatabasedbs_port<端口号>)
    teradata_df sparkreadjdbc (urlurl
    table<查询语句>
    properties{
    user <户名>
    password <密码>
    driver comteradatajdbcTeraDriver
    })
    teradata_dfwriteparquet(
    path'<存储路径><表名>'
    modeoverwrite)

    sparkstop()

    PySparkParquent文件写入Hive表
    from pysparksql import SparkSession
    # 开动态分区
    sparksql(set hiveexecdynamicpartitionmode nonstrict)
    sparksql(set hiveexecdynamicpartitiontrue)

    # 普通hivesql写入分区表
    sparksql(
    insert overwrite table aida_aipurchase_dailysale_hive
    partition (saledate)
    select productid propertyid processcenterid saleplatform sku poa salecount saledate
    from szy_aipurchase_tmp_szy_dailysale distribute by saledate
    )

    # 者次重建分区表方式
    jdbcDFwritemode(overwrite)partitionBy(saledate)insertInto(aida_aipurchase_dailysale_hive)
    jdbcDFwritesaveAsTable(aida_aipurchase_dailysale_hive None append partitionBy'saledate')

    # 写分区表简单导入hive表
    jdbcDFwritesaveAsTable(aida_aipurchase_dailysale_for_ema_predict None overwrite None)

    PySpark读取HiveSQL查询数写入parquet文件
    from pysparksql import HiveContext SparkSession

    spark SparkSessionbuilderappName(<配置名称>)config(sparkdebugmaxToStringFields 100)enableHiveSupport()getOrCreate()

    hive HiveContext(sparksparkContext)
    df hivesql(
    )
    dfwriteparquet(<存储路径><文件名>
    mode'overwrite')

    PySpark获取Dataframe采样数保存CSV文件
    from pysparksql import SparkSession

    spark SparkSessionBuilder()getOrCreate()

    df …

    dfsample(False0122345)repartition(1)writecsv(path'<存储路径><文件名>csv' headerTrue sep mode'overwrite')

    PySpark连接MySQL数库插入数
    <表名>_df sparkcreateDataFrame([(<值1><值2><值3><值n>)(<值n+1><值n+2><值n+3><值2n>)(<值2n+1><值2n+2><值2n+3><值3n>)(<值mn+1><值mn+2><值mn+3><值(m+1)n>)]['<字段名1>''<字段名2>''<字段名3>''<字段名n>'])

    <表名>_dfwritejdbc(url url table<表名>'append' properties{user<户名> password <密码> database <选择数库>})

    PySpark连接Teradata数库插入数
    <表名>_df sparkcreateDataFrame([(<值1><值2><值3><值n>)(<值n+1><值n+2><值n+3><值2n>)(<值2n+1><值2n+2><值2n+3><值3n>)(<值mn+1><值mn+2><值mn+3><值(m+1)n>)]['<字段名1>''<字段名2>''<字段名3>''<字段名n>'])
    url jdbcteradata{ip} \
    DATABASE{database} \
    DBS_PORT{dbs_port} \
    LOGMECH{LDAP} \
    CHARSET{ASCII|UTF8} \
    COLUMN_NAMEON \
    MAYBENULLONformat(ipdatabasedbs_port<端口号>)
    <表名>_dfwritejdbc (urlurl
    table<表名>'append'
    properties{
    user <户名>
    password <密码>
    driver comteradatajdbcTeraDriver
    })

    PySpark遍历Dataframe行
    表(data1)某数宽表(data2)列缺失数处理方法配置表中COLUMN_NAME数宽表特征名称NULL_PROCESS_METHON特征列缺失数处置办法假设处理方式4种:dropzeromeanother

    需求
    遍历配置表(data1)COLUMN_NAME获取相应缺失值处理方法(NULL_PROCESS_METHON)然应数宽表(data2)应特征列
    实现
    rows data1collect()
    cols data1columns
    cols_len len(data1columns) 1

    for row in rows
    row_data_temp []
    for idxcol in enumerate(cols)
    row_data_tempappend(row[col])
    if idx cols_len
    row_data row_data_temp
    print(row_data[4])
    · 1
    · 2
    · 3
    · 4
    · 5
    · 6
    · 7
    · 8
    · 9
    · 10
    · 11
    班天头昏脑涨顿操作实现
    直第二天午新天开始思维格外敏捷检查代码发现艹昨晚写?(黑问号…)
    二话说直接改掉
    rows data1collect()

    for row in rows
    print(row[4])
    · 1
    · 2
    · 3
    · 4
    解collect
    DataFrame调collect方法[ [………] [………] [………] … ]形式存储数外层数组行数(Row)里层数组行中列(Column)数调collect方法生成数二维数组数组里面数元素数组遍历方式二维数遍历方式
    注意collect方法性消耗非常会数加载驱动程序存中般适合数量进行collect操作果数量dataframecollect() 容易导致存溢出改map

    PySpark移动Parquet文件目录
    import subprocess

    subprocesscheck_call(
    'mv <存储路径1><表名>[<分区字段><分区值>] <存储路径2><表名>[<分区字段><分区值>]' shellTrue)

    PySpark复制Parquet文件目录
    import subprocess

    subprocesscheck_call(cp r <存储路径1><表名>[<分区字段><分区值>] <存储路径2><表名>[<分区字段><分区值>] shellTrue)

    PySpark删Parquet文件目录
    import subprocess

    subprocesscheck_call('rm r <存储路径1><表名>[<分区字段><分区值>]'shellTrue)

    PySpark修改Hive指存储路径
    from pysparksql import HiveContext

    hive HiveContext(sparksparkContext)
    hivesql(alter table [`<库名>`]`<表名>` set location '<存储路径><表名>[<分区列名><分区值>]')

    PySpark显示HDFS路径文件
    import subprocess
     
    subprocesscheck_call('HADOOP_HOMEbinhdfs dfs ls hdfs<存储路径><表名>'shellTrue)

    PySpark显示普通Hive表容量(GB)
    import subprocess
     
    subprocesscheck_call('HADOOP_HOMEbinhadoop fs du hdfs<存储路径><表名>|awk '{ SUM + 1 } END { print SUM(1024*1024*1024)}'shellTrue)

    PySpark显示Hive分区表容量(GB)
    import subprocess
     
    subprocesscheck_call('HADOOP_HOMEbinhadoop fs ls hdfs<存储路径><表名><分区列名><分区值> | awk F ' ' '{print 5}'|awk '{a+1}END {print a(1024*1024*1024)}'shellTrue)

    PySpark显示HDFS目录子目录容量
    import subprocess
     
    subprocesscheck_call('hdfs dfs du h hdfs<存储路径>}'shellTrue)

    PySpark调SqoopHDFS导入Hive表
    import subprocess
    from pysparksql import HiveContext

    hive HiveContext(sparksparkContext)

    hivesql('drop table if exists [`<库名>`]`<表名>` purge')
     
    hivesql(
    CREATE TABLE [`<库名>`]`<表名>`(
    `<字段名1>` <类型1>
    `<字段名2>` <类型2>
    `<字段名3>` <类型3>

    `<字段名n>` <类型n>)
    [PARTITIONED BY (
    `<分区字段1>` <分区字段类型1>
    `<分区字段2>` <分区字段类型2>
    `<分区字段3>` <分区字段类型3>

    `<分区字段n>` <分区字段类型n>
    )]
    ROW FORMAT SERDE
    'orgapachehadoophiveqlioparquetserdeParquetHiveSerDe'
    STORED AS PARQUET)

    subprocesscheck_call('HADOOP_HOMEbinsqoop import connect jdbcmysql<数库名> username passwordfile password> table <表名> hiveimport m 1 hivetable ')shellTrue)

    HiveQLparquet文件创建Hive表
    DROP TABLE IF EXISTS `<库名>``<表名>`
    CREATE EXTERNAL TABLE [IF NOT EXISTS] `<库名>``<表名>` (
    `<字段1>` <类型1>
    `<字段2>` <类型2>
    `<字段3>` <类型3>

    `<字段n>` <类型n>
    )
    ROW FORMAT SERDE
    'orgapachehadoophiveqlioparquetserdeParquetHiveSerDe'
    STORED AS INPUTFORMAT
    'orgapachehadoophiveqlioparquetMapredParquetInputFormat'
    OUTPUTFORMAT
    'orgapachehadoophiveqlioparquetMapredParquetOutputFormat'
    LOCATION
    '<存储路径><表名>[<分区字段><分区值>]'

    HiveQLHive表创建Hive视图
    CREATE OR REPLACE VIEW `<库名>``<表名>` AS


    HiveQL格式化显示Hive查询结果数
    set hivecliprintheadertrue
    set hiveresultsetuseuniquecolumnnamestrue

    Hive导出Hive查询结果数CSV文件
    hive e set hivecliprintheadertrue <查询语句> | sed 's[\t]g' > <文件名>csv

    set hivecliprintheadertrue表头输出
    sed s[\t]g’ \t换成
    shell里印容输出文件

    HiveQL显示Hive表
    前数库显示表
    show tables
    指定数库显示表
    show tables in <数库名>
    前数库显示带关键字表
    show tables *<关键字>*’
    指定数库显示关键字表
    show tables in <数库名> *<关键字>*’

    HiveQL显示Hive数库
    show databases

    Shell带日期参数运行HQL脚
    vim <文件名>sh
    #binbash

    hql_path1
    if [ 2 ]then
    batch_date(TZAsiaShanghai date d @`date +s` +Ymd) #中国时间前日期
    else
    batch_date2
    fi

    delete_partition_date`date d batch_date 7 day +Ymd` #删7天前数
    hql_file_name`basename hql_path`
    log_file_name{hql_file_name*}_`date +s`log
    log_file_pathloglog_file_name
    partition_date`date d batch_date +Ymd` #YYYYMMDD格式日期

    hive hivevar current_datepartition_date hivevar filter_datebatch_date hivevar drop_datedelete_partition_date f hql_path >>log_file_path 2>&1

    vim hql
    全量写入Hive数:
    INSERT OVERWRITE TABLE <库名><表名> partition(<时间分区字段(格式YYYYMMDD)>'{current_date}')
    SELECT
    <字段名1>
    <字段名2>
    <字段名3>

    <字段名n>
    FROM <库名><表名>

    alter table <库名><表名> drop partition (<时间分区字段(格式YYYYMMDD)><'{drop_date}' )

    增量写入Hive数:
    INSERT OVERWRITE TABLE <库名><表名> partition(<时间分区字段(格式YYYYMMDD)>'{current_date}')
    SELECT
    <字段名1>
    <字段名2>
    <字段名3>

    <字段名n>
    FROM <库名><表名>
    WHERE substr(<记录创建时间戳字段(格式YYYYMMDD)>110) '{filter_date}' OR substr(<记录修改时间戳字段(格式YYYYMMDD)>110) '{filter_date}'

    执行方法:
    <文件名>sh hql

    HiveQL更新视图指天表数
    vim hql

    drop view <库名><视图名>
    CREATE VIEW <库名><视图名>
    AS
    SELECT
    <字段名1>
    <字段名2>
    <字段名3>

    <字段名n>
    FROM <库名><表名>
    WHERE <时间分区字段(格式YYYYMMDD)> '{current_date}'

    HiveQL修改Hive表指存储文件
    alter table <库名><表名>
    set location
    'hdfs<表名>'

    Shell清HDFS里数
    HADOOP_HOMEbinhadoop fs rm r <表名>

    HADOOP_HOMEbinhadoop fs rm <表名>partm*

    Shell查HDFS数
    HADOOP_HOMEbinhadoop fs cat <表名>partm*

    Sqoop显示MySQL中数库
    HADOOP_HOMEbinsqoop listdatabases connect jdbcmysql username password

    HADOOP_HOMEbinsqoop listdatabases connect jdbcmysql username P
    Enter password

    echo n > <存储目录>password
    chmod 400 password
    HADOOP_HOMEbinsqoop listdatabases connect jdbcmysql username passwordfile <存储目录>password

    echo n > <存储目录>password
    chmod 400 password
    HADOOP_HOMEbinhadoop dfs put <存储目录>password
    HADOOP_HOMEbinhadoop dfs chmod 400 password
    rm <存储目录>password
    rm remove writeprotected regular file `password' y
    HADOOP_HOMEbinsqoop listdatabases connect jdbcmysql username passwordfile password

    SqoopMySQL数库导入HDFS
    指定目录:
    HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password table <表名> [m <行度>] [fieldsterminatedby <分隔符>]

    指定目录:
    sqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> deletetargetdir [m <行度>] [fieldsterminatedby <分隔符>]

    增量导入:
    HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> [m <行度>] [fieldsterminatedby <分隔符>]

    设置job增量导入:
    HADOOP_HOMEbinsqoop job create job_import_<表名> import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> [m <行度>] [fieldsterminatedby <分隔符>]
    HADOOP_HOMEbinsqoop job exec job_import_<表名>

    删job:
    HADOOP_HOMEbinsqoop job delete job_import_<表名>

    查前job:
    HADOOP_HOMEbinsqoop job list

    查某具体job信息:
    HADOOP_HOMEbinsqoop job show job_import_<表名>

    全量导入带gzip压缩数:
    HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> z [m <行度>] [fieldsterminatedby <分隔符>]

    全量导入空值换指定字符数:
    HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> deletetargetdir nullnonstring <空值换指定字符> nullstring <空值换指定字符> [m <行度>] [fieldsterminatedby <分隔符>]

    全量导入指定字段数:
    HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> columns <字段1><字段2><字段3>…<字段n> targetdir <级目录><表名> deletetargetdir [m <行度>] [fieldsterminatedby <分隔符>]

    全量带筛选条件导入数:
    HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> where <筛选条件> targetdir <级目录><表名> deletetargetdir [m <行度>] [fieldsterminatedby <分隔符>]

    全量查询导入数:
    HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> query <查询> splitby <查询中表名>{<键>|<外键>} targetdir <级目录><表名> deletetargetdir [m <行度>] [fieldsterminatedby <分隔符>]

    全量导入指定数导入数格式:
    HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> assequencefile deletetargetdir [m <行度>] [fieldsterminatedby <分隔符>]

    全量导入指定 Map 务发度:
    HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> nummappers deletetargetdir [m <行度>] [fieldsterminatedby <分隔符>]

    增量导入带gzip压缩数:
    HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> z [m <行度>] [fieldsterminatedby <分隔符>]

    增量导入空值换指定字符数:
    HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> nullnonstring <空值换指定字符> nullstring <空值换指定字符> [m <行度>] [fieldsterminatedby <分隔符>]

    增量导入指定字段数:
    HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> columns <字段1><字段2><字段3>…<字段n> targetdir <级目录><表名> append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> [m <行度>] [fieldsterminatedby <分隔符>]

    增量带筛选条件导入数:
    HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> where <筛选条件> targetdir <级目录><表名> append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> [m <行度>] [fieldsterminatedby <分隔符>]

    增量查询导入数:
    HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> query <查询> splitby <查询中表名>{<键>|<外键>} targetdir <级目录><表名> append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> [m <行度>] [fieldsterminatedby <分隔符>]

    增量导入指定数导入数格式:
    HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> assequencefile append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> [m <行度>] [fieldsterminatedby <分隔符>]

    增量导入指定 Map 务发度:
    HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> nummappers append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> [m <行度>] [fieldsterminatedby <分隔符>]

    SqoopHDFS数库导入MySQL
    HADOOP_HOMEbinsqoop export connect jdbcmysql<数库名> username passwordfile password> table <表名> exportdir hdfs<级目录><表名>

    Sqoop显示MySQL中数库
    HADOOP_HOMEbinsqoop listdatabases connect jdbcmysql username password

    HADOOP_HOMEbinsqoop listdatabases connect jdbcmysql username P
    Enter password

    echo n > <存储目录>password
    chmod 400 password
    HADOOP_HOMEbinsqoop listdatabases connect jdbcmysql username passwordfile <存储目录>password

    echo n > <存储目录>password
    chmod 400 password
    HADOOP_HOMEbinhadoop dfs put <存储目录>password
    HADOOP_HOMEbinhadoop dfs chmod 400 password
    rm <存储目录>password
    rm remove writeprotected regular file `password' y
    HADOOP_HOMEbinsqoop listdatabases connect jdbcmysql username passwordfile password

    Teradata支持数类型

    MySQL支持数类型
    名称
    类型
    说明
    INT
    整型
    4字节整数类型范围约+21亿
    BIGINT
    长整型
    8字节整数类型范围约+922亿亿
    REAL
    浮点型
    4字节浮点数范围约+1038
    DOUBLE
    浮点型
    8字节浮点数范围约+10308
    DECIMAL(MN)
    高精度数
    户指定精度数例DECIMAL(2010)表示20位中数10位通常财务计算
    CHAR(N)
    定长字符串
    存储指定长度字符串例CHAR(100)总存储100字符字符串
    VARCHAR(N)
    变长字符串
    存储变长度字符串例VARCHAR(100)存储0~100字符字符串
    BOOLEAN
    布尔类型
    存储True者False
    DATE
    日期类型
    存储日期例20180622
    TIME
    时间类型
    存储时间例122059
    DATETIME
    日期时间类型
    存储日期+时间例20180622 122059

    Hive支持数类型
    1 基数类型
    Hive数类型
    Java 数类型
    长度
    例子
    TINYINT
    byte
    1byte 符号整数
    20
    SMALINT
    short
    2byte 符号整数
    20
    INT
    int
    4byte 符号整数
    20
    BIGINT
    long
    8byte 符号整数
    20
    BOOLEAN
    boolean
    布尔类型true 者 false
    TRUE FALSE
    FLOAT
    float
    单精度浮点数
    314159
    DOUBLE
    double
    双精度浮点数
    314159
    STRING
    string
    字符系列指定字符集单引号者双引号
    now is the time’ for all good men
    TIMESTAMP
     
    时间类型
     
    BINARY
     
    字节数组
     
    Hive String 类型相数库 varchar 类型该类型变字符串声明中存储少字符理存储 2GB 字符数
    2 集合数类型
    数类型
    描述
    语法示例
    STRUCT
    c 语言中 struct 类似通点符号访问元素容例果某列数类型 STRUCT{first STRINGlast STRING}第 1 元素通字段first 引
    struct()
    MAP
    MAP 组键值元组集合数组表示法访问数例果某列数类型 MAP中键>值’first’>’John’’last’>’Doe’通字段名[last’]获取元素
    map()
    ARRAY
    数组组具相类型名称变量集合变量称数组元素数组元素编号编号零开始例数组值[John’ Doe’]第 2 元素通数组名[1]进行引
    Array()
    Hive 三种复杂数类型 ARRAYMAP STRUCTARRAY MAP Java 中Array Map 类似 STRUCT C 语言中 Struct 类似封装命名字段集合复杂数类型允许意层次嵌套

    Parquet文件存储格式
    Parquet仅仅种存储格式语言台关需种数处理框架绑定目前够Parquet适配组件包括面出基通常查询引擎计算框架已适配方便序列化工具生成数转换成Parquet格式
    · 查询引擎 Hive Impala Pig Presto Drill Tajo HAWQ IBM Big SQL
    · 计算框架 MapReduce Spark Cascading Crunch Scalding Kite
    · 数模型 Avro Thrift Protocol Buffers POJOs
    项目组成
    Parquet项目子项目组成
    · parquetformat项目java实现定义Parquet元数象Parquet元数Apache Thrift进行序列化存储Parquet文件尾部
    · parquetformat项目java实现包括模块包括实现读写Parquet文件功提供组件适配工具例Hadoop InputOutput FormatsHive Serde(目前Hive已带Parquet)Pig loaders等
    · parquetcompatibility项目包含编程语言间(JAVACC++)读写文件测试代码
    · parquetcpp项目读写Parquet文件C++库
    图展示Parquet组件层次交互方式

    · 数存储层定义Parquet文件格式中元数parquetformat中定义包括Parquet原始类型定义Page类型编码类型压缩类型等等
    · 象转换层完成象模型Parquet部数模型映射转换Parquet编码方式striping and assembly算法
    · 象模型层定义读取Parquet文件容层转换包括AvroThriftPB等序列化格式Hive serde等适配帮助家理解Parquet提供orgapacheparquetexample包实现java象Parquet文件转换
    数模型
    Parquet支持嵌套数模型类似Protocol Buffers数模型schema包含字段字段包含字段字段三属性:重复数数类型字段名重复数三种:required(出现1次)repeated(出现0次次)optional(出现0次1次)字段数类型分成两种:group(复杂类型)primitive(基类型)例Dremel中提供Documentschema示例定义:
    message Document {
    required int64 DocId
    optional group Links {
    repeated int64 Backward
    repeated int64 Forward
    }
    repeated group Name {
    repeated group Language {
    required string Code
    optional string Country
    }
    optional string Url
    }
    }
    Schema转换成树状结构根节点理解repeated类型图 

    出Schema中基类型字段叶子节点Schema中存6叶子节点果样Schema转换成扁式关系模型理解该表包含六列Parquet中没MapArray样复杂数结构通repeatedgroup组合实现样需求包含6字段表中字段条记录中出现次数:
    DocId int64 出现次
    LinksBackward int64 出现意次果出现0次需NULL标识
    LinksForward int64
    NameLanguageCode string
    NameLanguageCountry string
    NameUrl string
    表中存出现意次列列需标示出现次者等NULL情况StripingAssembly算法实现
    StripingAssembly算法
    文介绍Parquet数模型Document中存非required列Parquet条记录数分散存储列中组合列值组成条记录StripingAssembly算法决定该算法中列值包含三部分:valuerepetition leveldefinition level
    Repetition Levels
    支持repeated类型节点写入时候该值等前面值层节点享读取时候根该值推导出层需创建新节点例样schema两条记录
    message nested {
    repeated group leve1 {
    repeated string leve2
    }
    }
    r1[[abc] [defg]]
    r2[[h] [ij]]
    计算repetition level值程:
    · valuea条记录开始前面值(已没值)根节点(第0层)享repeated level0
    · valueb前面值享level1节点level2节点享repeated level2
    · 理valuec repeated level2
    · valued前面值享根节点(属相记录)level1节点享repeated level1
    · valueh前面值属条记录享节点repeated level0
    根分析value需记录repeated level值:

    读取时候序读取值然根repeated level创建象读取valuea时repeated level0表示需创建新根节点(新记录)valueb时repeated level2表示需创建新level2节点valued时repeated level1表示需创建新level1节点列读取完成创建条新记录例中读取文件构建条记录结果:

    出repeated level0表示条记录开始repeated level值针路径repeated类型节点计算该值时候忽略非repeated类型节点写入时候理解该节点路径repeated节点享读取时候理解需层创建新repeated节点样话列repeated level值等路径repeated节点数(包括根节点)减repeated level处够存储更加紧凑编码方式节省存储空间
    Definition Levels
    repeated level构造出记录什需definition levels呢?repeatedoptional类型存条记录中某列没值假设记录样值会导致该属条记录值做前记录部分造成数错误种情况需占位符标示种情况
    definition level值仅仅空值效表示该值路径第层开始未定义非空值没意义非空值叶子节点定义父节点肯定定义总等该列definition levels例面schema
    message ExampleDefinitionLevel {
    optional group a {
    optional group b {
    optional string c
    }
    }
    }
    包含列abc列节点optional类型c定义时ab肯定已定义c未定义时需标示出层开始时未定义面值:

    definition level需考虑未定义值repeated类型节点父节点已定义该节点必须定义(例Document中DocId条记录该列必须值样Language节点定义Code必须值)计算definition level值时忽略路径required节点样减definition level值优化存储
    完整例子
    节Dremel文中Document示例定两值r1r2展示计算repeated leveldefinition level程里未定义值记录NULLR表示repeated levelD表示definition level

    首先DocuId列r1DocId10记录开始已定义R0D0样r2中DocId20R0D0
    LinksForward列r1中未定义Links已定义该记录中第值R0D1r1中该列两值value110R0(记录中该列第值)D2(该列definition level)
    NameUrl列r1中三值分url1’httpAr1中该列第值定义R0D2value2’httpB值value1Name层相R1D2value3NULL值value2Name层相R1未定义Name层定义D1r2中该列值value3’httpCR0D2
    NameLanguageCode列r1中4值value1’enus’r1中第值已定义R0D2(Coderequired类型列repeated level值等2)value2’en’value1Language节点享R2D2value3NULL未定义前值Name节点享Name节点已定义R1D1value4’engb’前值Name层享R1D2r2中该列值未定义Name层已定义R0D1
     
    Parquet文件格式
    Parquet文件二进制方式存储直接读取文件中包括该文件数元数Parquet格式文件解析HDFS文件系统Parquet文件中存概念
    · HDFS块(Block):HDFS副单位HDFS会Block存储文件维护分散机器副通常情况Block256M512M等
    · HDFS文件(File):HDFS文件包括数元数数分散存储Block中
    · 行组(Row Group):行数物理划分单元行组包含定行数HDFS文件中少存储行组Parquet读写时候会整行组缓存存中果行组存决定例记录占空间较Schema行组中存储更行
    · 列块(Column Chunk):行组中列保存列块中行组中列连续存储行组文件中列块中值相类型列块算法进行压缩
    · 页(Page):列块划分页页编码单位列块页编码方式
    文件格式
    通常情况存储Parquet数时候会Block设置行组般情况Mapper务处理数单位Block样行组Mapper务处理增务执行行度Parquet文件格式图示

    图展示Parquet文件容文件中存储行组文件首位该文件Magic Code校验否Parquet文件Footer length文件元数通该值文件长度计算出元数偏移量文件元数中包括行组元数信息该文件存储数Schema信息文件中行组元数页开始会存储该页元数Parquet中三种类型页:数页字典页索引页数页存储前行组中该列值字典页存储该列值编码字典列块中包含字典页索引页存储前行组该列索引目前Parquet中支持索引页面版中增加
    执行MR务时候存Mapper务输入Parquet文件情况Mapper通InputSplit标示处理文件范围果InputSplit跨越Row GroupParquet够保证Row Group会Mapper务处理
    映射推(Project PushDown)
    说列式存储优势映射推突出意味着获取表中原始数时需扫描查询中需列列值连续存储分区取出列值实现TableScan算子避免扫描整表文件容
    Parquet中原生支持映射推执行查询时候通Configuration传递需读取列信息列必须Schema子集映射次会扫描Row Group数然次性该Row Group里需列Cloumn Chunk读取存中次读取Row Group数够降低机读次数外Parquet读取时候会考虑列否连续果某需列存储位置连续次读操作列数读取存
    谓词推(Predicate PushDown)
    数库类查询系统中常优化手段谓词推通滤条件底层执行减少层交互数量提升性例select count(1) from A Join B on Aid Bid where Aa > 10 and Bb < 100SQL查询中处理Join操作前需首先AB执行TableScan操作然进行Join执行滤计算聚合函数返回果滤条件Aa > 10Bb < 100分移A表TableScanB表TableScan时候执行降低Join操作输入数
    行式存储列式存储滤条件读取条记录执行判断该记录否需返回调者Parquet做更进步优化优化方法时Row GroupColumn Chunk存储时候计算应统计信息包括该Column Chunk值值空值数通统计值该列滤条件判断该Row Group否需扫描外Parquet未会增加诸Bloom FilterIndex等优化数更加效完成谓词推
    Parquet时候通两种策略提升查询性:1类似关系数库键需频繁滤列设置序样导入数时候会根该列序存储数样化利值值实现谓词推2减行组页样增加跳整行组性时需权衡压缩编码效率降带IO负载

    相传统行式存储Hadoop生态圈年涌现出诸RCORCParquet列式存储格式性优势体现两方面:1更高压缩相类型数更容易针类型列高效编码压缩方式2更IO操作映射推谓词推减少部分必数扫描尤表结构较庞时候更加明显够带更查询性

    图展示格式存储TPCHTPCDS数集中两表数文件出Parquet较二进制文件存储格式够更效利存储空间新版Parquet(20版)更加高效页存储方式进步提升存储空间

    图展示TwitterImpala中格式文件执行TPCDS基准测试结果测试结果出Parquet较行式存储格式较明显性提升

    图展示criteo公司Hive中ORCParquet两种列式存储格式执行TPCDS基准测试结果测试结果出数存储方面两种存储格式snappy压缩情况量中存储格式占空间相差查询结果显示Parquet格式稍ORC格式两者功优缺点Parquet原生支持嵌套式数结构ORC支持较差种复杂Schema查询相较差Parquet支持数修改ACIDORC提供支持OLAP环境少会单条数修改更批量导入
    项目发展
    2012年TwitterCloudera研发Parquet开始该项目直处高速发展中项目初贡献开源社区2013年Criteo公司加入开发Hive社区提交hive集成Parquetpatch(HIVE5783)Hive 013版正式加入Parquet支持越越查询引擎进行支持进步带动Parquet发展
    目前Parquet正处20版迈进阶段新版中实现新Page存储格式针类型优化编码算法外丰富支持原始类型增加DecimalTimestamp等类型支持增加更加丰富统计信息例Bloon Filter够谓词推元数层完成
    总结
    文介绍种支持嵌套数模型列式存储系统Parquet作数系统中OLAP查询优化方案已种查询引擎原生支持部分高性引擎作默认文件存储格式通数编码压缩映射推谓词推功Parquet性较文件格式提升预见着数模型丰富Ad hoc查询需求Parquet会更广泛
    Apache Airflow文档

    拒绝条款: Apache Airflow正Apache Incubator赞助Apache软件基金会(ASF)进行孵化新接受项目需孵化直进步审查表明基础设施通信决策程已成功ASF项目致方式稳定然孵化状态定反映代码完整性稳定性确实表明该项目尚未ASF完全认
    Apache Airflow 灵活扩展工作流动化调度系统编集理数百 PB 数流项目轻松编排复杂计算工作流通智调度数库赖关系理错误处理日志记录Airflow 动化单服务器规模集群资源理项目采 Python 编写具高扩展性够运行语言编写务允许常体系结构项目集成 AWS S3DockerKubernetesMySQLPostgres 等
    airflow工作流实现务非循环图(DAG)airflow调度程序遵循指定赖关系时组工作程序执行您务丰富命令行实程序轻松DAG执行复杂操作丰富户界面您轻松生产环境中运行视化数道监控进度需时解决问题
    工作流定义代码时变更易维护版化测试协作

    悉Apache Airflow 目前正 200 组织包括 AdobeAirbnbAstronomerEtsyGoogleINGLyftNYC City PlanningPaypalPolideaQuboleQuizletRedditReplySolitaSquareTwitter 等

    Airflow 允许工作流开发员轻松创建维护周期性调度运行工作流(环图成DAGs)工具Airbnb中工作流包括数存储增长分析Email发送AB测试等等跨越部门例台拥 HivePrestoMySQLHDFSPostgresS3交互力提供钩子系统拥扩展性命令行界面该工具提供    基Web户界面您视化道赖关系监控进度触发务等 
    Airflow 包含组件:
    · 元数库(MySQLPostgres)
    · 组Airflow工作节点
    · 调节器(RedisRabbitMQ)
    · Airflow Web服务器
    截图:


    道定义示例:

    Code that goes along with the Airflow tutorial located at
    httpsgithubcomairbnbairflowblobmasterairflowexample_dagstutorialpy

    from airflow import DAG
    from airflowoperatorsbash_operator import BashOperator
    from datetime import datetime timedelta


    default_args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' datetime(2015 6 1)
    'email' ['airflow@airflowcom']
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes5)
    # 'queue' 'bash_queue'
    # 'pool' 'backfill'
    # 'priority_weight' 10
    # 'end_date' datetime(2016 1 1)
    }

    dag DAG('tutorial' default_argsdefault_args)

    # t1 t2 and t3 are examples of tasks created by instantiating operators
    t1 BashOperator(
    task_id'print_date'
    bash_command'date'
    dagdag)

    t2 BashOperator(
    task_id'sleep'
    bash_command'sleep 5'
    retries3
    dagdag)

    templated_command
    { for i in range(5) }
    echo {{ ds }}
    echo {{ macrosds_add(ds 7)}}
    echo {{ paramsmy_param }}
    { endfor }


    t3 BashOperator(
    task_id'templated'
    bash_commandtemplated_command
    params{'my_param' 'Parameter I passed in'}
    dagdag)

    t2set_upstream(t1)
    t3set_upstream(t1)

    apache airflow
    Watch 623Star 10710Fork 3756
    Apache Airflow — More
    httpsairflowapacheorg

    master 分支 :20190119
    载zip

    原理
    · 动态性:Airflow pipelines配置代码(Python)允许动态pipeline生成允许编写动态实例化道代码
    · 扩展性:轻松定义运算符执行程序扩展库符合适合您环境抽象级
    · 优雅:Airflow pipelines精益明确强Jinja模板引擎参数化脚置Airflow核心
    · 扩展性:Airflow具模块化架构消息队列协调意数量workersAirflow已准备扩展限

    Airflow数流解决方案务会数移动(务交换元数) AirflowSpark StreamingStorm空间OozieAzkaban更具性
    工作流预计部分静态缓慢变化您工作流中务结构视数库结构稍微动态样预计Airflow工作流程运行期间起相似样明确工作单元连续性
    Airflow环境安装配置
    airflow支持运行windows环境搭建基docker环境
    通安装方法
    载安装Airflow
    安装新稳定版Airflow简单方法pip:
    pip install apacheairflow
    安装Airflow支持s3postgres等额外功:
    pip install apacheairflow[postgress3]
    注意
    GPL赖
    One of the dependencies of Apache Airflow by default pulls in a GPL library (unidecode’) 默认情况Apache Airflow赖项会提取GPL库( unidecode)
    果您关心GPL赖性请安装airflow前导出设置环境变量:export AIRFLOW_GPL_UNIDECODEyes
    果您担心问题通设置export SLUGIFY_USES_TEXT_UNIDECODEyes强制非GPL库然继续进行正常安装请注意次升级时需指定项 注意果系统已存unidecode赖项
    额外包
    apacheairflow PyPI基软件包安装入门需容安装子包具体取决您环境中容例果您需Postgres连接必历安装postgresdevel yum软件包麻烦者您正分发应等效项
    台Airflow会需额外赖关系operator进行条件导入
    subpackage参数启功列表:
    subpackage参数
    安装命令
    启功
    all
    pip install apacheairflow[all]
    man里提Airflow功
    all_dbs
    pip install apacheairflow[all_dbs]
    All databases integrations
    async
    pip install apacheairflow[async]
    Async worker classes for Gunicorn
    celery
    pip install apacheairflow[celery]
    CeleryExecutor
    cloudant
    pip install apacheairflow[cloudant]
    Cloudant hook
    crypto
    pip install apacheairflow[crypto]
    Encrypt connection passwords in metadata db
    devel
    pip install apacheairflow[devel]
    Minimum dev tools requirements
    devel_hadoop
    pip install apacheairflow[devel_hadoop]
    Airflow + dependencies on the Hadoop stack
    druid
    pip install apacheairflow[druid]
    Druid related operators & hooks
    gcp_api
    pip install apacheairflow[gcp_api]
    Google Cloud Platform hooks and operators (using googleapipythonclient)
    github_enterprise
    pip install apacheairflow[github_enterprise]
    Github Enterprise auth backend
    google_auth
    pip install apacheairflow[google_auth]
    Google auth backend
    hdfs
    pip install apacheairflow[hdfs]
    HDFS hooks and operators
    hive
    pip install apacheairflow[hive]
    All Hive related operators
    jdbc
    pip install apacheairflow[jdbc]
    JDBC hooks and operators
    kerberos
    pip install apacheairflow[kerberos]
    Kerberos integration for Kerberized Hadoop
    ldap
    pip install apacheairflow[ldap]
    LDAP authentication for users
    mssql
    pip install apacheairflow[mssql]
    Microsoft SQL Server operators and hook support as an Airflow backend
    mysql
    pip install apacheairflow[mysql]
    MySQL operators and hook support as an Airflow backend The version of MySQL server has to be 564+ The exact version upper bound depends on version of mysqlclient package For example mysqlclient 1312 can only be used with MySQL server 564 through 57
    password
    pip install apacheairflow[password]
    Password authentication for users
    postgres
    pip install apacheairflow[postgres]
    PostgreSQL operators and hook support as an Airflow backend
    qds
    pip install apacheairflow[qds]
    Enable QDS (Qubole Data Service) support
    rabbitmq
    pip install apacheairflow[rabbitmq]
    RabbitMQ support as a Celery backend
    redis
    pip install apacheairflow[redis]
    Redis hooks and sensors
    s3
    pip install apacheairflow[s3]
    S3KeySensor S3PrefixSensor
    samba
    pip install apacheairflow[samba]
    Hive2SambaOperator
    slack
    pip install apacheairflow[slack]
    SlackAPIPostOperator
    ssh
    pip install apacheairflow[ssh]
    SSH hooks and Operator
    vertica
    pip install apacheairflow[vertica]
    Vertica hook support as an Airflow backend
    启动Airflow数库
    您运行务前Airflow需启动数库果您试验学Airflow您坚持默认SQLite选项果您想SQLite请查初始化数库端设置数库
    配置完成您需先初始化数库然运行务:
    airflow initdb
    Airflow环境安装(Docker)
    第1步:安装docker
    通URL载安装docker
    httpsstoredockercomeditionscommunitydockercedesktopwindows
    第2步:载镜启动airflow服务
    输入命令分载airflow镜启动airflow容器
    airflow容器启动通URL:httplocalhost8080
    访问airflow admin console
    docker pull puckeldockerairflow
    docker run d p 80808080 e LOAD_EXy puckeldockerairflow
    Airflow环境配置(Docker)
    查容器基信息
    docker ps  # list running container
    docker stop # stop container
    docker start # start container
    docker rm # remove container
    登录container shell
    docker exec it binbash
    载container中airflow配置文件
    指定container中airflow配置文件载前目录中
    docker cp usrlocalairflowairflowcfg
    传container DAG文件
    docker exec mkdir usrlocalairflowdags # need to create the dags folder at the first time
     
    docker cp testingpy usrlocalairflowdagstestingpy
    修改airflowcfg配置启动airflow scheduler
    默认采LocalExecutor方式默认没启动airflow scheduler需进行配置启动scheduler台服务
    container中载airflow配置文件(usrlocalairflowairflowcfg)
    修改配置文件中参数
    max_threads 1
    命令启动airflow scheduler
    docker exec airflow scheduler D

    Airflow环境安装(Windows 10)
    开命令行提示符输入 pip install apacheairflow会报错找ssl模块参考解决教程:WIN10 Anaconda安装Python3pip时报没ssl模块错误 
    错误解决继续输入 pip install apacheairflow

    报错Microsoft Visual C++ 140 is required

    原Windows环境版问题需官网载新安装果安装生效原Microsoft NET Framework版太旧重新官网载新安装
    Visual C++ Build Tools 2015 Microsoft
     httpwwwlfduciedu~gohlkepythonlibs#twisted 载twisted应版whl文件cp面python版amd64代表64位运行命令: pip install 载whl文件名
    果行话分享安装器进行安装:visualcppbuildtools_fullzip
    解决问题继续安装airflow

    报错VS140 linkexe failed with exit status 1158

    问题关键:
    Visual Studio can't build due to rcexe
    通步骤解决问题:
    1 环境变量中增加文件路径PATH 环境变量中:
    C\Program Files (x86)\Windows Kits\10\bin\x64
    2 文件rcexercdlldll 路径C\Program Files (x86)\Windows Kits\81\bin\x86 复制路径C\Program Files (x86)\Microsoft Visual Studio 140\VC\bin
    3 重新命令行提示符里运行pip install apacheairflow命令




    项目
    历史
    Airflow项目2014年10月AirbnbMaxime Beauchemin启动 第次提交时开源2015年6月宣布正式加入Airbnb Github
    该项目2016年3月加入Apache Software Foundation孵化计划
    2019年1月8日Apache 软件基金会宣布Apache Airflow 已成功孵化毕业成基金会新顶级项目
    公告址:
    httpblogsapacheorgfoundationentrytheapachesoftwarefoundationannounces44
    提交者
    · @mistercrunch (Maxime Max Beauchemin)
    · @r39132 (Siddharth Sid Anand)
    · @criccomini (Chris Riccomini)
    · @bolkedebruin (Bolke de Bruin)
    · @artwr (Arthur Wiedmer)
    · @jlowin (Jeremiah Lowin)
    · @patrickleotardif (Patrick Leo Tardif)
    · @aoen (Dan Davydov)
    · @syvineckruyk (Steven YvinecKruyk)
    · @msumit (Sumit Maheshwari)
    · @alexvanboxel (Alex Van Boxel)
    · @saguziel (Alex Guziel)
    · @joygao (Joy Gao)
    · @fokko (Fokko Driesprong)
    · @ash (Ash BerlinTaylor)
    · @kaxilnaik (Kaxil Naik)
    · @fengtao (Tao Feng)
    关贡献者完整列表请查AirflowGithub贡献者页面:
    资源链接
    · Airflow’s official documentation
    · Mailing list (send emails to devsubscribe@airflowincubatorapacheorg andor commitssubscribe@airflowincubatorapacheorg to subscribe to each)
    · Issues on Apache’s Jira
    · Gitter (chat) Channel
    · More resources and links to Airflow related content on the Wiki
    路线图
    Please refer to the Roadmap on the wiki
    许证

    Apache License
    Version 20 January 2004
    httpwwwapacheorglicenses

    TERMS AND CONDITIONS FOR USE REPRODUCTION AND DISTRIBUTION

    1 Definitions

    License shall mean the terms and conditions for use reproduction
    and distribution as defined by Sections 1 through 9 of this document

    Licensor shall mean the copyright owner or entity authorized by
    the copyright owner that is granting the License

    Legal Entity shall mean the union of the acting entity and all
    other entities that control are controlled by or are under common
    control with that entity For the purposes of this definition
    control means (i) the power direct or indirect to cause the
    direction or management of such entity whether by contract or
    otherwise or (ii) ownership of fifty percent (50) or more of the
    outstanding shares or (iii) beneficial ownership of such entity

    You (or Your) shall mean an individual or Legal Entity
    exercising permissions granted by this License

    Source form shall mean the preferred form for making modifications
    including but not limited to software source code documentation
    source and configuration files

    Object form shall mean any form resulting from mechanical
    transformation or translation of a Source form including but
    not limited to compiled object code generated documentation
    and conversions to other media types

    Work shall mean the work of authorship whether in Source or
    Object form made available under the License as indicated by a
    copyright notice that is included in or attached to the work
    (an example is provided in the Appendix below)

    Derivative Works shall mean any work whether in Source or Object
    form that is based on (or derived from) the Work and for which the
    editorial revisions annotations elaborations or other modifications
    represent as a whole an original work of authorship For the purposes
    of this License Derivative Works shall not include works that remain
    separable from or merely link (or bind by name) to the interfaces of
    the Work and Derivative Works thereof

    Contribution shall mean any work of authorship including
    the original version of the Work and any modifications or additions
    to that Work or Derivative Works thereof that is intentionally
    submitted to Licensor for inclusion in the Work by the copyright owner
    or by an individual or Legal Entity authorized to submit on behalf of
    the copyright owner For the purposes of this definition submitted
    means any form of electronic verbal or written communication sent
    to the Licensor or its representatives including but not limited to
    communication on electronic mailing lists source code control systems
    and issue tracking systems that are managed by or on behalf of the
    Licensor for the purpose of discussing and improving the Work but
    excluding communication that is conspicuously marked or otherwise
    designated in writing by the copyright owner as Not a Contribution

    Contributor shall mean Licensor and any individual or Legal Entity
    on behalf of whom a Contribution has been received by Licensor and
    subsequently incorporated within the Work

    2 Grant of Copyright License Subject to the terms and conditions of
    this License each Contributor hereby grants to You a perpetual
    worldwide nonexclusive nocharge royaltyfree irrevocable
    copyright license to reproduce prepare Derivative Works of
    publicly display publicly perform sublicense and distribute the
    Work and such Derivative Works in Source or Object form

    3 Grant of Patent License Subject to the terms and conditions of
    this License each Contributor hereby grants to You a perpetual
    worldwide nonexclusive nocharge royaltyfree irrevocable
    (except as stated in this section) patent license to make have made
    use offer to sell sell import and otherwise transfer the Work
    where such license applies only to those patent claims licensable
    by such Contributor that are necessarily infringed by their
    Contribution(s) alone or by combination of their Contribution(s)
    with the Work to which such Contribution(s) was submitted If You
    institute patent litigation against any entity (including a
    crossclaim or counterclaim in a lawsuit) alleging that the Work
    or a Contribution incorporated within the Work constitutes direct
    or contributory patent infringement then any patent licenses
    granted to You under this License for that Work shall terminate
    as of the date such litigation is filed

    4 Redistribution You may reproduce and distribute copies of the
    Work or Derivative Works thereof in any medium with or without
    modifications and in Source or Object form provided that You
    meet the following conditions

    (a) You must give any other recipients of the Work or
    Derivative Works a copy of this License and

    (b) You must cause any modified files to carry prominent notices
    stating that You changed the files and

    (c) You must retain in the Source form of any Derivative Works
    that You distribute all copyright patent trademark and
    attribution notices from the Source form of the Work
    excluding those notices that do not pertain to any part of
    the Derivative Works and

    (d) If the Work includes a NOTICE text file as part of its
    distribution then any Derivative Works that You distribute must
    include a readable copy of the attribution notices contained
    within such NOTICE file excluding those notices that do not
    pertain to any part of the Derivative Works in at least one
    of the following places within a NOTICE text file distributed
    as part of the Derivative Works within the Source form or
    documentation if provided along with the Derivative Works or
    within a display generated by the Derivative Works if and
    wherever such thirdparty notices normally appear The contents
    of the NOTICE file are for informational purposes only and
    do not modify the License You may add Your own attribution
    notices within Derivative Works that You distribute alongside
    or as an addendum to the NOTICE text from the Work provided
    that such additional attribution notices cannot be construed
    as modifying the License

    You may add Your own copyright statement to Your modifications and
    may provide additional or different license terms and conditions
    for use reproduction or distribution of Your modifications or
    for any such Derivative Works as a whole provided Your use
    reproduction and distribution of the Work otherwise complies with
    the conditions stated in this License

    5 Submission of Contributions Unless You explicitly state otherwise
    any Contribution intentionally submitted for inclusion in the Work
    by You to the Licensor shall be under the terms and conditions of
    this License without any additional terms or conditions
    Notwithstanding the above nothing herein shall supersede or modify
    the terms of any separate license agreement you may have executed
    with Licensor regarding such Contributions

    6 Trademarks This License does not grant permission to use the trade
    names trademarks service marks or product names of the Licensor
    except as required for reasonable and customary use in describing the
    origin of the Work and reproducing the content of the NOTICE file

    7 Disclaimer of Warranty Unless required by applicable law or
    agreed to in writing Licensor provides the Work (and each
    Contributor provides its Contributions) on an AS IS BASIS
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND either express or
    implied including without limitation any warranties or conditions
    of TITLE NONINFRINGEMENT MERCHANTABILITY or FITNESS FOR A
    PARTICULAR PURPOSE You are solely responsible for determining the
    appropriateness of using or redistributing the Work and assume any
    risks associated with Your exercise of permissions under this License

    8 Limitation of Liability In no event and under no legal theory
    whether in tort (including negligence) contract or otherwise
    unless required by applicable law (such as deliberate and grossly
    negligent acts) or agreed to in writing shall any Contributor be
    liable to You for damages including any direct indirect special
    incidental or consequential damages of any character arising as a
    result of this License or out of the use or inability to use the
    Work (including but not limited to damages for loss of goodwill
    work stoppage computer failure or malfunction or any and all
    other commercial damages or losses) even if such Contributor
    has been advised of the possibility of such damages

    9 Accepting Warranty or Additional Liability While redistributing
    the Work or Derivative Works thereof You may choose to offer
    and charge a fee for acceptance of support warranty indemnity
    or other liability obligations andor rights consistent with this
    License However in accepting such obligations You may act only
    on Your own behalf and on Your sole responsibility not on behalf
    of any other Contributor and only if You agree to indemnify
    defend and hold each Contributor harmless for any liability
    incurred by or claims asserted against such Contributor by reason
    of your accepting any such warranty or additional liability
    快速开始
    安装快速直接
    # airflow needs a home ~airflow is the default
    # but you can lay foundation somewhere else if you prefer
    # (optional)
    export AIRFLOW_HOME~airflow

    # install from pypi using pip
    pip install apacheairflow

    # initialize the database
    airflow initdb

    # start the web server default port is 8080
    airflow webserver p 8080

    # start the scheduler
    airflow scheduler

    # visit localhost8080 in the browser and enable the example dag in the home page
    运行命令Airflow创建AIRFLOW_HOME文件夹airflowcfg文件默认值您快速手您AIRFLOW_HOMEairflowcfg中检查文件通Admin>Configuration菜单中UI检查文件 果systemd启动Web服务器PID文件存储AIRFLOW_HOMEairflowwebserverpidrunairflowwebserverpid中
    开箱Airflow sqlite 数库法数库端进行行化应该快扩展SequentialExecutor配合该序仅序运行务实例然非常限允许您快速启动运行浏览 UI 命令行实程序
    触发务实例命令 运行命令时您应该够example1 DAG中作业状态发生变化
    # run your first task instance
    airflow run example_bash_operator runme_0 20150101
    # run a backfill over 2 days
    airflow backfill example_bash_operator s 20150101 e 20150102
    步什?
    点开始您前教程部分获取更示例者果您已准备弄清楚请参阅操作指南部分
    教程
    教程您介绍基Airflow概念象编写第道时法
    示例道定义
    基道定义示例果起复杂请担心面逐行说明

    Code that goes along with the Airflow tutorial located at
    httpsgithubcomapacheincubatorairflowblobmasterairflowexample_dagstutorialpy

    from airflow import DAG
    from airflowoperatorsbash_operator import BashOperator
    from datetime import datetime timedelta


    default_args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' datetime(2015 6 1)
    'email' ['airflow@examplecom']
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes5)
    # 'queue' 'bash_queue'
    # 'pool' 'backfill'
    # 'priority_weight' 10
    # 'end_date' datetime(2016 1 1)
    }

    dag DAG('tutorial' default_argsdefault_args schedule_intervaltimedelta(days1))

    # t1 t2 and t3 are examples of tasks created by instantiating operators
    t1 BashOperator(
    task_id'print_date'
    bash_command'date'
    dagdag)

    t2 BashOperator(
    task_id'sleep'
    bash_command'sleep 5'
    retries3
    dagdag)

    templated_command
    { for i in range(5) }
    echo {{ ds }}
    echo {{ macrosds_add(ds 7)}}
    echo {{ paramsmy_param }}
    { endfor }


    t3 BashOperator(
    task_id'templated'
    bash_commandtemplated_command
    params{'my_param' 'Parameter I passed in'}
    dagdag)

    t2set_upstream(t1)
    t3set_upstream(t1)
    DAG定义文件
    围绕着件事(说直观)Airflow Python脚实际配置文件DAG结构指定代码处定义实际务脚文文中运行务时间点运行工作者意味着该脚务间交叉通信请注意名XCom更高级功
    时会DAG定义文件视进行实际数处理方 事实非该脚目定义DAG象需快速评估(秒分钟)调度程序定期执行反映更改(果话)
    导入模块
    Airflow道Python脚恰定义Airflow DAG象首先导入需库
    # The DAG object we'll need this to instantiate a DAG
    from airflow import DAG

    # Operators we need this to operate
    from airflowoperatorsbash_operator import BashOperator
    默认参数
    创建DAG务选择显式组参数传递务构造函数(变余)者(更)定义默认参数字典创建务时
    from datetime import datetime timedelta

    default_args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' datetime(2015 6 1)
    'email' ['airflow@examplecom']
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes5)
    # 'queue' 'bash_queue'
    # 'pool' 'backfill'
    # 'priority_weight' 10
    # 'end_date' datetime(2016 1 1)
    }
    关BaseOperator参数功更信息请参阅airflowmodelsBaseOperator文档
    外请注意您轻松定义目参数集例子生产开发环境间进行设置
    实例化DAG
    需DAG象嵌入务里传递定义dag_id字符串作DAG唯标识符传递刚刚定义默认参数字典DAG定义1天schedule_interval
    dag DAG(
    'tutorial' default_argsdefault_args schedule_intervaltimedelta(days1))

    实例化操作员象时生成务运算符实例化象称构造函数第参数task_id充务唯标识符
    t1 BashOperator(
    task_id'print_date'
    bash_command'date'
    dagdag)

    t2 BashOperator(
    task_id'sleep'
    bash_command'sleep 5'
    retries3
    dagdag)
    请注意BaseOperator继承运算符(retries)运算符特定参数(bash_command)通参数传递运算符构造函数构造函数调传递参数更简单请注意第二务中3重载retries参数
    务优先规:
    1 显式传递参数
    2 default_args 字典中存值
    3 operator默认值(果存)
    务必须包含继承参数task_idowner否Airflow引发异常
    Jinja代码模板
    Airflow充分利Jinja Templating强功pipeline author(道作者)提供组置参数宏 Airflowpipeline author(道作者)提供定义参数宏模板hooks(钩子)
    教程没涉Airflow中模板进行操作表面节目您知道功存您熟悉双花括号指常见模板变量: {{ ds }}(天日期戳)
    templated_command
    { for i in range(5) }
    echo {{ ds }}
    echo {{ macrosds_add(ds 7) }}
    echo {{ paramsmy_param }}
    { endfor }


    t3 BashOperator(
    task_id'templated'
    bash_commandtemplated_command
    params{'my_param' 'Parameter I passed in'}
    dagdag)
    请注意templated_command包含{ }块中代码逻辑引参数{{ ds }}调{{ macrosds_add(ds 7)}}中函数{{ paramsmy_param }}中引户定义参数
    BaseOperator中params hook(钩子)允许您参数象字典传递模板 请花点时间解通模板参数my_param
    Files can also be passed to the bash_command argument like bash_command'templated_commandsh' where the file location is relative to the directory containing the pipeline file (tutorialpy in this case) 文件作值传递bash_command参数例bash_command'templated_commandsh'中文件位置相包含pipeline(道)文件目录(例中tutorialpy)This may be desirable for many reasons like separating your script’s logic and pipeline code allowing for proper code highlighting in files composed in different languages and general flexibility in structuring pipelines 出许原例分离脚逻辑道代码允许语言组成文件中正确代码突出显示构造道般灵活性It is also possible to define your template_searchpath as pointing to any folder locations in the DAG constructor call template_searchpath定义指DAG构造函数调中文件夹位置
    相 DAG 构造函数调定义user_defined_macros允许您指定变量例dict(foo'bar')传递参数您模板中{{ foo }}外指定user_defined_filters允许您注册您滤器例dict(hellolambda name 'Hello s' name)传递参数您模板中{{ 'world' | hello }}关定义滤器详细信息请参阅Jinja文档
    For more information on the variables and macros that can be referenced in templates make sure to read through the Macros section
    设置赖关系
    We have two simple tasks that do not depend on each other Here’s a few ways you can define dependencies between them 两相互赖简单务 定义间赖关系方法:
    t2set_upstream(t1)

    # This means that t2 will depend on t1
    # running successfully to run
    # It is equivalent to
    # t1set_downstream(t2)

    t3set_upstream(t1)

    # all of this is equivalent to
    # dagset_dependency('print_date' 'sleep')
    # dagset_dependency('print_date' 'templated')
    请注意执行脚时Airflow会DAG中找循环次引赖项时引发异常
    概括
    吧非常基DAG时您代码应示:

    Code that goes along with the Airflow tutorial located at
    httpsgithubcomapacheincubatorairflowblobmasterairflowexample_dagstutorialpy

    from airflow import DAG
    from airflowoperatorsbash_operator import BashOperator
    from datetime import datetime timedelta


    default_args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' datetime(2015 6 1)
    'email' ['airflow@examplecom']
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes5)
    # 'queue' 'bash_queue'
    # 'pool' 'backfill'
    # 'priority_weight' 10
    # 'end_date' datetime(2016 1 1)
    }

    dag DAG(
    'tutorial' default_argsdefault_args schedule_intervaltimedelta(days1))

    # t1 t2 and t3 are examples of tasks created by instantiating operators
    t1 BashOperator(
    task_id'print_date'
    bash_command'date'
    dagdag)

    t2 BashOperator(
    task_id'sleep'
    bash_command'sleep 5'
    retries3
    dagdag)

    templated_command
    { for i in range(5) }
    echo {{ ds }}
    echo {{ macrosds_add(ds 7)}}
    echo {{ paramsmy_param }}
    { endfor }


    t3 BashOperator(
    task_id'templated'
    bash_commandtemplated_command
    params{'my_param' 'Parameter I passed in'}
    dagdag)

    t2set_upstream(t1)
    t3set_upstream(t1)
    AWS应实例 – EMR集群
    This DAG demonstrates DAG parameters shows the use of the Airflow Operators and build dependencies The DAG accomplishes the following DAG演示DAG参数显示Airflow操作符构建赖关系 DAG完成务:
    · Spinup an EMR cluster旋转EMR集群
    Copy some from airflow import DAG
    from airflowoperators import EmrOperator
    from airflowoperators import GenieHiveOperator GenieS3DistCpOperator \
        GeniePigOperator GenieSparkOperator
    from datetime import datetime timedelta
    from batch_common import DMError BICommon
     
    # import profile properties
    profile BICommon()get_profile()
    job_user profile['JOB_USER']
     
     
    # Static Values
    CLUSTER_NAME 'gck_af_beta_test_demo_129'
    ON_FAILURE_CB EmrOperator(owner'noowner' task_id'notask' cluster_action'terminate' cluster_nameCLUSTER_NAME)execute
     
    default_args {
        'owner' job_user
        'wait_for_downstream' True
        'start_date' datetime(2017 1 29)
        'email' ['LstDigitalTechNGAPAlerts@nikecom']
        'email_on_failure' True
        'email_on_retry' False
        'retries' 1
        'queue' 'airflow'
        'retry_delay' timedelta(seconds10)
        'on_failure_callback' ON_FAILURE_CB
    }
     
    dag DAG('gck_af_beta_test_demo_129' default_argsdefault_args schedule_intervaltimedelta(days1))
     
    distcp_cmd src s3ainbounddatascienceadhocngap2_beta_testsdata dest hdfsuserhadoopdatalanding
    hive_cmd f s3inbounddatascienceadhocngap2_beta_testsaf_test_scriptssamplehivejobhql
    pig_cmd f s3inbounddatascienceadhocngap2_beta_testsaf_test_scriptssamplepigjobpig
    spark_cmd s3inbounddatascienceadhocngap2_beta_testsaf_test_scriptssamplesparkjobpy
     
    task1 EmrOperator(   task_id 'EmrSpinup'
        cluster_action 'spinup'
        cluster_name CLUSTER_NAME
        num_task_nodes 1
        num_core_nodes 1
        classification 'bronze'
        queue 'airflow'
        cost_center 'COST_CENTER'
        emr_version '500'
        applications ['hive' 'pig' 'spark']
        dag dag
    )
     
    task2 GenieS3DistCpOperator(   task_id 'TestDistcpStep'
        command distcp_cmd
        job_name distcpgeniejob
        queue 'airflow'
        sched_type CLUSTER_NAME
        dag dag
    )
     
    task3 GenieHiveOperator(   task_id 'TestHiveStep'
        command hive_cmd
        job_name 'hivegeniejob'
        queue 'airflow'
        sched_type CLUSTER_NAME
        dag dag
    )
     
    task4 GeniePigOperator(   task_id 'TestPigStep'
        command pig_cmd
        job_name 'piggeniejob'
        queue 'airflow'
        sched_type CLUSTER_NAME
        dag dag
    )
     
    task5 GenieSparkOperator(   task_id 'TestSparkStep'
        command spark_cmd
        job_name 'sparkgeniejob'
        queue 'airflow'
        sched_type CLUSTER_NAME
        dag dag
    )
     
    task6 EmrOperator(   task_id 'EmrTerminate'
        cluster_action 'terminate'
        cluster_name CLUSTER_NAME
        queue 'airflow'
        dag dag
    )
     
    # Set Dependencies
    task2set_upstream(task1)
    task3set_upstream(task2)
    task4set_upstream(task3)
    task5set_upstream(task4)
    task6set_upstream(task5)
    · task1set_downstream(task6) data files from S3 to HDFS
    · Run Hive Pig and Spark tasks on that sample data
    Terminate the EMR cluster

    测试
    运行脚
    Time to run some tests First let’s make sure that the pipeline parses Let’s assume we’re saving the code from the previous step in tutorialpy in the DAGs folder referenced in your airflowcfg The default location for your DAGs is ~airflowdags 时候进行测试 首先确保道解析 假设正保存airflowcfg中引DAGs文件夹中tutorialpy中步代码 DAG默认位置〜 airflow dags
    python ~airflowdagstutorialpy
    If the script does not raise an exception it means that you haven’t done anything horribly wrong and that your Airflow environment is somewhat sound 果脚没引发异常意味着您没做怕错误您Airflow环境健全
    命令行元数验证
    通运行命令进步验证脚
    # print the list of active DAGs
    airflow list_dags

    # prints the list of tasks the tutorial dag_id
    airflow list_tasks tutorial

    # prints the hierarchy of tasks in the tutorial DAG
    airflow list_tasks tutorial tree
    测试
    Let’s test by running the actual task instances on a specific date The date specified in this context is an execution_date which simulates the scheduler running your task or dag at a specific date + time 通特定日期运行实际务实例进行测试 文中指定日期execution_date模拟特定日期+时间运行务dag调度程序:
    # command layout command subcommand dag_id task_id date

    # testing print_date
    airflow test tutorial print_date 20150601

    # testing sleep
    airflow test tutorial sleep 20150601
    Now remember what we did with templating earlier See how this template gets rendered and executed by running this command 现记前模板做事? 通运行命令解呈现执行模板:
    # testing templated
    airflow test tutorial templated 20150601
    This should result in displaying a verbose log of events and ultimately running your bash command and printing the result 应该导致显示详细事件日志终运行bash命令印结果
    Note that the airflow test command runs task instances locally outputs their log to stdout (on screen) doesn’t bother with dependencies and doesn’t communicate state (running success failed …) to the database It simply allows testing a single task instance 请注意airflow test命令运行务实例日志输出stdout(屏幕)赖赖项数库传达状态(运行成功失败) 允许测试单务实例
    Backfill回填
    Everything looks like it’s running fine so let’s run a backfill backfill will respect your dependencies emit logs into files and talk to the database to record status If you do have a webserver up you’ll be able to track the progress airflow webserver will start a web server if you are interested in tracking the progress visually as your backfill progresses 切起运行良运行回填 回填尊重您赖关系日志发送文件数库通信记录状态 果您网络服务器您够踪进度 果您兴趣回填程中直观踪进度airflow webserver启动Web服务器
    Note that if you use depends_on_pastTrue individual task instances will depend on the success of the preceding task instance except for the start_date specified itself for which this dependency is disregarded 请注意果depends_on_past True单务实例取决前面务实例成功指定身start_date赖关系忽略
    The date range in this context is a start_date and optionally an end_date which are used to populate the run schedule with task instances from this dag 文中日期范围start_date选end_datedag中务实例填充运行计划
    # optional start a web server in debug mode in the background
    # airflow webserver debug &

    # start your backfill on a date range
    airflow backfill tutorial s 20150601 e 20150607
    步什?
    That’s it you’ve written tested and backfilled your very first Airflow pipeline Merging your code into a code repository that has a master scheduler running against it should get it to get triggered and run every day 样已编写测试回填第Airflow道 代码合具针运行调度程序代码存储库中应该天触发运行
    Here’s a few things you might want to do next 您想做事情:
    · 深入解户界面 点击容
    · Keep reading the docs Especially the sections on
    o Command line interface
    o Operators
    o Macros
    · Write your first pipeline
    AWS部署运行
    Step 1 Copy the DAG to the Airflow Scheduler
    · Click on httpttygckngap2nikecom8080
    · You will be in the airflow home directory  varlibairflow
    · Change directory to varlibairflowdags
    · cd dags
    · Make sure the start date of the dag is updated
    · If you don't have a subdirectory underneath the dags folder you will want to create one as the push_dags command expects a subdirectory 
    · mkdir
    · cd
    · Copy your DAG into a python file or optionally use the AWS S3 CLI
    · aws s3 cp s3nikeemrbingckairflowdagsgck_af_beta_test_demo_129py gck_af_beta_test_demo_228py
    · Sync your DAG from the Airflow Scheduler to the Workers
    · push_dag
    · follow the prompts
    · For a development environment this might be the way you copy your DAGs to the Airflow Scheduler and Workers
    · For a test or production environment this would typically be handled through a Jenkins build
    You're not limited by the EMR versions in the PaaSUI
    If you are spinning up a cluster in your dag you can choose any EMR version
    Airflow Scheduler TTY Shell

    Additionally you can capture the IP address from this TTY shell or from Flower and login through your local terminal with Jumpbox
     Step 2 Refresh Airflow DAG and Turn it On from the Scheduler GUI
    · Click on httpairflowgckngap2nikecom8080
    · Locate your DAG
    · Refresh DAG
    · Turn DAG on
     Airflow Scheduler GUI DAG Refresh

     
    Step 3 Open Airflow DAG and Run
    · Click on first task and Run if not already on a schedule
    Airflow Scheduler GUI Run DAG

    Step 4 View DAG Tasks in Celery Flower
    · Click on httpflowergckngap2nikecom8080
    · Click on Tasks
    Celery Flower GUI Tasks

    AWS查日志
    View Airflow Logs
    · Click on a Genie Operator Task and select the View Logs button
    · Notice the link to Kibana to visualize your logs
    · Click the Kibana Link and notice the EMR Cluster Name is now pinned you should only see your cluster logs
    Airflow Scheduler GUI Task Log

    Kibana Dashboard for EMR Cluster Metrics

    Airflow GPU WorkersTTY
    Airflow GPU worker nodes will have tty access when the Instances is spun up and the access will be revoked when the Instance is terminated This applies to Tensorflow workers (p2xlarge p28xlarge and p216xlarge) Instances and AnacondaR worker Instances Airflow Operators TensorflowSpinUp and TensorflowTerminate is used in the Airflow DAG's to Spinup and Terminate the Tensorflow Instances For more information click here Similarly Airflow Operators AnacondaRSpinUpOperator and AnacondaRTerminateOperator is used in the Airflow DAG's to Spinup and Terminate the AnacondaR Instances For more information click here
     
    An example of a tty endpoint of a Tensorflow worker node is given below
    ttytfclusternamequeuenamengap2nikecom8080
    where 
    clustername will be the name of the Airflow Cluster (ex nikeplus nikeplusqa)
    queuename will be the name of the Tensorflow Queue (ex tensorflow_tfp2xlarge1)
    If an instance is spunup in nikeplusqa environment in the queue tensorflow_tfp2xlarge2 the endpoint will be like
    ttytfnikeplusqatensorflow_tfp2xlarge2ngap2nikecom8080

     
    An example of a tty endpoint of a AnacondaR worker node is given below
    ttytfclusternamequeuenamengap2nikecom8080
    where
    clustername will be the name of the Airflow Cluster (ex dtc dtcdev dtcqa gckengineering etc)
    queuename will be the name of the AnacondaR Queue (ex anacondarr38xlarge1 )
    If an instance is spunup in dtcdev environment in the queue  anacondarr38xlarge1 the endpoint will be like
    ttytfdtcdevanacondarr38xlarge1ngap2nikecom8080
    关Airflow
    Airflow User Guides Documentation
    2Airflow Clusters
    3 Team Access to Clusters
    4Accessing Airflow GPU Nodes

    Apache Airflow is our jobscheduling utility This is used to create pipelines usually for ETL operationsJobs are ran on our EMR clusters and managed via a variety of tools

    Access
    The following services are provided
    Airflow
    DAG interface
    Celery
    Cluster health and monitoring

    These services and clusters can be found in the PaaSUI The master node can also be accessed via the SSHRemotely tool
    Security
    Access is determined by your NGAP2 AD group Only group members can see (and access) your clusters and jobs
    To access Okera clusters utilize these paramaters in the EMR spinup step of your DAG
    cerebro_cdasTrue
    tags[
    {Keycerebro_clusterValueprofile['ENV']}
    ]

    Airflow User Guides Documentation
    Topic
    Link
    Notes
    Topic
    Link
    Notes
    Airflow Tutorial
    httpsairflowincubatorapacheorgtutorialhtml
    REQUIRED reference for all Airflow users Start here Return when you are stuck
    DAG Development
    Airflow Best Practices DAG Development
    Validating and Testing DAGs
    Airflow Operators (NGAP 2)
    Airflow REST API
    Airflow Basics Overview Presentation
    Internal documents to guide Airflow developers
    Release 32 adds SQS Operator Snowflake python connector Cerberus libraries R Libraries
    Airflow Operations
    FAQs DAG problems Airflow tasks etc
    Airflow Operations Best Practices
    Content on common Airflow problems
    Key points on Airflow operations 
    Airflow Production DAG Status Dashboard
    For DTC Users NGAP2 Airflow Status Dashboard
    For GCK Users NGAP2 Airflow Status Dashboard
    For Default Site NGAP 20 Airflow DAG Status Dashboard
    These Dashboards are meant to check the realtime status for the DAG runs in the Production Environments for past 3 days
    Currently published in Default DTCAnalytics and GCK Sites
    Airflow Clusters
    Each NGAP 2 team that needs Airflow will get a dedicated Airflow cluster with 3 user components
    Airflow URLhttpairflowngap2nikecom8080
    Celery Flower URLhttpflowerngap2nikecom8080
    If found necessary teams can request multiple AF clusters for use in their development workflow (devqaprod) 
    Team Access to Clusters
    Team
    URLs
    Notesdetail
    Cluster Names(lower case)
    plugin version
    Team
    URLs
    Notesdetail
    Cluster Names(lower case)
    plugin version
    GCK
    httpairflowgckngap2nikecom8080
    httpflowergckngap2nikecom8080
    httpttygckngap2nikecom8080
    GCK PROD with Tensorflow instances
    airflowtropospheretensorflowr
    gck

    0126
    GCKWG
    httpairflowgckwgngap2nikecom8080
    httpflowergckwgngap2nikecom8080
    httpttygckwgngap2nikecom8080
    GCKWG Prod
    gckwg
    0127
    GCK Engineering
    httpairflowgckengineeringngap2nikecom8080
    httpflowergckengineeringngap2nikecom8080
    httpttygckengineeringngap2nikecom8080
    GCK Engineering PROD
    gckengineering
    anacondar
    0126

    httpairflowgckengineeringdevngap2nikecom8080
    httpflowergckengineeringdevngap2nikecom8080
    httpttygckengineeringdevngap2nikecom8080
    GCK Engineering Dev
    gckengineeringdev
    anacondar
    0127

    httpairflowgckengineeringqangap2nikecom8080
    httpflowergckengineeringqangap2nikecom8080
    httpttygckengineeringqangap2nikecom8080
    GCK Engineering QA
    gckengineeeringqa
    anacondar
    0127

    httpairflowgckengineeringuatngap2nikecom8080
    httpflowergckengineeringuatngap2nikecom8080
    httpttygckengineeringuatngap2nikecom8080
    GCK Engineering UAT
    gckengineeeringuat
    0127
    GCK Engineering WG
    httpairflowgckengineeringwgngap2nikecom8080
    httpflowergckengineeringwgngap2nikecom8080
    httpttygckengineeringwgngap2nikecom8080
    GCK Eng WG PROD
    gckengineeringwg
    0126
    ChinaBIEngineering
    httpairflowchinabiengineeringngap2nikecom8080
    httpflowerchinabiengineeringngap2nikecom8080
    httpttychinabiengineeringngap2nikecom8080
    ChinaBI Engineering Prod
    ChinaBIEngineering
    0128
    DTC
     httpairflowdtcngap2nikecom8080
    httpflowerdtcngap2nikecom8080
    httpttydtcngap2nikecom8080
    DTC PROD
    dtc
    anacondar
    0126

    httpairflowdtcdevngap2nikecom8080
    httpflowerdtcdevngap2nikecom8080
    httpttydtcdevngap2nikecom8080
    DTC DEV
    dtcdev
    anacondar
    0127

    httpairflowdtcqangap2nikecom8080
    httpflowerdtcqangap2nikecom8080
    httpttydtcqangap2nikecom8080
    DTC QA
    dtcqa
    anacondar
    0126
    DTC Engineering
    httpairflowdtcengineeringngap2nikecom8080
    httpflowerdtcengineeringngap2nikecom8080
    httpttydtcengineeringngap2nikecom8080
    DTCengineering PROD
    dtcengineering

    0125

    httpairflowdtcengineeringdevngap2nikecom8080
    httpflowerdtcengineeringdevngap2nikecom8080
    httpttydtcengineeringdevngap2nikecom8080
    DTCengineering DEV
    dtcengineeringdev

    0127

    httpairflowdtcengineeringqangap2nikecom8080
    httpflowerdtcengineeringqangap2nikecom8080
    httpttydtcengineeringqangap2nikecom8080
    DTCengineering QA
    dtcengineeringqa

    0126
    DTC Engineering WG
    httpairflowdtcengineeringwgngap2nikecom8080
    httpflowerdtcengineeringwgngap2nikecom8080
    httpttydtcengineeringwgngap2nikecom8080
    DTCengineeringWG PROD
    dtcengineeringwg
    0125
    RDF Engineering
    httpairflowrdfengineeringngap2nikecom8080
    httpflowerrdfengineeringngap2nikecom8080
    httpttyrdfengineeringngap2nikecom8080
    RDF Engineering Prod
    rdfengineering
    0124

    httpairflowrdfengineeringdevngap2nikecom8080
    httpflowerrdfengineeringdevngap2nikecom8080
    httpttyrdfengineeringdevngap2nikecom8080
    RDF Engineering DEV
    rdfengineeringdev
    0127

    httpairflowrdfengineeringqangap2nikecom8080
    httpflowerrdfengineeringqangap2nikecom8080
    httpttyrdfengineeringqangap2nikecom8080
    RDF Engineering QA
    rdfengineeringqa
    0127
    RDF Engineering WG
    httpairflowrdfengineeringwgngap2nikecom8080
    httpflowerrdfengineeringwgngap2nikecom8080
    httpttyrdfengineeringwgngap2nikecom8080
    RDF Engineering WG Prod
    rdfengineering
    0126

    httpairflowrdfengineeringwgqangap2nikecom8080
    httpflowerrdfengineeringwgqangap2nikecom8080
    httpttyrdfengineeringwgqangap2nikecom8080
    RDF Engineering WG QA
    rdfengineeringqa
    0127
    EU BI
    httpairfloweubidevngap2nikecom8080
    httpflowereubidevngap2nikecom8080
    httpttyeubidevngap2nikecom8080
    EU BI Dev
    eubidev
    0127

    httpairfloweubiqangap2nikecom8080
    httpflowereubiqangap2nikecom8080
    httpttyeubiqangap2nikecom8080
    EU BI QA
    eubiqa
    0127

    httpairfloweubingap2nikecom8080
    httpflowereubingap2nikecom8080
    httpttyeubingap2nikecom8080
    EU BI Prod
    eubi
    0124
    NikePlus
    httpairflownikeplusngap2nikecom8080
    httpflowernikeplusngap2nikecom8080
    httpttynikeplusngap2nikecom8080
    NIKEPLUS PROD
    nikeplus

    0126

    httpairflownikeplusqangap2nikecom8080
    httpflowernikeplusqangap2nikecom8080
    httpttynikeplusqangap2nikecom8080
    NIKEPLUS NONPROD QA with Tensorflow instances

    nikeplusqa
    0127
    RTLengineering
    httpairflowrtlengineeringngap2nikecom8080
    httpflowerrtlengineeringngap2nikecom8080
    httpttyrtlengineeringngap2nikecom8080
    RTLEngineering PROD
    rtlengineering
    0126

    httpairflowrtlengineeringdevngap2nikecom8080
    httpflowerrtlengineeringdevngap2nikecom8080
    httpttyrtlengineeringdevngap2nikecom8080
    RTLEngineering Dev
    rtlengineeringdev
    0127

    httpairflowrtlengineeringqangap2nikecom8080
    httpflowerrtlengineeringqangap2nikecom8080
    httpttyrtlengineeringqangap2nikecom8080
    RTLEngineering QA
    rtlengineeringqa
    0127
    UserServicesSocial
    httpairflowuserservicessocialngap2nikecom
    httpfloweruserservicessocialngap2nikecom
    httpttyuserservicessocialngap2nikecom
    User Services Social Prod
    userservicessocial
    0124

    httpairflowuserservicessocialdevngap2nikecom
    httpfloweruserservicessocialdevngap2nikecom
    httpttyuserservicessocialdevngap2nikecom
     User Service Social Dev
    userservicessocialdev
    0127
    NDeInnovation
    airflowndeinnovationdevngap2nikecom8080
    flowerndeinnovationdevngap2nikecom8080
    ttyndeinnovationdevngap2nikecom8080
    NDeInnovation Dev
    ndeinnovationdev
    0127

    airflowndeinnovationngap2nikecom8080
    flowerndeinnovationngap2nikecom8080
    ttyndeinnovationngap2nikecom8080
    NDeInnovation Prod
    ndeinnovation
    0124
    RetailExperience
    httpairflowretailexperiencedevngap2nikecom8080
    httpflowerretailexperiencedevngap2nikecom8080
    httpttyretailexperiencedevngap2nikecom8080
    RetailExperience Dev
    retailexperiencedev
    0127

    httpairflowretailexperiencengap2nikecom8080
    httpflowerretailexperiencengap2nikecom8080
    httpttyretailexperiencengap2nikecom8080
    RetailExperience Prod
    retailexperience
    0124
    RFSengineeringWG
    httpairflowrfsengineeringwgqangap2nikecom8080
    httpflowerrfsengineeringwgqangap2nikecom8080
    httpttyrfsengineeringwgqangap2nikecom8080
    RFSengineeringWG QA
    rfsengineeringwgqa
    0127

    httpairflowrfsengineeringwgngap2nikecom8080
    httpflowerrfsengineeringwgngap2nikecom8080
    httpttyrfsengineeringwgngap2nikecom8080
    RFSengineeringWG Prod
    rfsengineeringwg
    0122
    MPAA
    httpairflowmpaangap2nikecom8080
    httpflowermpaangap2nikecom8080
    httpttympaangap2nikecom8080
    MPAA Prod
    mpaa
    0124

    httpairflowmpaaqangap2nikecom8080
    httpflowermpaaqangap2nikecom8080
    httpttympaaqangap2nikecom8080
    MPAA QA
    mpaaqa
    0127
    Integrated Knowledge
    httpairflowintegratedknowledgengap2nikecom8080
    httpflowerintegratedknowledgengap2nikecom8080
    httpttyintegratedknowledgengap2nikecom8080
    Integrated Knowledge Prod
    integratedknowledge
    0125
    Search Engineering
    httpairflowsearchengineeringngap2nikecom
    httpflowersearchengineeringngap2nikecom
    httpttysearchengineeringngap2nikecom
    Search Engineering Prod
    searchengineering
    0125
    GlobalSupplyChain
    httpairflowglobalsupplychainngap2nikecom
    httpflowerglobalsupplychainngap2nikecom
    httpttyglobalsupplychainngap2nikecom
    GlobalSupplyChain Prod
    globalsupplychain
    0125
    Platform Support
    httpairflowplatformsupportngap2nikecom8080admin
    httpflowerplatformsupportngap2nikecom8080
    httpttyplatformsupportngap2nikecom8080
    Platform Support
    platformsupport
    0125
    PME
    httpairflowpmengap2nikecom8080
    httpflowerpmengap2nikecom8080
    httpttypmengap2nikecom8080
    PME
    pme
    0126
    Adhoc Query Cluster for spinning up Presto Enabled EMR clusters everyday
    httpairflowaqclusterngap2nikecom8080
    httpfloweraqclusterngap2nikecom8080
    httpttyaqclusterngap2nikecom8080
    Alpha
    aqcluster
    0126
    DSMPlanningAnalytics
    httpairflowdsmplanninganalyticsngap2nikecom8080
    httpflowerdsmplanninganalyticsngap2nikecom8080
    httpttydsmplanninganalyticsngap2nikecom8080
    DSMPlanningAnalytics
    dsmplanninganalytics
    0127

    httpairflowdsmplanninganalyticsqangap2nikecom8080
    httpflowerdsmplanninganalyticsqangap2nikecom8080
    httpttydsmplanninganalyticsqangap2nikecom8080
    DSMPlanningAnalytics QA
    dsmplanninganalyticsqa
    0127
    GlobalROSE
    httpairflowglobalrosengap2nikecom8080
    httpflowerglobalrosengap2nikecom8080
    httpttyglobalrosengap2nikecom8080
    GlobalRose
    globalrose
    0127
    EDAAML
    httpairflowedaamlngap2nikecom8080
    httpfloweredaamlngap2nikecom8080
    httpttyedaamlngap2nikecom8080
    EDAAML
    edaaml
    0127
    Accessing Airflow GPU Nodes
    Airflow GPU worker nodes will have tty access when the Instances is spun up and the access will be revoked when the Instance is terminated
    This applies to Tensorflow workers (p2xlarge p28xlarge and p216xlarge) Instances and 
    AnacondaR worker Instances 
     Airflow Operators TensorflowSpinUp and TensorflowTerminate is used in the Airflow DAG's to Spinup and Terminate the Tensorflow Instances For more information click here Similarly Airflow Operators AnacondaRSpinUpOperator and AnacondaRTerminateOperator is used in the Airflow DAG's to Spinup and Terminate the AnacondaR Instances For more information click here
    AnacondaR
    An example of a tty endpoint of a AnacondaR worker node is given below
    ttytfclusternamequeuenamengap2nikecom8080
    where
    clustername will be the name of the Airflow Cluster (ex dtc dtcdev dtcqa gckengineering etc)
    queuename will be the name of the AnacondaR Queue (ex anacondarr38xlarge1 )
    If an instance is spunup in dtcdev environment in the queue  anacondarr38xlarge1 the endpoint will be like
    ttytfdtcdevanacondarr38xlarge1ngap2nikecom8080
    Tensorflow
    An example of a tty endpoint of a Tensorflow worker node is given below
    ttytfclusternamequeuenamengap2nikecom8080
    where 
    clustername will be the name of the Airflow Cluster (ex nikeplus nikeplusqa)
    queuename will be the name of the Tensorflow Queue (ex tensorflow_tfp2xlarge1)
    If an instance is spun up in nikeplusqa environment in the queue tensorflow_tfp2xlarge2 the endpoint will be like
    ttytfnikeplusqatensorflow_tfp2xlarge2ngap2nikecom8080


    Airflow Plugin Version
    0126 EMR Plugin Changes for Cerebro
    0125 GenieSnowflakeOperator 
    0124  Critical Security Updates for Intel CPU issues OS moved from Ubuntu to Amazon Linux Operator changes (Athena to use boto EMR clusters with dstools instance fleet timeout_duration fixes Cerebro integration Genie pig operator)
    0123 GeniePigBatchOperator Plugin changes for airflow not to error out when emr already exists extend timeout for airflow emr spinup in case of Instance Fleet extend timeout for dstools cluster spinupairflow operator change for Cerebro
    0122 Support for Instance Fleet with Anaconda R and TF workers Additional enhancements for Anaconda R worker
    0121 Plugin Changes for Instance Fleet
    0120 Fix for SLAWatcher TTY Access for dynamic AnacondaR instances
    0119 Tableau Extract Refresh Operator Anaconda R worker Spin & Terminate TTY Access for TF instances
    Airflow Best Practices Dag Development
    1  It's recommended to go to the Airflow tutorial to understand some basis of Airflow httppythonhostedorgairflowtutorialhtml
    2 Default Args Fill in your team information owner email to identify your dag on the UI and get notified if the dag fails

    Here is the DEFAULT_ARGS we have been using in the development environment
    DEFAULT_ARGS {
    'owner' 'etldev'
     'depends_on_past' True
     'wait_for_downstream' True
     'start_date' datetime(2017 01 18)
     'email' ['xxx@nikecom']
     'email_on_failure' True
     'email_on_retry' False
     'retries' 2
     'provide_context' True
     'retry_delay' timedelta(minutes2)
    }
    3 When setting up the default args we recommend setting the depends_on_past and wait_for_downstream to set to True and setting the last task as the downstream of the first task  This way the dag will wait for the previous day dag to complete before running the current day  However this is the command cause of a task being kicked off because the previous day task wasn't complete
    4 Airflow loads the schedule on after the last minute of the day is complete  That means if you want a dag to start running on Feb 7 2017 the dag will run at 400pm (or 500pm summer hours) on Feb 7 which is actually Feb 8 UTC  
    5 Make sure to update the start date if you have previous date on the start date Airflow will try to run all the previous day before catching up
    6 For EMR dag we recommend retrying two times before failing the task
    7 Make sure to name your dag so you can easily identify which directory the dag resides under varlibairflowdags directory
    8 Make sure to validate the dag before deploying the dag to the cluster to check for error by typing 'airflow list_dags sd '' eg airflow list_dags sd testpy
    9 Use persistent clusters
    10 Use of sensor prior to the BatchtStartOperator




     
    Parameters
    · task_id (string) – a unique meaningful id for the task
    · owner (string) – the owner of the task using the unix username is recommended
    · retries (int) – the number of retries that should be performed before failing the task
    · retry_delay (timedelta) – delay between retries
    · start_date (datetime) – start date for the task the scheduler will start from this point in time Note that if you run a DAG on a schedule_interval of one day the run stamped 20160101 will be trigger soon after 20160101T2359 In other words the job instance is started once the period it covers has ended
    · end_date (datetime) – if specified the scheduler won’t go beyond this date
    · depends_on_past (bool) – when set to true task instances will run sequentially while relying on the previous task’s schedule to succeed The task instance for the start_date is allowed to run
    · wait_for_downstream (bool) – when set to true an instance of task X will wait for tasks immediately downstream of the previous instance of task X to finish successfully before it runs This is useful if the different instances of a task X alter the same asset and this asset is used by tasks downstream of task X Note that depends_on_past is forced to True wherever wait_for_downstream is used
    · queue (str) – which queue to target when running this job Not all executors implement queue management the CeleryExecutor does support targeting specific queues
    · dag (DAG) – a reference to the dag the task is attached to (if any)
    · priority_weight (int) – priority weight of this task against other task This allows the executor to trigger higher priority tasks before others when things get backed up
    · pool (str) – the slot pool this task should run in slot pools are a way to limit concurrency for certain tasks
    · sla (datetimetimedelta) – time by which the job is expected to succeed Note that this represents the timedelta after the period is closed For example if you set an SLA of 1 hour the scheduler would send dan email soon after 100AM on the20160102 if the 20160101 instance has not succeede yet The scheduler pays special attention for jobs with an SLA and sends alert emails for sla misses SLA misses are also recorded in the database for future reference All tasks that share the same SLA time get bundled in a single email sent soon after that time SLA notification are sent once and only once for each task instance
    · execution_timeout (datetimetimedelta) – max time allowed for the execution of this task instance if it goes beyond it will raise and fail
    · on_failure_callback (callable) – a function to be called when a task instance of this task fails a context dictionary is passed as a single parameter to this function Context contains references to related objects to the task instance and is documented under the macros section of the API
    · on_retry_callback – much like the on_failure_callback excepts that it is executed when retries occur
    · on_success_callback (callable) – much like the on_failure_callback excepts that it is executed when the task succeeds
    · trigger_rule (str) – defines the rule by which dependencies are applied for the task to get triggered Options are{ all_success | all_failed | all_done | one_success | one_failed | dummy}default is all_success Options can be set as string or using the constants defined in the static class airflowutilsTriggerRule
    Validating and Testing the DAG
    Go to start of metadata
    Reference
    httppythonhostedorgairflowtutorialhtml
     
    Validating the Script
    Time to run some tests First let’s make sure that the pipeline parses Let’s assume we’re saving the code from the previous step in tutorialpy in the DAGs folder referenced in your airflowcfg The default location for your DAGs is ~airflowdags
     
     python ~airflowdagstutorialpy
     
    If the script does not raise an exception it means that you haven’t done anything horribly wrong and that your Airflow environment is somewhat sound
     
    Command Line Metadata Validation
    Let’s run a few commands to validate this script further
     
     # validate dag for one dag file
    airflow list_dags sd
    eg airflow list_dags sd varlibairflowdagsplatformsamplepy
    # prints the list of tasks for a dag that is named sample
    airflow list_tasks sd
    eg airflow list_tasks sd varlibariflowdagsplatformsamplepy sample
     
    # prints the hierarchy of tasks in the tutorial DAG
    airflow list_tasks tutorial sd tree
     
     
    测试
    Let’s test by running the actual task instances on a specific date The date specified in this context is an execution_date which simulates the scheduler running your task or dag at a specific date + time
     # command layout command subcommand dag_id task_id date
    # testing print_date
    airflow test sd
    eg airflow test sd varlibairflowdagssamplepy sample sample_task 20150601

     
    This should result in displaying a verbose log of events and ultimately running your bash command and printing the result
    Note that the airflow test command runs task instances locally outputs their log to stdout (on screen) doesn’t bother with dependencies and doesn’t communicate state (running success failed ) to the database It simply allows testing a single task instance
    Airflow operators
    Import airflow operators in the DAG DAG中导入airflow operator:
    from airflowoperators import
    NGAP2 Airflow Operators 
    All Airflow out of the box operators are available
    · AnacondaR Operator
    · AthenaOperator
    · BashOperator (Ngap2)
    · BatchStartOperator (NGAP2)
    · BoxFileTransferOperator
    · CanaryOperator (Ngap2)
    · CatchUpOperator
    · DMDoneOperator(Ngap2)
    · DMErrorOperator (Ngap2)
    · DMStartOperator (Ngap2)
    · EMR Operator (Ngap2)
    · GenieDistCpOperator (Ngap2)
    · GenieHadoopOperator (Ngap2)
    · GenieHiveOperator (Ngap2)
    · GeniePigOperator (Ngap2)
    · GenieSnowflakeOperator (Ngap2)
    · GenieSparkOperator (Ngap2)
    · GenieSqoopBatchOperator(Ngap2)
    · GenieSqoopOperator(Ngap2)
    · InsightsOperator(Ngap2)
    · Slack Operator
    · SnowFlakeOperator (Ngap 2)
    · SQS Operators
    · TabExtractRefreshOperator (NGAP2)
    · TaskDependencySensor (NGAP2)
    · TensorFlow Operators
    Standard Airflow Operators
    See Airflow Documentation for details httppythonhostedorgairflowcodehtml#operators
    Airflow FAQ (Ngap2)
    Go to start of metadata
    Refer to Airflow FAQ  httppythonhostedorgairflowfaqhtml
    · How do I deploy the DAG scripts to my airflow environment
    · How do I know if my task was forced run or marked success
    · How to create a common module to be used for airflow dags
    · How to deploy profile to airflow cluster
    · How to schedule a dag to run weekday only
    · How to schedule multiple runs in a day
    · My dag is not refreshed in the web UI
    · Use Hive shared metastore for Pig jobs
    · Why is my task not running
    Airflow Best Practices Operations
    1 Make sure to run 'airflow list_dags sd to validate the metadata before deploying the dag
    2 Clean up dags that is not needed in the cluster  The number of dags in the cluster has impact on the performance of the cluster You will see more delay with more dags in the cluster See Purging Unused Airflow DAGS
    3 When you put a dag on hold make sure to clear out any running tasks Any sensor task that is running will keep occupying a slot in the worker node
    4 When you want to mark a running task success you will have to hold the dag clear the task and mark the task success This will clear out a the slot on the worker  If you just mark a running task success the task will continue to run in the backend keep occupying a slot and you won't be able to clear the slot
    5 When promoting a dag from one environment to another environment make sure to update the start date of the dag
    6 In case of cleaning up xcom table use the attached (delete_xcompy) script from any Airflow TTY Instance This will cleanup any data older than 30 days from the required metadata table
    > python delete_xcompy
    Rest API for Airflow
    The source of this Rest API Plugins is from httpsgithubcomteamclairvoyantairflowrestapiplugin
    Rest API can be called using curl command to run Airflow operations like run task pause dag
    To use the Rest API you will have to get the Authentication Token just like when you need to access any URL from NGAP2
    The detail documentation of the API is in your airflow UI
    http8080adminrest_api 
    Or
    From the Airflow UI Admin → Rest API Plugin
     

     
    Examples
    AUTH_TOKEN
     
    #UNPAUSE_DAG
    curl v cookie cdtplatformauth{AUTH_TOKEN} X GET httpairflowgckengineeringdevngap2nikecom8080adminrest_apiapiv10unpausedag_idtest_R

    #PAUSE_DAG
    curl v cookie cdtplatformauth{AUTH_TOKEN} X GET httpairflowgckengineeringdevngap2nikecom8080adminrest_apiapiv10pausedag_idtest_R

    #TASK STATE
    #run_r
    curl v cookie cdtplatformauth{AUTH_TOKEN} X GET httpairflowgckengineeringdevngap2nikecom8080adminrest_apiapiv10task_statedag_idtest_R&task_idrun_r&execution_date20170418&subdir

    #run_r_beeline
    curl v cookie cdtplatformauth{AUTH_TOKEN} X GET httpairflowgckengineeringdevngap2nikecom8080adminrest_apiapiv10task_statedag_idtest_R&task_idrun_r_beeline&execution_date20170418&subdir

    #TEST A TASK run_r_beelin

    curl v cookie cdtplatformauth{AUTH_TOKEN} X GET httpairflowgckengineeringdevngap2nikecom8080adminrest_apiapiv10testdag_idtest_R&task_idrun_r_beeline&execution_date20170418&subdir&task_params
     
    # RUN A TASK
    curl v cookie cdtplatformauth{AUTH_TOKEN} X GET httpairflowgckengineeringdevngap2nikecom8080adminrest_apiapiv10rundag_idtest_R&task_idrun_r&execution_date20170418&subdir&forceon&ignore_dependencieson&ignore_depends_on_paston

    #UNPAUSE_DAG
    curl v cookie cdtplatformauth{AUTH_TOKEN} X GET httpairflowgckengineeringdevngap2nikecom8080adminrest_apiapiv10unpausedag_idtest_R
    Batch code deployment in Airflow
    To deploy batch code from the BIG project httpsbitbucketnikecomprojectsBIG to NGAP2 Airflow cluster
    Create an 'airflow_deploy' folder in awsbatch project put the scripts or prop folder inside this folder
    Any folderscripts under the airflow_deploy directory will be synced down to all the airflow nodes scheduler and all workers node
    See example

     
    In airflow the code will be in appbindtccommerceairflow_deployprop and appbindtccommerceairflow_deployscripts
     
     
    Go to Jenkins httpjenkinsngapnikecom8080viewdeploymentjobairflowbatchdeploy and enter the repository branch and the target airflow cluster name

    After the Jenkins deploy the code will be synced down o the airflow cluster every 5 minutes
     
    When you need to access the code from a dag use the exact path appbinairflow_deploy
    Running R script from Airflow AnacondaR worker node
    To run AnacondaR script in Airflow the cluster will have to be set up with the AnacondaR worker node  If you have this need please submit a JIRA ticket to the platform support team to update your Airflow cluster
    If your cluster have the AnacondaR worker node set up use the 'anacondar' queue to submit the job to that worker
    Use BashOperator to submit a bash command to run your R script You will need to enable an virtual environment to run the R script
    source optdstoolsAnaconda3envsrenvbinactivate renv

    If you need to use R to connect to Beeline to a running EMR cluster make sure to export the required variables
    export JAVA_HOMEusrlibjvmdefaultjavaexport HIVE_HOMEusrlocalapachehive210bin export PATHHIVE_HOMEbinPATHexport CLASSPATHCLASSPATHusrlocalapachehive210binlib*export HADOOP_HOMEusrlocalhadoop273export CLASSPATHCLASSPATHusrlocalhadoop273lib*

    Note Refer to Data Science Tools Old for more information about virtualenv and R script execution on the data science anaconda node
    Example (Running just an R script)

    from airflowoperators import BatchStartOperator
    task2 BashOperator(
    task_id'run_r'
    bash_command'source optdstoolsAnaconda3envsrenvbinactivate renv Rscript appbinraR'
    queue'anacondar'
    dagdag)
     
    aR code in the example
    a < 42
    A < a * 2
    print(a)
    print(A)
    Example (Running an R script that connect to beeline to a running EMR cluster)

    task3 BashOperator(
    task_id'run_r_beeline'
    bash_command'source optdstoolsAnaconda3envsrenvbinactivate renv export JAVA_HOMEusrlibjvmdefaultjavaexport HIVE_HOMEusrlocalapachehive210bin export PATHHIVE_HOMEbinPATHexport CLASSPATHCLASSPATHusrlocalapachehive210binlib*export HADOOP_HOMEusrlocalhadoop273export CLASSPATHCLASSPATHusrlocalhadoop273lib*Rscript appbinrrbeelineR'
    queue'anacondar'
    dagdag)
     
    rbeelineR code in the example
    rBeeQryngap2nikecom10000 userhadoop) {
    beecmdngap2nikecom10000 n hadoop e' qry'sep)
    qryres}
    hqltxt<('select preferred_gender count(1) as n from membermember group by preferred_gender')
    result←rBeeQry(hqltxt)
    New Relic for Airflow
    New Relic is a Performance monitoring and Management tool which is primarily used to get more insights from the Instance level New Relic will be a part of all the long running Airflow Nodes The Installation link for New Relic is as given below
    httpsrpmnewreliccomget_started_with_server_monitoring
    A sample dashboard is given below

     
     
    New Relic also integrates with Cloud Health to provide more insights about Cost Management and Resource Utilization 
    删DAG
    This feature helps to delete a DAG from the airflow
    How it works
    It accepts DAG name as the input parameter and deletes the DAG related entries from s3airflow scheduler and airflow database
    Usage
    delete_dag

    NoteSync script running every 15 mins in airflow workers will delete the dag from workers
    Using BashOperator to persist variable
    You can use BashOperator to persist an variable as long as the value is in the last line of the standard output
    Here is an example who you do it in a task
    get_latest_s3_file_membership_results BashOperator(
    task_id'get_latest_s3_file_membership_results'
    bash_command ' echo (aws s3 ls s3'+profile['S3_NIKEBI_MANAGED']+'ckmemberIDuserengagementmembership_results recursive | sort | tail n 1 | cut c8491)'
    queue'airflow'
    xcom_pushTrue
    dagdag
    )
    The above task persist the latest s3 directory and persist the value into xcom
    You can pull the value in the successor task by referring to the task_id and the key 'return_value'
    eg
    latest_ms_lkp_visitor_upm_member_file {{ task_instancexcom_pull(keyreturn_valuetask_idsget_latest_s3_file_ms_lkp_visitor_upm_member) }}
    hive_cmd_membership_results_alter_table alter_table_membership_results_queryformat(profile['MEMBER_DB']lower()'membership_results's3_emr_membership_resultslatest_membership_results_file)
    To pull the xcom value into the successor task you have to set a flag xcom_pullTrue
    alter_membership_results_table AthenaOperator(
    task_id'alter_membership_results_table'
    queue'airflow'
    xcom_pullTrue
    command hive_cmd_membership_results_alter_table
    dagdag)
    Airflow Operation Support Policy
    Go to start of metadata
    We encourage the user to enter a support ticket to report any Airflow issues using the BIPS Airflow link Business Intelligence and Analytics Production Support
    1 A support ticket is required for any nonproduction airflow issues
    2 A support ticket is required for any production airflow issues that need investigation and troubleshooting 
    3 The user is encouraged to report the Airflow production issue via the Airflow support slack channel #ngapairflow 
    a Please make sure that you address your message to the platform team if a prompt response is expected

    Operator教程
    Setting up the sandbox in the Quick Start section was easy building a productiongrade environment requires a bit more work 快速入门部分中设置沙箱容易 建立生产级环境需更工作
    These howto guides will step you through common tasks in using and configuring an Airflow environment 方法指南指导您逐步完成配置Airflow环境常见务
    设置配置选项
    The first time you run Airflow it will create a file called airflowcfg in your AIRFLOW_HOME directory (~airflow by default) This file contains Airflow’s configuration and you can edit it to change any of the settings You can also set options with environment variables by using this format AIRFLOW__{SECTION}__{KEY} (note the double underscores) 第次运行Airflow时 AIRFLOW_HOME目录中创建名airflowcfg文件(默认〜 airflow) 该文件包含Airflow配置您进行编辑更改设置 您格式环境变量设置选项: AIRFLOW __ {SECTION} __ {KEY}(请注意双划线)
    For example the metadata database connection string can either be set in airflowcfg like this 例airflowcfg中设置元数数库连接字符串示:
    [core]
    sql_alchemy_conn my_conn_string
    or by creating a corresponding environment variable 通创建相应环境变量:
    AIRFLOW__CORE__SQL_ALCHEMY_CONNmy_conn_string
    You can also derive the connection string at run time by appending _cmd to the key like this 您运行时通_cmd附加键派生连接字符串示:
    [core]
    sql_alchemy_conn_cmd bash_command_to_run
    The following config options support this _cmd version 配置选项支持_cmd版:
    · sql_alchemy_conn in [core] section
    · fernet_key in [core] section
    · broker_url in [celery] section
    · result_backend in [celery] section
    · password in [atlas] section
    · smtp_password in [smtp] section
    · bind_password in [ldap] section
    · git_password in [kubernetes] section
    The idea behind this is to not store passwords on boxes in plain text files 背想法密码存储纯文文件框中
    The order of precedence for all connfig options is as follows 配置选项优先序
    1 environment variable
    2 configuration in airflowcfg
    3 command in airflowcfg
    4 Airflow’s built in defaults
    初始化数库端
    If you want to take a real test drive of Airflow you should consider setting up a real database backend and switching to the LocalExecutor 果Airflow进行真实测试应考虑设置真实数库端切换LocalExecutor
    As Airflow was built to interact with its metadata using the great SqlAlchemy library you should be able to use any database backend supported as a SqlAlchemy backend We recommend using MySQL or Postgres Airflow强SqlAlchemy库构建元数进行交互您应该够支持数库端作SqlAlchemy端 建议MySQLPostgres
    Note
    We rely on more strict ANSI SQL settings for MySQL in order to have sane defaults Make sure to have specified explicit_defaults_for_timestamp1 in your mycnf under [mysqld] 获合理默认值MySQL赖更严格ANSI SQL设置 确保mycnf中[mysqld]指定explicit_defaults_for_timestamp1
    注意
    If you decide to use Postgres we recommend using the psycopg2 driver and specifying it in your SqlAlchemy connection string Also note that since SqlAlchemy does not expose a way to target a specific schema in the Postgres connection URI you may want to set a default schema for your role with a command similar to ALTER ROLE username SET search_path airflow foobar 果决定Postgres建议您psycopg2驱动程序SqlAlchemy连接字符串中指定 注意SqlAlchemy未提供Postgres连接URI中定位特定模式方法您想类似ALTER ROLE username SET search_path airflow foobar命令角色设置默认模式
    Once you’ve setup your database to host Airflow you’ll need to alter the SqlAlchemy connection string located in your configuration file AIRFLOW_HOMEairflowcfg You should then also change the executor setting to use LocalExecutor an executor that can parallelize task instances locally 数库设置托Airflow您需更改配置文件AIRFLOW_HOMEairflowcfg中SqlAlchemy连接字符串 然您应该执行程序设置更改 LocalExecutor该执行程序行化务实例
    # initialize the database
    airflow initdb
    Operators
    An operator represents a single ideally idempotent task Operators determine what actually executes when your DAG runs 操作员代表理想幂等务 操作员确定DAG运行时实际执行操作
    See the Operators Concepts documentation and the Operators API Reference for more information 关更信息请参阅运营商概念文档运营商API参考
    · BashOperator
    o Templating
    o Troubleshooting
    § Jinja template not found
    · PythonOperator
    o Passing in arguments
    o Templating
    · Google Cloud Platform Operators
    o GoogleCloudStorageToBigQueryOperator
    o GceInstanceStartOperator
    o GceInstanceStopOperator
    o GceSetMachineTypeOperator
    o GcfFunctionDeleteOperator
    § Troubleshooting
    o GcfFunctionDeployOperator
    § Troubleshooting
    Airflow项目实例
    分享Airflow Howto示例 实例均已Airflow 1713进行测试 更高版Airflow应该正常工作未测试 (NGAP2Airflow 1713该版测试)
    基础实例
    · airflow_example_ui_setting Airflow理Web控制台中设置DAG外观
    · airflow_example_task_branch 3 examples to choose task branch in one DAG
    · airflow_example_python_function 调Python函数
    · airflow_example_pass_params CLI传递参数手动触发DAG
    · airflow_example_skip_task DAG中跳务
    · airflow_example_depends_dag_first airflow_example_depends_dag_second DAG中务赖项
    · airflow_example_trigger_controller_dag airflow_example_trigger_target_dag 触发DAG
    Airflow理Web控制台中设置DAG外观
    airflow_example_ui_settingpy
    # *codingutf8*

    # Setup how DAG looks in Airflow admin web console

    from airflow import DAG
    from airflowoperators import DummyOperator
    from datetime import datetime timedelta

    #
    # args
    #
    days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
    args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' days_ago
    'email' []
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes5)
    'max_active_runs' 1
    }

    #
    # dag
    #
    dag DAG('airflow_example_ui_setting' default_args args)

    # Set description for DAG in Markdown format
    dagdoc_md '''
    ### DAG Documentation
    Put Summary information here to describe your DAG
    [airflow](httpsairflowapacheorg)
    '''

    #
    # task
    #
    task1 DummyOperator(task_id'task1' dagdag)
    task2 DummyOperator(task_id'task2'dagdag)

    # Set description for Task in Markdown format
    task1doc_md '''
    ### Task Documentation
    Put Summary information here to describe your task which gets
    rendered in the UI's Task Instance Details page
    '''

    # Set the background color for the task node in Graph View Tree View
    task1ui_color '#FF2D00'
    # Set the font color for the task node in Graph View
    task1ui_fgcolor '#003AFF'

    #
    # Dependency
    #
    task2set_upstream(task1)

    if __name__ __main__
    dagcli()
    Task Branch with Dynamic Condition Parameter passed by cli Paramter from Pervious Task动态条件参数务CLI 参数传递务分支
    airflow_example_task_branchpy
    # *codingutf8*

    # Example for task branch with dynamic condition parameter passed by cli paramter from pervious task

    from airflow import DAG
    from airflowoperators import PythonOperator DummyOperator BranchPythonOperator
    from airflowutilstrigger_rule import TriggerRule
    from datetime import datetime timedelta
    import logging

    #
    # args
    #
    days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
    args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' days_ago
    'email' []
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes5)
    'max_active_runs' 1
    }

    #
    # dag
    #
    dag DAG('airflow_example_task_branch' default_args args)

    #
    # python function
    #

    def branch_example1_fun()
    return 'example1_fork{}'format(int(datetimenow()strftime('S')) 2 + 1)

    def branch_example2_fun(**kwargs)
    fork_num 1
    if(kwargs['dag_run']conf is not None and kwargs['dag_run']confget('fork_num') is not None)
    fork_num kwargs['dag_run']confget('fork_num')

    return 'example2_fork{}'format(fork_num)

    # there are 2 ways to transfer variable via XCom
    # 1 pushes an XCom with a specific key
    # 2 pushes an XCom just by returning it
    def branch_example3_xcom1_fun(**kwargs)
    kwargs['ti']xcom_push(key'branch_task1' value'example3_fork')
    def branch_example3_xcom2_fun(**kwargs)
    return int(datetimenow()strftime('S')) 2 + 1

    # get variables from both example3_xcom1 and example3_xcom2
    # and connect the variables as the task name for choosing branch
    def branch_example3_fun(**kwargs)
    ti kwargs['ti']
    logginginfo('')
    logginginfo('1 Xcom variable {} 2 Xcom variable {}'format(
    tixcom_pull(keyNone task_ids'example3_xcom1')
    tixcom_pull(task_ids'example3_xcom2')
    ))
    logginginfo('')

    return '{}{}'format(tixcom_pull(keyNone task_ids'example3_xcom1') tixcom_pull(task_ids'example3_xcom2'))

    #
    # Dynamicly choose the branch according to the seconds
    #
    example1_start DummyOperator(task_id'example1_start' dagdag)
    example1_branch BranchPythonOperator(
    task_id'example1_branch'
    python_callablebranch_example1_fun
    dagdag)
    example1_fork1 DummyOperator(task_id'example1_fork1' dagdag)
    example1_fork2 DummyOperator(task_id'example1_fork2' dagdag)
    # the default is ALL_SUCCESS which will skip all downstream tasks Since there are one task skipped
    # so the example1_done need to setup trigger_rule to ONE_SUCCESS
    example1_done DummyOperator(task_id'example1_done' trigger_ruleTriggerRuleONE_SUCCESS dagdag)
    example1_end DummyOperator(task_id'example1_end' dagdag)

    example1_startset_downstream(example1_branch)
    example1_branchset_downstream(example1_fork1)
    example1_branchset_downstream(example1_fork2)
    example1_fork1set_downstream(example1_done)
    example1_fork2set_downstream(example1_done)
    example1_doneset_downstream(example1_end)

    #
    # Default Branch can change branch by parameter passed by cli when manully trigger DAG
    #
    # default branch is fork1 branch use the following cli to trigger dag to execute fork2 branch
    # airflow trigger_dag airflow_example_task_branch c '{fork_num2}'
    example2_start DummyOperator(task_id'example2_start' dagdag)
    example2_branch BranchPythonOperator(
    task_id'example2_branch'
    provide_contextTrue
    python_callablebranch_example2_fun
    dagdag)
    example2_fork1 DummyOperator(task_id'example2_fork1' dagdag)
    example2_fork2 DummyOperator(task_id'example2_fork2' dagdag)
    example2_done DummyOperator(task_id'example2_done' trigger_ruleTriggerRuleONE_SUCCESS dagdag)
    example2_end DummyOperator(task_id'example2_end' dagdag)

    example2_startset_downstream(example2_branch)
    example2_branchset_downstream(example2_fork1)
    example2_branchset_downstream(example2_fork2)
    example2_fork1set_downstream(example2_done)
    example2_fork2set_downstream(example2_done)
    example2_doneset_downstream(example2_end)

    #
    # Dynamic choose branch based on the variable transfered via XComs
    #
    # default branch is fork1 branch use the following cli to trigger dag to execute fork2 branch
    # airflow trigger_dag airflow_example_task_branch c '{fork_num2}'
    example3_start DummyOperator(task_id'example3_start' dagdag)
    example3_xcom1 PythonOperator(
    task_id'example3_xcom1'
    provide_contextTrue
    python_callablebranch_example3_xcom1_fun
    dagdag)
    example3_xcom2 PythonOperator(
    task_id'example3_xcom2'
    provide_contextTrue
    python_callablebranch_example3_xcom2_fun
    dagdag)
    example3_branch BranchPythonOperator(
    task_id'example3_branch'
    provide_contextTrue
    python_callablebranch_example3_fun
    dagdag)
    example3_fork1 DummyOperator(task_id'example3_fork1' dagdag)
    example3_fork2 DummyOperator(task_id'example3_fork2' dagdag)
    example3_done DummyOperator(task_id'example3_done' trigger_ruleTriggerRuleONE_SUCCESS dagdag)
    example3_end DummyOperator(task_id'example3_end' dagdag)

    example3_startset_downstream(example3_xcom1)
    example3_startset_downstream(example3_xcom2)
    example3_xcom1set_downstream(example3_branch)
    example3_xcom2set_downstream(example3_branch)
    example3_branchset_downstream(example3_fork1)
    example3_branchset_downstream(example3_fork2)
    example3_fork1set_downstream(example3_done)
    example3_fork2set_downstream(example3_done)
    example3_doneset_downstream(example3_end)

    if __name__ __main__
    dagcli()
    调Python函数
    airflow_example_python_functionpy
    # *codingutf8*

    # Example for how to invoke python funtion

    from airflow import DAG
    from airflowoperators import PythonOperator
    from datetime import datetime timedelta
    import logging

    #
    # args
    #
    days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
    args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' days_ago
    'email' []
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes5)
    'max_active_runs' 1
    }

    #
    # dag
    #
    dag DAG('airflow_example_python_function' default_args args)

    #
    # python function
    #
    def print_context(*args **kwargs)

    # check if args has 2 parameters
    args args if len(args) 2 else [None]*2

    # write log information to the Airflow logs
    logginginfo('')
    logginginfo('Airflow Task Context {}'format(kwargs))
    logginginfo('Airflow PythonOperator op_args Parameters args[0]{} args[1]{}'format(args[0] args[1]))
    logginginfo('Airflow PythonOperator op_kwargs Parameters param1{} param2{}'format(kwargsget('param1') kwargsget('param2')))
    logginginfo('')

    return 'This will be write into airflow log'

    #
    # task
    #
    # python callable without context
    task1 PythonOperator(
    task_id'python_without_context'
    provide_contextFalse
    python_callableprint_context
    dagdag)

    # python callable with context
    task2 PythonOperator(
    task_id'python_with_context'
    provide_contextTrue
    python_callableprint_context
    dagdag)

    # python callable pass parameters
    task3 PythonOperator(
    task_id'python_pass_params'
    provide_contextTrue
    python_callableprint_context
    op_args['args value1' datetimenow()]
    op_kwargs{'param1' 'kwargs value1' 'param2' datetimenow() }
    dagdag)

    #
    # Dependency
    #
    task2set_upstream(task1)

    if __name__ __main__
    dagcli()
    CLI传递参数手动触发DAG
    airflow_example_pass_paramspy
    # *codingutf8*

    # This is the example of pass parameters from CLI when manully tigger the DAG
    # The following is the Airflow CLI

    # airflow trigger_dag airflow_example_pass_params c '{messagemanual value}'

    from airflow import DAG
    from airflowoperators import BashOperator PythonOperator
    from datetime import datetime timedelta
    import logging

    #
    # args
    #
    days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
    args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' days_ago
    'email' []
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes5)
    'max_active_runs' 1
    }

    #
    # dag
    #
    dag DAG('airflow_example_pass_params' default_args args)

    #
    # python function
    #
    def print_context(**kwargs)
    msg None
    if(kwargs['dag_run']conf is not None)
    msg kwargs['dag_run']confget('message')

    logginginfo('Messages from CLI {}'format(msg if msg is not None else 'default value'))

    #
    # task
    #
    task1 PythonOperator(
    task_id'python_task'
    provide_contextTrue
    python_callableprint_context
    dagdag)

    task2 BashOperator(
    task_id'bash_task'
    bash_command'echo Messages from CLI {{ dag_runconf[message] if dag_run and dag_runconf[message] else default value }}'
    dagdag)

    #
    # Dependency
    #
    task2set_upstream(task1)

    if __name__ __main__
    dagcli()
    DAG中跳务
    airflow_example_skip_taskpy
    # *codingutf8*

    # Example for skip task

    from airflow import DAG
    from airflowoperators import PythonOperator DummyOperator BranchPythonOperator ShortCircuitOperator
    from airflowutilstrigger_rule import TriggerRule
    from airflowexceptions import AirflowSkipException
    from datetime import datetime timedelta

    #
    # args
    #
    days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
    args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' days_ago
    'email' []
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes5)
    'max_active_runs' 1
    }

    #
    # dag
    #
    dag DAG('airflow_example_skip_task' default_args args)

    #
    # python function
    #
    def shortcircuit_example1_fun()
    '''
    return 0(False) or 1(True) according to seconds of execution time
    '''
    return bool(int(datetimenow()strftime('S')) 2)

    def branch_example2_fun(**kwargs)
    '''
    if dag is triggered by scheduler or manually triggered without parameter fork_num
    the default task the example2_fork1
    '''
    fork_num 1
    if(kwargs['dag_run']conf is not None and kwargs['dag_run']confget('fork_num') is not None)
    fork_num kwargs['dag_run']confget('fork_num')

    return 'example2_fork{}'format(fork_num)

    def exception_example3_fun()
    '''
    manully raise a skip exception
    '''
    raise AirflowSkipException('Airflow skip exection is raised')

    #
    # Use ShortCircuitOperator to skip task and all downstream tasks
    #
    example1_start DummyOperator(task_id'example1_start' dagdag)
    example1_branch DummyOperator(task_id'example1_branch' dagdag)
    example1_fork1 DummyOperator(task_id'example1_fork1' dagdag)
    example1_fork2_shortcircuit ShortCircuitOperator(task_id'example1_fork2_shortcircuit'
    python_callableshortcircuit_example1_fun
    dagdag)
    example1_fork2 DummyOperator(task_id'example1_fork2' dagdag)
    example1_done DummyOperator(task_id'example1_done' trigger_ruleTriggerRuleALL_DONE dagdag)
    example1_end DummyOperator(task_id'example1_end' dagdag)

    example1_startset_downstream(example1_branch)
    example1_branchset_downstream(example1_fork1)
    example1_branchset_downstream(example1_fork2_shortcircuit)
    example1_fork2_shortcircuitset_downstream(example1_fork2)
    example1_fork1set_downstream(example1_done)
    example1_fork2set_downstream(example1_done)
    example1_doneset_downstream(example1_end)

    #
    # Default Branch can change branch by parameter passed by cli when manully
    # trigger DAG
    #
    # default branch is fork1 branch use the following cli to trigger dag to
    # execute fork2 branch
    # so you can manully choose which task to skip
    # airflow trigger_dag airflow_example_task_branch c '{fork_num2}'
    example2_start DummyOperator(task_id'example2_start' dagdag)
    example2_branch BranchPythonOperator(task_id'example2_branch'
    provide_contextTrue
    python_callablebranch_example2_fun
    dagdag)
    example2_fork1 DummyOperator(task_id'example2_fork1' dagdag)
    example2_fork2 DummyOperator(task_id'example2_fork2' dagdag)
    example2_done DummyOperator(task_id'example2_done' trigger_ruleTriggerRuleALL_DONE dagdag)
    example2_end DummyOperator(task_id'example2_end' dagdag)

    example2_startset_downstream(example2_branch)
    example2_branchset_downstream(example2_fork1)
    example2_branchset_downstream(example2_fork2)
    example2_fork1set_downstream(example2_done)
    example2_fork2set_downstream(example2_done)
    example2_doneset_downstream(example2_end)

    #
    # Raise Exception to skip task
    #
    example3_start DummyOperator(task_id'example3_start' dagdag)
    example3_branch DummyOperator(task_id'example3_branch' dagdag)
    example3_fork1 DummyOperator(task_id'example3_fork1' dagdag)
    example3_fork2_exception ShortCircuitOperator(task_id'example3_fork2_exception'
    python_callableexception_example3_fun
    dagdag)
    example3_fork2 DummyOperator(task_id'example3_fork2' dagdag)
    example3_done DummyOperator(task_id'example3_done' trigger_ruleTriggerRuleALL_DONE dagdag)
    example3_end DummyOperator(task_id'example3_end' dagdag)

    example3_startset_downstream(example3_branch)
    example3_branchset_downstream(example3_fork1)
    example3_branchset_downstream(example3_fork2_exception)
    example3_fork2_exceptionset_downstream(example3_fork2)
    example3_fork1set_downstream(example3_done)
    example3_fork2set_downstream(example3_done)
    example3_doneset_downstream(example3_end)

    if __name__ __main__
    dagcli()
    DAG中务赖项
    airflow_example_depends_dag_firstpy
    # *codingutf8*

    # Example for tasks dependency that not in the same DAG
    # airflow_example_depends_dag_first > airflow_example_depends_dag_second
    # airflow_example_depends_dag_second will depends on bash_task in airflow_example_depends_dag_first

    from airflow import DAG
    from airflowoperators import DummyOperator BashOperator
    from datetime import datetime timedelta

    #
    # args
    #
    days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
    args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' days_ago
    'email' []
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes5)
    'max_active_runs' 1
    }

    #
    # dag
    #
    dag DAG('airflow_example_depends_dag_first' default_args args)

    #
    # task
    #
    start DummyOperator(task_id'start' dagdag)
    bash_task BashOperator(task_id'bash_task'
    bash_command'sleep 300'
    dagdag)
    end DummyOperator(task_id'end' dagdag)

    #
    # Dependency
    #
    startset_downstream(bash_task)
    bash_taskset_downstream(end)

    if __name__ __main__
    dagcli()
    airflow_example_depends_dag_secondpy
    # *codingutf8*

    # Example for tasks dependency that not in the same DAG
    # airflow_example_depends_dag_first > airflow_example_depends_dag_second
    # airflow_example_depends_dag_second will depends on bash_task in airflow_example_depends_dag_first

    # Note ExternalTaskSensor assumes that you are dependent on a task in a dag run with the same execution date
    # or you can setup execution timedelta via either execution_delta or execution_date_fn

    from airflow import DAG
    from airflowoperators import DummyOperator BashOperator
    from airflowoperators import ExternalTaskSensor
    from datetime import datetime timedelta

    #
    # args
    #
    days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
    args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' days_ago
    'email' []
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes5)
    'max_active_runs' 1
    }

    #
    # dag
    #
    dag DAG('airflow_example_depends_dag_second' default_args args)

    #
    # python function
    #
    def exec_time()
    '''
    return datetimetimedelta object which determine the execution time
    '''
    return ''

    #
    # task
    #
    start DummyOperator(task_id'start' dagdag)
    sensor ExternalTaskSensor(task_id'sensor'
    external_dag_id'airflow_example_depends_dag_first'
    external_task_id'bash_task'
    dagdag)
    task DummyOperator(task_id'task' dagdag)
    end DummyOperator(task_id'end' dagdag)

    #
    # Dependency
    #
    startset_downstream(sensor)
    sensorset_downstream(task)
    taskset_downstream(end)

    if __name__ __main__
    dagcli()
    触发DAG
    airflow_example_trigger_controller_dagpy
    # *codingutf8*

    # Example for trigger DAG
    # DAG info reads from Airflow Variables

    This example illustrates the use of the TriggerDagRunOperator There are 2
    entities at work in this scenario
    1 The Controller DAG the DAG that conditionally executes the trigger
    2 The Target DAG DAG being triggered (in example_trigger_target_dagpy)

    A TriggerDagRunOperator will be generated dynamically according to the
    Airflow Variable TRIGGER_DAG TRIGGER_DAG stors the dag_ids that need to be triggered


    from airflow import DAG
    from airflowoperators import DummyOperator PythonOperator TriggerDagRunOperator
    from airflowmodels import Variable
    from datetime import datetime timedelta
    import logging

    # init the Airflow Vairiables
    # for prod env you should setup the Vairiable out side DAG script in advance
    Variableset('TRIGGER_DAGS' ['airflow_example_trigger_target_dag'] serialize_jsonTrue)

    #
    # args
    #
    days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
    args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' days_ago
    'email' []
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes5)
    'max_active_runs' 1
    }

    #
    # dag
    #
    dag DAG('airflow_example_trigger_controller_dag' default_args args)

    #
    # python function
    #
    def gen_trigger_dag_operator(upstream_task)
    '''generate TriggerDagRunOperator according to Variable TRIGGER_DAGS'''
    # get Vairable TRIGGER_DAGS
    trigger_dags Variableget('TRIGGER_DAGS' deserialize_jsonTrue) or []
    for dag_id in trigger_dags
    task TriggerDagRunOperator(task_id'trigger_{}'format(dag_id)
    trigger_dag_iddag_id
    python_callablelambda ctx dag_run_obj dag_run_obj
    dagdag)
    taskset_upstream(upstream_task)

    #
    # Task
    #
    start DummyOperator(task_id'start' dagdag)
    gen_trigger_dag_operator(start)

    if __name__ __main__
    dagcli()
    airflow_example_trigger_target_dagpy
    # *codingutf8*

    # Example for trigger DAG
    # DAG info reads from Airflow Variables

    This example illustrates the use of the TriggerDagRunOperator There are 2
    entities at work in this scenario
    1 The Controller DAG the DAG that conditionally executes the trigger
    2 The Target DAG DAG being triggered (in example_trigger_target_dagpy)

    A TriggerDagRunOperator will be generated dynamically according to the
    Airflow Variable TRIGGER_DAG TRIGGER_DAG stors the dag_ids that need to be triggered

    from airflowmodels import DAG
    from airflowoperators import DummyOperator
    from datetime import datetime timedelta
    import logging

    #
    # args
    #
    days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
    args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' days_ago
    'email' []
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes5)
    'max_active_runs' 1
    }

    #
    # dag
    #
    dag DAG('airflow_example_trigger_target_dag'
    default_argsargs
    schedule_intervalNone)

    #
    # Task
    #
    run_this DummyOperator(task_id'run_this' dagdag)

    if __name__ __main__
    dagcli()
    进阶实例
    · airflow_example_dagpy 读取指定DAG元数信息
    · airflow_example_dagbagpy 通DAGBag获取DAG象
    · airflow_example_user_defined_macrospy 创建DAG时定义宏
    · airflow_example_inherit_operatorpy 创建operator现operator继承
    读取指定DAG元数信息
    airflow_example_dagpy
    # *codingutf8*

    # Example of DAG

    from airflow import DAG
    from airflowoperators import PythonOperator
    from airflowmodels import DagBag
    from datetime import datetime timedelta

    #
    # args
    #
    days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
    args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' days_ago
    'email' []
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes5)
    'max_active_runs' 1
    }

    #
    # dag
    #
    dag DAG('airflow_example_dag' default_args args)

    #
    # python function
    #
    def dag_fun(**kwargs)
    dagbag DagBag()

    # get specific DAG Object
    dag_id 'airflow_example_dagbag'
    dag dagbagdagsget(dag_id)

    dagbagloggerinfo('')
    dagbagloggerinfo(kwargs)
    # Get all Tasks from DAG object
    # task objects
    dagbagloggerinfo('task objects{}'format(dagtasks))
    # task_ids
    dagbagloggerinfo('task ids{}'format(dagtask_ids))

    # Get execute date information
    # Returns the latest date for which at least one task instance exists
    dagbagloggerinfo('latest executtion date{}'format(daglatest_execution_date))

    # Get all DAGrun object
    dagbagloggerinfo('')

    #
    # task
    #
    python_task PythonOperator(task_id'python_task'
    provide_contextTrue
    python_callabledag_fun
    dagdag)

    if __name__ __main__
    dagcli()

    通DAGBag获取DAG象
    airflow_example_dagbagpy
    # *codingutf8*

    # Example of DAGbag

    from airflow import DAG
    from airflowoperators import PythonOperator
    from airflowmodels import DagBag
    from datetime import datetime timedelta

    #
    # args
    #
    days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
    args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' days_ago
    'email' []
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes5)
    'max_active_runs' 1
    }

    #
    # dag
    #
    dag DAG('airflow_example_dagbag' default_args args)

    #
    # python function
    #
    def dagbag_fun(**kwargs)
    dagbag DagBag()

    dagbagloggerinfo('')
    dagbagloggerinfo(kwargs)
    # Get all DAGs from Dagbag object
    for key val in dagbagdagsitems()
    dagbagloggerinfo('{}{}'format(key valtask_ids))
    dagbagloggerinfo('')

    #
    # task
    #
    python_task PythonOperator(task_id'python_task'
    provide_contextTrue
    python_callabledagbag_fun
    dagdag)

    if __name__ __main__
    dagcli()
    创建DAG时定义宏
    airflow_example_user_defined_macrospy
    # *codingutf8*

    # Example of user defined macros

    from airflow import DAG
    from airflowoperators import BashOperator
    from datetime import datetime timedelta

    #
    # args
    #
    days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
    args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' days_ago
    'email' []
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes5)
    'max_active_runs' 1

    }

    #
    # python function
    #
    def user_defined_fun(ds)
    '''
    get date that 30 days before the current execution_date
    '''
    ds datetimestrptime(ds 'Ymd')
    ds ds + timedelta(30)
    return dsisoformat()[10]

    #
    # dag
    #
    dag DAG('airflow_example_user_defined_macros'
    default_argsargs
    user_defined_macros{
    'user_defined_fun' user_defined_fun
    })

    #
    # task
    #
    bash_task BashOperator(task_id'bash_task'
    bash_command'echo {{ ds }} {{ user_defined_fun(ds) }}'
    dagdag)

    if __name__ __main__
    dagcli()
    创建operator现operator继承
    airflow_example_inherit_operatorpy
    # *codingutf8*

    # Example of inherit operator

    from airflow import DAG
    from airflowoperators import BashOperator
    from airflowmodels import BaseOperator
    from datetime import datetime timedelta
    import logging

    #
    # args
    #
    days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
    args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' days_ago
    'email' []
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes5)
    'max_active_runs' 1

    }

    class InheritBaseOperator(BaseOperator)
    '''
    Inherit from BaseOperator to create user own new Operator
    here is only for demostration that this new Operator is just
    to record context info into log
    '''

    def execute(self context)
    logginginfo('')
    logginginfo('This is user defined Operator the following is info from context')
    logginginfo(context)
    logginginfo('')


    class InheritBashOperator(BashOperator)
    '''
    Inherit from BashOperator overwriting method render_template
    set a new variable user_defined_ds which is date that 30 days before the current execution_date
    '''
    def render_template(self attr content context)
    execution_date context['execution_date']
    context['user_defined_ds'] execution_date + timedelta(30)

    return super(InheritBashOperator self)render_template(attr content context)

    #
    # dag
    #
    dag DAG('airflow_example_inherit_operator' default_argsargs)

    #
    # task
    #
    new_op InheritBaseOperator(task_id'new_op_task' dagdag)
    bash_task InheritBashOperator(task_id'bash_task'
    bash_command'echo {{ user_defined_ds }}'
    dagdag)

    if __name__ __main__
    dagcli()
    完整实例
    baozun_network_inventory
    httpairflowchinabiengineeringe1ngap2nikecom8080adminairflowgraphroot&dag_idbaozun_network_inventory




    PySpark代码
    # * encoding utf8 *


    from airflow import DAG
    from airflowmodels import Variable
    from airflowoperators import (
    BatchStartOperator
    BatchEndOperator
    EmrOperator
    GenieSparkOperator
    PythonOperator
    BashOperator
    GenieSqoopOperator
    SlackOperator
    SnowFlakeOperator
    )
    from batch_common import DMError BICommon
    from datetime import datetime timedelta
    import ast


    #
    # VARIABLES
    #
    # system variables
    profile BICommon()get_profile()
    job_user profile[JOB_USER]
    env profile[ENV]
    DEFAULT_QUEUE airflow

    # project variables
    env_config astliteral_eval(Variableget(se_project_config))
    CLUSTER_NAME env_configget(CLUSTER_NAME)
    S3_BUCKET env_configget(S3_BUCKET)
    SCRIPT_BUCKET env_configget(SCRIPT_BUCKET)
    DML_PATH env_configget(DML_PATH)
    schedule_interval env_configget('SCHEDULE_INTERVAL_BAOZUN_NETWORK_INVENTORY')

    # code path
    BAOZUN_NETWORK_INVENTORY_SCRIPT env_configget('BAOZUN_NETWORK_INVENTORY_SCRIPT')
    ONETIME_TABLE_SCRIPT env_configget('ONETIME_TABLE_SCRIPT')

    #
    # dag
    #
    dag_id baozun_network_inventory

    default_args {
    'owner' 'Vivian Zhao'
    'start_date' datetime(2019 10 10)
    'email' ['VivianZhao2@nikecom']
    'email_on_failure' True
    'email_on_retry' False
    'retries' 3
    'retry_delay' timedelta(seconds60)
    'queue' DEFAULT_QUEUE
    'wait_for_downstream' False
    }


    dag DAG(dag_id
    default_argsdefault_args
    schedule_intervalschedule_interval)


    #
    # tasks
    #
    start BatchStartOperator(dagdag)

    spark_s3_cleansed GenieSparkOperator(task_id'onetime_tables_cleansed'
    command'{onetime_table_script}'format(onetime_table_scriptONETIME_TABLE_SCRIPT)
    queueDEFAULT_QUEUE
    sched_typeCLUSTER_NAME
    dagdag)


    # semantic
    spark_s3_semantic GenieSparkOperator(task_id'baozun_network_inventory_semantic'
    command'{sementic_script}'format(sementic_scriptBAOZUN_NETWORK_INVENTORY_SCRIPT)
    queueDEFAULT_QUEUE
    sched_typeCLUSTER_NAME
    dagdag)


    end BatchEndOperator(dagdag)


    #
    # snowflake operator
    #
    load_snowflake SnowFlakeOperator(task_id'load_snowflake'
    sql_file'{dml_path}BAOZUN_NETWORK_INVENTORYsql'format(dml_pathDML_PATH)
    conn_id'snowflake'
    dagdag)

    #
    # dependency
    #
    startset_downstream(spark_s3_cleansed)
    spark_s3_cleansedset_downstream(spark_s3_semantic)
    spark_s3_semanticset_downstream(load_snowflake)
    load_snowflakeset_downstream(end)

    BashOperator
    BashOperatorBash shell中执行命令
    run_this BashOperator(
    task_id'run_after_loop' bash_command'echo 1' dagdag)
    模板
    You can use Jinja templates to parameterize the bash_command argument
    task BashOperator(
    task_id'also_run_this'
    bash_command'echo run_id{{ run_id }} | dag_run{{ dag_run }}'
    dagdag)
    障排
    Jinja template not found
    Add a space after the script name when directly calling a Bash script with the bash_command argument This is because Airflow tries to apply a Jinja template to it which will fail bash_command参数直接调Bash脚时脚名称添加空格Airflow尝试Jinja模板应失败
    t2 BashOperator(
    task_id'bash_example'

    # This fails with `Jinja template not found` error
    # bash_commandhomebatchertestsh

    # This works (has a space after)
    bash_commandhomebatchertestsh
    dagdag)
    EMR Operator
    Operator to spin up or terminate an EMR

    Parameters
    · cluster_action (Required)  Must be either 'spinup' or 'terminate'
    · cluster_name (Required)  Specify any name you like just be advised that if you have more than one DAG using the same cluster_name that they can submit their jobs to each others' clusters  Cluster name characters limit is set to less than 55 Please make sure not to exceed the limit (Genie routing is by cluster_name)
    · custom_dns (Optional) Custom_dns Field to be used for creating Route53 DNS entries
    · num_core_nodes (Required) Specify the number of EMR core vm nodes to spinup  Please use appropriate discretion and do not allocate more cores than necessary
    · classification (Required) Data security level choose from bronze silver gold platinum
    · project_id (Required) Enter your Relay code  Currently the selection is FYSHAREDBI FY150038CK FY150062DEF FY12126320 Acxiom GCD FY12126395RDF FY12126348CBIT3170113
    · cost_center (Required) Enter the cost center number for your project
    · num_task_nodes (Optional) Specify a number of EMR task vm nodes to request using ec2 Spot Instances  Spot Instances may or may not be available to meet the request
    · master_inst_type(Optional) ec2 type for master instances  Allowed values are m32xlarge r32xlarge r34xlarge r38xlarge  If you omit this parameter we'll use the first value (m32xlarge)
    · core_inst_type (Optional) ec2 type for core instances  Allowed values are m32xlarge r32xlarge r34xlarge  If you omit this parameter we'll use the first value (m32xlarge)
    · task_int_type (Optional) ec2 type for task instances  Allowed values are m32xlarge r32xlarge r34xlarge  If you omit this parameter we'll use core_int_type or the first value (m32xlarge)
    · core_bid_type (Not used) bid type for core instance can be specified SPOT or ON_DEMAND For NGAP2 currently it's defaulting to ON_DEMAND
    · task_bid_type(Not used) bid type for task instance can be specified SPOT or ON_DEMAND For NGAP2 currently it's defaulting to SPOT
    · emr_version (Required) version of EMR can be specified which is used to spin up EMR values are 570 590 5110 5130
    · applications (Optional) List of applications to be installed in EMR can be specifiedBy default hivepigspark will be installed Based on the EMR version user can give the list of applications If the application given is not in the EMR version specified EMR spin up will failAlso if sqoop is selected as an application only hive will be installed with itno other application is allowed with sqoop
    · properties (Optional) List of configuration properties to pass to configure each instance group see Note
    · bootstrap_actions (Optional) Bootstrap scripts to run at the EMR nodes spin up time See Note
    · long_running_cluster (Optional) True or False Default is False
    · auto_scaling To use Auto Scaling feature in EMR set auto_scaling as True by default it will be False
    1
    a core_max  Maximum limit for core nodes to scale out
    b core_min  Minimum limit for core nodes to scale in
    c task_max  Maximum limit for task nodes to scale out
    d task_min  Minimum limit for task nodes to scale in
    e core_scale_up  Number of core nodes to scale out each timeDefault value is 1
    f core_scale_down  Number of core nodes to scale in each timeDefault value is 1
    g task_scale_up  Number of task nodes to scale out each timeDefault value is 2
    h task_scale_down  Number of task nodes to scale in each timeDefault value is 2
    · email (Required) Specify one or more email addresses to receive alerts in case of error  The first email address will be added as an EC2 tag on all EMR nodes for auditing purposes
    · on_failure_callback (Optional) Specify a function that terminates the EMR cluster in case of error  (Otherwise the cluster will keep running and costing you money)
    · queue (required) always use 'airflow'
    · is_instance_fleet (optional boolean) A boolean used to specify if you are using instance fleets
    · instance_fleets (Required when using instance fleets list) A list of InstanceFleet objects refer to Instance Fleet documentation Example below
    1 instance_fleet_typestrThe node type of the instance fleet[optional]
    2 target_on_demand_capacityintThe target capacity of Spot units for the instance fleet[optional]
    3 target_spot_capacityintThe target capacity of Spot units for the instance fleet[optional]
    4 instance_type_configsInstanceTypeConfigThe InstanceTypeConfigs for EMR instance fleets[optional]
    5 launch_specificationsLaunchSpecificationAn empty wrapper object around the spot spec assuming aws will add props later[optional]
    · cerebro_cdas (Optional) Cerebro Field to use both metastore and cerebro data access service
    · cerebro_hms(Optional) Cerebro Field to use only the cerebro metastore
    · tags (Optional) We can use this option to add custom tags to the EMR Cluster For Enabling Cerebro using tags please see the description below



    NOTE

    1 If sqoop is selected as an application along that only hive will be installed in EMR Sqoop and other applications like sparkzepplin cannot be in same EMR
    2  properties for setting up emrfs consistent view
    · properties[{Classification emrfssiteProperties {fss3consistent true}}]
    3 By default speculative execution will be false to set it to true
    · properties[{Classification mapredsiteProperties {mapreducemapspeculative truemapreducereducespeculative true}}{Classification hivesiteProperties {hivemapredreducetasksspeculativeexecutiontrue}}]
    4 To add bootstrap action to the EMR spin up  Here is the format
    bootstrap_actions[{Name
    ScriptBootstrapAction {
    Path
    Args [arguments]
    }}]
    eg 
    bootstrap_actions[{Name EMR Bootstrap smoke test
    ScriptBootstrapAction {
    Path s3nikeemrbinemr_bootstrap_hello2sh
    Args []
    }}]

          2 By default speculative execution will be false to set it to true

    实例:
    To spin up a cluster
    task1  EmrOperator(
        task_id'EmrSpinup'
        cluster_action'spinup'
        cluster_name'CLUSTER_NAME'
            custom_dns'True'
        num_task_nodes1
        num_core_nodes1
        classification'bronze'
        project_id'PROJECT_ID'
        queue'airflow'
        cost_center'COST_CENTER'
        emr_version'5110'
            master_inst_type'm3xlarge' #optional
        applications['hive' 'pig' 'spark']
    auto_scalingTrue #optional Use it only if autoscaling is required
    core_max2 #optionalrequired only if autoscaling is true
    core_min1 #optionalrequired only if autoscaling is true
    task_max2 #optionalrequired only if autoscaling is true
    task_min1 #optionalrequired only if autoscaling is true
    core_scale_up1 #optional
    core_scale_down1 #optional
    task_scale_up1 #optional
    task_scale_down1 #optional
        dagdag
    )
    To Terminate a cluster
    task6  EmrOperator(
        task_id'EmrTerminate'
        cluster_action'terminate'
        cluster_name'CLUSTER_NAME'
        queue'airflow'
        dagdag
    )
    Instance fleet example


    from airflow import DAG
    from pluginsemr_plugin import EmrOperator
    from datetime import datetime
    from datetime import timedelta
    from emr_clientmodelsinstance_fleet import InstanceFleet
    from emr_clientmodelsinstance_type_config import InstanceTypeConfig
    from emr_clientmodelslaunch_specification import LaunchSpecification
    from emr_clientmodelsspot_specification import SpotSpecification
     
     
    CLUSTER_NAME instance_fleet_spot_master_mixed_core_mixed_task
     
    default_args {
        'owner' 'airflow'
        'depends_on_past' False
        'start_date' datetime(20xx x xx)
        'retries' 1
        'retry_delay' timedelta(minutes5)
        'queue' 'airflow'
    }
     
    instance_fleets [
        InstanceFleet(
            instance_fleet_typeMASTER
            target_spot_capacity1
            instance_type_configs[
                InstanceTypeConfig(
                    bid_price_as_percentage_of_on_demand_price100
                    instance_typem3xlarge
                    weighted_capacity1
                )
            ]
            launch_specificationsLaunchSpecification(
                spot_specificationSpotSpecification(
                    block_duration_minutes60
                    timeout_actionTERMINATE_CLUSTER
                    timeout_duration_minutes5
                )
            )
        )
        InstanceFleet(
            instance_fleet_typeCORE
            target_on_demand_capacity1
            target_spot_capacity3
            instance_type_configs[
                InstanceTypeConfig(
                    bid_price_as_percentage_of_on_demand_price100
                    instance_typem3xlarge
                    weighted_capacity1
                )
                InstanceTypeConfig(
                    bid_price_as_percentage_of_on_demand_price100
                    instance_typem32xlarge
                    weighted_capacity2
                )
            ]
            launch_specificationsLaunchSpecification(
                spot_specificationSpotSpecification(
                    block_duration_minutes60
                    timeout_actionTERMINATE_CLUSTER
                    timeout_duration_minutes5
                )
            )
        )
        InstanceFleet(
            instance_fleet_typeTASK
            target_on_demand_capacity1
            target_spot_capacity3
            instance_type_configs[
                InstanceTypeConfig(
                    bid_price_as_percentage_of_on_demand_price100
                    instance_typem3xlarge
                    weighted_capacity1
                )
                InstanceTypeConfig(
                    bid_price_as_percentage_of_on_demand_price100
                    instance_typem32xlarge
                    weighted_capacity2
                )
            ]
            launch_specificationsLaunchSpecification(
                spot_specificationSpotSpecification(
                    block_duration_minutes60
                    timeout_actionTERMINATE_CLUSTER
                    timeout_duration_minutes5
                )
            )
        )
    ]
     
    dag DAG(
        'test_instance_fleet'
        default_argsdefault_args
        schedule_interval'0 0 * * *'
    )
     
    task1 EmrOperator(
        task_id'EmrSpinup'
        cluster_action'spinup'
        cluster_nameCLUSTER_NAME
        custom_dns'True'
        num_task_nodes1
        num_core_nodes1
        classification'bronze'
        project_id'PROJECT_ID'
        queue'airflow'
        cost_center'COST_CENTER'
        emr_version'530'
        master_inst_type'm3xlarge'  # optional
        applications['hive' 'pig' 'spark']
        is_instance_fleetTrue
        instance_fleetsinstance_fleets
        dagdag
    )
     
    task2 EmrOperator(
        task_id'EmrTerminate'
        cluster_action'terminate'
        cluster_nameCLUSTER_NAME
        queue'airflow'
        dagdag
    )
     
    task2set_upstream(task1)

    Cerebro Changes for EMR Operator
    We can enable role based read access through CerebroOkera we need to use cerebro_cdas flag into EMROperator
    By Default when the flag is set to True the EMR Spun up will look up the Environment Specific CDAS Cluster
    IE EMR Spun up from Airflow DEV will use the CDAS Cluster mapped as DEV for the respective NGAP2 AD Group
    In addition the variable can be setup in Airflow Variables which will override the default environment based CDAS cluster selection for all the DAGs hosted in an Airflow Cluster
    Example
    task1 EmrOperator(
    task_id'EmrSpinup'
    cluster_action'spinup'
    cluster_nameCLUSTER_NAME
    num_task_nodes1
    num_core_nodes1
    classification'bronze'
    queue'airflow'
    cost_center'COST_CENTER'
    emr_version'520'
    applications['hive' 'pig' 'spark']
    cerebro_cdasTrue
    dagdag
    )
    Airflow Variable to Override Environment based CDAS selection

    In case we need to override the default setting then we can provide a custom TAG pointing to the specific CDAS cluster
    task1 EmrOperator(
    task_id'EmrSpinup'
    cluster_action'spinup'
    cluster_nameCLUSTER_NAME
    num_task_nodes1
    num_core_nodes1
    classification'bronze'
    queue'airflow'
    cost_center'COST_CENTER'
    emr_version'520'
    applications['hive' 'pig' 'spark']
    cerebro_cdasTrue
    tags[{Keycerebro_clusterValueUNIFYTEST}]
    dagdag
    )

    Custom_dns for EMR Operator
    In case if team want to create a custom dns entries follow the syntax as below
    task1 EmrOperator(
    task_id'EmrSpinup'
    cluster_action'spinup'
    cluster_nameCLUSTER_NAME
    num_task_nodes1
    num_core_nodes1
    classification'bronze'
    queue'airflow'
    cost_center'COST_CENTER'
    emr_version'5210'
    applications['hive''spark']
    custom_dnsTrue
    dagdag
    )

    S3BoxFileTransferOperator
    Uploads files from Box  to S3 Available only on NGAP 20 airflow
    Prerequisite in order to use this operator you will have to create a Airflow support request to create a folder in the ngap2 box application httpsnikeentboxcomprofile4003105259
    Parameters
    · conn_id (string) – The box connection name as configured in Admin → Connections
    · box_dir (string) – Location file which needs to be Uploaded
    · box_file_name (String) file name for renaming
    · s3path(String) Full S3 path
    · s3_Region (String) provide respective resgion where your bucket resides in
    · file_type(String) provide file types
    · box_folder_id (String) For every Folder There is an ID (which is provided by Platform team )
    NOTE
    1 This is a NGAP 20 only feature All the connection details pertaining to your batch user needs to be configured in connections
    2 This Operator moves all files from given box folder to specified s3 prefix
     Airflow Migration Status
    实例:
    s3_box_transfer_task2 S3BoxFileTransferOperator(
    task_id'box_s3_step'
    queue'airflow'
    s3path 's3nikeemrbinshabarishtestairflowdagstest'
    box_folder_id '53639622254'
    s3_Region 'useast1'
    dagdag
    )

    PythonOperator
    PythonOperator执行Python callables
    def print_context(ds **kwargs)
    pprint(kwargs)
    print(ds)
    return 'Whatever you return gets printed in the logs'


    run_this PythonOperator(
    task_id'print_the_context'
    provide_contextTrue
    python_callableprint_context
    dagdag)
    传递参数
    op_argsop_kwargs参数额外参数传递Python调象
    def my_sleeping_function(random_base)
    This is a function that will run within the DAG execution
    timesleep(random_base)


    # Generate 10 sleeping tasks sleeping from 0 to 4 seconds respectively
    for i in range(5)
    task PythonOperator(
    task_id'sleep_for_' + str(i)
    python_callablemy_sleeping_function
    op_kwargs{'random_base' float(i) 10}
    dagdag)

    taskset_upstream(run_this)
    模板
    您provide_context参数设置True时Airflow会传入组额外关键字参数:Jinja模板变量templates_dict参数
    templates_dict参数模板化字典中值估算Jinja模板
    GenieHiveOperator
    Hive命令提交Genie
    Parameters
    · command (string) – 提交Genie命令
    · job_name (string) – Job Name to be displayed on Genie console The execution date will be appended in Genie operator automatically Genie控制台显示作业名称 执行日期动附加Genie运算符中
    · sched_type (string) Optional – 提交作业集群类型值包括:adhocETL
    · file_dependencies(string)Optional A comma delimited list of files to be uploaded to Genie for use of submitting job  选 逗号分隔文件列表文件传Genie便提交作业
    eg filehomeubuntutestSimpleApppyfilehomeubuntutestloremtxt
    · queue required always use 'airflow'
    实例:
    hive_cmd  f s3inbounddatascienceadhocngap2_beta_testsaf_test_scriptssamplehivejobhql
    task3  GenieHiveOperator(
        task_id'TestHiveStep'
        commandhive_cmd
        job_name'hivegeniejob'
        queue'airflow'
        sched_type'CLUSTER_NAME'
        dagdag
    )
    Note 
    For Okera enabled EMR clusters (for PII data access) the hive query files (hql) need to have the below commands added in the start 启OkeraEMR群集(PII数访问)配置单元查询文件(hql)必须开始时添加命令
    add jar hdfsuserhadooplibokerahivemetastorejar
    add jar hdfsuserhadooplibrecordservicehivejar

    Google Cloud Platform Operators
    GoogleCloudStorageToBigQueryOperator
    Use the GoogleCloudStorageToBigQueryOperator to execute a BigQuery load job
    load_csv gcs_to_bqGoogleCloudStorageToBigQueryOperator(
    task_id'gcs_to_bq_example'
    bucket'cloudsamplesdata'
    source_objects['bigqueryusstatesusstatescsv']
    destination_project_dataset_table'airflow_testgcs_to_bq_table'
    schema_fields[
    {'name' 'name' 'type' 'STRING' 'mode' 'NULLABLE'}
    {'name' 'post_abbr' 'type' 'STRING' 'mode' 'NULLABLE'}
    ]
    write_disposition'WRITE_TRUNCATE'
    dagdag)
    GceInstanceStartOperator
    Allows to start an existing Google Compute Engine instance
    In this example parameter values are extracted from Airflow variables Moreover the default_args dict is used to pass common arguments to all operators in a single DAG
    PROJECT_ID modelsVariableget('PROJECT_ID' '')
    LOCATION modelsVariableget('LOCATION' '')
    INSTANCE modelsVariableget('INSTANCE' '')
    SHORT_MACHINE_TYPE_NAME modelsVariableget('SHORT_MACHINE_TYPE_NAME' '')
    SET_MACHINE_TYPE_BODY {
    'machineType' 'zones{}machineTypes{}'format(LOCATION SHORT_MACHINE_TYPE_NAME)
    }

    default_args {
    'start_date' airflowutilsdatesdays_ago(1)
    }
    Define the GceInstanceStartOperator by passing the required arguments to the constructor
    gce_instance_start GceInstanceStartOperator(
    project_idPROJECT_ID
    zoneLOCATION
    resource_idINSTANCE
    task_id'gcp_compute_start_task'
    )
    GceInstanceStopOperator
    Allows to stop an existing Google Compute Engine instance
    For parameter definition take a look at GceInstanceStartOperator above
    Define the GceInstanceStopOperator by passing the required arguments to the constructor
    gce_instance_stop GceInstanceStopOperator(
    project_idPROJECT_ID
    zoneLOCATION
    resource_idINSTANCE
    task_id'gcp_compute_stop_task'
    )
    GceSetMachineTypeOperator
    Allows to change the machine type for a stopped instance to the specified machine type
    For parameter definition take a look at GceInstanceStartOperator above
    Define the GceSetMachineTypeOperator by passing the required arguments to the constructor
    gce_set_machine_type GceSetMachineTypeOperator(
    project_idPROJECT_ID
    zoneLOCATION
    resource_idINSTANCE
    bodySET_MACHINE_TYPE_BODY
    task_id'gcp_compute_set_machine_type'
    )
    GcfFunctionDeleteOperator
    Use the default_args dict to pass arguments to the operator
    PROJECT_ID modelsVariableget('PROJECT_ID' '')
    LOCATION modelsVariableget('LOCATION' '')
    ENTRYPOINT modelsVariableget('ENTRYPOINT' '')
    # A fullyqualified name of the function to delete

    FUNCTION_NAME 'projects{}locations{}functions{}'format(PROJECT_ID LOCATION
    ENTRYPOINT)
    default_args {
    'start_date' airflowutilsdatesdays_ago(1)
    }
    Use the GcfFunctionDeleteOperator to delete a function from Google Cloud Functions
    t1 GcfFunctionDeleteOperator(
    task_idgcf_delete_task
    nameFUNCTION_NAME
    )
    Troubleshooting
    If you want to run or deploy an operator using a service account and get forbidden 403 errors it means that your service account does not have the correct Cloud IAM permissions
    1 Assign your Service Account the Cloud Functions Developer role
    2 Grant the user the Cloud IAM Service Account User role on the Cloud Functions runtime service account
    The typical way of assigning Cloud IAM permissions with gcloud is shown below Just replace PROJECT_ID with ID of your Google Cloud Platform project and SERVICE_ACCOUNT_EMAIL with the email ID of your service account
    gcloud iam serviceaccounts addiampolicybinding \
    PROJECT_ID@appspotgserviceaccountcom \
    memberserviceAccount[SERVICE_ACCOUNT_EMAIL] \
    rolerolesiamserviceAccountUser
    See Adding the IAM service agent user role to the runtime service for details
    GcfFunctionDeployOperator
    Use the GcfFunctionDeployOperator to deploy a function from Google Cloud Functions
    The following examples of Airflow variables show various variants and combinations of default_args that you can use The variables are defined as follows
    PROJECT_ID modelsVariableget('PROJECT_ID' '')
    LOCATION modelsVariableget('LOCATION' '')
    SOURCE_ARCHIVE_URL modelsVariableget('SOURCE_ARCHIVE_URL' '')
    SOURCE_UPLOAD_URL modelsVariableget('SOURCE_UPLOAD_URL' '')
    SOURCE_REPOSITORY modelsVariableget('SOURCE_REPOSITORY' '')
    ZIP_PATH modelsVariableget('ZIP_PATH' '')
    ENTRYPOINT modelsVariableget('ENTRYPOINT' '')
    FUNCTION_NAME 'projects{}locations{}functions{}'format(PROJECT_ID LOCATION
    ENTRYPOINT)
    RUNTIME 'nodejs6'
    VALIDATE_BODY modelsVariableget('VALIDATE_BODY' True)

    With those variables you can define the body of the request
    body {
    name FUNCTION_NAME
    entryPoint ENTRYPOINT
    runtime RUNTIME
    httpsTrigger {}
    }
    When you create a DAG the default_args dictionary can be used to pass the body and other arguments
    default_args {
    'start_date' datesdays_ago(1)
    'project_id' PROJECT_ID
    'location' LOCATION
    'body' body
    'validate_body' VALIDATE_BODY
    }
    Note that the neither the body nor the default args are complete in the above examples Depending on the set variables there might be different variants on how to pass source code related fields Currently you can pass either sourceArchiveUrl sourceRepository or sourceUploadUrl as described in the CloudFunction API specification Additionally default_args might contain zip_path parameter to run the extra step of uploading the source code before deploying it In the last case you also need to provide an empty sourceUploadUrl parameter in the body 请注意示例中body默认args完整 根设置变量传递源代码相关字段会变体 前您CloudFunction API规范中说明传递sourceArchiveUrlsourceRepositorysourceUploadUrl 外default_args包含zip_path参数运行部署源代码前传源代码额外步骤 种情况您需正文中提供空sourceUploadUrl参数
    Based on the variables defined above example logic of setting the source code related fields is shown here 根面定义变量处显示设置源代码相关字段示例逻辑:
    if SOURCE_ARCHIVE_URL
    body['sourceArchiveUrl'] SOURCE_ARCHIVE_URL
    elif SOURCE_REPOSITORY
    body['sourceRepository'] {
    'url' SOURCE_REPOSITORY
    }
    elif ZIP_PATH
    body['sourceUploadUrl'] ''
    default_args['zip_path'] ZIP_PATH
    elif SOURCE_UPLOAD_URL
    body['sourceUploadUrl'] SOURCE_UPLOAD_URL
    else
    raise Exception(Please provide one of the source_code parameters)
    The code to create the operator
    deploy_task GcfFunctionDeployOperator(
    task_idgcf_deploy_task
    nameFUNCTION_NAME
    )
    Troubleshooting 障排
    If you want to run or deploy an operator using a service account and get forbidden 403 errors it means that your service account does not have the correct Cloud IAM permissions 果您想服务帐户运行部署操作员收禁止403错误意味着您服务帐户没正确Cloud IAM权限
    1 Assign your Service Account the Cloud Functions Developer role
    2 Grant the user the Cloud IAM Service Account User role on the Cloud Functions runtime service account
    The typical way of assigning Cloud IAM permissions with gcloud is shown below Just replace PROJECT_ID with ID of your Google Cloud Platform project and SERVICE_ACCOUNT_EMAIL with the email ID of your service account gcloud分配Cloud IAM权限典型方法示 需PROJECT_ID换您Google Cloud Platform项目IDSERVICE_ACCOUNT_EMAIL换您服务帐户电子邮件ID
    gcloud iam serviceaccounts addiampolicybinding \
    PROJECT_ID@appspotgserviceaccountcom \
    memberserviceAccount[SERVICE_ACCOUNT_EMAIL] \
    rolerolesiamserviceAccountUser
    See Adding the IAM service agent user role to the runtime service for details
    If the source code for your function is in Google Source Repository make sure that your service account has the Source Repository Viewer role so that the source code can be downloaded if necessary 果您功源代码Google Source Repository中请确保您服务帐户具Source Repository Viewer角色便必时载源代码
    Slack Operator
    This operator is used to send message to slack channel
     
    Parameters
    · task_id
    · channel (required) slack channel
    · message (required) message
    Note You may have to add the user 'aslack_admin' to the private channel in order to be able to send message to a private channel
    实例:
    from airflow import SlackOperator
    task3 SlackOperator(
    task_id'send_slack'
    channel'test_cwong'
    message'hi catherine'
    dagdag)
     
    if you want to code the dag so that it sends a failure message to the channel you can code the on_failure_cb as below
    on_failure_cb SlackOperator(owner'owner' task_id'task_id' channel'test_channel'message'sorry fail')execute
    and add the following code to the task
    on_failure_callbackon_failure_cb

    SnowFlakeOperator
    Snowflake数库中执行sql查询 仅NGAP 20 Airflow
    Parameters
    · conn_id (string) – The snowflake connection name as configured in Admin → Connections
    · sql_file (string) – Location of the query file which needs to be executed
    · parameters(dictionary) Parameter that can be passed as a Key value pair
    NOTE
    1 This is a NGAP 20 only feature All the connection details pertaining to your batch user needs to be configured in connections
    2 Accepts multiple queries in a single file
    3 If AWS Key and Secret key is required in a query then substitute with aws_s3_key and aws_s3_secret_key Operator is designed to pick the s3_default keys for your environment
    4 The variable names(key) in the sql file should be appended with (only in the sql file and not in the input dictionary parameters)
    实例:
    sample_snowflake_tasks1  SnowFlakeOperator(
       task_id'sample_snowflake_task1'
       sql_file'appbincommonscriptstest_snowflake_s3sql'
       conn_id'snowflake'
       dagdag
       parameters{'location' 's3nikebimanagedbidevdtc_commercetesttable'} 
    )
    SnowflakeSensor
    Runs a sql statement until a criteria is met It will keep trying while sql returns no row or if the first cell in (0 '0' '')
    Parameters
    · conn_id (string) – The snowflake connection name as configured in Admin → Connections
    · sql (string) – The query which needs to be executed
    · poke_interval(integer) How often in seconds to run the sql statement
    · timeout(integer) How long to wait before failing
    · soft_fail(bool) Set to true to mark the task as SKIPPED on failure 

    实例:
    sample_snowflake_tasks1  SnowflakeSensor(
       task_id 'sample_snowflake_task1'
       sql'select * from some_table'
       poke_interval60 # 1 minute
       timeout3600 # 10 hours
       conn_id'snowflake'
       dagdag
    )
    TaskDependencySensor
    Waits for a task to complete in a different DAG similar to ExternalTaskSensor but with more options
    Parameters
    · external_dag_id (string) – The dag_id that contains the task you want to wait for
    · external_task_id (string) – The task_id that contains the task you want to wait for
    · allowed_states (list) – list of allowed states default is ['success'] or ['failed''upstream_failed']
    · queue queue name
    · execution_delta (datetimetimedelta) – time difference with the previous execution to look at the default is the same execution_date as the current task For yesterday use [positive] datetimetimedelta(days1)
    · execution_delta_json (json) – time difference with the previous execution to look at the default is the same execution_date as the current task For yesterday use [positive] datetimetimedelta(days1) eg execution_delta_json{00 09 08 550 151110} for two dags that run at hours 0815 and waiting for the completion of the other previous run of the other dag that runs at 0 8 15 ( dag1 0th hour will poke for dag2 15th hour run on previous day dag1 8th hour run will poke for dag2 1350th hour run dag1 15th hour run will poke for dag2 0350th hour run )
    · cluster_id (string) Optional– cluster id of another airflow cluster Note connection of the backend mysql database will have to be set up on the airflow → admin → connection
    · queue (string) queue for the task to be sent
    NOTE only one of the execution_delta execution_delta_json can be set
    实例:
    task1 TaskDependencySensor(
    task_id'task_dep'
    external_dag_id'CheckEMRHealthV2'
    external_task_id'success'
    allowed_states['success']
    cluster_id'dev1713'
    queue'airflow'
    dagdag

    理连接
    Airflow needs to know how to connect to your environment Information such as hostname port login and passwords to other systems and services is handled in the Admin>Connections section of the UI The pipeline code you will author will reference the conn_id’ of the Connection objects Airflow需知道连接您环境 UIAdmin>Connections部分处理诸机名端口登录名系统服务密码类信息 您编写道代码引Connection象 conn_id

    Connections can be created and managed using either the UI or environment variables UI环境变量创建理连接
    See the Connenctions Concepts documentation for more information
    Creating a Connection with the UI
    Open the Admin>Connections section of the UI Click the Create link to create a new connection 开户界面理>连接部分 单击创建链接创建新连接

    1 Fill in the Conn Id field with the desired connection ID It is recommended that you use lowercase characters and separate words with underscores
    2 Choose the connection type with the Conn Type field
    3 Fill in the remaining fields See Connection Types for a description of the fields belonging to the different connection types
    4 Click the Save button to create the connection
    Editing a Connection with the UI
    Open the Admin>Connections section of the UI Click the pencil icon next to the connection you wish to edit in the connection list

    Modify the connection properties and click the Save button to save your changes
    Creating a Connection with Environment Variables
    Connections in Airflow pipelines can be created using environment variables The environment variable needs to have a prefix of AIRFLOW_CONN_ for Airflow with the value in a URI format to use the connection properly
    When referencing the connection in the Airflow pipeline the conn_id should be the name of the variable without the prefix For example if the conn_id is named postgres_master the environment variable should be named AIRFLOW_CONN_POSTGRES_MASTER (note that the environment variable must be all uppercase) Airflow assumes the value returned from the environment variable to be in a URI format (eg postgresuserpassword@localhost5432master or s3accesskeysecretkey@S3)
    Connection Types
    Google云台
    The Google Cloud Platform connection type enables the GCP Integrations
    Authenticating to GCP
    There are two ways to connect to GCP using Airflow
    1 Use Application Default Credentials such as via the metadata server when running on Google Compute Engine
    2 Use a service account key file (JSON format) on disk
    Default Connection IDs
    The following connection IDs are used by default
    bigquery_default
    Used by the BigQueryHook hook
    google_cloud_datastore_default
    Used by the DatastoreHook hook
    google_cloud_default
    Used by the GoogleCloudBaseHook DataFlowHook DataProcHook MLEngineHook and GoogleCloudStorageHook hooks
    Configuring the Connection
    Project Id (required)
    The Google Cloud project ID to connect to
    Keyfile Path
    Path to a service account key file (JSON format) on disk
    Not required if using application default credentials
    Keyfile JSON
    Contents of a service account key file (JSON format) on disk It is recommended to Secure your connections if using this method to authenticate
    Not required if using application default credentials
    Scopes (comma separated)
    A list of commaseparated Google Cloud scopes to authenticate with
    Note
    Scopes are ignored when using application default credentials See issue AIRFLOW2522
    MySQL
    The MySQL connect type allows to connect with MySQL database
    Configuring the Connection
    Host (required)
    The host to connect to
    Schema (optional)
    Specify the schema name to be used in the database
    Login (required)
    Specify the user name to connect
    Password (required)
    Specify the password to connect
    Extra (optional)
    Specify the charset Example {charset utf8}
    Note
    If encounter UnicodeDecodeError while working with MySQL connection check the charset defined is matched to the database charset
    Securing Connections
    By default Airflow will save the passwords for the connection in plain text within the metadata database The crypto package is highly recommended during installation The crypto package does require that your operating system have libffidev installed
    If crypto package was not installed initially you can still enable encryption for connections by following steps below
    1 Install crypto package pip install apacheairflow[crypto]
    2 Generate fernet_key using this code snippet below fernet_key must be a base64encoded 32byte key
    from cryptographyfernet import Fernet
    fernet_key Fernetgenerate_key()
    print(fernet_keydecode()) # your fernet_key keep it in secured place
    3 Replace airflowcfg fernet_key value with the one from step 2 Alternatively you can store your fernet_key in OS environment variable You do not need to change airflowcfg in this case as Airflow will use environment variable over the value in airflowcfg
    # Note the double underscores
    export AIRFLOW__CORE__FERNET_KEYyour_fernet_key
    4 Restart Airflow webserver
    5 For existing connections (the ones that you had defined before installing airflow[crypto] and creating a Fernet key) you need to open each connection in the connection admin UI retype the password and save it
    Writing Logs
    Writing Logs Locally
    Users can specify a logs folder in airflowcfg using the base_log_folder setting By default it is in the AIRFLOW_HOME directory
    In addition users can supply a remote location for storing logs and log backups in cloud storage
    In the Airflow Web UI local logs take precedence over remote logs If local logs can not be found or accessed the remote logs will be displayed Note that logs are only sent to remote storage once a task completes (including failure) In other words remote logs for running tasks are unavailable Logs are stored in the log folder as {dag_id}{task_id}{execution_date}{try_number}log
    Writing Logs to Amazon S3
    Before you begin
    Remote logging uses an existing Airflow connection to readwrite logs If you don’t have a connection properly setup this will fail
    Enabling remote logging
    To enable this feature airflowcfg must be configured as in this example
    [core]
    # Airflow can store logs remotely in AWS S3 Users must supply a remote
    # location URL (starting with either 's3') and an Airflow connection
    # id that provides access to the storage location
    remote_base_log_folder s3mybucketpathtologs
    remote_log_conn_id MyS3Conn
    # Use serverside encryption for logs stored in S3
    encrypt_s3_logs False
    In the above example Airflow will try to use S3Hook('MyS3Conn')
    Writing Logs to Azure Blob Storage
    Airflow can be configured to read and write task logs in Azure Blob Storage Follow the steps below to enable Azure Blob Storage logging
    1 Airflow’s logging system requires a custom py file to be located in the PYTHONPATH so that it’s importable from Airflow Start by creating a directory to store the config file AIRFLOW_HOMEconfig is recommended
    2 Create empty files called AIRFLOW_HOMEconfiglog_configpy and AIRFLOW_HOMEconfig__init__py
    3 Copy the contents of airflowconfig_templatesairflow_local_settingspy into the log_configpy file that was just created in the step above
    4 Customize the following portions of the template
    5 # wasb buckets should start with wasb just to help Airflow select correct handler
    6 REMOTE_BASE_LOG_FOLDER 'wasb
    7
    8 # Rename DEFAULT_LOGGING_CONFIG to LOGGING CONFIG
    9 LOGGING_CONFIG
    10 Make sure a Azure Blob Storage (Wasb) connection hook has been defined in Airflow The hook should have read and write access to the Azure Blob Storage bucket defined above in REMOTE_BASE_LOG_FOLDER
    11 Update AIRFLOW_HOMEairflowcfg to contain
    12 remote_logging True
    13 logging_config_class log_configLOGGING_CONFIG
    14 remote_log_conn_id
    15 Restart the Airflow webserver and scheduler and trigger (or wait for) a new task execution
    16 Verify that logs are showing up for newly executed tasks in the bucket you’ve defined
    Writing Logs to Google Cloud Storage
    Follow the steps below to enable Google Cloud Storage logging
    To enable this feature airflowcfg must be configured as in this example
    [core]
    # Airflow can store logs remotely in AWS S3 Google Cloud Storage or Elastic Search
    # Users must supply an Airflow connection id that provides access to the storage
    # location If remote_logging is set to true see UPDATINGmd for additional
    # configuration requirements
    remote_logging True
    remote_base_log_folder gsmybucketpathtologs
    remote_log_conn_id MyGCSConn
    1 Install the gcp_api package first like so pip install apacheairflow[gcp_api]
    2 Make sure a Google Cloud Platform connection hook has been defined in Airflow The hook should have read and write access to the Google Cloud Storage bucket defined above in remote_base_log_folder
    3 Restart the Airflow webserver and scheduler and trigger (or wait for) a new task execution
    4 Verify that logs are showing up for newly executed tasks in the bucket you’ve defined
    5 Verify that the Google Cloud Storage viewer is working in the UI Pull up a newly executed task and verify that you see something like
    6 *** Reading remote log from gsexample_bash_operatorrun_this_last20171003T00000016log
    7 [20171003 215750056] {clipy377} INFO Running on host chrisr00532
    8 [20171003 215750093] {base_task_runnerpy115} INFO Running ['bash' 'c' u'airflow run example_bash_operator run_this_last 20171003T000000 job_id 47 raw sd DAGS_FOLDERexample_dagsexample_bash_operatorpy']
    9 [20171003 215751264] {base_task_runnerpy98} INFO Subtask [20171003 215751263] {__init__py45} INFO Using executor SequentialExecutor
    10 [20171003 215751306] {base_task_runnerpy98} INFO Subtask [20171003 215751306] {modelspy186} INFO Filling up the DagBag from airflowdagsexample_dagsexample_bash_operatorpy
    Note the top line that says it’s reading from the remote log file
    Scaling Out with Celery
    CeleryExecutor is one of the ways you can scale out the number of workers For this to work you need to setup a Celery backend (RabbitMQ Redis …) and change your airflowcfg to point the executor parameter to CeleryExecutor and provide the related Celery settings
    For more information about setting up a Celery broker refer to the exhaustive Celery documentation on the topic
    Here are a few imperative requirements for your workers
    · airflow needs to be installed and the CLI needs to be in the path
    · Airflow configuration settings should be homogeneous across the cluster
    · Operators that are executed on the worker need to have their dependencies met in that context For example if you use the HiveOperator the hive CLI needs to be installed on that box or if you use the MySqlOperator the required Python library needs to be available in the PYTHONPATH somehow
    · The worker needs to have access to its DAGS_FOLDER and you need to synchronize the filesystems by your own means A common setup would be to store your DAGS_FOLDER in a Git repository and sync it across machines using Chef Puppet Ansible or whatever you use to configure machines in your environment If all your boxes have a common mount point having your pipelines files shared there should work as well
    To kick off a worker you need to setup Airflow and kick off the worker subcommand
    airflow worker
    Your worker should start picking up tasks as soon as they get fired in its direction
    Note that you can also run Celery Flower a web UI built on top of Celery to monitor your workers You can use the shortcut command airflow flower to start a Flower web server
    Some caveats
    · Make sure to use a database backed result backend
    · Make sure to set a visibility timeout in [celery_broker_transport_options] that exceeds the ETA of your longest running task
    · Tasks can consume resources Make sure your worker has enough resources to run worker_concurrency tasks
    Scaling Out with Dask
    DaskExecutor allows you to run Airflow tasks in a Dask Distributed cluster
    Dask clusters can be run on a single machine or on remote networks For complete details consult the Distributed documentation
    To create a cluster first start a Scheduler
    # default settings for a local cluster
    DASK_HOST127001
    DASK_PORT8786

    daskscheduler host DASK_HOST port DASK_PORT
    Next start at least one Worker on any machine that can connect to the host
    daskworker DASK_HOSTDASK_PORT
    Edit your airflowcfg to set your executor to DaskExecutor and provide the Dask Scheduler address in the [dask] section
    Please note
    · Each Dask worker must be able to import Airflow and any dependencies you require
    · Dask does not support queues If an Airflow task was created with a queue a warning will be raised but the task will be submitted to the cluster
    Scaling Out with Mesos (community contributed)
    There are two ways you can run airflow as a mesos framework
    1 Running airflow tasks directly on mesos slaves requiring each mesos slave to have airflow installed and configured
    2 Running airflow tasks inside a docker container that has airflow installed which is run on a mesos slave
    Tasks executed directly on mesos slaves
    MesosExecutor allows you to schedule airflow tasks on a Mesos cluster For this to work you need a running mesos cluster and you must perform the following steps
    1 Install airflow on a mesos slave where web server and scheduler will run let’s refer to this as the Airflow server
    2 On the Airflow server install mesos python eggs from mesos downloads
    3 On the Airflow server use a database (such as mysql) which can be accessed from all mesos slaves and add configuration in airflowcfg
    4 Change your airflowcfg to point executor parameter to MesosExecutor and provide related Mesos settings
    5 On all mesos slaves install airflow Copy the airflowcfg from Airflow server (so that it uses same sql alchemy connection)
    6 On all mesos slaves run the following for serving logs
    airflow serve_logs
    7 On Airflow server to start processingscheduling DAGs on mesos run
    airflow scheduler p
    Note We need p parameter to pickle the DAGs
    You can now see the airflow framework and corresponding tasks in mesos UI The logs for airflow tasks can be seen in airflow UI as usual
    For more information about mesos refer to mesos documentation For any queriesbugs on MesosExecutor please contact @kapilmalik
    Tasks executed in containers on mesos slaves
    This gist contains all files and configuration changes necessary to achieve the following
    1 Create a dockerized version of airflow with mesos python eggs installed
    We recommend taking advantage of docker’s multi stage builds in order to achieve this We have one Dockerfile that defines building a specific version of mesos from source (Dockerfilemesos) in order to create the python eggs In the airflow Dockerfile (Dockerfileairflow) we copy the python eggs from the mesos image
    2 Create a mesos configuration block within the airflowcfg
    The configuration block remains the same as the default airflow configuration (default_airflowcfg) but has the addition of an option docker_image_slave This should be set to the name of the image you would like mesos to use when running airflow tasks Make sure you have the proper configuration of the DNS record for your mesos master and any sort of authorization if any exists
    3 Change your airflowcfg to point the executor parameter to MesosExecutor (executor SequentialExecutor)
    4 Make sure your mesos slave has access to the docker repository you are using for your docker_image_slave
    Instructions are available in the mesos docs
    The rest is up to you and how you want to work with a dockerized airflow configuration
    Running Airflow with systemd
    Airflow can integrate with systemd based systems This makes watching your daemons easy as systemd can take care of restarting a daemon on failure In the scriptssystemd directory you can find unit files that have been tested on Redhat based systems You can copy those to usrlibsystemdsystem It is assumed that Airflow will run under airflowairflow If not (or if you are running on a non Redhat based system) you probably need to adjust the unit files
    Environment configuration is picked up from etcsysconfigairflow An example file is supplied Make sure to specify the SCHEDULER_RUNS variable in this file when you run the scheduler You can also define here for example AIRFLOW_HOME or AIRFLOW_CONFIG
    Running Airflow with upstart
    Airflow can integrate with upstart based systems Upstart automatically starts all airflow services for which you have a corresponding *conf file in etcinit upon system boot On failure upstart automatically restarts the process (until it reaches respawn limit set in a *conf file)
    You can find sample upstart job files in the scriptsupstart directory These files have been tested on Ubuntu 1404 LTS You may have to adjust start on and stop on stanzas to make it work on other upstart systems Some of the possible options are listed in scriptsupstartREADME
    Modify *conf files as needed and copy to etcinit directory It is assumed that airflow will run under airflowairflow Change setuid and setgid in *conf files if you use other usergroup
    You can use initctl to manually start stop view status of the airflow process that has been integrated with upstart
    initctl airflowwebserver status
    Using the Test Mode Configuration
    Airflow has a fixed set of test mode configuration options You can load these at any time by calling airflowconfigurationload_test_config() (note this operation is not reversible) However some options (like the DAG_FOLDER) are loaded before you have a chance to call load_test_config() In order to eagerly load the test configuration set test_mode in airflowcfg
    [tests]
    unit_test_mode True
    Due to Airflow’s automatic environment variable expansion (see Setting Configuration Options) you can also set the env var AIRFLOW__CORE__UNIT_TEST_MODE to temporarily overwrite airflowcfg
    图形界面截图
    The Airflow UI makes it easy to monitor and troubleshoot your data pipelines Here’s a quick overview of some of the features and visualizations you can find in the Airflow UI
    DAGs View
    List of the DAGs in your environment and a set of shortcuts to useful pages You can see exactly how many tasks succeeded failed or are currently running at a glance



    Tree View
    A tree representation of the DAG that spans across time If a pipeline is late you can quickly see where the different steps are and identify the blocking ones



    Graph View
    The graph view is perhaps the most comprehensive Visualize your DAG’s dependencies and their current status for a specific run



    Variable View
    The variable view allows you to list create edit or delete the keyvalue pair of a variable used during jobs Value of a variable will be hidden if the key contains any words in (password’ secret’ passwd’ authorization’ api_key’ apikey’ access_token’) by default but can be configured to show in cleartext



    Gantt Chart
    The Gantt chart lets you analyse task duration and overlap You can quickly identify bottlenecks and where the bulk of the time is spent for specific DAG runs



    Task Duration
    The duration of your different tasks over the past N runs This view lets you find outliers and quickly understand where the time is spent in your DAG over many runs



    Code View
    Transparency is everything While the code for your pipeline is in source control this is a quick way to get to the code that generates the DAG and provide yet more context



    Task Instance Context Menu
    From the pages seen above (tree view graph view gantt …) it is always possible to click on a task instance and get to this rich context menu that can take you to more detailed metadata and perform some actions


    概念
    Airflow Platform描述执行监控工作流工具
    核心理念
    DAG
    Airflow中DAG(定非循环图)您运行务集合反映关系赖关系方式进行组织
    例简单DAG包含三务:ABC说A必须B运行前成功运行C时运行务A5分钟超时B重新启动5次防失败说工作流程天晚10点运行应该某特定日期前开始
    In this way a DAG describes how you want to carry out your workflow but notice that we haven’t said anything about what we actually want to do A B and C could be anything Maybe A prepares data for B to analyze while C sends an email 通种方式DAG描述您希执行工作流程请注意没说真正想做事情 ABC东西C发送电子邮件时许A准备B进行分析数 Or perhaps A monitors your location so B can open your garage door while C turns on your house lights 者许A监控位置样B开车库门C开房子灯 The important thing is that the DAG isn’t concerned with what its constituent tasks do its job is to make sure that whatever they do happens at the right time or in the right order or with the right handling of any unexpected issues 重发展议程集团关心组成务作工作确保做什正确时间正确序正确处理意外问题
    DAGs are defined in standard Python files that are placed in Airflow’s DAG_FOLDER Airflow will execute the code in each file to dynamically build the DAG objects DAG标准Python文件中定义文件放AirflowDAG_FOLDER中 Airflow执行文件中代码动态构建DAG象 You can have as many DAGs as you want each describing an arbitrary number of tasks In general each one should correspond to a single logical workflow 您拥意数量DAGDAG描述意数量务通常应该应单逻辑工作流
    注意
    When searching for DAGs Airflow will only consider files where the string airflow and DAG both appear in the contents of the py file 搜索DAG时Airflow仅考虑py文件容中时出现字符串 airflow DAG文件
    范围
    Airflow will load any DAG object it can import from a DAGfile Critically that means the DAG must appear in globals() Consider the following two DAGs Only dag_1 will be loaded the other one only appears in a local scope Airflow加载DAG文件导入DAG象重意味着DAG必须出现globals()中考虑两DAG会加载dag_1出现范围
    dag_1 DAG('this_dag_will_be_discovered')

    def my_function()
    dag_2 DAG('but_this_dag_will_not')

    my_function()
    Sometimes this can be put to good use For example a common pattern with SubDagOperator is to define the subdag inside a function so that Airflow doesn’t try to load it as a standalone DAG 时利例SubDagOperator常见模式定义函数子标记便Airflow会尝试作独立DAG加载
    默认参数
    果default_args字典传递DAG应运算符容易公参数应许运算符需次键入
    default_args {
    'start_date' datetime(2016 1 1)
    'owner' 'Airflow'
    }

    dag DAG('my_dag' default_argsdefault_args)
    op DummyOperator(task_id'dummy' dagdag)
    print(opowner) # Airflow
    Context Manager
    Added in Airflow 18
    DAGs can be used as context managers to automatically assign new operators to that DAG DAG作文理器动新运算符分配该DAG
    with DAG('my_dag' start_datedatetime(2016 1 1)) as dag
    op DummyOperator('op')

    opdag is dag # True
    Operator
    然DAG描述运行工作流Operator确定实际完成工作
    An operator describes a single task in a workflow Operators are usually (but not always) atomic meaning they can stand on their own and don’t need to share resources with any other operators 操作员描述工作流中单务运营商通常(非总)原子意味着独立运营需运营商享资源 The DAG will make sure that operators run in the correct certain order other than those dependencies operators generally run independently In fact they may run on two completely different machines DAG确保运营商正确序运行赖项外运营商通常独立运行实际两台完全机器运行
    This is a subtle but very important point in general if two operators need to share information like a filename or small amount of data you should consider combining them into a single operator 微妙非常重点:通常果两运营商需享信息文件名少量数您应该考虑组合运算符中 If it absolutely can’t be avoided Airflow does have a feature for operator crosscommunication called XCom that is described elsewhere in this document 果绝法避免Airflow确实具操作员交叉通信功称XCom文档部分进行描述
    Airflow provides operators for many common tasks including Airflow许常见务提供操作员包括:
    · BashOperator 执行bash命令
    · PythonOperator 调意Python函数
    · EmailOperator sends an email
    · SimpleHttpOperator sends an HTTP request
    · MySqlOperator SqliteOperator PostgresOperator MsSqlOperator OracleOperator JdbcOperator etc executes a SQL command
    · Sensor waits for a certain time file database row S3 key etc…
    In addition to these basic building blocks there are many more specific operators DockerOperator HiveOperator S3FileTransformOperator PrestoToMysqlOperator SlackOperator… you get the idea 基构建块外许特定运算符:DockerOperatorHiveOperatorS3FileTransformOperatorPrestoToMysqlOperatorSlackOperator 明白
    The airflowcontrib directory contains yet more operators built by the community These operators aren’t always as complete or welltested as those in the main distribution but allow users to more easily add new functionality to the platform airflow contrib 目录包含更社区构建运算符运算符总发行版中样完整良测试允许户更轻松台添加新功
    Operators are only loaded by Airflow if they are assigned to a DAG 果操作员分配DAG操作员仅Airflow加载
    See Using Operators for how to use Airflow operators 请参阅运算符解Airflow运算符
    DAG Assignment
    Added in Airflow 18
    Operators do not have to be assigned to DAGs immediately (previously dag was a required argument) 操作员必立分配DAG(前dag必需参数) However once an operator is assigned to a DAG it can not be transferred or unassigned DAG assignment can be done explicitly when the operator is created through deferred assignment or even inferred from other operators 旦运营商分配DAG法转移取消分配创建运算符时通延迟赋值甚运算符推断显式完成DAG分配
    dag DAG('my_dag' start_datedatetime(2016 1 1))

    # sets the DAG explicitly
    explicit_op DummyOperator(task_id'op1' dagdag)

    # deferred DAG assignment
    deferred_op DummyOperator(task_id'op2')
    deferred_opdag dag

    # inferred DAG assignment (linked operators must be in the same DAG)
    inferred_op DummyOperator(task_id'op3')
    inferred_opset_upstream(deferred_op)
    Bitshift Composition Bitshift成分
    Added in Airflow 18
    Traditionally operator relationships are set with the set_upstream() and set_downstream() methods In Airflow 18 this can be done with the Python bitshift operators >> and << The following four statements are all functionally equivalent 传统set_upstream()set_downstream()方法设置运算符关系Airflow 18中通Python bitshift操作符>><<完成四语句功等效:
    op1 >> op2
    op1set_downstream(op2)

    op2 << op1
    op2set_upstream(op1)
    When using the bitshift to compose operators the relationship is set in the direction that the bitshift operator points For example op1 >> op2 means that op1 runs first and op2 runs second bitshift组合运算符时关系设置bitshift运算符指方例op1 >> op2表示op1先运行op2运行第二 Multiple operators can be composed – keep in mind the chain is executed lefttoright and the rightmost object is always returned For example 组成运算符 请记住链左右执行始终返回右边象例:
    op1 >> op2 >> op3 << op4
    相:
    op1set_downstream(op2)
    op2set_downstream(op3)
    op3set_upstream(op4)
    For convenience the bitshift operators can also be used with DAGs For example 方便起见bitshift运算符DAG起例:
    dag >> op1 >> op2
    相:
    op1dag dag
    op1set_downstream(op2)
    We can put this all together to build a simple pipeline 切放起构建简单道:
    with DAG('my_dag' start_datedatetime(2016 1 1)) as dag
    (
    DummyOperator(task_id'dummy_1')
    >> BashOperator(
    task_id'bash_1'
    bash_command'echo HELLO')
    >> PythonOperator(
    task_id'python_1'
    python_callablelambda print(GOODBYE))
    )
    Task
    Once an operator is instantiated it is referred to as a task The instantiation defines specific values when calling the abstract operator and the parameterized task becomes a node in a DAG 旦运算符实例化称务实例化调抽象运算符时定义特定值参数化务成DAG中节点
    Task Instance
    A task instance represents a specific run of a task and is characterized as the combination of a dag a task and a point in time Task instances also have an indicative state which could be running success failed skipped up for retry etc 务实例表示务特定运行特征dag务时间点组合务实例指示状态运行成功失败跳重试等
    Workflow
    You’re now familiar with the core building blocks of Airflow Some of the concepts may sound very similar but the vocabulary can be conceptualized like this 您现熟悉Airflow核心构建模块概念听起非常相似词汇表概念化:
    · DAG a description of the order in which work should take place
    · Operator a class that acts as a template for carrying out some work
    · Task a parameterized instance of an operator
    · Task Instance a task that 1) has been assigned to a DAG and 2) has a state associated with a specific run of the DAG
    By combining DAGs and Operators to create TaskInstances you can build complex workflows 通组合DAG运算符创建TaskInstances您构建复杂工作流
    附加功
    In addition to the core Airflow objects there are a number of more complex features that enable behaviors like limiting simultaneous access to resources crosscommunication conditional execution and more 核心Airflow象外许更复杂功实现限制时访问资源交叉通信条件执行等行
    Hook
    Hooks are interfaces to external platforms and databases like Hive S3 MySQL Postgres HDFS and Pig Hooks implement a common interface when possible and act as a building block for operators 钩子外部台数库接口HiveS3MySQLPostgresHDFSPig Hooks实现通接口充运营商构建块They also use the airflowmodelsConnection model to retrieve hostnames and authentication information Hooks keep authentication code and information out of pipelines centralized in the metadata database airflowmodelsConnection模型检索机名身份验证信息挂钩身份验证代码信息保存道外集中元数数库中
    Hooks are also very useful on their own to use in Python scripts Airflow airflowoperatorsPythonOperator and in interactive environments like iPython or Jupyter Notebook 钩子Python脚Airflow airflowoperatorsPythonOperatoriPythonJupyter Notebook等交互式环境中非常
    Pool
    Some systems can get overwhelmed when too many processes hit them at the same time Airflow pools can be used to limit the execution parallelism on arbitrary sets of tasks 太进程时攻击时某系统会淹没气流池限制意务集执行行性 The list of pools is managed in the UI (Menu > Admin > Pools) by giving the pools a name and assigning it a number of worker slots 通池命名分配工作槽UI(菜单 >理 >池)中理池列表 Tasks can then be associated with one of the existing pools by using the pool parameter when creating tasks (ie instantiating operators) 然创建务时(实例化运算符)通池参数务现池相关联
    aggregate_db_message_job BashOperator(
    task_id'aggregate_db_message_job'
    execution_timeouttimedelta(hours3)
    pool'ep_data_pipeline_db_msg_agg'
    bash_commandaggregate_db_message_job_cmd
    dagdag)
    aggregate_db_message_jobset_upstream(wait_for_empty_queue)
    The pool parameter can be used in conjunction with priority_weight to define priorities in the queue and which tasks get executed first as slots open up in the pool pool参数priority_weight结合定义队列中优先级池中开槽时首先执行务 The default priority_weight is 1 and can be bumped to any number When sorting the queue to evaluate which task should be executed next we use the priority_weight summed up with all of the priority_weight values from tasks downstream from this task 默认priority_weight1碰数字队列进行排序评估接应该执行务时priority_weight务游务priority_weight值相加 You can use this to bump a specific important task and the whole path to that task gets prioritized accordingly 您执行特定重务相应优先处理该务整路径
    Tasks will be scheduled as usual while the slots fill up Once capacity is reached runnable tasks get queued and their state will show as such in the UI 插槽填满时务常安排达容量运行务排队状态UI中显示 As slots free up queued tasks start running based on the priority_weight (of the task and its descendants) 插槽空闲时排队务根priority_weight(务代)开始运行
    Note that by default tasks aren’t assigned to any pool and their execution parallelism is only limited to the executor’s setting 请注意默认情况务会分配池执行行性仅限执行程序设置
    Connection
    The connection information to external systems is stored in the Airflow metadata database and managed in the UI (Menu > Admin > Connections) 外部系统连接信息存储Airflow元数数库中UI中进行理(菜单 >理 >连接) A conn_id is defined there and hostname login password schema information attached to it Airflow pipelines can simply refer to the centrally managed conn_id without having to hard code any of this information anywhere 里定义conn_id附加hostname login password schema信息气流道简单引集中理conn_id需方硬编码类信息
    Many connections with the same conn_id can be defined and when that is the case and when the hooks uses the get_connection method from BaseHook Airflow will choose one connection randomly allowing for some basic load balancing and fault tolerance when used in conjunction with retries 定义具相conn_id许连接种情况挂钩BaseHookget_connection方法时Airflow机选择连接允许重试起时进行基负载衡容错
    Airflow also has the ability to reference connections via environment variables from the operating system But it only supports URI format If you need to specify extra for your connection please use web UI Airflow够通操作系统中环境变量引连接支持URI格式果您需连接指定额外容请Web UI
    If connections with the same conn_id are defined in both Airflow metadata database and environment variables only the one in environment variables will be referenced by Airflow (for example given conn_id postgres_master Airflow will search for AIRFLOW_CONN_POSTGRES_MASTER in environment variables first and directly reference it if found before it starts to search in metadata database) 果Airflow元数数库环境变量中定义具相conn_id连接Airflow仅引环境变量中连接(例定conn_id postgres_masterAirflow首先环境变量中搜索AIRFLOW_CONN_POSTGRES_MASTER直接引果发现开始搜索元数数库前)
    Many hooks have a default conn_id where operators using that hook do not need to supply an explicit connection ID For example the default conn_id for the PostgresHook is postgres_default 许挂钩默认conn_id该挂钩运算符需提供显式连接ID例PostgresHook默认conn_idpostgres_default
    See Managing Connections for how to create and manage connections 请参阅理Connections解创建理连接
    Queue
    When using the CeleryExecutor the Celery queues that tasks are sent to can be specified queue is an attribute of BaseOperator so any task can be assigned to any queue CeleryExecutor时指定发送务Celery队列 queueBaseOperator属性务分配队列 The default queue for the environment is defined in the airflowcfg’s celery > default_queue This defines the queue that tasks get assigned to when not specified as well as which queue Airflow workers listen to when started 环境默认队列airflowcfgcelery > default_queue中定义定义未指定务时分配队列Airflow工作程序启动时侦听队列
    Workers can listen to one or multiple queues of tasks When a worker is started (using the command airflow worker) a set of commadelimited queue names can be specified (eg airflow worker q spark) This worker will then only pick up tasks wired to the specified queue(s) 工作员收听务队列工作程序启动时(命令气流工作程序)指定组逗号分隔队列名称(例气流工作者q spark)然该工作员仅接收连接指定队列务
    This can be useful if you need specialized workers either from a resource perspective (for say very lightweight tasks where one worker could take thousands of tasks without a problem) or from an environment perspective (you want a worker running from within the Spark cluster itself because it needs a very specific environment and security rights) 果您需专业工作员资源角度(例工作员毫问题执行数千务)者环境角度(您希工作员Spark群集中运行)非常身需非常具体环境安全权利)
    XCom
    XComs let tasks exchange messages allowing more nuanced forms of control and shared state The name is an abbreviation of crosscommunication XComs允许务交换消息允许更细微控制形式享状态该名称交叉通信缩写 XComs are principally defined by a key value and timestamp but also track attributes like the taskDAG that created the XCom and when it should become visible XComs键值时间戳定义踪创建XCom务 DAG时应该见属性 Any object that can be pickled can be used as an XCom value so users should make sure to use objects of appropriate size pickle象作XCom值户应该确保适象
    XComs can be pushed (sent) or pulled (received) When a task pushes an XCom it makes it generally available to other tasks 推(发送)拉(接收)XComs务推送XCom时通常务 Tasks can push XComs at any time by calling the xcom_push() method In addition if a task returns a value (either from its Operator’s execute() method or from a PythonOperator’s python_callable function) then an XCom containing that value is automatically pushed 务通调xcom_push()方法时推送XComs外果务返回值(Operatorexecute()方法者PythonOperatorpython_callable函数)会动推送包含该值XCom
    Tasks call xcom_pull() to retrieve XComs optionally applying filters based on criteria like key source task_ids and source dag_id By default xcom_pull() filters for the keys that are automatically given to XComs when they are pushed by being returned from execute functions (as opposed to XComs that are pushed manually) 务调xcom_pull()检索XComs选根keysource task_idssource dag_id等条件应滤器默认情况xcom_pull()滤掉执行函数返回时动赋予XCom键(手动推送XCom相反)
    If xcom_pull is passed a single string for task_ids then the most recent XCom value from that task is returned if a list of task_ids is passed then a corresponding list of XCom values is returned
    # inside a PythonOperator called 'pushing_task'
    def push_function()
    return value

    # inside another PythonOperator where provide_contextTrue
    def pull_function(**context)
    value context['task_instance']xcom_pull(task_ids'pushing_task')
    It is also possible to pull XCom directly in a template here’s an example of what this may look like
    SELECT * FROM {{ task_instancexcom_pull(task_ids'foo' key'table_name') }}
    Note that XComs are similar to Variables but are specifically designed for intertask communication rather than global settings
    Variables
    Variables are a generic way to store and retrieve arbitrary content or settings as a simple key value store within Airflow Variables can be listed created updated and deleted from the UI (Admin > Variables) code or CLI In addition json settings files can be bulk uploaded through the UI While your pipeline code definition and most of your constants and variables should be defined in code and stored in source control it can be useful to have some variables or configuration items accessible and modifiable through the UI 变量Airflow中存储检索意容设置作简单键值存储通方法 UI(理员>变量)代码CLI列出创建更新删变量 外通UI批量传json设置文件 您道代码定义数常量变量应代码中定义存储源代码控制中某变量配置项通UI进行访问修改会
    from airflowmodels import Variable
    foo Variableget(foo)
    bar Variableget(bar deserialize_jsonTrue)
    The second call assumes json content and will be deserialized into bar Note that Variable is a sqlalchemy model and can be used as such 第二调采json容反序列化bar 请注意变量sqlalchemy模型
    您jinja模板中变量语法:
    echo {{ varvalue }}
    者果您需变量反序列化json象:
    echo {{ varjson }}
    Branching
    Sometimes you need a workflow to branch or only go down a certain path based on an arbitrary condition which is typically related to something that happened in an upstream task One way to do this is by using the BranchPythonOperator 时您需工作流进行分支者仅根通常游务中发生情况关意条件特定路径走 种方法BranchPythonOperator
    The BranchPythonOperator is much like the PythonOperator except that it expects a python_callable that returns a task_id The task_id returned is followed and all of the other paths are skipped The task_id returned by the Python function has to be referencing a task directly downstream from the BranchPythonOperator task BranchPythonOperatorPythonOperator非常相似处希返回task_idpython_callable 遵循返回task_id跳路径 Python函数返回task_id必须直接BranchPythonOperator务游引务
    Note that using tasks with depends_on_pastTrue downstream from BranchPythonOperator is logically unsound as skipped status will invariably lead to block tasks that depend on their past successes skipped states propagates where all directly upstream tasks are skipped 请注意逻辑讲BranchPythonOperator游depends_on_past True务逻辑合理跳状态总会导致赖成功阻止务 跳状态会传播中直接游务跳
    If you want to skip some tasks keep in mind that you can’t have an empty path if so make a dummy task 果您想跳务请记住您空路径果样请执行虚拟务
    like this the dummy task branch_false is skipped样虚拟务 branch_false跳

    Not like this where the join task is skipped样跳加入务

    SubDAGs
    SubDAGs are perfect for repeating patterns Defining a function that returns a DAG object is a nice design pattern when using Airflow SubDAG非常适合重复模式 Airflow时定义返回DAG象函数种错设计模式
    Airbnb uses the stagecheckexchange pattern when loading data Data is staged in a temporary table after which data quality checks are performed against that table Once the checks all pass the partition is moved into the production table 加载数时Airbnb阶段检查交换模式 数暂存时表中然该表执行数质量检查 旦检查通分区会移生产表中
    作示例请考虑DAG:

    We can combine all of the parallel task* operators into a single SubDAG so that the resulting DAG resembles the following 行task *运算符组合单SubDAG中便生成DAG类似容:

    Note that SubDAG operators should contain a factory method that returns a DAG object This will prevent the SubDAG from being treated like a separate DAG in the main UI For example 请注意SubDAG运算符应包含返回DAG象工厂方法 防止UI中SubDAG视单独DAG 例:
    #dagssubdagpy
    from airflowmodels import DAG
    from airflowoperatorsdummy_operator import DummyOperator


    # Dag is returned by a factory method
    def sub_dag(parent_dag_name child_dag_name start_date schedule_interval)
    dag DAG(
    'ss' (parent_dag_name child_dag_name)
    schedule_intervalschedule_interval
    start_datestart_date
    )

    dummy_operator DummyOperator(
    task_id'dummy_task'
    dagdag
    )

    return dag
    然您DAG文件中引SubDAG:
    # main_dagpy
    from datetime import datetime timedelta
    from airflowmodels import DAG
    from airflowoperatorssubdag_operator import SubDagOperator
    from dagssubdag import sub_dag


    PARENT_DAG_NAME 'parent_dag'
    CHILD_DAG_NAME 'child_dag'

    main_dag DAG(
    dag_idPARENT_DAG_NAME
    schedule_intervaltimedelta(hours1)
    start_datedatetime(2016 1 1)
    )

    sub_dag SubDagOperator(
    subdagsub_dag(PARENT_DAG_NAME CHILD_DAG_NAME main_dagstart_date
    main_dagschedule_interval)
    task_idCHILD_DAG_NAME
    dagmain_dag
    )
    You can zoom into a SubDagOperator from the graph view of the main DAG to show the tasks contained within the SubDAG 您DAG图形视图放SubDagOperator显示SubDAG中包含务:

    SubDAG时技巧:
    · by convention a SubDAG’s dag_id should be prefixed by its parent and a dot As in parentchild惯例SubDAGdag_id应该父项点作前缀 parentchild中样
    · 通参数传递SubDAG operatorDAGSubDAG间享参数(示)
    · SubDAGs must have a schedule and be enabled If the SubDAG’s schedule is set to None or @once the SubDAG will succeed without having done anything SubDAG必须具时间表已启 果SubDAG时间表设置None@onceSubDAG成功执行需执行操作
    · clearing a SubDagOperator also clears the state of the tasks within清SubDagOperator会清中务状态
    · marking success on a SubDagOperator does not affect the state of the tasks withinSubDagOperator标记成功会影响中务状态
    · refrain from using depends_on_pastTrue in tasks within the SubDAG as this can be confusingSubDAG务中depends_on_past True会造成混淆
    · it is possible to specify an executor for the SubDAG It is common to use the SequentialExecutor if you want to run the SubDAG inprocess and effectively limit its parallelism to one Using LocalExecutor can be problematic as it may oversubscribe your worker running multiple tasks in a single slotSubDAG指定执行程序 果程中运行SubDAG行性效限制通常SequentialExecutor LocalExecutor会出现问题会超额预定您工作员单插槽中运行务
    关示例请参见airflowexample_dags
    SLAs
    Service Level Agreements or time by which a task or DAG should have succeeded can be set at a task level as a timedelta If one or many instances have not succeeded by that time an alert email is sent detailing the list of tasks that missed their SLA The event is also recorded in the database and made available in the web UI under Browse>Missed SLAs where events can be analyzed and documented 务级服务级协议务DAG应该成功通时间设置时间增量 果时实例没成功会发送封警报电子邮件详细列出错SLA务列表 该事件记录数库中Web UI中浏览>缺少SLA中分析记录事件
    触发规
    Though the normal workflow behavior is to trigger tasks when all their directly upstream tasks have succeeded Airflow allows for more complex dependency settings 正常工作流程行直接游务成功完成触发务Airflow允许进行更复杂赖项设置
    All operators have a trigger_rule argument which defines the rule by which the generated task get triggered The default value for trigger_rule is all_success and can be defined as trigger this task when all directly upstream tasks have succeeded All other rules described here are based on direct parent tasks and are values that can be passed to any operator while creating tasks 运算符trigger_rule参数该参数定义触发生成务规 trigger_rule默认值all_success定义直接游务成功时触发务 处描述规均基直接父务创建务时传递操作员值:
    · all_success (default) all parents have succeeded(默认)父母成功
    · all_failed all parents are in a failed or upstream_failed state父母处失败游失败状态
    · all_done all parents are done with their execution父母完成处决
    · one_failed fires as soon as at least one parent has failed it does not wait for all parents to be done少位父母失败立触发会等父母完成
    · one_success fires as soon as at least one parent succeeds it does not wait for all parents to be done少位父母成功会触发会等父母完成
    · dummy dependencies are just for show trigger at will赖显示意触发
    Note that these can be used in conjunction with depends_on_past (boolean) that when set to True keeps a task from getting triggered if the previous schedule for the task hasn’t succeeded 请注意depends_on_past(布尔值)结合设置True时果先前务计划未成功执行会触发务
    Latest Run Only仅新运行
    Standard workflow behavior involves running a series of tasks for a particular datetime range Some workflows however perform tasks that are independent of run time but need to be run on a schedule much like a standard cron job In these cases backfills or running jobs missed during a pause just wastes CPU cycles 标准工作流程行涉针特定日期时间范围运行系列务 某工作流程执行务运行时间关需计划运行标准cron作业样 情况暂停期间错回填正运行作业会浪费CPU周期
    For situations like this you can use the LatestOnlyOperator to skip tasks that are not being run during the most recent scheduled run for a DAG The LatestOnlyOperator skips all immediate downstream tasks and itself if the time right now is not between its execution_time and the next scheduled execution_time 种情况您LatestOnlyOperator跳DAG新计划运行期间未运行务 果前时间execute_time计划execute_time间LatestOnlyOperator会跳紧接游务身
    One must be aware of the interaction between skipped tasks and trigger rules Skipped tasks will cascade through trigger rules all_success and all_failed but not all_done one_failed one_success and dummy If you would like to use the LatestOnlyOperator with trigger rules that do not cascade skips you will need to ensure that the LatestOnlyOperator is directly upstream of the task you would like to skip 必须意识跳务触发规间相互作 跳务通触发规all_successall_failed进行级联all_doneone_failedone_success虚拟规 果您希LatestOnlyOperator会级联跳触发规起需确保LatestOnlyOperator直接您跳务游
    It is possible through use of trigger rules to mix tasks that should run in the typical datetime dependent mode and those using the LatestOnlyOperator 通触发规混合应典型日期时间相关模式运行务LatestOnlyOperator务
    For example consider the following DAG 例考虑问题:
    #dagslatest_only_with_triggerpy
    import datetime as dt

    from airflowmodels import DAG
    from airflowoperatorsdummy_operator import DummyOperator
    from airflowoperatorslatest_only_operator import LatestOnlyOperator
    from airflowutilstrigger_rule import TriggerRule


    dag DAG(
    dag_id'latest_only_with_trigger'
    schedule_intervaldttimedelta(hours4)
    start_datedtdatetime(2016 9 20)
    )

    latest_only LatestOnlyOperator(task_id'latest_only' dagdag)

    task1 DummyOperator(task_id'task1' dagdag)
    task1set_upstream(latest_only)

    task2 DummyOperator(task_id'task2' dagdag)

    task3 DummyOperator(task_id'task3' dagdag)
    task3set_upstream([task1 task2])

    task4 DummyOperator(task_id'task4' dagdag
    trigger_ruleTriggerRuleALL_DONE)
    task4set_upstream([task1 task2])
    In the case of this dag the latest_only task will show up as skipped for all runs except the latest run task1 is directly downstream of latest_only and will also skip for all runs except the latest task2 is entirely independent of latest_only and will run in all scheduled periods task3 is downstream of task1 and task2 and because of the default trigger_rule being all_success will receive a cascaded skip from task1 task4 is downstream of task1 and task2 but since its trigger_rule is set to all_done it will trigger as soon as task1 has been skipped (a valid completion state) and task2 has succeeded 种情况新运行(新运行外)新运行显示last_only务 task1直接位Latest_only游跳新操作外运行 task2完全独立latest_only计划时间段运行 task3位task1task2游默认trigger_ruleall_successtask1接收级联跳 task4位task1task2游trigger_rule设置all_donetask1跳(效完成状态)task2成功立触发

    Zombies & Undeads僵尸亡灵
    Task instances die all the time usually as part of their normal life cycle but sometimes unexpectedly 务实例通常正常生命周期中直时刻死时出意料
    Zombie tasks are characterized by the absence of an heartbeat (emitted by the job periodically) and a running status in the database They can occur when a worker node can’t reach the database when Airflow processes are killed externally or when a node gets rebooted for instance Zombie killing is performed periodically by the scheduler’s process 僵尸务特点没心跳(作业定期发出)数库中运行状态 工作程序节点法访问数库外部停止Airflow进程例重新启动节点时会发生种情况 僵尸杀死调度程序定期执行
    Undead processes are characterized by the existence of a process and a matching heartbeat but Airflow isn’t aware of this task as running in the database This mismatch typically occurs as the state of the database is altered most likely by deleting rows in the Task Instances view in the UI Tasks are instructed to verify their state as part of the heartbeat routine and terminate themselves upon figuring out that they are in this undead state 死进程特点存进程匹配心跳Airflow知道务数库中运行 种匹配通常着数库状态改变发生通删UI中务实例视图中行实现 指示务状态作心跳例程部分进行验证确定务处死状态时终止务
    Cluster Policy集群政策
    Your local airflow settings file can define a policy function that has the ability to mutate task attributes based on other task or DAG attributes It receives a single argument as a reference to task objects and is expected to alter its attributes 您气流设置文件定义策略功该功根务DAG属性更改务属性 接收单参数作务象引更改属性
    For example this function could apply a specific queue property when using a specific operator or enforce a task timeout policy making sure that no tasks run for more than 48 hours Here’s an example of what this may look like inside your airflow_settingspy 例函数特定运算符时应特定队列属性强制执行务超时策略确保没务运行超48时 airflow_settingspy部示例:
    def policy(task)
    if task__class____name__ 'HivePartitionSensor'
    taskqueue sensor_queue
    if tasktimeout > timedelta(hours48)
    tasktimeout timedelta(hours48)
    Documentation & Notes文档注释
    It’s possible to add documentation or notes to your dags & task objects that become visible in the web interface (Graph View for dags Task Details for tasks) There are a set of special task attributes that get rendered as rich content if defined 网络界面中见务务象添加文档注释(务图形视图务务详细信息) 果定义组特殊务属性属性呈现丰富容:
    attribute
    rendered to
    doc
    monospace
    doc_json
    json
    doc_yaml
    yaml
    doc_md
    markdown
    doc_rst
    reStructuredText
    Please note that for DAGs doc_md is the only attribute interpreted 请注意DAGsdoc_md唯解释属性
    This is especially useful if your tasks are built dynamically from configuration files it allows you to expose the configuration that led to the related tasks in Airflow 果您务根配置文件动态构建功特允许您Airflow中公开导致相关务配置

    ### My great DAG


    dag DAG('my_dag' default_argsdefault_args)
    dagdoc_md __doc__

    t BashOperator(foo dagdag)
    tdoc_md \
    #Title
    Here's a [url](wwwairbnbcom)

    This content will get rendered as markdown respectively in the Graph View and Task Details pages 容分图形视图务详细信息页面中显示减价
    Jinja Templating
    Jinja模板

    Airflow leverages the power of Jinja Templating and this can be a powerful tool to use in combination with macros (see the Macros section)
    气流利Jinja Templating强功宏结合功强工具(请参见宏部分)
    For example say you want to pass the execution date as an environment variable to a Bash script using the BashOperator
    例假设您想BashOperator执行日期作环境变量传递Bash脚
    # The execution date as YYYYMMDD
    date {{ ds }}
    t BashOperator(
    task_id'test_env'
    bash_command'tmptestsh '
    dagdag
    env{'EXECUTION_DATE' date})
    Here {{ ds }} is a macro and because the env parameter of the BashOperator is templated with Jinja the execution date will be available as an environment variable named EXECUTION_DATE in your Bash script
    里{{ds}}宏BashOperatorenv参数Jinja模板化执行日期您Bash脚中作名EXECUTION_DATE环境变量提供
    You can use Jinja templating with every parameter that is marked as templated in the documentation Template substitution occurs just before the pre_execute function of your operator is called
    您文档中标记 templated参数Jinja模板 模板换发生调运算符pre_execute函数前
    Packaged dags
    包dags
    While often you will specify dags in a single py file it might sometimes be required to combine dag and its dependencies For example you might want to combine several dags together to version them together or you might want to manage them together or you might need an extra module that is not available by default on the system you are running airflow on To allow this you can create a zip file that contains the dag(s) in the root of the zip file and have the extra modules unpacked in directories
    然通常您会单py文件中指定dag时需结合dag赖项 例您希dag组合起起进行版控制者希起理者需额外模块该模块运行气流系统默认情况 您创建zip文件该文件zip文件根目录中包含dag目录中解压缩模块
    For instance you can create a zip file that looks like this
    例您创建示zip文件:
    my_dag1py
    my_dag2py
    package1__init__py
    package1functionspy
    Airflow will scan the zip file and try to load my_dag1py and my_dag2py It will not go into subdirectories as these are considered to be potential packages
    Airflow扫描该zip文件尝试加载my_dag1pymy_dag2py 会进入子目录认潜软件包
    In case you would like to add module dependencies to your DAG you basically would do the same but then it is more to use a virtualenv and pip
    果您想模块赖项添加DAG中基样做virtualenvpip更
    virtualenv zip_dag
    source zip_dagbinactivate

    mkdir zip_dag_contents
    cd zip_dag_contents

    pip install installoptioninstalllibPWD my_useful_package
    cp ~my_dagpy

    zip r zip_dagzip *
    Note
    the zip file will be inserted at the beginning of module search list (syspath) and as such it will be available to any other code that resides within the same interpreter
    zip文件插入模块搜索列表(syspath)开头驻留解释器中代码均
    Note
    packaged dags cannot be used with pickling turned on
    包dags启酸洗情况
    Note
    packaged dags cannot contain dynamic libraries (eg libzso) these need to be available on the system if a module needs those In other words only pure python modules can be packaged
    包dag包含动态库(例libzso)果模块需动态库动态库需系统 换句话说包纯python模块
    airflowignore
    A airflowignore file specifies the directories or files in DAG_FOLDER that Airflow should intentionally ignore Each line in airflowignore specifies a regular expression pattern and directories or files whose names (not DAG id) match any of the patterns would be ignored (under the hood refindall() is used to match the pattern) Overall it works like a gitignore file
    airflowignore文件指定DAG_FOLDER中Airflow应该意忽略目录文件 airflowignore中行指定正表达式模式名称(非DAG id)模式匹配目录文件忽略(幕refindall()匹配该模式)总体言工作方式类似gitignore文件
    airflowignore file should be put in your DAG_FOLDER For example you can prepare a airflowignore file with contents
    airflowignore文件应放您DAG_FOLDER中例您准备包含容airflowignore文件
    project_a
    tenant_[\d]
    Then files like project_a_dag_1py TESTING_project_apy tenant_1py project_adag_1py and tenant_1dag_1py in your DAG_FOLDER would be ignored (If a directory’s name matches any of the patterns this directory and all its subfolders would not be scanned by Airflow at all This improves efficiency of DAG finding)
    然DAG_FOLDER中文件 project_a_dag_1py TESTING_project_apy tenant_1py project_a dag_1py tenant_1 dag_1py忽略(果目录名称模式中Airflow根会扫描目录子文件夹提高DAG查找效率)
    The scope of a airflowignore file is the directory it is in plus all its subfolders You can also prepare airflowignore file for a subfolder in DAG_FOLDER and it would only be applicable for that subfolder
    airflowignore文件范围目录子文件夹您DAG_FOLDER中子文件夹准备airflowignore文件该文件仅适该子文件夹
    数分析
    Note
    Adhoc Queries and Charts are no longer supported in the new FABbased webserver and UI due to security concerns
    安全方面考虑新基FABWeb服务器UI支持特查询图表
    Part of being productive with data is having the right weapons to profile the data you are working with Airflow provides a simple query interface to write SQL and get results quickly and a charting application letting you visualize data
    提高数生产力部分原拥正确武器剖析您正数 Airflow提供简单查询界面编写SQL快速获取结果提供图表应程序您视化数
    席查询
    Ad Hoc Query UI允许户Airflow中注册数库连接进行简单SQL交互

    图表
    A simple UI built on top of flaskadmin and highcharts allows building data visualizations and charts easily Fill in a form with a label SQL chart type pick a source database from your environment’s connections select a few other options and save it for later use
    基flaskadminhighcharts构建简单UI轻松构建数视化图表 填写带标签SQL图表类型表单环境连接中选择源数库选择选项然保存备
    You can even use the same templating and macros available when writing airflow pipelines parameterizing your queries and modifying parameters directly in the URL
    您甚编写气流道参数化查询直接URL中修改参数时相模板宏
    These charts are basic but they’re easy to create modify and share
    图表基图表易创建修改享

    图表截图


    Chart Form Screenshot

    命令行参数
    Airflow命令行参数拥非常丰富功DAG进行种操作启动服务支持开发测试
    法 airflow [h]
    {resetdbrendervariablesconnectionsuserspausesync_permtask_failed_depsversiontrigger_daginitdbtestunpauselist_dag_runsdag_staterunlist_tasksbackfilllist_dagskerberosworkerwebserverflowerschedulertask_statepoolserve_logsclearnext_executionupgradedbdelete_dag}

    位置参数
    subcommand
    Possible choices resetdb render variables connections users pause sync_perm task_failed_deps version trigger_dag initdb test unpause list_dag_runs dag_state run list_tasks backfill list_dags kerberos worker webserver flower scheduler task_state pool serve_logs clear next_execution upgradedb delete_dag
    subcommand help
    子命令:
    resetdb
    清重建元数库
    airflow resetdb [h] [y]
    命名参数
    y yes
    提示确认重置数库心
    默认值 False
    render
    Render a task instance’s template(s)
    渲染务实例模板
    airflow render [h] [sd SUBDIR] dag_id task_id execution_date
    位置参数
    dag_id
    The id of the dag
    task_id
    The id of the task
    execution_date
    The execution date of the DAG
    命名参数
    sd subdir
    File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
    中查找dag文件位置目录 默认 [AIRFLOW_HOME] dags中[AIRFLOW_HOME]您 airflowcfg中设置 AIRFLOW_HOME配置中设置值
    Default [AIRFLOW_HOME]dags
    variables
    CRUD operations on variables
    airflow variables [h] [s KEY VAL] [g KEY] [j] [d VAL] [i FILEPATH]
    [e FILEPATH] [x KEY]
    命名参数
    s set
    Set a variable
    g get
    Get value of a variable
    j json
    Deserialize JSON variable
    Default False
    d default
    Default value returned if variable does not exist
    i import
    Import variables from JSON file
    e export
    Export variables to JSON file
    x delete
    Delete a variable
    connections
    ListAddDelete connections
    airflow connections [h] [l] [a] [d] [conn_id CONN_ID]
    [conn_uri CONN_URI] [conn_extra CONN_EXTRA]
    [conn_type CONN_TYPE] [conn_host CONN_HOST]
    [conn_login CONN_LOGIN] [conn_password CONN_PASSWORD]
    [conn_schema CONN_SCHEMA] [conn_port CONN_PORT]
    命名参数
    l list
    List all connections
    Default False
    a add
    Add a connection
    Default False
    d delete
    Delete a connection
    Default False
    conn_id
    Connection id required to adddelete a connection
    conn_uri
    Connection URI required to add a connection without conn_type
    conn_extra
    Connection Extra field optional when adding a connection
    conn_type
    Connection type required to add a connection without conn_uri
    conn_host
    Connection host optional when adding a connection
    conn_login
    Connection login optional when adding a connection
    conn_password
     
    Connection password optional when adding a connection
    conn_schema
    Connection schema optional when adding a connection
    conn_port
    Connection port optional when adding a connection
    users
    ListCreateDelete users
    airflow users [h] [l] [c] [d] [username USERNAME] [email EMAIL]
    [firstname FIRSTNAME] [lastname LASTNAME] [role ROLE]
    [password PASSWORD] [use_random_password]
    命名参数
    l list
    List all users
    Default False
    c create
    Create a user
    Default False
    d delete
    Delete a user
    Default False
    username
    Username of the user required to createdelete a user
    email
    Email of the user required to create a user
    firstname
    First name of the user required to create a user
    lastname
    Last name of the user required to create a user
    role
    Role of the user Existing roles include Admin User Op Viewer and Public Required to create a user
    password
    Password of the user required to create a user without –use_random_password
    use_random_password
     
    Do not prompt for password Use random string instead Required to create a user without –password
    Default False
    pause
    Pause a DAG
    暂停DAG
    airflow pause [h] [sd SUBDIR] dag_id
    位置参数
    位置参数
    dag_id
    The id of the dag
    命名参数
    sd subdir
    File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
    中查找dag文件位置目录 默认 [AIRFLOW_HOME] dags中[AIRFLOW_HOME]您 airflowcfg中设置 AIRFLOW_HOME配置中设置值
    Default [AIRFLOW_HOME]dags
    sync_perm
    更新现角色权限
    airflow sync_perm [h]
    task_failed_deps
    Returns the unmet dependencies for a task instance from the perspective of the scheduler In other words why a task instance doesn’t get scheduled and then queued by the scheduler and then run by an executor) 调度程序角度返回务实例未满足赖关系 换句话说什务实例没调度然调度程序排队然执行者运行)
    airflow task_failed_deps [h] [sd SUBDIR] dag_id task_id execution_date
    位置参数
    dag_id
    The id of the dag
    task_id
    The id of the task
    execution_date
    The execution date of the DAG
    命名参数
    sd subdir
    File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’ 中查找dag文件位置目录 默认 [AIRFLOW_HOME] dags中[AIRFLOW_HOME]您 airflowcfg中设置 AIRFLOW_HOME配置中设置值
    Default [AIRFLOW_HOME]dags
    version
    显示版
    airflow version [h]
    trigger_dag
    触发DAG运行
    airflow trigger_dag [h] [sd SUBDIR] [r RUN_ID] [c CONF] [e EXEC_DATE]
    dag_id
    位置参数
    dag_id
    The id of the dag
    命名参数
    sd subdir
    File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’ 中查找dag文件位置目录 默认 [AIRFLOW_HOME] dags中[AIRFLOW_HOME]您 airflowcfg中设置 AIRFLOW_HOME配置中设置值
    Default [AIRFLOW_HOME]dags
    r run_id
    Helps to identify this run
    c conf
    JSON string that gets pickled into the DagRun’s conf attribute浸入DagRunconf属性中JSON字符串
    e exec_date
     
    The execution date of the DAG
    initdb
    初始化元数库
    airflow initdb [h]
    test
    Test a task instance This will run a task without checking for dependencies or recording its state in the database 测试务实例 检查赖关系状态记录数库中情况运行务
    airflow test [h] [sd SUBDIR] [dr] [tp TASK_PARAMS]
    dag_id task_id execution_date
    位置参数
    dag_id
    The id of the dag
    task_id
    The id of the task
    execution_date
    The execution date of the DAG
    命名参数
    sd subdir
    File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’ 中查找dag文件位置目录 默认 [AIRFLOW_HOME] dags中[AIRFLOW_HOME]您 airflowcfg中设置 AIRFLOW_HOME配置中设置值
    Default [AIRFLOW_HOME]dags
    dr dry_run
    Perform a dry run
    Default False
    tp task_params
     
    JSON参数字典发送务
    unpause
    Resume a paused DAG
    airflow unpause [h] [sd SUBDIR] dag_id
    位置参数
    dag_id
    The id of the dag
    命名参数
    sd subdir
    File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
    Default [AIRFLOW_HOME]dags
    list_dag_runs
    List dag runs given a DAG id If state option is given it will onlysearch for all the dagruns with the given state If no_backfill option is given it will filter outall backfill dagruns for given dag id 定DAG ID列出dag运行 果指定state选项仅搜索具定状态dagrun 果指定no_backfill选项滤定dag ID回填dagrun
    airflow list_dag_runs [h] [no_backfill] [state STATE] dag_id
    位置参数
    dag_id
    The id of the dag
    命名参数
    no_backfill
    filter all the backfill dagruns given the dag id
    Default False
    state
    Only list the dag runs corresponding to the state
    dag_state
    获取DAG Run状态
    airflow dag_state [h] [sd SUBDIR] dag_id execution_date
    位置参数
    dag_id
    The id of the dag
    execution_date
    The execution date of the DAG
    命名参数
    sd subdir
    File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
    Default [AIRFLOW_HOME]dags
    run
    Run a single task instance
    airflow run [h] [sd SUBDIR] [m] [f] [pool POOL] [cfg_path CFG_PATH]
    [l] [A] [i] [I] [ship_dag] [p PICKLE] [int]
    dag_id task_id execution_date
    位置参数
    dag_id
    The id of the dag
    task_id
    The id of the task
    execution_date
    The execution date of the DAG
    命名参数
    sd subdir
    File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
    Default [AIRFLOW_HOME]dags
    m mark_success
     
    Mark jobs as succeeded without running them
    Default False
    f force
    Ignore previous task instance state rerun regardless if task already succeededfailed
    Default False
    pool
    Resource pool to use
    cfg_path
    Path to config file to use instead of airflowcfg
    l local
    Run the task using the LocalExecutor
    Default False
    A ignore_all_dependencies
     
    Ignores all noncritical dependencies including ignore_ti_state and ignore_task_deps
    Default False
    i ignore_dependencies
     
    Ignore taskspecific dependencies eg upstream depends_on_past and retry delay dependencies
    Default False
    I ignore_depends_on_past
     
    Ignore depends_on_past dependencies (but respect upstream dependencies)
    Default False
    ship_dag
    Pickles (serializes) the DAG and ships it to the worker
    Default False
    p pickle
    Serialized pickle object of the entire dag (used internally)
    int interactive
     
    Do not capture standard output and error streams (useful for interactive debugging)
    Default False
    list_tasks
    List the tasks within a DAG
    airflow list_tasks [h] [t] [sd SUBDIR] dag_id
    位置参数
    dag_id
    The id of the dag
    命名参数
    t tree
    Tree view
    Default False
    sd subdir
    File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
    Default [AIRFLOW_HOME]dags
    backfill
    Run subsections of a DAG for a specified date range If reset_dag_run option is used backfill will first prompt users whether airflow should clear all the previous dag_run and task_instances within the backfill date range If rerun_failed_tasks is used backfill will auto rerun the previous failed task instances within the backfill date range 指定日期范围运行DAG节 果reset_dag_run选项回填首先提示户气流否应清回填日期范围先前dag_runtask_instances 果rerun_failed_tasks回填动回填日期范围重新运行前失败务实例
    airflow backfill [h] [t TASK_REGEX] [s START_DATE] [e END_DATE] [m] [l]
    [x] [i] [I] [sd SUBDIR] [pool POOL]
    [delay_on_limit DELAY_ON_LIMIT] [dr] [v] [c CONF]
    [reset_dagruns] [rerun_failed_tasks]
    dag_id
    位置参数
    dag_id
    The id of the dag
    命名参数
    t task_regex
     
    The regex to filter specific task_ids to backfill (optional)
    s start_date
     
    Override start_date YYYYMMDD
    e end_date
    Override end_date YYYYMMDD
    m mark_success
     
    Mark jobs as succeeded without running them
    Default False
    l local
    Run the task using the LocalExecutor
    Default False
    x donot_pickle
     
    Do not attempt to pickle the DAG object to send over to the workers just tell the workers to run their version of the code
    Default False
    i ignore_dependencies
     
    Skip upstream tasks run only the tasks matching the regexp Only works in conjunction with task_regex
    Default False
    I ignore_first_depends_on_past
     
    Ignores depends_on_past dependencies for the first set of tasks only (subsequent executions in the backfill DO respect depends_on_past)
    Default False
    sd subdir
    File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
    Default [AIRFLOW_HOME]dags
    pool
    Resource pool to use
    delay_on_limit
     
    Amount of time in seconds to wait when the limit on maximum active dag runs (max_active_runs) has been reached before trying to execute a dag run again
    Default 10
    dr dry_run
    Perform a dry run
    Default False
    v verbose
    Make logging output more verbose
    Default False
    c conf
    JSON string that gets pickled into the DagRun’s conf attribute
    reset_dagruns
     
    if set the backfill will delete existing backfillrelated DAG runs and start anew with fresh running DAG runs
    Default False
    rerun_failed_tasks
     
    if set the backfill will autorerun all the failed tasks for the backfill date range instead of throwing exceptions
    Default False
    list_dags
    列出DAG
    airflow list_dags [h] [sd SUBDIR] [r]
    命名参数
    sd subdir
    File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
    Default [AIRFLOW_HOME]dags
    r report
    Show DagBag loading report
    Default False
    kerberos
    Start a kerberos ticket renewer
    airflow kerberos [h] [kt [KEYTAB]] [pid [PID]] [D] [stdout STDOUT]
    [stderr STDERR] [l LOG_FILE]
    [principal]
    位置参数
    principal
    kerberos principal
    Default airflow
    命名参数
    kt keytab
    keytab
    Default airflowkeytab
    pid
    PID file location
    D daemon
    Daemonize instead of running in the foreground
    Default False
    stdout
    Redirect stdout to this file
    stderr
    Redirect stderr to this file
    l logfile
    Location of the log file
    worker
    Start a Celery worker node
    airflow worker [h] [p] [q QUEUES] [c CONCURRENCY] [cn CELERY_HOSTNAME]
    [pid [PID]] [D] [stdout STDOUT] [stderr STDERR]
    [l LOG_FILE] [a AUTOSCALE]
    命名参数
    p do_pickle
     
    Attempt to pickle the DAG object to send over to the workers instead of letting workers run their version of the code
    Default False
    q queues
    Comma delimited list of queues to serve
    Default default
    c concurrency
     
    The number of worker processes
    Default 16
    cn celery_hostname
     
    Set the hostname of celery worker if you have multiple workers on a single machine
    pid
    PID file location
    D daemon
    Daemonize instead of running in the foreground
    Default False
    stdout
    Redirect stdout to this file
    stderr
    Redirect stderr to this file
    l logfile
    Location of the log file
    a autoscale
     
    Minimum and Maximum number of worker to autoscale
    webserver
    Start a Airflow webserver instance
    airflow webserver [h] [p PORT] [w WORKERS]
    [k {synceventletgeventtornado}] [t WORKER_TIMEOUT]
    [hn HOSTNAME] [pid [PID]] [D] [stdout STDOUT]
    [stderr STDERR] [A ACCESS_LOGFILE] [E ERROR_LOGFILE]
    [l LOG_FILE] [ssl_cert SSL_CERT] [ssl_key SSL_KEY] [d]
    命名参数
    p port
    The port on which to run the server
    Default 8080
    w workers
    Number of workers to run the webserver on
    Default 4
    k workerclass
     
    Possible choices sync eventlet gevent tornado
    The worker class to use for Gunicorn
    Default sync
    t worker_timeout
     
    The timeout for waiting on webserver workers
    Default 120
    hn hostname
     
    Set the hostname on which to run the web server
    Default 0000
    pid
    PID file location
    D daemon
    Daemonize instead of running in the foreground
    Default False
    stdout
    Redirect stdout to this file
    stderr
    Redirect stderr to this file
    A access_logfile
     
    The logfile to store the webserver access log Use to print to stderr
    Default
    E error_logfile
     
    The logfile to store the webserver error log Use to print to stderr
    Default
    l logfile
    Location of the log file
    ssl_cert
    Path to the SSL certificate for the webserver
    ssl_key
    Path to the key to use with the SSL certificate
    d debug
    Use the server that ships with Flask in debug mode
    Default False
    flower
    启动Celery Flower
    airflow flower [h] [hn HOSTNAME] [p PORT] [fc FLOWER_CONF] [u URL_PREFIX]
    [a BROKER_API] [pid [PID]] [D] [stdout STDOUT]
    [stderr STDERR] [l LOG_FILE]
    命名参数
    hn hostname
     
    Set the hostname on which to run the server
    Default 0000
    p port
    The port on which to run the server
    Default 5555
    fc flower_conf
     
    Configuration file for flower
    u url_prefix
     
    URL prefix for Flower
    a broker_api
     
    Broker api
    pid
    PID file location
    D daemon
    Daemonize instead of running in the foreground
    Default False
    stdout
    Redirect stdout to this file
    stderr
    Redirect stderr to this file
    l logfile
    Location of the log file
    scheduler
    启动调度程序实例
    airflow scheduler [h] [d DAG_ID] [sd SUBDIR] [r RUN_DURATION]
    [n NUM_RUNS] [p] [pid [PID]] [D] [stdout STDOUT]
    [stderr STDERR] [l LOG_FILE]
    命名参数
    d dag_id
    The id of the dag to run
    sd subdir
    File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
    Default [AIRFLOW_HOME]dags
    r runduration
     
    Set number of seconds to execute before exiting
    n num_runs
    Set the number of runs to execute before exiting
    Default 1
    p do_pickle
     
    Attempt to pickle the DAG object to send over to the workers instead of letting workers run their version of the code
    Default False
    pid
    PID file location
    D daemon
    Daemonize instead of running in the foreground
    Default False
    stdout
    Redirect stdout to this file
    stderr
    Redirect stderr to this file
    l logfile
    Location of the log file
    task_state
    获取务实例状态
    airflow task_state [h] [sd SUBDIR] dag_id task_id execution_date
    位置参数
    dag_id
    The id of the dag
    task_id
    The id of the task
    execution_date
    The execution date of the DAG
    命名参数
    sd subdir
    File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
    Default [AIRFLOW_HOME]dags
    pool
    CRUD operations on pools
    airflow pool [h] [s NAME SLOT_COUNT POOL_DESCRIPTION] [g NAME] [x NAME]
    [i FILEPATH] [e FILEPATH]
    命名参数
    s set
    Set pool slot count and description respectively
    g get
    Get pool info
    x delete
    Delete a pool
    i import
    Import pool from JSON file
    e export
    Export pool to JSON file
    serve_logs
    Serve logs generate by worker
    airflow serve_logs [h]
    clear
    Clear a set of task instance as if they never ran
    airflow clear [h] [t TASK_REGEX] [s START_DATE] [e END_DATE] [sd SUBDIR]
    [u] [d] [c] [f] [r] [x] [xp] [dx]
    dag_id
    位置参数
    dag_id
    The id of the dag
    命名参数
    t task_regex
     
    The regex to filter specific task_ids to backfill (optional)
    s start_date
     
    Override start_date YYYYMMDD
    e end_date
    Override end_date YYYYMMDD
    sd subdir
    File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
    Default [AIRFLOW_HOME]dags
    u upstream
    Include upstream tasks
    Default False
    d downstream
     
    Include downstream tasks
    Default False
    c no_confirm
     
    Do not request confirmation
    Default False
    f only_failed
     
    Only failed jobs
    Default False
    r only_running
     
    Only running jobs
    Default False
    x exclude_subdags
     
    Exclude subdags
    Default False
    xp exclude_parentdag
     
    Exclude ParentDAGS if the task cleared is a part of a SubDAG
    Default False
    dx dag_regex
     
    Search dag_id as regex instead of exact string
    Default False
    next_execution
    获取DAG次执行日期时间
    airflow next_execution [h] [sd SUBDIR] dag_id
    位置参数
    dag_id
    The id of the dag
    命名参数
    sd subdir
    File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
    Default [AIRFLOW_HOME]dags
    upgradedb
    元数库升级新版
    airflow upgradedb [h]
    delete_dag
    删指定DAG相关数库记录
    airflow delete_dag [h] [y] dag_id
    位置参数
    dag_id
    The id of the dag
    命名参数
    y yes
    Do not prompt to confirm reset Use with care
    Default False
    调度触发器
    Airflow调度程序监视务DAG触发已满足赖关系务实例 台监视文件夹中包含DAG象保持步定期(分钟左右)检查活动务查否触发
    Airflow计划程序设计Airflow生产环境中作持久性服务运行首先您需执行Airflow调度程序airflowcfg中指定配置
    请注意果您天schedule_interval运行DAG20160101T2359久触发标记20160101运行换句话说作业实例结束期启动
    Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date at the END of the period 重复遍计划程序会开始日期(周期结束)schedule_interval运行您作业
    调度程序启动airflowcfg中指定执行程序实例 果碰巧LocalExecutor务作子流程执行 CeleryExecutor MesosExecutor务远程执行
    启动调度程序需运行命令:
    airflow scheduler
    DAG Run
    DAG Run表示DAG时实例化象
    Each DAG may or may not have a schedule which informs how DAG Runs are created schedule_interval is defined as a DAG arguments and receives preferably a cron expression as a str or a datetimetimedelta object Alternatively you can also use one of these cron preset DAG没时间表该时间表会通知创建DAG Runschedule_interval定义DAG参数接收cron表达式作strdatetimetimedelta象 外您cron预设:
    预设
    含义
    cron
    None
    计划运行DAG仅外部触发
     
    @once
    Schedule once and only once仅安排次
     
    @hourly
    Run once an hour at the beginning of the hour时开始次时运行次
    0 * * * *
    @daily
    Run once a day at midnight天半夜运行次
    0 0 * * *
    @weekly
    Run once a week at midnight on Sunday morning周日周日午夜运行次
    0 0 * * 0
    @monthly
    Run once a month at midnight of the first day of the month月月第天午夜运行次
    0 0 1 * *
    @yearly
    Run once a year at midnight of January 1年1月1日午夜运行次
    0 0 1 1 *
    注意:您想计划运行DAG时请schedule_intervalNoneschedule_interval'None'
    Your DAG will be instantiated for each schedule while creating a DAG Run entry for each schedule 计划实例化DAG时计划创建DAG运行条目
    DAG runs have a state associated to them (running failed success) and informs the scheduler on which set of schedules should be evaluated for task submissions Without the metadata at the DAG run level the Airflow scheduler would have much more work to do in order to figure out what tasks should be triggered and come to a crawl It might also create undesired processing when changing the shape of your DAG by say adding in new tasks DAG Run具相关状态(运行失败成功)通知调度程序应针务提交评估调度程序集 果没DAG Run级元数Airflow计划程序做更工作便确定应触发务进行爬网 通说添加新务更改DAG形状时会产生需处理
    BackfillCatchup
    An Airflow DAG with a start_date possibly an end_date and a schedule_interval defines a series of intervals which the scheduler turn into individual Dag Runs and execute A key capability of Airflow is that these DAG Runs are atomic idempotent items and the scheduler by default will examine the lifetime of the DAG (from start to endnow one interval at a time) and kick off a DAG Run for any interval that has not been run (or has been cleared) This concept is called Catchup 带开始日期(结束日期)schedule_intervalAirflow DAG定义系列间隔计划程序间隔转换单独Dag运行执行 Airflow项关键功DAG运行原子幂等项默认情况调度程序检查DAG寿命(开始结束现次间隔时间)然启动DAG运行 尚未运行(已清)间隔 概念称追赶
    If your DAG is written to handle its own catchup (IE not limited to the interval but instead to Now for instance) then you will want to turn catchup off (Either on the DAG itself with dagcatchup False) or by default at the configuration file level with catchup_by_default False What this will do is to instruct the scheduler to only create a DAG Run for the most current instance of the DAG interval series 果您DAG编写处理身catchup(例IE限间隔改Now)您需关闭追赶(DAG身dagcatchup False) 默认情况配置文件级catchup_by_default False 指示调度程序仅DAG间隔系列新实例创建DAG运行

    Code that goes along with the Airflow tutorial located at
    httpsgithubcomapacheincubatorairflowblobmasterairflowexample_dagstutorialpy

    from airflow import DAG
    from airflowoperatorsbash_operator import BashOperator
    from datetime import datetime timedelta


    default_args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' datetime(2015 12 1)
    'email' ['airflow@examplecom']
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes5)
    }

    dag DAG(
    'tutorial'
    default_argsdefault_args
    description'A simple tutorial DAG'
    schedule_interval'@hourly'
    catchupFalse)
    In the example above if the DAG is picked up by the scheduler daemon on 20160102 at 6 AM (or from the command line) a single DAG Run will be created with an execution_date of 20160101 and the next one will be created just after midnight on the morning of 20160103 with an execution date of 20160102 面示例中果调度程序守护程序2016年1月2日午6点(命令行)拾取DAG创建DAG运行执行日期20160101 2016年1月3日午午夜创建执行日期2016年1月2日
    If the dagcatchup value had been True instead the scheduler would have created a DAG Run for each completed interval between 20151201 and 20160102 (but not yet one for 20160102 as that interval hasn’t completed) and the scheduler will execute them sequentially This behavior is great for atomic datasets that can easily be split into periods Turning catchup off is great if your DAG Runs perform backfill internally 果dagcatchup值改True调度程序2015120120160102间已完成间隔创建DAG Run(尚未20160102创建DAG Run该间隔 尚未完成)调度程序序执行 行轻松拆分周期原子数集非常 果您DAG运行部执行回填关闭追赶非常
    外部触发
    Note that DAG Runs can also be created manually through the CLI while running an airflow trigger_dag command where you can define a specific run_id The DAG Runs created externally to the scheduler get associated to the trigger’s timestamp and will be displayed in the UI alongside scheduled DAG runs 请注意运行airflow trigger_dag命令时通CLI手动创建DAG运行您中定义特定run_id 调度程序外部创建DAG运行触发器时间戳相关联已调度DAG运行起显示户界面中
    In addition you can also manually trigger a DAG Run using the web UI (tab DAGs > column Links > button Trigger Dag) 外您Web UI(选项卡 DAG>链接>触发)钮手动触发DAG Run
    牢记
    · The first DAG Run is created based on the minimum start_date for the tasks in your DAG 根DAG中务开始日期创建第DAG Run
    · Subsequent DAG Runs are created by the scheduler process based on your DAG’s schedule_interval sequentially DAG Runs调度程序根DAGschedule_interval序创建
    · When clearing a set of tasks’ state in hope of getting them to rerun it is important to keep in mind the DAG Run’s state too as it defines whether the scheduler should look into triggering tasks for that run 清组务状态希重新运行时请务必牢记DAG Run状态定义调度程序否应考虑触发该运行务
    Here are some of the ways you can unblock tasks 解阻止务方法:
    · From the UI you can clear (as in delete the status of) individual task instances from the task instances dialog while defining whether you want to includes the pastfuture and the upstreamdownstream dependencies Note that a confirmation window comes next and allows you to see the set you are about to clear You can also clear all task instances associated with the dag 户界面中您务实例话框中清(删状态样)单务实例时定义否包括游游赖关系 请注意接会出现确认窗口您通该窗口查清设置 您清dag关联务实例
    · The CLI command airflow clear h has lots of options when it comes to clearing task instance states including specifying date ranges targeting task_ids by specifying a regular expression flags for including upstream and downstream relatives and targeting task instances in specific states (failed or success) 清务实例状态时CLI命令airflow clear h选项包括指定日期范围通指定正表达式定位task_id包括游游亲属标志特定状态定位务实例(失败成功)
    · Clearing a task instance will no longer delete the task instance record Instead it updates max_tries and set the current task instance state to be None 清务实例删务实例记录 相反更新max_tries前务实例状态设置None
    · Marking task instances as failed can be done through the UI This can be used to stop running task instances 通UI务实例标记失败 停止运行务实例
    · Marking task instances as successful can be done through the UI This is mostly to fix false negatives or for instance when the fix has been applied outside of Airflow 通UI务实例标记成功 解决误报问题例修复Airflow外应解决方案
    · The airflow backfill CLI subcommand has a flag to mark_success and allows selecting subsections of the DAG as well as specifying date ranges airflow backfill CLI子命令具mark_success标志允许选择DAG节指定日期范围
    插件
    Airflow has a simple plugin manager builtin that can integrate external features to its core by simply dropping files in your AIRFLOW_HOMEplugins folder Airflow置简单插件理器需文件拖放AIRFLOW_HOMEplugins文件夹中便外部功集成核心中
    导入plugins文件夹中python模块hooksoperatorssensorsmacrosexecutorsWeb view已集成Airflow程序集中
    What for
    Airflow offers a generic toolbox for working with data Different organizations have different stacks and different needs Using Airflow plugins can be a way for companies to customize their Airflow installation to reflect their ecosystem Airflow提供处理数通工具箱 组织具堆栈需求 Airflow插件成公司定义Airflow安装反映生态系统种方式
    Plugins can be used as an easy way to write share and activate new sets of features 插件作编写享激活新功集简便方法
    There’s also a need for a set of more complex applications to interact with different flavors of data and metadata 需组更复杂应程序类型数元数进行交互
    实例:
    · A set of tools to parse Hive logs and expose Hive metadata (CPU IO phases skew …) 套解析Hive日志公开Hive元数工具(CPU IO 阶段偏斜…)
    · An anomaly detection framework allowing people to collect metrics set thresholds and alerts异常检测框架允许收集指标设置阈值警报
    · An auditing tool helping understand who accesses what审核工具助解谁访问什
    · A configdriven SLA monitoring tool allowing you to set monitored tables and at what time they should land alert people and expose visualizations of outages配置驱动SLA监视工具您设置监视表时登陆提醒员显示停机情况
    · …
    Why build on top of Airflow
    Airflow具构建应程序时重组件:
    · A web server you can use to render your views
    · A metadata database to store your models
    · Access to your databases and knowledge of how to connect to them
    · An array of workers that your application can push workload to
    · Airflow is deployed you can just piggy back on its deployment logistics
    · Basic charting capabilities underlying libraries and abstractions
    Interface
    To create a plugin you will need to derive the airflowplugins_managerAirflowPlugin class and reference the objects you want to plug into Airflow Here’s what the class you need to derive looks like 创建插件您需派生airflowplugins_managerAirflowPlugin类引插入Airflow象 您需派生类示:
    class AirflowPlugin(object)
    # The name of your plugin (str)
    name None
    # A list of class(es) derived from BaseOperator
    operators []
    # A list of class(es) derived from BaseSensorOperator
    sensors []
    # A list of class(es) derived from BaseHook
    hooks []
    # A list of class(es) derived from BaseExecutor
    executors []
    # A list of references to inject into the macros namespace
    macros []
    # A list of objects created from a class derived
    # from flask_adminBaseView
    admin_views []
    # A list of Blueprint object created from flaskBlueprint
    flask_blueprints []
    # A list of menu links (flask_adminbaseMenuLink)
    menu_links []
    You can derive it by inheritance (please refer to the example below) Please note name inside this class must be specified 您通继承派生(请参考面示例) 请注意必须类指定名称
    After the plugin is imported into Airflow you can invoke it using statement like插件导入Airflow您语句调:
    from airflow{type like operators sensors}{name specificed inside the plugin class} import *
    When you write your own plugins make sure you understand them well There are some essential properties for each type of plugin For example 您编写插件时请确保您解 种类型插件基属性 例
    · For Operator plugin an execute method is compulsory
    · For Sensor plugin a poke method returning a Boolean value is compulsory
    实例
    The code below defines a plugin that injects a set of dummy object definitions in Airflow 面代码定义插件该插件Airflow中注入组虚拟象定义
    # This is the class you derive to create a plugin
    from airflowplugins_manager import AirflowPlugin

    from flask import Blueprint
    from flask_admin import BaseView expose
    from flask_adminbase import MenuLink

    # Importing base classes that we need to derive
    from airflowhooksbase_hook import BaseHook
    from airflowmodels import BaseOperator
    from airflowsensorsbase_sensor_operator import BaseSensorOperator
    from airflowexecutorsbase_executor import BaseExecutor

    # Will show up under airflowhookstest_pluginPluginHook
    class PluginHook(BaseHook)
    pass

    # Will show up under airflowoperatorstest_pluginPluginOperator
    class PluginOperator(BaseOperator)
    pass

    # Will show up under airflowsensorstest_pluginPluginSensorOperator
    class PluginSensorOperator(BaseSensorOperator)
    pass

    # Will show up under airflowexecutorstest_pluginPluginExecutor
    class PluginExecutor(BaseExecutor)
    pass

    # Will show up under airflowmacrostest_pluginplugin_macro
    def plugin_macro()
    pass

    # Creating a flask admin BaseView
    class TestView(BaseView)
    @expose('')
    def test(self)
    # in this example put your test_plugintesthtml template at airflowpluginstemplatestest_plugintesthtml
    return selfrender(test_plugintesthtml contentHello galaxy)
    v TestView(categoryTest Plugin nameTest View)

    # Creating a flask blueprint to integrate the templates and static folder
    bp Blueprint(
    test_plugin __name__
    template_folder'templates' # registers airflowpluginstemplates as a Jinja template folder
    static_folder'static'
    static_url_path'statictest_plugin')

    ml MenuLink(
    category'Test Plugin'
    name'Test Menu Link'
    url'httpsairflowincubatorapacheorg')

    # Defining the plugin class
    class AirflowTestPlugin(AirflowPlugin)
    name test_plugin
    operators [PluginOperator]
    sensors [PluginSensorOperator]
    hooks [PluginHook]
    executors [PluginExecutor]
    macros [plugin_macro]
    admin_views [v]
    flask_blueprints [bp]
    menu_links [ml]
    安全性
    By default all gates are opened An easy way to restrict access to the web application is to do it at the network level or by using SSH tunnels 默认情况门均处开状态 限制Web应程序访问种简单方法网络级SSH隧道进行访问
    It is however possible to switch on authentication by either using one of the supplied backends or creating your own 通提供端创建端开身份验证
    Be sure to checkout Experimental Rest API for securing the API 请务必签出Experimental Rest API保护API
    Note
    Airflow uses the config parser of Python This config parser interpolates ’signs Make sure escape any signs in your config file (but not environment variables) as otherwise Airflow might leak these passwords on a config parser exception to a log AirflowPython配置解析器 配置解析器插入%符号 确保配置文件(环境变量)中%符号转义否Airflow会配置解析器异常中密码泄漏日志中
    Web身份验证
    密码
    Note
    This is for flaskadmin based web UI only If you are using FABbased web UI with RBAC feature please use command line interface airflow users create to create accounts or do that in the FABbased UI itself 仅适基flaskadminWeb UI 果您正具RBAC功基FABWeb UI请命令行界面气流户create创建帐户者基FABUI身中进行操作
    One of the simplest mechanisms for authentication is requiring users to specify a password before logging in Password authentication requires the used of the password subpackage in your requirements file Password hashing uses bcrypt before storing passwords 简单身份验证机制求户登录前指定密码密码身份验证需需求文件中密码子包 密码哈希存储密码前bcrypt
    [webserver]
    authenticate True
    auth_backend airflowcontribauthbackendspassword_auth
    When password auth is enabled an initial user credential will need to be created before anyone can login An initial user was not created in the migrations for this authentication backend to prevent default Airflow installations from attack Creating a new user has to be done via a Python REPL on the same machine Airflow is installed 启密码验证需创建初始户证登录 迁移中没身份验证端创建初始户防止默认Airflow安装受攻击 必须安装Airflow台计算机通Python REPL创建新户
    # navigate to the airflow installation directory
    cd ~airflow
    python
    Python 279 (default Feb 10 2015 032808)
    Type help copyright credits or license for more information
    >>> import airflow
    >>> from airflow import models settings
    >>> from airflowcontribauthbackendspassword_auth import PasswordUser
    >>> user PasswordUser(modelsUser())
    >>> userusername 'new_user_name'
    >>> useremail 'new_user_email@examplecom'
    >>> userpassword 'set_the_password'
    >>> session settingsSession()
    >>> sessionadd(user)
    >>> sessioncommit()
    >>> sessionclose()
    >>> exit()
    LDAP
    To turn on LDAP authentication configure your airflowcfg as follows Please note that the example uses an encrypted connection to the ldap server as you probably do not want passwords be readable on the network level It is however possible to configure without encryption if you really want to 开LDAP身份验证请步骤配置airflowcfg 请注意该示例ldap服务器加密连接您希密码网络级读 果您确实愿意加密进行配置
    Additionally if you are using Active Directory and are not explicitly specifying an OU that your users are in you will need to change search_scope to SUBTREE 外果您Active Directory未明确指定户OU需search_scope更改 SUBTREE
    Valid search_scope options can be found in the ldap3 Documentationldap3文档中找效search_scope选项
    [webserver]
    authenticate True
    auth_backend airflowcontribauthbackendsldap_auth

    [ldap]
    # set a connection without encryption uri ldap
    uri ldaps
    user_filter objectClass*
    # in case of Active Directory you would use user_name_attr sAMAccountName
    user_name_attr uid
    # group_member_attr should be set accordingly with *_filter
    # eg
    # group_member_attr groupMembership
    # superuser_filter groupMembershipCNairflowsuperusers
    group_member_attr memberOf
    superuser_filter memberOfCNairflowsuperusersOUGroupsOURWCOUUSOUNORAMDCexampleDCcom
    data_profiler_filter memberOfCNairflowdataprofilersOUGroupsOURWCOUUSOUNORAMDCexampleDCcom
    bind_user cnManagerdcexampledccom
    bind_password insecure
    basedn dcexampledccom
    cacert etccaldap_cacrt
    # Set search_scope to one of them BASE LEVEL SUBTREE
    # Set search_scope to SUBTREE if using Active Directory and not specifying an Organizational Unit
    search_scope LEVEL
    The superuser_filter and data_profiler_filter are optional If defined these configurations allow you to specify LDAP groups that users must belong to in order to have superuser (admin) and dataprofiler permissions If undefined all users will be superusers and data profilers superuser_filterdata_profiler_filter选 果定义配置允许您指定户必须属LDAP组具超级户(admin)数配置程序权限 果未定义户均超级户数分析器
    Roll your own
    Airflow uses flask_login and exposes a set of hooks in the airflowdefault_login module You can alter the content and make it part of the PYTHONPATH and configure it as a backend in airflowcfg Airflowflask_loginairflowdefault_login模块中公开组挂钩 您更改容设置PYTHONPATH部分配置airflowcfg中端
    [webserver]
    authenticate True
    auth_backend mypackageauth
    Multitenancy
    You can filter the list of dags in webserver by owner name when authentication is turned on by setting webserverfilter_by_owner in your config With this a user will see only the dags which it is owner of unless it is a superuser 启身份验证通配置中设置webserver:filter_by_owner者名称滤Web服务器中列表 样非超级户否户仅者dag
    [webserver]
    filter_by_owner True
    Kerberos
    Airflow has initial support for Kerberos This means that airflow can renew kerberos tickets for itself and store it in the ticket cache The hooks and dags can make use of ticket to authenticate against kerberized services Airflow初Kerberos提供支持 意味着气流行更新kerberos票证存储票证缓存中 钩子挂锁利票证Kerberos服务进行身份验证
    限制
    Please note that at this time not all hooks have been adjusted to make use of this functionality Also it does not integrate kerberos into the web interface and you will have to rely on network level security for now to make sure your service remains secure 请注意时非挂钩已进行调整利功 外会kerberos集成Web界面中您现必须网络级安全性确保您服务保持安全
    Celery integration has not been tried and tested yet However if you generate a key tab for every host and launch a ticket renewer next to every worker it will most likely work Celery集成尚未尝试测试 果机生成密钥选项卡工作员旁边启动票证续订会起作
    Enabling kerberos
    Airflow
    To enable kerberos you will need to generate a (service) key tab 启kerberos您需生成(服务)密钥标签
    # in the kadminlocal or kadmin shell create the airflow principal
    kadmin addprinc randkey airflowfullyqualifieddomainname@YOURREALMCOM

    # Create the airflow keytab file that will contain the airflow principal
    kadmin xst norandkey k airflowkeytab airflowfullyqualifieddomainname
    Now store this file in a location where the airflow user can read it (chmod 600) And then add the following to your airflowcfg 现文件存储气流户读取位置(chmod 600) 然容添加您airflowcfg中
    [core]
    security kerberos

    [kerberos]
    keytab etcairflowairflowkeytab
    reinit_frequency 3600
    principal airflow
    Launch the ticket renewer by
    # run ticket renewer
    airflow kerberos
    Hadoop
    If want to use impersonation this needs to be enabled in coresitexml of your hadoop config 果模拟需hadoop配置coresitexml中启

    hadoopproxyuserairflowgroups
    *



    hadoopproxyuserairflowusers
    *



    hadoopproxyuserairflowhosts
    *

    Of course if you need to tighten your security replace the asterisk with something more appropriate 然果您需加强安全性请更合适名称换星号
    Using kerberos authentication
    The hive hook has been updated to take advantage of kerberos authentication To allow your DAGs to use it simply update the connection details with for example 蜂巢挂钩已更新利kerberos身份验证 允许您DAG需命令更新连接详细信息:
    { use_beeline true principal hive_HOST@EXAMPLECOM}
    Adjust the principal to your settings The _HOST part will be replaced by the fully qualified domain name of the server 体调整您设置 _HOST部分服务器标准域名换
    You can specify if you would like to use the dag owner as the user for the connection or the user specified in the login section of the connection For the login user specify the following as extra 您指定dag者作连接户连接登录部分中指定户 登录户请外指定容:
    { use_beeline true principal hive_HOST@EXAMPLECOM proxy_user login}
    DAG者请:
    { use_beeline true principal hive_HOST@EXAMPLECOM proxy_user owner}
    and in your DAG when initializing the HiveOperator specify 您DAG中初始化HiveOperator时请指定:
    run_as_ownerTrue
    To use kerberos authentication you must install Airflow with the kerberos extras group kerberos身份验证必须Airflowkerberos extras组起安装:
    pip install airflow[kerberos]
    OAuth验证
    GitHub Enterprise (GHE) Authentication
    The GitHub Enterprise authentication backend can be used to authenticate users against an installation of GitHub Enterprise using OAuth2 You can optionally specify a team whitelist (composed of slug cased team names) to restrict login to only members of those teams GitHub Enterprise身份验证端针OAuth2进行GitHub Enterprise安装户进行身份验证 您选择指定团队白名单(笨拙团队名称组成)仅登录限制团队成员
    [webserver]
    authenticate True
    auth_backend airflowcontribauthbackendsgithub_enterprise_auth

    [github_enterprise]
    host githubexamplecom
    client_id oauth_key_from_github_enterprise
    client_secret oauth_secret_from_github_enterprise
    oauth_callback_route exampleghe_oauthcallback
    allowed_teams 1 345 23
    Note
    If you do not specify a team whitelist anyone with a valid account on your GHE installation will be able to login to Airflow 果您未指定团队白名单您GHE安装中拥效帐户登录Airflow
    To use GHE authentication you must install Airflow with the github_enterprise extras group GHE身份验证必须Airflowgithub_enterprise Extras组起安装:
    pip install airflow[github_enterprise]
    Setting up GHE Authentication设置GHE身份验证
    An application must be setup in GHE before you can use the GHE authentication backend In order to setup an application 必须先GHE中设置应程序然GHE身份验证端 设置应程序:
    1 Navigate to your GHE profile
    2 Select Applications’ from the left hand nav
    3 Select the Developer Applications’ tab
    4 Click Register new application’
    5 Fill in the required information (the Authorization callback URL’ must be fully qualified eg httpairflowexamplecomexampleghe_oauthcallback)
    6 Click Register application’
    7 Copy Client ID’ Client Secret’ and your callback route to your airflowcfg according to the above example
    Using GHE Authentication with githubcom
    githubcomGHE身份验证:
    1 Create an Oauth App
    2 Copy Client ID’ Client Secret’ to your airflowcfg according to the above example
    3 Set host githubcom and oauth_callback_route oauthcallback in airflowcfg
    Google身份认证
    The Google authentication backend can be used to authenticate users against Google using OAuth2 You must specify the email domains to restrict login separated with a comma to only members of those domains Google身份验证端OAuth2Google进行户身份验证 您必须指定电子邮件域登录名(逗号分隔)限制仅域成员
    [webserver]
    authenticate True
    auth_backend airflowcontribauthbackendsgoogle_auth

    [google]
    client_id google_client_id
    client_secret google_client_secret
    oauth_callback_route oauth2callback
    domain example1comexample2com
    To use Google authentication you must install Airflow with the google_auth extras group Google身份验证您必须Airflowgoogle_auth extras组起安装:
    pip install airflow[google_auth]
    Setting up Google Authentication设置Google身份验证
    An application must be setup in the Google API Console before you can use the Google authentication backend In order to setup an application 必须先Google API控制台中设置应程序然Google身份验证端 设置应程序:
    1 Navigate to httpsconsoledevelopersgooglecomapis
    2 Select Credentials’ from the left hand nav
    3 Click Create credentials’ and choose OAuth client ID’
    4 Choose Web application’
    5 Fill in the required information (the Authorized redirect URIs’ must be fully qualified eg httpairflowexamplecomoauth2callback)
    6 Click Create’
    7 Copy Client ID’ Client Secret’ and your redirect URI to your airflowcfg according to the above example
    SSL
    SSL can be enabled by providing a certificate and key Once enabled be sure to use https in your browser 通提供证书密钥启SSL 启请确保浏览器中 https
    [webserver]
    web_server_ssl_cert
    web_server_ssl_key
    Enabling SSL will not automatically change the web server port If you want to use the standard port 443 you’ll need to configure that too Be aware that super user privileges (or cap_net_bind_service on Linux) are required to listen on port 443 启SSL会动更改Web服务器端口 果标准端口443需进行配置 请注意侦听端口443需超级户特权(Linuxcap_net_bind_service)
    # Optionally set the server to listen on the standard SSL port
    web_server_port 443
    base_url http443
    启带SSLCeleryExecutor确保正确生成客户端服务器证书密钥
    [celery]
    ssl_active True
    ssl_key
    ssl_cert
    ssl_cacert
    代理户身份
    Airflow has the ability to impersonate a unix user while running task instances based on the task’s run_as_user parameter which takes a user’s name Airflow运行务实例时根务run_as_user参数模拟unix户该参数带户名
    NOTE For impersonations to work Airflow must be run with sudo as subtasks are run with sudo u and permissions of files are changed Furthermore the unix user needs to exist on the worker Here is what a simple sudoers file entry could look like to achieve this assuming as airflow is running as the airflow user Note that this means that the airflow user must be trusted and treated the same way as the root user 注意:模拟正常工作必须sudo运行Airflow子务sudo u运行文件权限已更改 外unix户需存worker 简单sudoers文件条目会实现目假设气流气流户身份运行 请注意意味着必须信气流户root户相方式
    airflow ALL(ALL) NOPASSWD ALL
    Subtasks with impersonation will still log to the same folder except that the files they log to will have permissions changed such that only the unix user can write to it 具模拟功子务然记录相文件夹中处登录文件权限发生更改Unix户进行写入
    默认代理
    To prevent tasks that don’t use impersonation to be run with sudo privileges you can set the coredefault_impersonation config which sets a default user impersonate if run_as_user is not set 防止sudo特权运行模拟务您设置coredefault_impersonation配置该配置未设置run_as_user情况设置默认户模拟
    [core]
    default_impersonation airflow
    时区
    Airflow默认启时区支持程序部数库中存储UTC日期时间信息允许您时区相关计划运行DAG目前Airflow尚未户界面中转换户时区户界面始终UTC日期时间显示Operator中模板会转换日期时间时间时区信息程序中访问取决DAG编写者处理
    果您户分散时区您想根户挂钟显示日期时间信息方便
    The main reason is Daylight Saving Time (DST) Many countries have a system of DST where clocks are moved forward in spring and backward in autumn If you’re working in local time you’re likely to encounter errors twice a year when the transitions happen (The pendulum and pytz documentation discusses these issues in greater detail) This probably doesn’t matter for a simple DAG but it’s a problem if you are in for example financial services where you have end of day deadlines to meet 您仅时区中运行Airflow数存储数库中UTC中然种做法(Airflow成时区前建议甚需设置) 原夏令时(DST) 许国家区采夏令时制时钟春季前移动秋季移动 果您时间工作转换时年会遇两次错误(摆锤pytz文档问题进行更详细讨)简单DAG重果您例金融服务等截止日期情况遇问题
    airflowcfg文件中包含时区设置默认情况设置utc您更改系统设置意IANA时区例 EuropeAmsterdam(欧洲阿姆斯特丹)赖pendulum(摆锤)pendulumpytz更准确 您安装Airflow时会动安装pendulum
    请注意AirflowWeb界面目前仅UTC时间调度job运行
    概念
    原始日期时间象
    Pythondatetimedatetime象tzinfo属性该属性存储时区信息表示datetimetzinfo子类实例 设置属性描述偏移量会知道datetime象
    您timezoneis_aware()timezoneis_naive()确定日期时间否
    Because Airflow uses timezoneaware datetime objects If your code creates datetime objects they need to be aware too Airflow时区感知日期时间象果您代码创建datetime象需注意
    from airflowutils import timezone

    now timezoneutcnow()
    a_date timezonedatetime(201711)
    原始日期时间象解释
    Although Airflow operates fully time zone aware it still accepts naive date time objects for start_dates and end_dates in your DAG definitions This is mostly in order to preserve backwards compatibility In case a naive start_date or end_date is encountered the default time zone is applied It is applied in such a way that it is assumed that the naive date time is already in the default time zone In other words if you have a default time zone setting of EuropeAmsterdam and create a naive datetime start_date of datetime(201711) it is assumed to be a start_date of Jan 1 2017 Amsterdam time Airflow完全识时区会DAG定义中接受start_datesend_dates原始日期时间象 保持兼容性 果遇天真起始日期结束日期会应默认时区 假定原始日期时间已默认时区中种方式应 换句话说果您默认时区设置设置Europe Amsterdam创建简单datetime start_date datetime(201711)假定start_date阿姆斯特丹时间2017年1月1日
    default_argsdict(
    start_datedatetime(2016 1 1)
    owner'Airflow'
    )

    dag DAG('my_dag' default_argsdefault_args)
    op DummyOperator(task_id'dummy' dagdag)
    print(opowner) # Airflow
    Unfortunately during DST transitions some datetimes don’t exist or are ambiguous In such situations pendulum raises an exception That’s why you should always create aware datetime objects when time zone support is enabled 幸DST渡期间某日期时间存明确 种情况摆会引发例外情况 启时区支持时您应该始终创建识日期时间象
    In practice this is rarely an issue Airflow gives you aware datetime objects in the models and DAGs and most often new datetime objects are created from existing ones through timedelta arithmetic The only datetime that’s often created in application code is the current time and timezoneutcnow() automatically does the right thing 实际少问题 通Airflow您模型DAG中解日期时间象常见新日期时间象通timedelta算法现象中创建 应程序代码中唯常创建日期时间前时间timezoneutcnow()动执行正确操作
    默认时区
    The default time zone is the time zone defined by the default_timezone setting under [core] If you just installed Airflow it will be set to utc which is recommended You can also set it to system or an IANA time zone (eg`EuropeAmsterdam`) DAGs are also evaluated on Airflow workers it is therefore important to make sure this setting is equal on all Airflow nodes 默认时区[core]default_timezone设置定义时区 果您安装Airflow建议设置utc 您设置系统IANA时区(例欧洲阿姆斯特丹) Airflow工作员评估DAG确保Airflow节点设置均相等重
    [core]
    default_timezone utc
    时区感知DAG
    创建时区感知DAG非常简单 需确保提供时区感知start_date 建议pendulumpytz(手动安装)
    import pendulum

    local_tz pendulumtimezone(EuropeAmsterdam)

    default_argsdict(
    start_datedatetime(2016 1 1 tzinfolocal_tz)
    owner'Airflow'
    )

    dag DAG('my_tz_dag' default_argsdefault_args)
    op DummyOperator(task_id'dummy' dagdag)
    print(dagtimezone) #
    Please note that while it is possible to set a start_date and end_date for Tasks always the DAG timezone or global timezone (in that order) will be used to calculate the next execution date 请注意然Tasks设置start_dateend_dateDAG时区全局时区(序)计算执行日期 Upon first encounter the start date or end date will be converted to UTC using the timezone associated with start_date or end_date then for calculations this timezone information will be disregarded 首次遇时start_dateend_date关联时区开始日期结束日期转换UTC然计算时区信息忽略
    模板
    Airflow returns time zone aware datetimes in templates but does not convert them to local time so they remain in UTC It is left up to the DAG to handle this Airflow模板中返回时区感知日期时间会转换时间保持UTC DAG处理问题
    import pendulum

    local_tz pendulumtimezone(EuropeAmsterdam)
    local_tzconvert(execution_date)
    Cron时间表
    In case you set a cron schedule Airflow assumes you will always want to run at the exact same time It will then ignore day light savings time Thus if you have a schedule that says run at the end of interval every day at 0800 GMT+1 it will always run at the end of interval 0800 GMT+1 regardless if day light savings time is in place 万您设置cron时间表Airflow会假设您始终希完全时间运行 然忽略夏令时 果您时间表说天0800 GMT + 1间隔时间结束时运行始终间隔0800 GMT + 1结束时间运行否设置夏时制
    时间增量
    For schedules with time deltas Airflow assumes you always will want to run with the specified interval So if you specify a timedelta(hours2) you will always want to run two hours later In this case day light savings time will be taken into account 具时间增量计划Airflow假设您始终希指定间隔运行 果您指定timedelta(hours 2)您总两时运行 种情况考虑日光节约时间
    Experimental Rest API
    Airflow exposes an experimental Rest API It is available through the webserver Endpoints are available at apiexperimental Please note that we expect the endpoint definitions to change 气流公开实验性Rest API 通Web服务器 端点位 api experimental 请注意希端点定义会发生变化
    Endpoints
    This is a place holder until the swagger definitions are active占位符直效定义生效
    · apiexperimentaldagstasks returns info for a task (GET)
    · apiexperimentaldagsdag_runs creates a dag_run for a given dag id (POST)
    CLI
    For some functions the cli can use the API To configure the CLI to use the API when available configure as follows 某功CLIAPI CLI配置时API请进行配置:
    [cli]
    api_client airflowapiclientjson_client
    endpoint_url http
    Authentication
    Authentication for the API is handled separately to the Web Authentication The default is to not require any authentication on the API – ie wide open by default This is not recommended if your Airflow webserver is publicly accessible and you should probably use the deny all backend API身份验证Web身份验证分开处理 默认设置求API进行身份验证默认情况完全开放 果您公开访问您Airflow Web服务器建议样做您应该deny all backend:
    [api]
    auth_backend airflowapiauthbackenddeny_all
    Two real methods for authentication are currently supported for the API 该API前支持两种真实身份验证方法
    启密码验证请配置文件中设置容:
    [api]
    auth_backend airflowcontribauthbackendspassword_auth
    It’s usage is similar to the Password Authentication used for the Web interface 法类似Web界面密码验证
    To enable Kerberos authentication set the following in the configuration 启Kerberos身份验证请配置中设置容:
    [api]
    auth_backend airflowapiauthbackendkerberos_auth

    [kerberos]
    keytab
    The Kerberos service is configured as airflowfullyqualifieddomainname@REALM Make sure this principal exists in the keytab file Kerberos服务配置airflowfullyqualifieddomainname@REALM 确保体存keytab文件中
    整合
    · Reverse Proxy
    · Azure Microsoft Azure
    · AWS Amazon Web Services
    · Databricks
    · GCP Google Cloud Platform
    · Qubole
    Reverse Proxy反代理
    Airflow can be set up behind a reverse proxy with the ability to set its endpoint with great flexibility 反代理面建立气流够灵活设置端点
    For example you can configure your reverse proxy to get 例您反代理配置:
    httpslabmycompanycommyorgairflow
    To do so you need to set the following setting in your airflowcfg 您需airflowcfg中设置设置:
    base_url httpmy_hostmyorgairflow
    Additionally if you use Celery Executor you can get Flower in myorgflower with 外果您Celery Executor通方式 myorg flower中获Flower:
    flower_url_prefix myorgflower
    Your reverse proxy (ex nginx) should be configured as follow 您反代理(例:nginx)应配置:
    · pass the url and http header as it for the Airflow webserver without any rewrite for example URLhttp标头传递Airflow Web服务器进行重写例:
    · server {
    · listen 80
    · server_name labmycompanycom
    ·
    · location myorgairflow {
    · proxy_pass httplocalhost8080
    · proxy_set_header Host host
    · proxy_redirect off
    · proxy_http_version 11
    · proxy_set_header Upgrade http_upgrade
    · proxy_set_header Connection upgrade
    · }
    · }
    · rewrite the url for the flower endpoint
    · server {
    · listen 80
    · server_name labmycompanycom
    ·
    · location myorgflower {
    · rewrite ^myorgflower(*) 1 break # remove prefix from http header
    · proxy_pass httplocalhost5555
    · proxy_set_header Host host
    · proxy_redirect off
    · proxy_http_version 11
    · proxy_set_header Upgrade http_upgrade
    · proxy_set_header Connection upgrade
    · }
    · }
    To ensure that Airflow generates URLs with the correct scheme when running behind a TLSterminating proxy you should configure the proxy to set the XForwardedProto header and enable the ProxyFix middleware in your airflowcfg
    enable_proxy_fix True
    Note you should only enable the ProxyFix middleware when running Airflow behind a trusted proxy (AWS ELB nginx etc)
    Azure Microsoft Azure
    Airflow has limited support for Microsoft Azure interfaces exist only for Azure Blob Storage and Azure Data Lake Hook Sensor and Operator for Blob Storage and Azure Data Lake Hook are in contrib section AirflowMicrosoft Azure支持限:接口仅适Azure Blob存储Azure Data Lake Blob存储Azure Data Lake Hook挂钩传感器操作员位贡献部分
    Azure Blob Storage
    All classes communicate via the Window Azure Storage Blob protocol Make sure that a Airflow connection of type wasb exists Authorization can be done by supplying a login (Storage account name) and password (KEY) or login and SAS token in the extra field (see connection wasb_default for an example) 类通Window Azure Storage Blob协议进行通信 确保存wasb类型气流连接 通额外字段中提供登录名(存储帐户名)密码( KEY)登录名SAS令牌完成授权(关示例请参见wasb_default)
    · WasbBlobSensor Checks if a blob is present on Azure Blob storage
    · WasbPrefixSensor Checks if blobs matching a prefix are present on Azure Blob storage
    · FileToWasbOperator Uploads a local file to a container as a blob
    · WasbHook Interface with Azure Blob Storage
    WasbBlobSensor
    WasbPrefixSensor
    FileToWasbOperator
    WasbHook
    Azure File Share
    Cloud variant of a SMB file share Make sure that a Airflow connection of type wasb exists Authorization can be done by supplying a login (Storage account name) and password (Storage account key) or login and SAS token in the extra field (see connection wasb_default for an example) SMB文件享云变体 确保存wasb类型气流连接 通额外字段中提供登录名(存储帐户名)密码(存储帐户密钥)登录名SAS令牌完成授权(关示例请参见connection wasb_default)
    AzureFileShareHook
    Logging
    Airflow can be configured to read and write task logs in Azure Blob Storage See Writing Logs to Azure Blob Storage 气流配置Azure Blob存储中读写务日志 请参阅日志写入Azure Blob存储
    Azure Data Lake
    AzureDataLakeHook communicates via a REST API compatible with WebHDFS Make sure that a Airflow connection of type azure_data_lake exists Authorization can be done by supplying a login (Client ID) password (Client Secret) and extra fields tenant (Tenant) and account_name (Account Name) AzureDataLakeHook通WebHDFS兼容REST API进行通信 确保存azure_data_lake类型Airflow连接 通提供登录名(客户ID)密码(客户机密)额外字段租户(Tenant)account_name(帐户名)完成授权
    (see connection azure_data_lake_default for an example) (关示例请参见连接azure_data_lake_default)
    · AzureDataLakeHook Interface with Azure Data Lake AzureDataLakeHook:Azure Data Lake接口
    AzureDataLakeHook
    AWS Amazon Web Services
    Airflow has extensive support for Amazon Web Services But note that the Hooks Sensors and Operators are in the contrib section AirflowAmazon Web Services具广泛支持 请注意挂钩传感器操作员贡献部分中
    AWS EMR
    · EmrAddStepsOperator Adds steps to an existing EMR JobFlow
    · EmrCreateJobFlowOperator Creates an EMR JobFlow reading the config from the EMR connection
    · EmrTerminateJobFlowOperator Terminates an EMR JobFlow
    · EmrHook Interact with AWS EMR
    EmrAddStepsOperator
    class airflowcontriboperatorsemr_add_steps_operatorEmrAddStepsOperator(job_flow_id aws_conn_id's3_default' stepsNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    An operator that adds steps to an existing EMR job_flow 步骤添加现EMR job_flow运算符
    Parameters
    · job_flow_id (str) – id of the JobFlow to add steps to (templated)
    · aws_conn_id (str) – aws connection to uses
    · steps (list) – boto3 style steps to be added to the jobflow (templated)
    EmrCreateJobFlowOperator
    class airflowcontriboperatorsemr_create_job_flow_operatorEmrCreateJobFlowOperator(aws_conn_id's3_default' emr_conn_id'emr_default' job_flow_overridesNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Creates an EMR JobFlow reading the config from the EMR connection A dictionary of JobFlow overrides can be passed that override the config from the connection 创建EMR JobFlowEMR连接中读取配置 传递JobFlow代字典该字典覆盖连接中配置
    Parameters
    · aws_conn_id (str) – aws connection to uses
    · emr_conn_id (str) – emr connection to use
    · job_flow_overrides – boto3 style arguments to override emr_connection extra (templated)
    EmrTerminateJobFlowOperator
    class airflowcontriboperatorsemr_terminate_job_flow_operatorEmrTerminateJobFlowOperator(job_flow_id aws_conn_id's3_default' *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Operator to terminate EMR JobFlows
    Parameters
    · job_flow_id (str) – id of the JobFlow to terminate (templated)
    · aws_conn_id (str) – aws connection to uses
    EmrHook
    class airflowcontribhooksemr_hookEmrHook(emr_conn_idNone *args **kwargs)[source]
    Bases airflowcontribhooksaws_hookAwsHook
    Interact with AWS EMR emr_conn_id is only neccessary for using the create_job_flow method AWS EMR进行交互 仅create_job flow方法时需emr conn_id
    create_job_flow(job_flow_overrides)[source]
    Creates a job flow using the config from the EMR connection Keys of the json extra hash may have the arguments of the boto3 run_job_flow method Overrides for this config may be passed as the job_flow_overrides EMR连接中配置创建作业流程 json Extra哈希键具boto3 run_job_flow方法参数 配置代作job_flow_overrides传递
    AWS S3
    · S3Hook Interact with AWS S3
    · S3FileTransformOperator Copies data from a source S3 location to a temporary location on the local filesystem
    · S3ListOperator Lists the files matching a key prefix from a S3 location
    · S3ToGoogleCloudStorageOperator Syncs an S3 location with a Google Cloud Storage bucket
    · S3ToHiveTransfer Moves data from S3 to Hive The operator downloads a file from S3 stores the file locally before loading it into a Hive table
    S3Hook
    class airflowhooksS3_hookS3Hook(aws_conn_id'aws_default' verifyNone)[source]
    Bases airflowcontribhooksaws_hookAwsHook
    Interact with AWS S3 using the boto3 library
    check_for_bucket(bucket_name)[source]
    Check if bucket_name exists
    Parameters
    bucket_name (str) – the name of the bucket
    check_for_key(key bucket_nameNone)[source]
    Checks if a key exists in a bucket检查存储桶中否存密钥
    Parameters
    · key (str) – S3 key that will point to the file
    · bucket_name (str) – Name of the bucket in which the file is stored
    check_for_prefix(bucket_name prefix delimiter)[source]
    Checks that a prefix exists in a bucket检查存储桶中否存前缀
    Parameters
    · bucket_name (str) – the name of the bucket
    · prefix (str) – a key prefix
    · delimiter (str) – the delimiter marks key hierarchy
    check_for_wildcard_key(wildcard_key bucket_nameNone delimiter'')[source]
    Checks that a key matching a wildcard expression exists in a bucket检查存储桶中否存通配符表达式匹配键
    Parameters
    · wildcard_key (str) – the path to the key
    · bucket_name (str) – the name of the bucket
    · delimiter (str) – the delimiter marks key hierarchy
    copy_object(source_bucket_key dest_bucket_key source_bucket_nameNone dest_bucket_nameNone source_version_idNone)[source]
    Creates a copy of an object that is already stored in S3 创建已存储S3中象副
    Note the S3 connection used here needs to have access to both source and destination bucketkey 注意:处S3连接需访问源存储桶目标存储桶密钥
    Parameters
    · source_bucket_key (str) –
    The key of the source object
    It can be either full s3 style url or relative path from root level 完整s3:样式url根级相路径
    When it’s specified as a full s3 url please omit source_bucket_name 果指定完整s3:网址请省略source_bucket_name
    · dest_bucket_key (str) –
    The key of the object to copy to 复制象键
    The convention to specify dest_bucket_key is the same as source_bucket_key 指定dest_bucket_key约定source_bucket_key相
    · source_bucket_name (str) –
    Name of the S3 bucket where the source object is in 源象S3存储桶名称
    It should be omitted when source_bucket_key is provided as a full s3 url source_bucket_key作完整s3: URL提供时应省略
    · dest_bucket_name (str) –
    Name of the S3 bucket to where the object is copied 象复制S3存储桶名称
    It should be omitted when dest_bucket_key is provided as a full s3 url dest_bucket_key作完整s3: URL提供时应省略
    · source_version_id (str) – Version ID of the source object (OPTIONAL)
    delete_objects(bucket keys)[source]
    Parameters
    · bucket (str) – Name of the bucket in which you are going to delete object(s)
    · keys (str or list) –
    The key(s) to delete from S3 bucket
    When keys is a string it’s supposed to be the key name of the single object to delete
    When keys is a list it’s supposed to be the list of the keys to delete
    get_bucket(bucket_name)[source]
    Returns a boto3S3Bucket object
    Parameters
    bucket_name (str) – the name of the bucket
    get_key(key bucket_nameNone)[source]
    Returns a boto3s3Object
    Parameters
    · key (str) – the path to the key
    · bucket_name (str) – the name of the bucket
    get_wildcard_key(wildcard_key bucket_nameNone delimiter'')[source]
    Returns a boto3s3Object object matching the wildcard expression
    Parameters
    · wildcard_key (str) – the path to the key
    · bucket_name (str) – the name of the bucket
    · delimiter (str) – the delimiter marks key hierarchy
    list_keys(bucket_name prefix'' delimiter'' page_sizeNone max_itemsNone)[source]
    Lists keys in a bucket under prefix and not containing delimiter
    Parameters
    · bucket_name (str) – the name of the bucket
    · prefix (str) – a key prefix
    · delimiter (str) – the delimiter marks key hierarchy
    · page_size (int) – pagination size
    · max_items (int) – maximum items to return
    list_prefixes(bucket_name prefix'' delimiter'' page_sizeNone max_itemsNone)[source]
    Lists prefixes in a bucket under prefix
    Parameters
    · bucket_name (str) – the name of the bucket
    · prefix (str) – a key prefix
    · delimiter (str) – the delimiter marks key hierarchy
    · page_size (int) – pagination size
    · max_items (int) – maximum items to return
    load_bytes(bytes_data key bucket_nameNone replaceFalse encryptFalse)[source]
    Loads bytes to S3
    This is provided as a convenience to drop a string in S3 It uses the boto infrastructure to ship a file to s3
    Parameters
    · bytes_data (bytes) – bytes to set as content for the key
    · key (str) – S3 key that will point to the file
    · bucket_name (str) – Name of the bucket in which to store the file
    · replace (bool) – A flag to decide whether or not to overwrite the key if it already exists
    · encrypt (bool) – If True the file will be encrypted on the serverside by S3 and will be stored in an encrypted form while at rest in S3
    load_file(filename key bucket_nameNone replaceFalse encryptFalse)[source]
    Loads a local file to S3
    Parameters
    · filename (str) – name of the file to load
    · key (str) – S3 key that will point to the file
    · bucket_name (str) – Name of the bucket in which to store the file
    · replace (bool) – A flag to decide whether or not to overwrite the key if it already exists If replace is False and the key exists an error will be raised
    · encrypt (bool) – If True the file will be encrypted on the serverside by S3 and will be stored in an encrypted form while at rest in S3
    load_string(string_data key bucket_nameNone replaceFalse encryptFalse encoding'utf8')[source]
    Loads a string to S3
    This is provided as a convenience to drop a string in S3 It uses the boto infrastructure to ship a file to s3
    Parameters
    · string_data (str) – str to set as content for the key
    · key (str) – S3 key that will point to the file
    · bucket_name (str) – Name of the bucket in which to store the file
    · replace (bool) – A flag to decide whether or not to overwrite the key if it already exists
    · encrypt (bool) – If True the file will be encrypted on the serverside by S3 and will be stored in an encrypted form while at rest in S3
    read_key(key bucket_nameNone)[source]
    Reads a key from S3
    Parameters
    · key (str) – S3 key that will point to the file
    · bucket_name (str) – Name of the bucket in which the file is stored
    select_key(key bucket_nameNone expression'SELECT * FROM S3Object' expression_type'SQL' input_serializationNone output_serializationNone)[source]
    Reads a key with S3 Select
    Parameters
    · key (str) – S3 key that will point to the file
    · bucket_name (str) – Name of the bucket in which the file is stored
    · expression (str) – S3 Select expression
    · expression_type (str) – S3 Select expression type
    · input_serialization (dict) – S3 Select input data serialization format
    · output_serialization (dict) – S3 Select output data serialization format
    Returns
    retrieved subset of original data by S3 Select
    Return type
    str
    See also
    For more details about S3 Select parameters httpboto3readthedocsioenlatestreferenceservicess3html#S3Clientselect_object_content
    S3FileTransformOperator
    class airflowoperatorss3_file_transform_operatorS3FileTransformOperator(source_s3_key dest_s3_key transform_scriptNone select_expressionNone source_aws_conn_id'aws_default' source_verifyNone dest_aws_conn_id'aws_default' dest_verifyNone replaceFalse *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Copies data from a source S3 location to a temporary location on the local filesystem Runs a transformation on this file as specified by the transformation script and uploads the output to a destination S3 location
    The locations of the source and the destination files in the local filesystem is provided as an first and second arguments to the transformation script The transformation script is expected to read the data from source transform it and write the output to the local destination file The operator then takes over control and uploads the local destination file to S3
    S3 Select is also available to filter the source contents Users can omit the transformation script if S3 Select expression is specified
    Parameters
    · source_s3_key (str) – The key to be retrieved from S3 (templated)
    · source_aws_conn_id (str) – source s3 connection
    · source_verify (bool or str) –
    Whether or not to verify SSL certificates for S3 connetion By default SSL certificates are verified You can provide the following values
    o False do not validate SSL certificates SSL will still be used
    (unless use_ssl is False) but SSL certificates will not be verified
    o pathtocertbundlepem A filename of the CA cert bundle to uses
    You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
    This is also applicable to dest_verify
    · dest_s3_key (str) – The key to be written from S3 (templated)
    · dest_aws_conn_id (str) – destination s3 connection
    · replace (bool) – Replace dest S3 key if it already exists
    · transform_script (str) – location of the executable transformation script
    · select_expression (str) – S3 Select expression
    S3ListOperator
    class airflowcontriboperatorss3_list_operatorS3ListOperator(bucket prefix'' delimiter'' aws_conn_id'aws_default' verifyNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    List all objects from the bucket with the given string prefix in name
    This operator returns a python list with the name of objects which can be used by xcom in the downstream task
    Parameters
    · bucket (str) – The S3 bucket where to find the objects (templated)
    · prefix (str) – Prefix string to filters the objects whose name begin with such prefix (templated)
    · delimiter (str) – the delimiter marks key hierarchy (templated)
    · aws_conn_id (str) – The connection ID to use when connecting to S3 storage
    Parame verify
    Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
    (unless use_ssl is False) but SSL certificates will not be verified
    · pathtocertbundlepem A filename of the CA cert bundle to uses
    You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
    Example
    The following operator would list all the files (excluding subfolders) from the S3 customers201804 key in the data bucket
    s3_file S3ListOperator(
    task_id'list_3s_files'
    bucket'data'
    prefix'customers201804'
    delimiter''
    aws_conn_id'aws_customers_conn'
    )
    S3ToGoogleCloudStorageOperator
    class airflowcontriboperatorss3_to_gcs_operatorS3ToGoogleCloudStorageOperator(bucket prefix'' delimiter'' aws_conn_id'aws_default' verifyNone dest_gcs_conn_idNone dest_gcsNone delegate_toNone replaceFalse *args **kwargs)[source]
    Bases airflowcontriboperatorss3_list_operatorS3ListOperator
    Synchronizes an S3 key possibly a prefix with a Google Cloud Storage destination path
    Parameters
    · bucket (str) – The S3 bucket where to find the objects (templated)
    · prefix (str) – Prefix string which filters objects whose name begin with such prefix (templated)
    · delimiter (str) – the delimiter marks key hierarchy (templated)
    · aws_conn_id (str) – The source S3 connection
    · dest_gcs_conn_id (str) – The destination connection ID to use when connecting to Google Cloud Storage
    · dest_gcs (str) – The destination Google Cloud Storage bucket and prefix where you want to store the files (templated)
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · replace (bool) – Whether you want to replace existing destination files or not
    Parame verify
    Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
    (unless use_ssl is False) but SSL certificates will not be verified
    · pathtocertbundlepem A filename of the CA cert bundle to uses
    You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
    Example
    s3_to_gcs_op S3ToGoogleCloudStorageOperator(
    task_id's3_to_gcs_example'
    bucket'mys3bucket'
    prefix'datacustomers201804'
    dest_gcs_conn_id'google_cloud_default'
    dest_gcs'gsmygcsbucketsomecustomers'
    replaceFalse
    dagmydag)
    Note that bucket prefix delimiter and dest_gcs are templated so you can use variables in them if you wish
    S3ToHiveTransfer
    class airflowoperatorss3_to_hive_operatorS3ToHiveTransfer(s3_key field_dict hive_table delimiter' ' createTrue recreateFalse partitionNone headersFalse check_headersFalse wildcard_matchFalse aws_conn_id'aws_default' verifyNone hive_cli_conn_id'hive_cli_default' input_compressedFalse tblpropertiesNone select_expressionNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Moves data from S3 to Hive The operator downloads a file from S3 stores the file locally before loading it into a Hive table If the create or recreate arguments are set to True a CREATE TABLE and DROP TABLE statements are generated Hive data types are inferred from the cursor’s metadata from
    Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization format If a large amount of data is loaded andor if the tables gets queried considerably you may want to use this operator only to stage the data into a temporary table before loading it into its final destination using a HiveOperator
    Parameters
    · s3_key (str) – The key to be retrieved from S3 (templated)
    · field_dict (dict) – A dictionary of the fields name in the file as keys and their Hive types as values
    · hive_table (str) – target Hive table use dot notation to target a specific database (templated)
    · create (bool) – whether to create the table if it doesn’t exist
    · recreate (bool) – whether to drop and recreate the table at every execution
    · partition (dict) – target partition as a dict of partition columns and values (templated)
    · headers (bool) – whether the file contains column names on the first line
    · check_headers (bool) – whether the column names on the first line should be checked against the keys of field_dict
    · wildcard_match (bool) – whether the s3_key should be interpreted as a Unix wildcard pattern
    · delimiter (str) – field delimiter in the file
    · aws_conn_id (str) – source s3 connection
    · hive_cli_conn_id (str) – destination hive connection
    · input_compressed (bool) – Boolean to determine if file decompression is required to process headers
    · tblproperties (dict) – TBLPROPERTIES of the hive table being created
    · select_expression (str) – S3 Select expression
    Parame verify
    Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
    (unless use_ssl is False) but SSL certificates will not be verified
    · pathtocertbundlepem A filename of the CA cert bundle to uses
    You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
    AWS EC2 Container Service
    · ECSOperator Execute a task on AWS EC2 Container Service
    ECSOperator
    class airflowcontriboperatorsecs_operatorECSOperator(task_definition cluster overrides aws_conn_idNone region_nameNone launch_type'EC2' groupNone placement_constraintsNone platform_version'LATEST' network_configurationNone **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Execute a task on AWS EC2 Container Service
    Parameters
    · task_definition (str) – the task definition name on EC2 Container Service
    · cluster (str) – the cluster name on EC2 Container Service
    · overrides (dict) – the same parameter that boto3 will receive (templated) httpboto3readthedocsorgenlatestreferenceservicesecshtml#ECSClientrun_task
    · aws_conn_id (str) – connection id of AWS credentials region name If None credential boto3 strategy will be used (httpboto3readthedocsioenlatestguideconfigurationhtml)
    · region_name (str) – region name to use in AWS Hook Override the region_name in connection (if provided)
    · launch_type (str) – the launch type on which to run your task (EC2’ or FARGATE’)
    · group (str) – the name of the task group associated with the task
    · placement_constraints (list) – an array of placement constraint objects to use for the task
    · platform_version (str) – the platform version on which your task is running
    · network_configuration (dict) – the network configuration for the task
    AWS Batch Service
    · AWSBatchOperator Execute a task on AWS Batch Service
    AWSBatchOperator
    class airflowcontriboperatorsawsbatch_operatorAWSBatchOperator(job_name job_definition job_queue overrides max_retries4200 aws_conn_idNone region_nameNone **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Execute a job on AWS Batch Service
    Parameters
    · job_name (str) – the name for the job that will run on AWS Batch (templated)
    · job_definition (str) – the job definition name on AWS Batch
    · job_queue (str) – the queue name on AWS Batch
    · overrides (dict) – the same parameter that boto3 will receive on containerOverrides (templated) httpboto3readthedocsioenlatestreferenceservicesbatchhtml#submit_job
    · max_retries (int) – exponential backoff retries while waiter is not merged 4200 48 hours
    · aws_conn_id (str) – connection id of AWS credentials region name If None credential boto3 strategy will be used (httpboto3readthedocsioenlatestguideconfigurationhtml)
    · region_name (str) – region name to use in AWS Hook Override the region_name in connection (if provided)
    AWS RedShift
    · AwsRedshiftClusterSensor Waits for a Redshift cluster to reach a specific status
    · RedshiftHook Interact with AWS Redshift using the boto3 library
    · RedshiftToS3Transfer Executes an unload command to S3 as CSV with or without headers
    · S3ToRedshiftTransfer Executes an copy command from S3 as CSV with or without headers
    AwsRedshiftClusterSensor
    class airflowcontribsensorsaws_redshift_cluster_sensorAwsRedshiftClusterSensor(cluster_identifier target_status'available' aws_conn_id'aws_default' *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Waits for a Redshift cluster to reach a specific status
    Parameters
    · cluster_identifier (str) – The identifier for the cluster being pinged
    · target_status (str) – The cluster status desired
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    RedshiftHook
    class airflowcontribhooksredshift_hookRedshiftHook(aws_conn_id'aws_default' verifyNone)[source]
    Bases airflowcontribhooksaws_hookAwsHook
    Interact with AWS Redshift using the boto3 library
    cluster_status(cluster_identifier)[source]
    Return status of a cluster
    Parameters
    cluster_identifier (str) – unique identifier of a cluster
    create_cluster_snapshot(snapshot_identifier cluster_identifier)[source]
    Creates a snapshot of a cluster
    Parameters
    · snapshot_identifier (str) – unique identifier for a snapshot of a cluster
    · cluster_identifier (str) – unique identifier of a cluster
    delete_cluster(cluster_identifier skip_final_cluster_snapshotTrue final_cluster_snapshot_identifier'')[source]
    Delete a cluster and optionally create a snapshot
    Parameters
    · cluster_identifier (str) – unique identifier of a cluster
    · skip_final_cluster_snapshot (bool) – determines cluster snapshot creation
    · final_cluster_snapshot_identifier (str) – name of final cluster snapshot
    describe_cluster_snapshots(cluster_identifier)[source]
    Gets a list of snapshots for a cluster
    Parameters
    cluster_identifier (str) – unique identifier of a cluster
    restore_from_cluster_snapshot(cluster_identifier snapshot_identifier)[source]
    Restores a cluster from its snapshot
    Parameters
    · cluster_identifier (str) – unique identifier of a cluster
    · snapshot_identifier (str) – unique identifier for a snapshot of a cluster
    RedshiftToS3Transfer
    class airflowoperatorsredshift_to_s3_operatorRedshiftToS3Transfer(schema table s3_bucket s3_key redshift_conn_id'redshift_default' aws_conn_id'aws_default' verifyNone unload_options() autocommitFalse include_headerFalse *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Executes an UNLOAD command to s3 as a CSV with headers
    Parameters
    · schema (str) – reference to a specific schema in redshift database
    · table (str) – reference to a specific table in redshift database
    · s3_bucket (str) – reference to a specific S3 bucket
    · s3_key (str) – reference to a specific S3 key
    · redshift_conn_id (str) – reference to a specific redshift database
    · aws_conn_id (str) – reference to a specific S3 connection
    · unload_options (list) – reference to a list of UNLOAD options
    Parame verify
    Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values
    · False do not validate SSL certificates SSL will still be used
    (unless use_ssl is False) but SSL certificates will not be verified
    · pathtocertbundlepem A filename of the CA cert bundle to uses
    You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
    S3ToRedshiftTransfer
    class airflowoperatorss3_to_redshift_operatorS3ToRedshiftTransfer(schema table s3_bucket s3_key redshift_conn_id'redshift_default' aws_conn_id'aws_default' verifyNone copy_options() autocommitFalse parametersNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Executes an COPY command to load files from s3 to Redshift
    Parameters
    · schema (str) – reference to a specific schema in redshift database
    · table (str) – reference to a specific table in redshift database
    · s3_bucket (str) – reference to a specific S3 bucket
    · s3_key (str) – reference to a specific S3 key
    · redshift_conn_id (str) – reference to a specific redshift database
    · aws_conn_id (str) – reference to a specific S3 connection
    · copy_options (list) – reference to a list of COPY options
    Parame verify
    Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
    (unless use_ssl is False) but SSL certificates will not be verified
    · pathtocertbundlepem A filename of the CA cert bundle to uses
    You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
    AWS DynamoDB
    · HiveToDynamoDBTransferOperator Moves data from Hive to DynamoDB
    · AwsDynamoDBHook Interact with AWS DynamoDB
    HiveToDynamoDBTransferOperator
    class airflowcontriboperatorshive_to_dynamodbHiveToDynamoDBTransferOperator(sql table_name table_keys pre_processNone pre_process_argsNone pre_process_kwargsNone region_nameNone schema'default' hiveserver2_conn_id'hiveserver2_default' aws_conn_id'aws_default' *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Moves data from Hive to DynamoDB note that for now the data is loaded into memory before being pushed to DynamoDB so this operator should be used for smallish amount of data
    Parameters
    · sql (str) – SQL query to execute against the hive database (templated)
    · table_name (str) – target DynamoDB table
    · table_keys (list) – partition key and sort key
    · pre_process (function) – implement preprocessing of source data
    · pre_process_args (list) – list of pre_process function arguments
    · pre_process_kwargs (dict) – dict of pre_process function arguments
    · region_name (str) – aws region name (example useast1)
    · schema (str) – hive database schema
    · hiveserver2_conn_id (str) – source hive connection
    · aws_conn_id (str) – aws connection
    AwsDynamoDBHook
    class airflowcontribhooksaws_dynamodb_hookAwsDynamoDBHook(table_keysNone table_nameNone region_nameNone *args **kwargs)[source]
    Bases airflowcontribhooksaws_hookAwsHook
    Interact with AWS DynamoDB
    Parameters
    · table_keys (list) – partition key and sort key
    · table_name (str) – target DynamoDB table
    · region_name (str) – aws region name (example useast1)
    write_batch_data(items)[source]
    Write batch items to dynamodb table with provisioned throughout capacity
    AWS Lambda
    · AwsLambdaHook Interact with AWS Lambda
    AwsLambdaHook
    class airflowcontribhooksaws_lambda_hookAwsLambdaHook(function_name region_nameNone log_type'None' qualifier'LATEST' invocation_type'RequestResponse' *args **kwargs)[source]
    Bases airflowcontribhooksaws_hookAwsHook
    Interact with AWS Lambda
    Parameters
    · function_name (str) – AWS Lambda Function Name
    · region_name (str) – AWS Region Name (example uswest2)
    · log_type (str) – Tail Invocation Request
    · qualifier (str) – AWS Lambda Function Version or Alias Name
    · invocation_type (str) – AWS Lambda Invocation Type (RequestResponse Event etc)
    invoke_lambda(payload)[source]
    Invoke Lambda Function
    AWS Kinesis
    · AwsFirehoseHook Interact with AWS Kinesis Firehose
    AwsFirehoseHook
    class airflowcontribhooksaws_firehose_hookAwsFirehoseHook(delivery_stream region_nameNone *args **kwargs)[source]
    Bases airflowcontribhooksaws_hookAwsHook
    Interact with AWS Kinesis Firehose param delivery_stream Name of the delivery stream type delivery_stream str param region_name AWS region name (example useast1) type region_name str
    get_conn()[source]
    Returns AwsHook connection object
    put_records(records)[source]
    Write batch records to Kinesis Firehose
    Databricks
    Databricks has contributed an Airflow operator which enables submitting runs to the Databricks platform Internally the operator talks to the api20jobsrunssubmit endpoint
    DatabricksSubmitRunOperator
    class airflowcontriboperatorsdatabricks_operatorDatabricksSubmitRunOperator(jsonNone spark_jar_taskNone notebook_taskNone new_clusterNone existing_cluster_idNone librariesNone run_nameNone timeout_secondsNone databricks_conn_id'databricks_default' polling_period_seconds30 databricks_retry_limit3 databricks_retry_delay1 do_xcom_pushFalse **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Submits a Spark job run to Databricks using the api20jobsrunssubmit API endpoint
    There are two ways to instantiate this operator
    In the first way you can take the JSON payload that you typically use to call the api20jobsrunssubmit endpoint and pass it directly to our DatabricksSubmitRunOperator through the json parameter For example
    json {
    'new_cluster' {
    'spark_version' '210db3scala211'
    'num_workers' 2
    }
    'notebook_task' {
    'notebook_path' 'Usersairflow@examplecomPrepareData'
    }
    }
    notebook_run DatabricksSubmitRunOperator(task_id'notebook_run' jsonjson)
    Another way to accomplish the same thing is to use the named parameters of the DatabricksSubmitRunOperator directly Note that there is exactly one named parameter for each top level parameter in the runssubmit endpoint In this method your code would look like this
    new_cluster {
    'spark_version' '210db3scala211'
    'num_workers' 2
    }
    notebook_task {
    'notebook_path' 'Usersairflow@examplecomPrepareData'
    }
    notebook_run DatabricksSubmitRunOperator(
    task_id'notebook_run'
    new_clusternew_cluster
    notebook_tasknotebook_task)
    In the case where both the json parameter AND the named parameters are provided they will be merged together If there are conflicts during the merge the named parameters will take precedence and override the top level json keys
    Currently the named parameters that DatabricksSubmitRunOperator supports are
    · spark_jar_task
    · notebook_task
    · new_cluster
    · existing_cluster_id
    · libraries
    · run_name
    · timeout_seconds
    Parameters
    · json (dict) –
    A JSON object containing API parameters which will be passed directly to the api20jobsrunssubmit endpoint The other named parameters (ie spark_jar_task notebook_task) to this operator will be merged with this json dictionary if they are provided If there are conflicts during the merge the named parameters will take precedence and override the top level json keys (templated)
    See also
    For more information about templating see Jinja Templating httpsdocsdatabrickscomapilatestjobshtml#runssubmit
    · spark_jar_task (dict) –
    The main class and parameters for the JAR task Note that the actual JAR is specified in the libraries EITHER spark_jar_task OR notebook_task should be specified This field will be templated
    See also
    httpsdocsdatabrickscomapilatestjobshtml#jobssparkjartask
    · notebook_task (dict) –
    The notebook path and parameters for the notebook task EITHER spark_jar_task OR notebook_task should be specified This field will be templated
    See also
    httpsdocsdatabrickscomapilatestjobshtml#jobsnotebooktask
    · new_cluster (dict) –
    Specs for a new cluster on which this task will be run EITHER new_cluster OR existing_cluster_id should be specified This field will be templated
    See also
    httpsdocsdatabrickscomapilatestjobshtml#jobsclusterspecnewcluster
    · existing_cluster_id (str) – ID for existing cluster on which to run this task EITHER new_cluster OR existing_cluster_id should be specified This field will be templated
    · libraries (list of dicts) –
    Libraries which this run will use This field will be templated
    See also
    httpsdocsdatabrickscomapilatestlibrarieshtml#managedlibrarieslibrary
    · run_name (str) – The run name used for this task By default this will be set to the Airflow task_id This task_id is a required parameter of the superclass BaseOperator This field will be templated
    · timeout_seconds (int32) – The timeout for this run By default a value of 0 is used which means to have no timeout This field will be templated
    · databricks_conn_id (str) – The name of the Airflow connection to use By default and in the common case this will be databricks_default To use token based authentication provide the key token in the extra field for the connection
    · polling_period_seconds (int) – Controls the rate which we poll for the result of this run By default the operator will poll every 30 seconds
    · databricks_retry_limit (int) – Amount of times retry if the Databricks backend is unreachable Its value must be greater than or equal to 1
    · databricks_retry_delay (float) – Number of seconds to wait between retries (it might be a floating point number)
    · do_xcom_push (bool) – Whether we should push run_id and run_page_url to xcom
    GCP Google Cloud Platform
    Airflow has extensive support for the Google Cloud Platform But note that most Hooks and Operators are in the contrib section Meaning that they have a beta status meaning that they can have breaking changes between minor releases
    See the GCP connection type documentation to configure connections to GCP
    Logging
    Airflow can be configured to read and write task logs in Google Cloud Storage See Writing Logs to Google Cloud Storage
    BigQuery
    BigQuery Operators
    · BigQueryCheckOperator Performs checks against a SQL query that will return a single row with different values
    · BigQueryValueCheckOperator Performs a simple value check using SQL code
    · BigQueryIntervalCheckOperator Checks that the values of metrics given as SQL expressions are within a certain tolerance of the ones from days_back before
    · BigQueryGetDataOperator Fetches the data from a BigQuery table and returns data in a python list
    · BigQueryCreateEmptyTableOperator Creates a new empty table in the specified BigQuery dataset optionally with schema
    · BigQueryCreateExternalTableOperator Creates a new external table in the dataset with the data in Google Cloud Storage
    · BigQueryDeleteDatasetOperator Deletes an existing BigQuery dataset
    · BigQueryCreateEmptyDatasetOperator Creates an empty BigQuery dataset
    · BigQueryOperator Executes BigQuery SQL queries in a specific BigQuery database
    · BigQueryToBigQueryOperator Copy a BigQuery table to another BigQuery table
    · BigQueryToCloudStorageOperator Transfers a BigQuery table to a Google Cloud Storage bucket
    BigQueryCheckOperator
    class airflowcontriboperatorsbigquery_check_operatorBigQueryCheckOperator(sql bigquery_conn_id'bigquery_default' use_legacy_sqlTrue *args **kwargs)[source]
    Bases airflowoperatorscheck_operatorCheckOperator
    Performs checks against BigQuery The BigQueryCheckOperator expects a sql query that will return a single row Each value on that first row is evaluated using python bool casting If any of the values return False the check is failed and errors out
    Note that Python bool casting evals the following as False
    · False
    · 0
    · Empty string ()
    · Empty list ([])
    · Empty dictionary or set ({})
    Given a query like SELECT COUNT(*) FROM foo it will fail only if the count 0 You can craft much more complex query that could for instance check that the table has the same number of rows as the source table upstream or that the count of today’s partition is greater than yesterday’s partition or that a set of metrics are less than 3 standard deviation for the 7 day average
    This operator can be used as a data quality check in your pipeline and depending on where you put it in your DAG you have the choice to stop the critical path preventing from publishing dubious data or on the side and receive email alterts without stopping the progress of the DAG
    Parameters
    · sql (str) – the sql to be executed
    · bigquery_conn_id (str) – reference to the BigQuery database
    · use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false)
    BigQueryValueCheckOperator
    class airflowcontriboperatorsbigquery_check_operatorBigQueryValueCheckOperator(sql pass_value toleranceNone bigquery_conn_id'bigquery_default' use_legacy_sqlTrue *args **kwargs)[source]
    Bases airflowoperatorscheck_operatorValueCheckOperator
    Performs a simple value check using sql code
    Parameters
    · sql (str) – the sql to be executed
    · use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false)
    BigQueryIntervalCheckOperator
    class airflowcontriboperatorsbigquery_check_operatorBigQueryIntervalCheckOperator(table metrics_thresholds date_filter_column'ds' days_back7 bigquery_conn_id'bigquery_default' use_legacy_sqlTrue *args **kwargs)[source]
    Bases airflowoperatorscheck_operatorIntervalCheckOperator
    Checks that the values of metrics given as SQL expressions are within a certain tolerance of the ones from days_back before
    This method constructs a query like so
    SELECT {metrics_threshold_dict_key} FROM {table}
    WHERE {date_filter_column}
    Parameters
    · table (str) – the table name
    · days_back (int) – number of days between ds and the ds we want to check against Defaults to 7 days
    · metrics_threshold (dict) – a dictionary of ratios indexed by metrics for example COUNT(*)’ 15 would require a 50 percent or less difference between the current day and the prior days_back
    · use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false)
    BigQueryGetDataOperator
    class airflowcontriboperatorsbigquery_get_dataBigQueryGetDataOperator(dataset_id table_id max_results'100' selected_fieldsNone bigquery_conn_id'bigquery_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Fetches the data from a BigQuery table (alternatively fetch data for selected columns) and returns data in a python list The number of elements in the returned list will be equal to the number of rows fetched Each element in the list will again be a list where element would represent the columns values for that row
    Example Result [['Tony' '10'] ['Mike' '20'] ['Steve' '15']]
    Note
    If you pass fields to selected_fields which are in different order than the order of columns already in BQ table the data will still be in the order of BQ table For example if the BQ table has 3 columns as [ABC] and you pass BA’ in the selected_fields the data would still be of the form 'AB'
    Example
    get_data BigQueryGetDataOperator(
    task_id'get_data_from_bq'
    dataset_id'test_dataset'
    table_id'Transaction_partitions'
    max_results'100'
    selected_fields'DATE'
    bigquery_conn_id'airflowserviceaccount'
    )
    Parameters
    · dataset_id (str) – The dataset ID of the requested table (templated)
    · table_id (str) – The table ID of the requested table (templated)
    · max_results (str) – The maximum number of records (rows) to be fetched from the table (templated)
    · selected_fields (str) – List of fields to return (commaseparated) If unspecified all fields are returned
    · bigquery_conn_id (str) – reference to a specific BigQuery hook
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    BigQueryCreateEmptyTableOperator
    class airflowcontriboperatorsbigquery_operatorBigQueryCreateEmptyTableOperator(dataset_id table_id project_idNone schema_fieldsNone gcs_schema_objectNone time_partitioningNone bigquery_conn_id'bigquery_default' google_cloud_storage_conn_id'google_cloud_default' delegate_toNone labelsNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Creates a new empty table in the specified BigQuery dataset optionally with schema
    The schema to be used for the BigQuery table may be specified in one of two ways You may either directly pass the schema fields in or you may point the operator to a Google cloud storage object name The object in Google cloud storage must be a JSON file with the schema fields in it You can also create a table without schema
    Parameters
    · project_id (str) – The project to create the table into (templated)
    · dataset_id (str) – The dataset to create the table into (templated)
    · table_id (str) – The Name of the table to be created (templated)
    · schema_fields (list) –
    If set the schema field list as defined here httpscloudgooglecombigquerydocsreferencerestv2jobs#configurationloadschema
    Example
    schema_fields[{name emp_name type STRING mode REQUIRED}
    {name salary type INTEGER mode NULLABLE}]
    · gcs_schema_object (str) – Full path to the JSON file containing schema (templated) For example gstestbucketdir1dir2employee_schemajson
    · time_partitioning (dict) –
    configure optional time partitioning fields ie partition by field type and expiration as per API specifications
    See also
    httpscloudgooglecombigquerydocsreferencerestv2tables#timePartitioning
    · bigquery_conn_id (str) – Reference to a specific BigQuery hook
    · google_cloud_storage_conn_id (str) – Reference to a specific Google cloud storage hook
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · labels (dict) –
    a dictionary containing labels for the table passed to BigQuery
    Example (with schema JSON in GCS)
    CreateTable BigQueryCreateEmptyTableOperator(
    task_id'BigQueryCreateEmptyTableOperator_task'
    dataset_id'ODS'
    table_id'Employees'
    project_id'internalgcpproject'
    gcs_schema_object'gsschemabucketemployee_schemajson'
    bigquery_conn_id'airflowserviceaccount'
    google_cloud_storage_conn_id'airflowserviceaccount'
    )
    Corresponding Schema file (employee_schemajson)
    [
    {
    mode NULLABLE
    name emp_name
    type STRING
    }
    {
    mode REQUIRED
    name salary
    type INTEGER
    }
    ]
    Example (with schema in the DAG)
    CreateTable BigQueryCreateEmptyTableOperator(
    task_id'BigQueryCreateEmptyTableOperator_task'
    dataset_id'ODS'
    table_id'Employees'
    project_id'internalgcpproject'
    schema_fields[{name emp_name type STRING mode REQUIRED}
    {name salary type INTEGER mode NULLABLE}]
    bigquery_conn_id'airflowserviceaccount'
    google_cloud_storage_conn_id'airflowserviceaccount'
    )
    BigQueryCreateExternalTableOperator
    class airflowcontriboperatorsbigquery_operatorBigQueryCreateExternalTableOperator(bucket source_objects destination_project_dataset_table schema_fieldsNone schema_objectNone source_format'CSV' compression'NONE' skip_leading_rows0 field_delimiter' ' max_bad_records0 quote_characterNone allow_quoted_newlinesFalse allow_jagged_rowsFalse bigquery_conn_id'bigquery_default' google_cloud_storage_conn_id'google_cloud_default' delegate_toNone src_fmt_configs{} labelsNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Creates a new external table in the dataset with the data in Google Cloud Storage
    The schema to be used for the BigQuery table may be specified in one of two ways You may either directly pass the schema fields in or you may point the operator to a Google cloud storage object name The object in Google cloud storage must be a JSON file with the schema fields in it
    Parameters
    · bucket (str) – The bucket to point the external table to (templated)
    · source_objects (list) – List of Google cloud storage URIs to point table to (templated) If source_format is DATASTORE_BACKUP’ the list must only contain a single URI
    · destination_project_dataset_table (str) – The dotted () BigQuery table to load data into (templated) If is not included project will be the project defined in the connection json
    · schema_fields (list) –
    If set the schema field list as defined here httpscloudgooglecombigquerydocsreferencerestv2jobs#configurationloadschema
    Example
    schema_fields[{name emp_name type STRING mode REQUIRED}
    {name salary type INTEGER mode NULLABLE}]
    Should not be set when source_format is DATASTORE_BACKUP’
    · schema_object (str) – If set a GCS object path pointing to a json file that contains the schema for the table (templated)
    · source_format (str) – File format of the data
    · compression (str) – [Optional] The compression type of the data source Possible values include GZIP and NONE The default value is NONE This setting is ignored for Google Cloud Bigtable Google Cloud Datastore backups and Avro formats
    · skip_leading_rows (int) – Number of rows to skip when loading from a CSV
    · field_delimiter (str) – The delimiter to use for the CSV
    · max_bad_records (int) – The maximum number of bad records that BigQuery can ignore when running the job
    · quote_character (str) – The value that is used to quote data sections in a CSV file
    · allow_quoted_newlines (bool) – Whether to allow quoted newlines (true) or not (false)
    · allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns The missing values are treated as nulls If false records with missing trailing columns are treated as bad records and if there are too many bad records an invalid error is returned in the job result Only applicable to CSV ignored for other formats
    · bigquery_conn_id (str) – Reference to a specific BigQuery hook
    · google_cloud_storage_conn_id (str) – Reference to a specific Google cloud storage hook
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · src_fmt_configs (dict) – configure optional fields specific to the source format
    param labels a dictionary containing labels for the table passed to BigQuery type labels dict
    BigQueryDeleteDatasetOperator
    class airflowcontriboperatorsbigquery_operatorBigQueryDeleteDatasetOperator(dataset_id project_idNone bigquery_conn_id'bigquery_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    This operator deletes an existing dataset from your Project in Big query httpscloudgooglecombigquerydocsreferencerestv2datasetsdelete param project_id The project id of the dataset type project_id str param dataset_id The dataset to be deleted type dataset_id str
    Example
    delete_temp_data BigQueryDeleteDatasetOperator(dataset_id 'tempdataset'
    project_id 'tempproject'
    bigquery_conn_id'_my_gcp_conn_'
    task_id'Deletetemp'
    dagdag)
    BigQueryCreateEmptyDatasetOperator
    class airflowcontriboperatorsbigquery_operatorBigQueryCreateEmptyDatasetOperator(dataset_id project_idNone dataset_referenceNone bigquery_conn_id'bigquery_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    This operator is used to create new dataset for your Project in Big query httpscloudgooglecombigquerydocsreferencerestv2datasets#resource
    Parameters
    · project_id (str) – The name of the project where we want to create the dataset Don’t need to provide if projectId in dataset_reference
    · dataset_id (str) – The id of dataset Don’t need to provide if datasetId in dataset_reference
    · dataset_reference – Dataset reference that could be provided with request body More info httpscloudgooglecombigquerydocsreferencerestv2datasets#resource
    BigQueryOperator
    class airflowcontriboperatorsbigquery_operatorBigQueryOperator(sqlNone destination_dataset_tableFalse write_disposition'WRITE_EMPTY' allow_large_resultsFalse flatten_resultsNone bigquery_conn_id'bigquery_default' delegate_toNone udf_configFalse use_legacy_sqlTrue maximum_billing_tierNone maximum_bytes_billedNone create_disposition'CREATE_IF_NEEDED' schema_update_options() query_paramsNone labelsNone priority'INTERACTIVE' time_partitioningNone api_resource_configsNone cluster_fieldsNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Executes BigQuery SQL queries in a specific BigQuery database
    Parameters
    · sql (Can receive a str representing a sql statement a list of str (sql statements) or reference to a template file Template reference are recognized by str ending in 'sql') – the sql code to be executed (templated)
    · destination_dataset_table (str) – A dotted (|)
    that if set will store the results of the query (templated)
    · write_disposition (str) – Specifies the action that occurs if the destination table already exists (default WRITE_EMPTY’)
    · create_disposition (str) – Specifies whether the job is allowed to create new tables (default CREATE_IF_NEEDED’)
    · allow_large_results (bool) – Whether to allow large results
    · flatten_results (bool) – If true and query uses legacy SQL dialect flattens all nested and repeated fields in the query results allow_large_results must be true if this is set to false For standard SQL queries this flag is ignored and results are never flattened
    · bigquery_conn_id (str) – reference to a specific BigQuery hook
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · udf_config (list) – The User Defined Function configuration for the query See httpscloudgooglecombigqueryuserdefinedfunctions for details
    · use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false)
    · maximum_billing_tier (int) – Positive integer that serves as a multiplier of the basic price Defaults to None in which case it uses the value set in the project
    · maximum_bytes_billed (float) – Limits the bytes billed for this job Queries that will have bytes billed beyond this limit will fail (without incurring a charge) If unspecified this will be set to your project default
    · api_resource_configs (dict) – a dictionary that contain params configuration’ applied for Google BigQuery Jobs API httpscloudgooglecombigquerydocsreferencerestv2jobs for example {query’ {useQueryCache’ False}} You could use it if you need to provide some params that are not supported by BigQueryOperator like args
    · schema_update_options (tuple) – Allows the schema of the destination table to be updated as a side effect of the load job
    · query_params (dict) – a dictionary containing query parameter types and values passed to BigQuery
    · labels (dict) – a dictionary containing labels for the jobquery passed to BigQuery
    · priority (str) – Specifies a priority for the query Possible values include INTERACTIVE and BATCH The default value is INTERACTIVE
    · time_partitioning (dict) – configure optional time partitioning fields ie partition by field type and expiration as per API specifications
    · cluster_fields (list of str) – Request that the result of this query be stored sorted by one or more columns This is only available in conjunction with time_partitioning The order of columns given determines the sort order
    BigQueryTableDeleteOperator
    class airflowcontriboperatorsbigquery_table_delete_operatorBigQueryTableDeleteOperator(deletion_dataset_table bigquery_conn_id'bigquery_default' delegate_toNone ignore_if_missingFalse *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Deletes BigQuery tables
    Parameters
    · deletion_dataset_table (str) – A dotted (|)
    that indicates which table will be deleted (templated)
    · bigquery_conn_id (str) – reference to a specific BigQuery hook
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · ignore_if_missing (bool) – if True then return success even if the requested table does not exist
    BigQueryToBigQueryOperator
    class airflowcontriboperatorsbigquery_to_bigqueryBigQueryToBigQueryOperator(source_project_dataset_tables destination_project_dataset_table write_disposition'WRITE_EMPTY' create_disposition'CREATE_IF_NEEDED' bigquery_conn_id'bigquery_default' delegate_toNone labelsNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Copies data from one BigQuery table to another
    See also
    For more details about these parameters httpscloudgooglecombigquerydocsreferencev2jobs#configurationcopy
    Parameters
    · source_project_dataset_tables (list|string) – One or more dotted (project|project)
    BigQuery tables to use as the source data If is not included project will be the project defined in the connection json Use a list if there are multiple source tables (templated)
    · destination_project_dataset_table (str) – The destination BigQuery table Format is (project|project)
    (templated)
    · write_disposition (str) – The write disposition if the table already exists
    · create_disposition (str) – The create disposition if the table doesn’t exist
    · bigquery_conn_id (str) – reference to a specific BigQuery hook
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · labels (dict) – a dictionary containing labels for the jobquery passed to BigQuery
    BigQueryToCloudStorageOperator
    class airflowcontriboperatorsbigquery_to_gcsBigQueryToCloudStorageOperator(source_project_dataset_table destination_cloud_storage_uris compression'NONE' export_format'CSV' field_delimiter' ' print_headerTrue bigquery_conn_id'bigquery_default' delegate_toNone labelsNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Transfers a BigQuery table to a Google Cloud Storage bucket
    See also
    For more details about these parameters httpscloudgooglecombigquerydocsreferencev2jobs
    Parameters
    · source_project_dataset_table (str) – The dotted (|)
    BigQuery table to use as the source data If is not included project will be the project defined in the connection json (templated)
    · destination_cloud_storage_uris (list) – The destination Google Cloud Storage URI (eg gssomebucketsomefiletxt) (templated) Follows convention defined here httpscloudgooglecombigqueryexportingdatafrombigquery#exportingmultiple
    · compression (str) – Type of compression to use
    · export_format (str) – File format to export
    · field_delimiter (str) – The delimiter to use when extracting to a CSV
    · print_header (bool) – Whether to print a header for a CSV file extract
    · bigquery_conn_id (str) – reference to a specific BigQuery hook
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · labels (dict) – a dictionary containing labels for the jobquery passed to BigQuery
    BigQueryHook
    class airflowcontribhooksbigquery_hookBigQueryHook(bigquery_conn_id'bigquery_default' delegate_toNone use_legacy_sqlTrue)[source]
    Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook airflowhooksdbapi_hookDbApiHook airflowutilsloglogging_mixinLoggingMixin
    Interact with BigQuery This hook uses the Google Cloud Platform connection
    get_conn()[source]
    Returns a BigQuery PEP 249 connection object
    get_pandas_df(sql parametersNone dialectNone)[source]
    Returns a Pandas DataFrame for the results produced by a BigQuery query The DbApiHook method must be overridden because Pandas doesn’t support PEP 249 connections except for SQLite See
    httpsgithubcompydatapandasblobmasterpandasiosqlpy#L447 httpsgithubcompydatapandasissues6900
    Parameters
    · sql (str) – The BigQuery SQL to execute
    · parameters (mapping or iterable) – The parameters to render the SQL query with (not used leave to override superclass method)
    · dialect (str in {'legacy' 'standard'}) – Dialect of BigQuery SQL – legacy SQL or standard SQL defaults to use selfuse_legacy_sql if not specified
    get_service()[source]
    Returns a BigQuery service object
    insert_rows(table rows target_fieldsNone commit_every1000)[source]
    Insertion is currently unsupported Theoretically you could use BigQuery’s streaming API to insert rows into a table but this hasn’t been implemented
    table_exists(project_id dataset_id table_id)[source]
    Checks for the existence of a table in Google BigQuery
    Parameters
    · project_id (str) – The Google cloud project in which to look for the table The connection supplied to the hook must provide access to the specified project
    · dataset_id (str) – The name of the dataset in which to look for the table
    · table_id (str) – The name of the table to check the existence of
    Compute Engine
    Compute Engine Operators
    · GceInstanceStartOperator start an existing Google Compute Engine instance
    · GceInstanceStopOperator stop an existing Google Compute Engine instance
    · GceSetMachineTypeOperator change the machine type for a stopped instance
    GceInstanceStartOperator
    class airflowcontriboperatorsgcp_compute_operatorGceInstanceStartOperator(project_id zone resource_id gcp_conn_id'google_cloud_default' api_version'v1' *args **kwargs)[source]
    Bases airflowcontriboperatorsgcp_compute_operatorGceBaseOperator
    Start an instance in Google Compute Engine
    Parameters
    · project_id (str) – Google Cloud Platform project where the Compute Engine instance exists
    · zone (str) – Google Cloud Platform zone where the instance exists
    · resource_id (str) – Name of the Compute Engine instance resource
    · gcp_conn_id (str) – The connection ID used to connect to Google Cloud Platform
    · api_version (str) – API version used (eg v1)
    GceInstanceStopOperator
    class airflowcontriboperatorsgcp_compute_operatorGceInstanceStopOperator(project_id zone resource_id gcp_conn_id'google_cloud_default' api_version'v1' *args **kwargs)[source]
    Bases airflowcontriboperatorsgcp_compute_operatorGceBaseOperator
    Stop an instance in Google Compute Engine
    Parameters
    · project_id (str) – Google Cloud Platform project where the Compute Engine instance exists
    · zone (str) – Google Cloud Platform zone where the instance exists
    · resource_id (str) – Name of the Compute Engine instance resource
    · gcp_conn_id (str) – The connection ID used to connect to Google Cloud Platform
    · api_version (str) – API version used (eg v1)
    GceSetMachineTypeOperator
    class airflowcontriboperatorsgcp_compute_operatorGceSetMachineTypeOperator(project_id zone resource_id body gcp_conn_id'google_cloud_default' api_version'v1' validate_bodyTrue *args **kwargs)[source]
    Bases airflowcontriboperatorsgcp_compute_operatorGceBaseOperator
    Changes the machine type for a stopped instance to the machine type specified in the request
    Parameters
    · project_id (str) – Google Cloud Platform project where the Compute Engine instance exists
    · zone (str) – Google Cloud Platform zone where the instance exists
    · resource_id (str) – Name of the Compute Engine instance resource
    · body (dict) – Body required by the Compute Engine setMachineType API as described in httpscloudgooglecomcomputedocsreferencerestv1instancessetMachineType#requestbody
    · gcp_conn_id (str) – The connection ID used to connect to Google Cloud Platform
    · api_version (str) – API version used (eg v1)
    Cloud Functions
    Cloud Functions Operators
    · GcfFunctionDeployOperator deploy Google Cloud Function to Google Cloud Platform
    · GcfFunctionDeleteOperator delete Google Cloud Function in Google Cloud Platform
    GcfFunctionDeployOperator
    class airflowcontriboperatorsgcp_function_operatorGcfFunctionDeployOperator(project_id location body gcp_conn_id'google_cloud_default' api_version'v1' zip_pathNone validate_bodyTrue *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Creates a function in Google Cloud Functions
    Parameters
    · project_id (str) – Google Cloud Platform Project ID where the function should be created
    · location (str) – Google Cloud Platform region where the function should be created
    · body (dict or googlecloudfunctionsv1CloudFunction) – Body of the Cloud Functions definition The body must be a Cloud Functions dictionary as described in httpscloudgooglecomfunctionsdocsreferencerestv1projectslocationsfunctions Different API versions require different variants of the Cloud Functions dictionary
    · gcp_conn_id (str) – The connection ID to use to connect to Google Cloud Platform
    · api_version (str) – API version used (for example v1 or v1beta1)
    · zip_path (str) – Path to zip file containing source code of the function If the path is set the sourceUploadUrl should not be specified in the body or it should be empty Then the zip file will be uploaded using the upload URL generated via generateUploadUrl from the Cloud Functions API
    · validate_body (bool) – If set to False body validation is not performed
    GcfFunctionDeleteOperator
    class airflowcontriboperatorsgcp_function_operatorGcfFunctionDeleteOperator(name gcp_conn_id'google_cloud_default' api_version'v1' *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Deletes the specified function from Google Cloud Functions
    Parameters
    · name (str) – A fullyqualified function name matching the pattern ^projects[^]+locations[^]+functions[^]+
    · gcp_conn_id (str) – The connection ID to use to connect to Google Cloud Platform
    · api_version (str) – API version used (for example v1 or v1beta1)
    Cloud Functions Hook
    class airflowcontribhooksgcp_function_hookGcfHook(api_version gcp_conn_id'google_cloud_default' delegate_toNone)[source]
    Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
    Hook for the Google Cloud Functions APIs
    create_new_function(full_location body)[source]
    Creates a new function in Cloud Function in the location specified in the body
    Parameters
    · full_location (str) – full location including the project in the form of of projectslocation
    · body (dict) – body required by the Cloud Functions insert API
    Returns
    response returned by the operation
    Return type
    dict
    delete_function(name)[source]
    Deletes the specified Cloud Function
    Parameters
    name (str) – name of the function
    Returns
    response returned by the operation
    Return type
    dict
    get_conn()[source]
    Retrieves the connection to Cloud Functions
    Returns
    Google Cloud Function services object
    Return type
    dict
    get_function(name)[source]
    Returns the Cloud Function with the given name
    Parameters
    name (str) – name of the function
    Returns
    a CloudFunction object representing the function
    Return type
    dict
    list_functions(full_location)[source]
    Lists all Cloud Functions created in the location
    Parameters
    full_location (str) – full location including the project in the form of of projectslocation
    Returns
    array of CloudFunction objects representing functions in the location
    Return type
    [dict]
    update_function(name body update_mask)[source]
    Updates Cloud Functions according to the specified update mask
    Parameters
    · name (str) – name of the function
    · body (str) – body required by the cloud function patch API
    · update_mask ([str]) – update mask array of fields that should be patched
    Returns
    response returned by the operation
    Return type
    dict
    upload_function_zip(parent zip_path)[source]
    Uploads zip file with sources
    Parameters
    · parent (str) – Google Cloud Platform project id and region where zip file should be uploaded in the form of projectslocation
    · zip_path (str) – path of the valid zip file to upload
    Returns
    Upload URL that was returned by generateUploadUrl method
    Cloud DataFlow
    DataFlow Operators
    · DataFlowJavaOperator launching Cloud Dataflow jobs written in Java
    · DataflowTemplateOperator launching a templated Cloud DataFlow batch job
    · DataFlowPythonOperator launching Cloud Dataflow jobs written in python
    DataFlowJavaOperator
    class airflowcontriboperatorsdataflow_operatorDataFlowJavaOperator(jar job_name'{{tasktask_id}}' dataflow_default_optionsNone optionsNone gcp_conn_id'google_cloud_default' delegate_toNone poll_sleep10 job_classNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Start a Java Cloud DataFlow batch job The parameters of the operation will be passed to the job
    See also
    For more detail on job submission have a look at the reference httpscloudgooglecomdataflowpipelinesspecifyingexecparams
    Parameters
    · jar (str) – The reference to a self executing DataFlow jar (templated)
    · job_name (str) – The jobName’ to use when executing the DataFlow job (templated) This ends up being set in the pipeline options so any entry with key 'jobName' in options will be overwritten
    · dataflow_default_options (dict) – Map of default job options
    · options (dict) – Map of job specific options
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Platform for the dataflow job status while the job is in the JOB_STATE_RUNNING state
    · job_class (str) – The name of the dataflow job class to be executued it is often not the main class configured in the dataflow jar file
    jar options and job_name are templated so you can use variables in them
    Note that both dataflow_default_options and options will be merged to specify pipeline execution parameter and dataflow_default_options is expected to save highlevel options for instances project and zone information which apply to all dataflow operators in the DAG
    It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project zone and staging location
    default_args {
    'dataflow_default_options' {
    'project' 'mygcpproject'
    'zone' 'europewest1d'
    'stagingLocation' 'gsmystagingbucketstaging'
    }
    }
    You need to pass the path to your dataflow as a file reference with the jar parameter the jar needs to be a self executing jar (see documentation here httpsbeamapacheorgdocumentationrunnersdataflow#selfexecutingjar) Use options to pass on options to your job
    t1 DataFlowJavaOperator(
    task_id'datapflow_example'
    jar'{{varvaluegcp_dataflow_base}}pipelinebuildlibspipelineexample10jar'
    options{
    'autoscalingAlgorithm' 'BASIC'
    'maxNumWorkers' '50'
    'start' '{{ds}}'
    'partitionType' 'DAY'
    'labels' {'foo' 'bar'}
    }
    gcp_conn_id'gcpairflowserviceaccount'
    dagmydag)
    default_args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date'
    (2016 8 1)
    'email' ['alex@vanboxelbe']
    'email_on_failure' False
    'email_on_retry' False
    'retries' 1
    'retry_delay' timedelta(minutes30)
    'dataflow_default_options' {
    'project' 'mygcpproject'
    'zone' 'uscentral1f'
    'stagingLocation' 'gsbuckettmpdataflowstaging'
    }
    }

    dag DAG('testdag' default_argsdefault_args)

    task DataFlowJavaOperator(
    gcp_conn_id'gcp_default'
    task_id'normalizecal'
    jar'{{varvaluegcp_dataflow_base}}pipelineingresscalnormalize10jar'
    options{
    'autoscalingAlgorithm' 'BASIC'
    'maxNumWorkers' '50'
    'start' '{{ds}}'
    'partitionType' 'DAY'

    }
    dagdag)
    DataflowTemplateOperator
    class airflowcontriboperatorsdataflow_operatorDataflowTemplateOperator(template job_name'{{tasktask_id}}' dataflow_default_optionsNone parametersNone gcp_conn_id'google_cloud_default' delegate_toNone poll_sleep10 *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Start a Templated Cloud DataFlow batch job The parameters of the operation will be passed to the job
    Parameters
    · template (str) – The reference to the DataFlow template
    · job_name – The jobName’ to use when executing the DataFlow template (templated)
    · dataflow_default_options (dict) – Map of default job environment options
    · parameters (dict) – Map of job specific parameters for the template
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Platform for the dataflow job status while the job is in the JOB_STATE_RUNNING state
    It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project zone and staging location
    See also
    httpscloudgooglecomdataflowdocsreferencerestv1b3LaunchTemplateParameters httpscloudgooglecomdataflowdocsreferencerestv1b3RuntimeEnvironment
    default_args {
    'dataflow_default_options' {
    'project' 'mygcpproject'
    'zone' 'europewest1d'
    'tempLocation' 'gsmystagingbucketstaging'
    }
    }
    }
    You need to pass the path to your dataflow template as a file reference with the template parameter Use parameters to pass on parameters to your job Use environment to pass on runtime environment variables to your job
    t1 DataflowTemplateOperator(
    task_id'datapflow_example'
    template'{{varvaluegcp_dataflow_base}}'
    parameters{
    'inputFile' gsbucketinputmy_inputtxt
    'outputFile' gsbucketoutputmy_outputtxt
    }
    gcp_conn_id'gcpairflowserviceaccount'
    dagmydag)
    template dataflow_default_options parameters and job_name are templated so you can use variables in them
    Note that dataflow_default_options is expected to save highlevel options for project information which apply to all dataflow operators in the DAG
    See also
    httpscloudgooglecomdataflowdocsreferencerestv1b3 LaunchTemplateParameters httpscloudgooglecomdataflowdocsreferencerestv1b3RuntimeEnvironment For more detail on job template execution have a look at the reference httpscloudgooglecomdataflowdocstemplatesexecutingtemplates
    DataFlowPythonOperator
    class airflowcontriboperatorsdataflow_operatorDataFlowPythonOperator(py_file job_name'{{tasktask_id}}' py_optionsNone dataflow_default_optionsNone optionsNone gcp_conn_id'google_cloud_default' delegate_toNone poll_sleep10 *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Launching Cloud Dataflow jobs written in python Note that both dataflow_default_options and options will be merged to specify pipeline execution parameter and dataflow_default_options is expected to save highlevel options for instances project and zone information which apply to all dataflow operators in the DAG
    See also
    For more detail on job submission have a look at the reference httpscloudgooglecomdataflowpipelinesspecifyingexecparams
    Parameters
    · py_file (str) – Reference to the python dataflow pipleline filepy eg somelocalfilepathtoyourpythonpipelinefile
    · job_name (str) – The job_name’ to use when executing the DataFlow job (templated) This ends up being set in the pipeline options so any entry with key 'jobName' or 'job_name' in options will be overwritten
    · py_options – Additional python options
    · dataflow_default_options (dict) – Map of default job options
    · options (dict) – Map of job specific options
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Platform for the dataflow job status while the job is in the JOB_STATE_RUNNING state
    execute(context)[source]
    Execute the python dataflow job
    DataFlowHook
    class airflowcontribhooksgcp_dataflow_hookDataFlowHook(gcp_conn_id'google_cloud_default' delegate_toNone poll_sleep10)[source]
    Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
    get_conn()[source]
    Returns a Google Cloud Dataflow service object
    Cloud DataProc
    DataProc Operators
    · DataprocClusterCreateOperator Create a new cluster on Google Cloud Dataproc
    · DataprocClusterDeleteOperator Delete a cluster on Google Cloud Dataproc
    · DataprocClusterScaleOperator Scale up or down a cluster on Google Cloud Dataproc
    · DataProcPigOperator Start a Pig query Job on a Cloud DataProc cluster
    · DataProcHiveOperator Start a Hive query Job on a Cloud DataProc cluster
    · DataProcSparkSqlOperator Start a Spark SQL query Job on a Cloud DataProc cluster
    · DataProcSparkOperator Start a Spark Job on a Cloud DataProc cluster
    · DataProcHadoopOperator Start a Hadoop Job on a Cloud DataProc cluster
    · DataProcPySparkOperator Start a PySpark Job on a Cloud DataProc cluster
    · DataprocWorkflowTemplateInstantiateOperator Instantiate a WorkflowTemplate on Google Cloud Dataproc
    · DataprocWorkflowTemplateInstantiateInlineOperator Instantiate a WorkflowTemplate Inline on Google Cloud Dataproc
    DataprocClusterCreateOperator
    class airflowcontriboperatorsdataproc_operatorDataprocClusterCreateOperator(cluster_name project_id num_workers zone network_uriNone subnetwork_uriNone internal_ip_onlyNone tagsNone storage_bucketNone init_actions_urisNone init_action_timeout'10m' metadataNone custom_imageNone image_versionNone propertiesNone master_machine_type'n1standard4' master_disk_type'pdstandard' master_disk_size500 worker_machine_type'n1standard4' worker_disk_type'pdstandard' worker_disk_size500 num_preemptible_workers0 labelsNone region'global' gcp_conn_id'google_cloud_default' delegate_toNone service_accountNone service_account_scopesNone idle_delete_ttlNone auto_delete_timeNone auto_delete_ttlNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Create a new cluster on Google Cloud Dataproc The operator will wait until the creation is successful or an error occurs in the creation process
    The parameters allow to configure the cluster Please refer to
    httpscloudgooglecomdataprocdocsreferencerestv1projectsregionsclusters
    for a detailed explanation on the different parameters Most of the configuration parameters detailed in the link are available as a parameter to this operator
    Parameters
    · cluster_name (str) – The name of the DataProc cluster to create (templated)
    · project_id (str) – The ID of the google cloud project in which to create the cluster (templated)
    · num_workers (int) – The # of workers to spin up If set to zero will spin up cluster in a single node mode
    · storage_bucket (str) – The storage bucket to use setting to None lets dataproc generate a custom one for you
    · init_actions_uris (list[string]) – List of GCS uri’s containing dataproc initialization scripts
    · init_action_timeout (str) – Amount of time executable scripts in init_actions_uris has to complete
    · metadata (dict) – dict of keyvalue google compute engine metadata entries to add to all instances
    · image_version (str) – the version of software inside the Dataproc cluster
    · custom_image – custom Dataproc image for more info see httpscloudgooglecomdataprocdocsguidesdataprocimages
    · properties (dict) – dict of properties to set on config files (eg sparkdefaultsconf) see httpscloudgooglecomdataprocdocsreferencerestv1projectsregionsclusters#SoftwareConfig
    · master_machine_type (str) – Compute engine machine type to use for the master node
    · master_disk_type (str) – Type of the boot disk for the master node (default is pdstandard) Valid values pdssd (Persistent Disk Solid State Drive) or pdstandard (Persistent Disk Hard Disk Drive)
    · master_disk_size (int) – Disk size for the master node
    · worker_machine_type (str) – Compute engine machine type to use for the worker nodes
    · worker_disk_type (str) – Type of the boot disk for the worker node (default is pdstandard) Valid values pdssd (Persistent Disk Solid State Drive) or pdstandard (Persistent Disk Hard Disk Drive)
    · worker_disk_size (int) – Disk size for the worker nodes
    · num_preemptible_workers (int) – The # of preemptible worker nodes to spin up
    · labels (dict) – dict of labels to add to the cluster
    · zone (str) – The zone where the cluster will be located (templated)
    · network_uri (str) – The network uri to be used for machine communication cannot be specified with subnetwork_uri
    · subnetwork_uri (str) – The subnetwork uri to be used for machine communication cannot be specified with network_uri
    · internal_ip_only (bool) – If true all instances in the cluster will only have internal IP addresses This can only be enabled for subnetwork enabled networks
    · tags (list[string]) – The GCE tags to add to all instances
    · region (str) – leave as global’ might become relevant in the future (templated)
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · service_account (str) – The service account of the dataproc instances
    · service_account_scopes (list[string]) – The URIs of service account scopes to be included
    · idle_delete_ttl (int) – The longest duration that cluster would keep alive while staying idle Passing this threshold will cause cluster to be autodeleted A duration in seconds
    · auto_delete_time (datetimedatetime) – The time when cluster will be autodeleted
    · auto_delete_ttl (int) – The life duration of cluster the cluster will be autodeleted at the end of this duration A duration in seconds (If auto_delete_time is set this parameter will be ignored)
    Type
    custom_image str
    DataprocClusterScaleOperator
    class airflowcontriboperatorsdataproc_operatorDataprocClusterScaleOperator(cluster_name project_id region'global' gcp_conn_id'google_cloud_default' delegate_toNone num_workers2 num_preemptible_workers0 graceful_decommission_timeoutNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Scale up or down a cluster on Google Cloud Dataproc The operator will wait until the cluster is rescaled
    Example
    t1 DataprocClusterScaleOperator(
    task_id'dataproc_scale'
    project_id'myproject'
    cluster_name'cluster1'
    num_workers10
    num_preemptible_workers10
    graceful_decommission_timeout'1h'
    dagdag)
    See also
    For more detail on about scaling clusters have a look at the reference httpscloudgooglecomdataprocdocsconceptsconfiguringclustersscalingclusters
    Parameters
    · cluster_name (str) – The name of the cluster to scale (templated)
    · project_id (str) – The ID of the google cloud project in which the cluster runs (templated)
    · region (str) – The region for the dataproc cluster (templated)
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · num_workers (int) – The new number of workers
    · num_preemptible_workers (int) – The new number of preemptible workers
    · graceful_decommission_timeout (str) – Timeout for graceful YARN decomissioning Maximum value is 1d
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    DataprocClusterDeleteOperator
    class airflowcontriboperatorsdataproc_operatorDataprocClusterDeleteOperator(cluster_name project_id region'global' gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Delete a cluster on Google Cloud Dataproc The operator will wait until the cluster is destroyed
    Parameters
    · cluster_name (str) – The name of the cluster to create (templated)
    · project_id (str) – The ID of the google cloud project in which the cluster runs (templated)
    · region (str) – leave as global’ might become relevant in the future (templated)
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    DataProcPigOperator
    class airflowcontriboperatorsdataproc_operatorDataProcPigOperator(queryNone query_uriNone variablesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_pig_propertiesNone dataproc_pig_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Start a Pig query Job on a Cloud DataProc cluster The parameters of the operation will be passed to the cluster
    It’s a good practice to define dataproc_* parameters in the default_args of the dag like the cluster name and UDFs
    default_args {
    'cluster_name' 'cluster1'
    'dataproc_pig_jars' [
    'gsexampleudfjardatafu120datafujar'
    'gsexampleudfjargpig12gpigjar'
    ]
    }
    You can pass a pig script as string or file reference Use variables to pass on variables for the pig script to be resolved on the cluster or use the parameters to be resolved in the script as template parameters
    Example
    t1 DataProcPigOperator(
    task_id'dataproc_pig'
    query'a_pig_scriptpig'
    variables{'out' 'gsexampleoutput{{ds}}'}
    dagdag)
    See also
    For more detail on about job submission have a look at the reference httpscloudgooglecomdataprocreferencerestv1projectsregionsjobs
    Parameters
    · query (str) – The query or reference to the query file (pg or pig extension) (templated)
    · query_uri (str) – The uri of a pig script on Cloud Storage
    · variables (dict) – Map of named parameters for the query (templated)
    · job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
    · cluster_name (str) – The name of the DataProc cluster (templated)
    · dataproc_pig_properties (dict) – Map for the Pig properties Ideal to put in default arguments
    · dataproc_pig_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · region (str) – The specified region where the dataproc cluster is created
    · job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
    Variables
    dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
    DataProcHiveOperator
    class airflowcontriboperatorsdataproc_operatorDataProcHiveOperator(queryNone query_uriNone variablesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_hive_propertiesNone dataproc_hive_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Start a Hive query Job on a Cloud DataProc cluster
    Parameters
    · query (str) – The query or reference to the query file (q extension)
    · query_uri (str) – The uri of a hive script on Cloud Storage
    · variables (dict) – Map of named parameters for the query
    · job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes
    · cluster_name (str) – The name of the DataProc cluster
    · dataproc_hive_properties (dict) – Map for the Pig properties Ideal to put in default arguments
    · dataproc_hive_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · region (str) – The specified region where the dataproc cluster is created
    · job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
    Variables
    dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
    DataProcSparkSqlOperator
    class airflowcontriboperatorsdataproc_operatorDataProcSparkSqlOperator(queryNone query_uriNone variablesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_spark_propertiesNone dataproc_spark_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Start a Spark SQL query Job on a Cloud DataProc cluster
    Parameters
    · query (str) – The query or reference to the query file (q extension) (templated)
    · query_uri (str) – The uri of a spark sql script on Cloud Storage
    · variables (dict) – Map of named parameters for the query (templated)
    · job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
    · cluster_name (str) – The name of the DataProc cluster (templated)
    · dataproc_spark_properties (dict) – Map for the Pig properties Ideal to put in default arguments
    · dataproc_spark_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · region (str) – The specified region where the dataproc cluster is created
    · job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
    Variables
    dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
    DataProcSparkOperator
    class airflowcontriboperatorsdataproc_operatorDataProcSparkOperator(main_jarNone main_classNone argumentsNone archivesNone filesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_spark_propertiesNone dataproc_spark_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Start a Spark Job on a Cloud DataProc cluster
    Parameters
    · main_jar (str) – URI of the job jar provisioned on Cloud Storage (use this or the main_class not both together)
    · main_class (str) – Name of the job class (use this or the main_jar not both together)
    · arguments (list) – Arguments for the job (templated)
    · archives (list) – List of archived files that will be unpacked in the work directory Should be stored in Cloud Storage
    · files (list) – List of files to be copied to the working directory
    · job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
    · cluster_name (str) – The name of the DataProc cluster (templated)
    · dataproc_spark_properties (dict) – Map for the Pig properties Ideal to put in default arguments
    · dataproc_spark_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · region (str) – The specified region where the dataproc cluster is created
    · job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
    Variables
    dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
    DataProcHadoopOperator
    class airflowcontriboperatorsdataproc_operatorDataProcHadoopOperator(main_jarNone main_classNone argumentsNone archivesNone filesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_hadoop_propertiesNone dataproc_hadoop_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Start a Hadoop Job on a Cloud DataProc cluster
    Parameters
    · main_jar (str) – URI of the job jar provisioned on Cloud Storage (use this or the main_class not both together)
    · main_class (str) – Name of the job class (use this or the main_jar not both together)
    · arguments (list) – Arguments for the job (templated)
    · archives (list) – List of archived files that will be unpacked in the work directory Should be stored in Cloud Storage
    · files (list) – List of files to be copied to the working directory
    · job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
    · cluster_name (str) – The name of the DataProc cluster (templated)
    · dataproc_hadoop_properties (dict) – Map for the Pig properties Ideal to put in default arguments
    · dataproc_hadoop_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · region (str) – The specified region where the dataproc cluster is created
    · job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
    Variables
    dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
    DataProcPySparkOperator
    class airflowcontriboperatorsdataproc_operatorDataProcPySparkOperator(main argumentsNone archivesNone pyfilesNone filesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_pyspark_propertiesNone dataproc_pyspark_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Start a PySpark Job on a Cloud DataProc cluster
    Parameters
    · main (str) – [Required] The Hadoop Compatible Filesystem (HCFS) URI of the main Python file to use as the driver Must be a py file
    · arguments (list) – Arguments for the job (templated)
    · archives (list) – List of archived files that will be unpacked in the work directory Should be stored in Cloud Storage
    · files (list) – List of files to be copied to the working directory
    · pyfiles (list) – List of Python files to pass to the PySpark framework Supported file types py egg and zip
    · job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
    · cluster_name (str) – The name of the DataProc cluster
    · dataproc_pyspark_properties (dict) – Map for the Pig properties Ideal to put in default arguments
    · dataproc_pyspark_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · region (str) – The specified region where the dataproc cluster is created
    · job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
    Variables
    dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
    DataprocWorkflowTemplateInstantiateOperator
    class airflowcontriboperatorsdataproc_operatorDataprocWorkflowTemplateInstantiateOperator(template_id *args **kwargs)[source]
    Bases airflowcontriboperatorsdataproc_operatorDataprocWorkflowTemplateBaseOperator
    Instantiate a WorkflowTemplate on Google Cloud Dataproc The operator will wait until the WorkflowTemplate is finished executing
    See also
    Please refer to httpscloudgooglecomdataprocdocsreferencerestv1beta2projectsregionsworkflowTemplatesinstantiate
    Parameters
    · template_id (str) – The id of the template (templated)
    · project_id (str) – The ID of the google cloud project in which the template runs
    · region (str) – leave as global’ might become relevant in the future
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    DataprocWorkflowTemplateInstantiateInlineOperator
    class airflowcontriboperatorsdataproc_operatorDataprocWorkflowTemplateInstantiateInlineOperator(template *args **kwargs)[source]
    Bases airflowcontriboperatorsdataproc_operatorDataprocWorkflowTemplateBaseOperator
    Instantiate a WorkflowTemplate Inline on Google Cloud Dataproc The operator will wait until the WorkflowTemplate is finished executing
    See also
    Please refer to httpscloudgooglecomdataprocdocsreferencerestv1beta2projectsregionsworkflowTemplatesinstantiateInline
    Parameters
    · template (map) – The template contents (templated)
    · project_id (str) – The ID of the google cloud project in which the template runs
    · region (str) – leave as global’ might become relevant in the future
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    Cloud Datastore
    Datastore Operators
    · DatastoreExportOperator Export entities from Google Cloud Datastore to Cloud Storage
    · DatastoreImportOperator Import entities from Cloud Storage to Google Cloud Datastore
    DatastoreExportOperator
    class airflowcontriboperatorsdatastore_export_operatorDatastoreExportOperator(bucket namespaceNone datastore_conn_id'google_cloud_default' cloud_storage_conn_id'google_cloud_default' delegate_toNone entity_filterNone labelsNone polling_interval_in_seconds10 overwrite_existingFalse xcom_pushFalse *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Export entities from Google Cloud Datastore to Cloud Storage
    Parameters
    · bucket (str) – name of the cloud storage bucket to backup data
    · namespace (str) – optional namespace path in the specified Cloud Storage bucket to backup data If this namespace does not exist in GCS it will be created
    · datastore_conn_id (str) – the name of the Datastore connection id to use
    · cloud_storage_conn_id (str) – the name of the cloud storage connection id to forcewrite backup
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · entity_filter (dict) – description of what data from the project is included in the export refer to httpscloudgooglecomdatastoredocsreferencerestSharedTypesEntityFilter
    · labels (dict) – clientassigned labels for cloud storage
    · polling_interval_in_seconds (int) – number of seconds to wait before polling for execution status again
    · overwrite_existing (bool) – if the storage bucket + namespace is not empty it will be emptied prior to exports This enables overwriting existing backups
    · xcom_push (bool) – push operation name to xcom for reference
    DatastoreImportOperator
    class airflowcontriboperatorsdatastore_import_operatorDatastoreImportOperator(bucket file namespaceNone entity_filterNone labelsNone datastore_conn_id'google_cloud_default' delegate_toNone polling_interval_in_seconds10 xcom_pushFalse *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Import entities from Cloud Storage to Google Cloud Datastore
    Parameters
    · bucket (str) – container in Cloud Storage to store data
    · file (str) – path of the backup metadata file in the specified Cloud Storage bucket It should have the extension overall_export_metadata
    · namespace (str) – optional namespace of the backup metadata file in the specified Cloud Storage bucket
    · entity_filter (dict) – description of what data from the project is included in the export refer to httpscloudgooglecomdatastoredocsreferencerestSharedTypesEntityFilter
    · labels (dict) – clientassigned labels for cloud storage
    · datastore_conn_id (str) – the name of the connection id to use
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · polling_interval_in_seconds (int) – number of seconds to wait before polling for execution status again
    · xcom_push (bool) – push operation name to xcom for reference
    DatastoreHook
    class airflowcontribhooksdatastore_hookDatastoreHook(datastore_conn_id'google_cloud_datastore_default' delegate_toNone)[source]
    Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
    Interact with Google Cloud Datastore This hook uses the Google Cloud Platform connection
    This object is not threads safe If you want to make multiple requests simultaneously you will need to create a hook per thread
    allocate_ids(partialKeys)[source]
    Allocate IDs for incomplete keys see httpscloudgooglecomdatastoredocsreferencerestv1projectsallocateIds
    Parameters
    partialKeys – a list of partial keys
    Returns
    a list of full keys
    begin_transaction()[source]
    Get a new transaction handle
    See also
    httpscloudgooglecomdatastoredocsreferencerestv1projectsbeginTransaction
    Returns
    a transaction handle
    commit(body)[source]
    Commit a transaction optionally creating deleting or modifying some entities
    See also
    httpscloudgooglecomdatastoredocsreferencerestv1projectscommit
    Parameters
    body – the body of the commit request
    Returns
    the response body of the commit request
    delete_operation(name)[source]
    Deletes the longrunning operation
    Parameters
    name – the name of the operation resource
    export_to_storage_bucket(bucket namespaceNone entity_filterNone labelsNone)[source]
    Export entities from Cloud Datastore to Cloud Storage for backup
    get_conn(version'v1')[source]
    Returns a Google Cloud Storage service object
    get_operation(name)[source]
    Gets the latest state of a longrunning operation
    Parameters
    name – the name of the operation resource
    import_from_storage_bucket(bucket file namespaceNone entity_filterNone labelsNone)[source]
    Import a backup from Cloud Storage to Cloud Datastore
    lookup(keys read_consistencyNone transactionNone)[source]
    Lookup some entities by key
    See also
    httpscloudgooglecomdatastoredocsreferencerestv1projectslookup
    Parameters
    · keys – the keys to lookup
    · read_consistency – the read consistency to use default strong or eventual Cannot be used with a transaction
    · transaction – the transaction to use if any
    Returns
    the response body of the lookup request
    poll_operation_until_done(name polling_interval_in_seconds)[source]
    Poll backup operation state until it’s completed
    rollback(transaction)[source]
    Roll back a transaction
    See also
    httpscloudgooglecomdatastoredocsreferencerestv1projectsrollback
    Parameters
    transaction – the transaction to roll back
    run_query(body)[source]
    Run a query for entities
    See also
    httpscloudgooglecomdatastoredocsreferencerestv1projectsrunQuery
    Parameters
    body – the body of the query request
    Returns
    the batch of query results
    Cloud ML Engine
    Cloud ML Engine Operators
    · MLEngineBatchPredictionOperator Start a Cloud ML Engine batch prediction job
    · MLEngineModelOperator Manages a Cloud ML Engine model
    · MLEngineTrainingOperator Start a Cloud ML Engine training job
    · MLEngineVersionOperator Manages a Cloud ML Engine model version
    MLEngineBatchPredictionOperator
    class airflowcontriboperatorsmlengine_operatorMLEngineBatchPredictionOperator(project_id job_id region data_format input_paths output_path model_nameNone version_nameNone uriNone max_worker_countNone runtime_versionNone gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Start a Google Cloud ML Engine prediction job
    NOTE For model origin users should consider exactly one from the three options below 1 Populate uri’ field only which should be a GCS location that points to a tensorflow savedModel directory 2 Populate model_name’ field only which refers to an existing model and the default version of the model will be used 3 Populate both model_name’ and version_name’ fields which refers to a specific version of a specific model
    In options 2 and 3 both model and version name should contain the minimal identifier For instance call
    MLEngineBatchPredictionOperator(

    model_name'my_model'
    version_name'my_version'
    )
    if the desired model version is projectsmy_projectmodelsmy_modelversionsmy_version
    See httpscloudgooglecommlenginereferencerestv1projectsjobs for further documentation on the parameters
    Parameters
    · project_id (str) – The Google Cloud project name where the prediction job is submitted (templated)
    · job_id (str) – A unique id for the prediction job on Google Cloud ML Engine (templated)
    · data_format (str) – The format of the input data It will default to DATA_FORMAT_UNSPECIFIED’ if is not provided or is not one of [TEXT TF_RECORD TF_RECORD_GZIP]
    · input_paths (list of string) – A list of GCS paths of input data for batch prediction Accepting wildcard operator * but only at the end (templated)
    · output_path (str) – The GCS path where the prediction results are written to (templated)
    · region (str) – The Google Compute Engine region to run the prediction job in (templated)
    · model_name (str) – The Google Cloud ML Engine model to use for prediction If version_name is not provided the default version of this model will be used Should not be None if version_name is provided Should be None if uri is provided (templated)
    · version_name (str) – The Google Cloud ML Engine model version to use for prediction Should be None if uri is provided (templated)
    · uri (str) – The GCS path of the saved model to use for prediction Should be None if model_name is provided It should be a GCS path pointing to a tensorflow SavedModel (templated)
    · max_worker_count (int) – The maximum number of workers to be used for parallel processing Defaults to 10 if not specified
    · runtime_version (str) – The Google Cloud ML Engine runtime version to use for batch prediction
    · gcp_conn_id (str) – The connection ID used for connection to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have doaminwide delegation enabled
    Raises
    ValueError if a unique modelversion origin cannot be determined
    MLEngineModelOperator
    class airflowcontriboperatorsmlengine_operatorMLEngineModelOperator(project_id model operation'create' gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Operator for managing a Google Cloud ML Engine model
    Parameters
    · project_id (str) – The Google Cloud project name to which MLEngine model belongs (templated)
    · model (dict) –
    A dictionary containing the information about the model If the operation is create then the model parameter should contain all the information about this model such as name
    If the operation is get the model parameter should contain the name of the model
    · operation (str) –
    The operation to perform Available operations are
    o create Creates a new model as provided by the model parameter
    o get Gets a particular model where the name is specified in model
    · gcp_conn_id (str) – The connection ID to use when fetching connection info
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    MLEngineTrainingOperator
    class airflowcontriboperatorsmlengine_operatorMLEngineTrainingOperator(project_id job_id package_uris training_python_module training_args region scale_tierNone runtime_versionNone python_versionNone job_dirNone gcp_conn_id'google_cloud_default' delegate_toNone mode'PRODUCTION' *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Operator for launching a MLEngine training job
    Parameters
    · project_id (str) – The Google Cloud project name within which MLEngine training job should run (templated)
    · job_id (str) – A unique templated id for the submitted Google MLEngine training job (templated)
    · package_uris (str) – A list of package locations for MLEngine training job which should include the main training program + any additional dependencies (templated)
    · training_python_module (str) – The Python module name to run within MLEngine training job after installing package_uris’ packages (templated)
    · training_args (str) – A list of templated command line arguments to pass to the MLEngine training program (templated)
    · region (str) – The Google Compute Engine region to run the MLEngine training job in (templated)
    · scale_tier (str) – Resource tier for MLEngine training job (templated)
    · runtime_version (str) – The Google Cloud ML runtime version to use for training (templated)
    · python_version (str) – The version of Python used in training (templated)
    · job_dir (str) – A Google Cloud Storage path in which to store training outputs and other data needed for training (templated)
    · gcp_conn_id (str) – The connection ID to use when fetching connection info
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · mode (str) – Can be one of DRY_RUN’’CLOUD’ In DRY_RUN’ mode no real training job will be launched but the MLEngine training job request will be printed out In CLOUD’ mode a real MLEngine training job creation request will be issued
    MLEngineVersionOperator
    class airflowcontriboperatorsmlengine_operatorMLEngineVersionOperator(project_id model_name version_nameNone versionNone operation'create' gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Operator for managing a Google Cloud ML Engine version
    Parameters
    · project_id (str) – The Google Cloud project name to which MLEngine model belongs
    · model_name (str) – The name of the Google Cloud ML Engine model that the version belongs to (templated)
    · version_name (str) – A name to use for the version being operated upon If not None and the version argument is None or does not have a value for the name key then this will be populated in the payload for the name key (templated)
    · version (dict) – A dictionary containing the information about the version If the operation is create version should contain all the information about this version such as name and deploymentUrl If the operation is get or delete the version parameter should contain the name of the version If it is None the only operation possible would be list (templated)
    · operation (str) –
    The operation to perform Available operations are
    o create Creates a new version in the model specified by model_name in which case the version parameter should contain all the information to create that version (eg name deploymentUrl)
    o get Gets full information of a particular version in the model specified by model_name The name of the version should be specified in the version parameter
    o list Lists all available versions of the model specified by model_name
    o delete Deletes the version specified in version parameter from the model specified by model_name) The name of the version should be specified in the version parameter
    · gcp_conn_id (str) – The connection ID to use when fetching connection info
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    Cloud ML Engine Hook
    MLEngineHook
    class airflowcontribhooksgcp_mlengine_hookMLEngineHook(gcp_conn_id'google_cloud_default' delegate_toNone)[source]
    Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
    create_job(project_id job use_existing_job_fnNone)[source]
    Launches a MLEngine job and wait for it to reach a terminal state
    Parameters
    · project_id (str) – The Google Cloud project id within which MLEngine job will be launched
    · job (dict) –
    MLEngine Job object that should be provided to the MLEngine API such as
    {
    'jobId' 'my_job_id'
    'trainingInput' {
    'scaleTier' 'STANDARD_1'

    }
    }
    · use_existing_job_fn (function) – In case that a MLEngine job with the same job_id already exist this method (if provided) will decide whether we should use this existing job continue waiting for it to finish and returning the job object It should accepts a MLEngine job object and returns a boolean value indicating whether it is OK to reuse the existing job If use_existing_job_fn’ is not provided we by default reuse the existing MLEngine job
    Returns
    The MLEngine job object if the job successfully reach a terminal state (which might be FAILED or CANCELLED state)
    Return type
    dict
    create_model(project_id model)[source]
    Create a Model Blocks until finished
    create_version(project_id model_name version_spec)[source]
    Creates the Version on Google Cloud ML Engine
    Returns the operation if the version was created successfully and raises an error otherwise
    delete_version(project_id model_name version_name)[source]
    Deletes the given version of a model Blocks until finished
    get_conn()[source]
    Returns a Google MLEngine service object
    get_model(project_id model_name)[source]
    Gets a Model Blocks until finished
    list_versions(project_id model_name)[source]
    Lists all available versions of a model Blocks until finished
    set_default_version(project_id model_name version_name)[source]
    Sets a version to be the default Blocks until finished
    Cloud Storage
    Storage Operators
    · FileToGoogleCloudStorageOperator Uploads a file to Google Cloud Storage
    · GoogleCloudStorageCreateBucketOperator Creates a new cloud storage bucket
    · GoogleCloudStorageListOperator List all objects from the bucket with the give string prefix and delimiter in name
    · GoogleCloudStorageDownloadOperator Downloads a file from Google Cloud Storage
    · GoogleCloudStorageToBigQueryOperator Loads files from Google cloud storage into BigQuery
    · GoogleCloudStorageToGoogleCloudStorageOperator Copies objects from a bucket to another with renaming if requested
    FileToGoogleCloudStorageOperator
    class airflowcontriboperatorsfile_to_gcsFileToGoogleCloudStorageOperator(src dst bucket google_cloud_storage_conn_id'google_cloud_default' mime_type'applicationoctetstream' delegate_toNone gzipFalse *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Uploads a file to Google Cloud Storage Optionally can compress the file for upload
    Parameters
    · src (str) – Path to the local file (templated)
    · dst (str) – Destination path within the specified bucket (templated)
    · bucket (str) – The bucket to upload to (templated)
    · google_cloud_storage_conn_id (str) – The Airflow connection ID to upload with
    · mime_type (str) – The mimetype string
    · delegate_to (str) – The account to impersonate if any
    · gzip (bool) – Allows for file to be compressed and uploaded as gzip
    execute(context)[source]
    Uploads the file to Google cloud storage
    GoogleCloudStorageCreateBucketOperator
    class airflowcontriboperatorsgcs_operatorGoogleCloudStorageCreateBucketOperator(bucket_name storage_class'MULTI_REGIONAL' location'US' project_idNone labelsNone google_cloud_storage_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Creates a new bucket Google Cloud Storage uses a flat namespace so you can’t create a bucket with a name that is already in use
    See also
    For more information see Bucket Naming Guidelines httpscloudgooglecomstoragedocsbucketnaminghtml#requirements
    Parameters
    · bucket_name (str) – The name of the bucket (templated)
    · storage_class (str) –
    This defines how objects in the bucket are stored and determines the SLA and the cost of storage (templated) Values include
    o MULTI_REGIONAL
    o REGIONAL
    o STANDARD
    o NEARLINE
    o COLDLINE
    If this value is not specified when the bucket is created it will default to STANDARD
    · location (str) –
    The location of the bucket (templated) Object data for objects in the bucket resides in physical storage within this region Defaults to US
    See also
    httpsdevelopersgooglecomstoragedocsbucketlocations
    · project_id (str) – The ID of the GCP Project (templated)
    · labels (dict) – Userprovided labels in keyvalue pairs
    · google_cloud_storage_conn_id (str) – The connection ID to use when connecting to Google cloud storage
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    Example
    The following Operator would create a new bucket testbucket with MULTI_REGIONAL storage class in EU region
    CreateBucket GoogleCloudStorageCreateBucketOperator(
    task_id'CreateNewBucket'
    bucket_name'testbucket'
    storage_class'MULTI_REGIONAL'
    location'EU'
    labels{'env' 'dev' 'team' 'airflow'}
    google_cloud_storage_conn_id'airflowserviceaccount'
    )
    GoogleCloudStorageDownloadOperator
    class airflowcontriboperatorsgcs_download_operatorGoogleCloudStorageDownloadOperator(bucket object filenameNone store_to_xcom_keyNone google_cloud_storage_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Downloads a file from Google Cloud Storage
    Parameters
    · bucket (str) – The Google cloud storage bucket where the object is (templated)
    · object (str) – The name of the object to download in the Google cloud storage bucket (templated)
    · filename (str) – The file path on the local file system (where the operator is being executed) that the file should be downloaded to (templated) If no filename passed the downloaded data will not be stored on the local file system
    · store_to_xcom_key (str) – If this param is set the operator will push the contents of the downloaded file to XCom with the key set in this parameter If not set the downloaded data will not be pushed to XCom (templated)
    · google_cloud_storage_conn_id (str) – The connection ID to use when connecting to Google cloud storage
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    GoogleCloudStorageListOperator
    class airflowcontriboperatorsgcs_list_operatorGoogleCloudStorageListOperator(bucket prefixNone delimiterNone google_cloud_storage_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    List all objects from the bucket with the give string prefix and delimiter in name
    This operator returns a python list with the name of objects which can be used by
    xcom in the downstream task
    Parameters
    · bucket (str) – The Google cloud storage bucket to find the objects (templated)
    · prefix (str) – Prefix string which filters objects whose name begin with this prefix (templated)
    · delimiter (str) – The delimiter by which you want to filter the objects (templated) For eg to lists the CSV files from in a directory in GCS you would use delimiter’csv’
    · google_cloud_storage_conn_id (str) – The connection ID to use when connecting to Google cloud storage
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    Example
    The following Operator would list all the Avro files from salessales2017 folder in data bucket
    GCS_Files GoogleCloudStorageListOperator(
    task_id'GCS_Files'
    bucket'data'
    prefix'salessales2017'
    delimiter'avro'
    google_cloud_storage_conn_idgoogle_cloud_conn_id
    )
    GoogleCloudStorageToBigQueryOperator
    class airflowcontriboperatorsgcs_to_bqGoogleCloudStorageToBigQueryOperator(bucket source_objects destination_project_dataset_table schema_fieldsNone schema_objectNone source_format'CSV' compression'NONE' create_disposition'CREATE_IF_NEEDED' skip_leading_rows0 write_disposition'WRITE_EMPTY' field_delimiter' ' max_bad_records0 quote_characterNone ignore_unknown_valuesFalse allow_quoted_newlinesFalse allow_jagged_rowsFalse max_id_keyNone bigquery_conn_id'bigquery_default' google_cloud_storage_conn_id'google_cloud_default' delegate_toNone schema_update_options() src_fmt_configsNone external_tableFalse time_partitioningNone cluster_fieldsNone autodetectFalse *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Loads files from Google cloud storage into BigQuery
    The schema to be used for the BigQuery table may be specified in one of two ways You may either directly pass the schema fields in or you may point the operator to a Google cloud storage object name The object in Google cloud storage must be a JSON file with the schema fields in it
    Parameters
    · bucket (str) – The bucket to load from (templated)
    · source_objects (list of str) – List of Google cloud storage URIs to load from (templated) If source_format is DATASTORE_BACKUP’ the list must only contain a single URI
    · destination_project_dataset_table (str) – The dotted ()
    BigQuery table to load data into If is not included project will be the project defined in the connection json (templated)
    · schema_fields (list) – If set the schema field list as defined here httpscloudgooglecombigquerydocsreferencev2jobs#configurationload Should not be set when source_format is DATASTORE_BACKUP’
    · schema_object (str) – If set a GCS object path pointing to a json file that contains the schema for the table (templated)
    · source_format (str) – File format to export
    · compression (str) – [Optional] The compression type of the data source Possible values include GZIP and NONE The default value is NONE This setting is ignored for Google Cloud Bigtable Google Cloud Datastore backups and Avro formats
    · create_disposition (str) – The create disposition if the table doesn’t exist
    · skip_leading_rows (int) – Number of rows to skip when loading from a CSV
    · write_disposition (str) – The write disposition if the table already exists
    · field_delimiter (str) – The delimiter to use when loading from a CSV
    · max_bad_records (int) – The maximum number of bad records that BigQuery can ignore when running the job
    · quote_character (str) – The value that is used to quote data sections in a CSV file
    · ignore_unknown_values (bool) – [Optional] Indicates if BigQuery should allow extra values that are not represented in the table schema If true the extra values are ignored If false records with extra columns are treated as bad records and if there are too many bad records an invalid error is returned in the job result
    · allow_quoted_newlines (bool) – Whether to allow quoted newlines (true) or not (false)
    · allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns The missing values are treated as nulls If false records with missing trailing columns are treated as bad records and if there are too many bad records an invalid error is returned in the job result Only applicable to CSV ignored for other formats
    · max_id_key (str) – If set the name of a column in the BigQuery table that’s to be loaded This will be used to select the MAX value from BigQuery after the load occurs The results will be returned by the execute() command which in turn gets stored in XCom for future operators to use This can be helpful with incremental loads–during future executions you can pick up from the max ID
    · bigquery_conn_id (str) – Reference to a specific BigQuery hook
    · google_cloud_storage_conn_id (str) – Reference to a specific Google cloud storage hook
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · schema_update_options (list) – Allows the schema of the destination table to be updated as a side effect of the load job
    · src_fmt_configs (dict) – configure optional fields specific to the source format
    · external_table (bool) – Flag to specify if the destination table should be a BigQuery external table Default Value is False
    · time_partitioning (dict) – configure optional time partitioning fields ie partition by field type and expiration as per API specifications Note that field’ is not available in concurrency with datasettablepartition
    · cluster_fields (list of str) – Request that the result of this load be stored sorted by one or more columns This is only available in conjunction with time_partitioning The order of columns given determines the sort order Not applicable for external tables
    GoogleCloudStorageToGoogleCloudStorageOperator
    class airflowcontriboperatorsgcs_to_gcsGoogleCloudStorageToGoogleCloudStorageOperator(source_bucket source_object destination_bucketNone destination_objectNone move_objectFalse google_cloud_storage_conn_id'google_cloud_default' delegate_toNone last_modified_timeNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Copies objects from a bucket to another with renaming if requested
    Parameters
    · source_bucket (str) – The source Google cloud storage bucket where the object is (templated)
    · source_object (str) – The source name of the object to copy in the Google cloud storage bucket (templated) You can use only one wildcard for objects (filenames) within your bucket The wildcard can appear inside the object name or at the end of the object name Appending a wildcard to the bucket name is unsupported
    · destination_bucket (str) – The destination Google cloud storage bucket where the object should be (templated)
    · destination_object (str) – The destination name of the object in the destination Google cloud storage bucket (templated) If a wildcard is supplied in the source_object argument this is the prefix that will be prepended to the final destination objects’ paths Note that the source path’s part before the wildcard will be removed if it needs to be retained it should be appended to destination_object For example with prefix foo* and destination_object blah the file foobaz will be copied to blahbaz to retain the prefix write the destination_object as eg blahfoo in which case the copied file will be named blahfoobaz
    · move_object (bool) – When move object is True the object is moved instead of copied to the new location This is the equivalent of a mv command as opposed to a cp command
    · google_cloud_storage_conn_id (str) – The connection ID to use when connecting to Google cloud storage
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · last_modified_time (datetime) – When specified if the object(s) were modified after last_modified_time they will be copiedmoved If tzinfo has not been set UTC will be assumed
    Examples
    The following Operator would copy a single file named salessales2017januaryavro in the data bucket to the file named copied_sales2017januarybackupavro in the data_backup bucket
    copy_single_file GoogleCloudStorageToGoogleCloudStorageOperator(
    task_id'copy_single_file'
    source_bucket'data'
    source_object'salessales2017januaryavro'
    destination_bucket'data_backup'
    destination_object'copied_sales2017januarybackupavro'
    google_cloud_storage_conn_idgoogle_cloud_conn_id
    )
    The following Operator would copy all the Avro files from salessales2017 folder (ie with names starting with that prefix) in data bucket to the copied_sales2017 folder in the data_backup bucket
    copy_files GoogleCloudStorageToGoogleCloudStorageOperator(
    task_id'copy_files'
    source_bucket'data'
    source_object'salessales2017*avro'
    destination_bucket'data_backup'
    destination_object'copied_sales2017'
    google_cloud_storage_conn_idgoogle_cloud_conn_id
    )
    The following Operator would move all the Avro files from salessales2017 folder (ie with names starting with that prefix) in data bucket to the same folder in the data_backup bucket deleting the original files in the process
    move_files GoogleCloudStorageToGoogleCloudStorageOperator(
    task_id'move_files'
    source_bucket'data'
    source_object'salessales2017*avro'
    destination_bucket'data_backup'
    move_objectTrue
    google_cloud_storage_conn_idgoogle_cloud_conn_id
    )
    GoogleCloudStorageHook
    class airflowcontribhooksgcs_hookGoogleCloudStorageHook(google_cloud_storage_conn_id'google_cloud_default' delegate_toNone)[source]
    Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
    Interact with Google Cloud Storage This hook uses the Google Cloud Platform connection
    copy(source_bucket source_object destination_bucketNone destination_objectNone)[source]
    Copies an object from a bucket to another with renaming if requested
    destination_bucket or destination_object can be omitted in which case source bucketobject is used but not both
    Parameters
    · source_bucket (str) – The bucket of the object to copy from
    · source_object (str) – The object to copy
    · destination_bucket (str) – The destination of the object to copied to Can be omitted then the same bucket is used
    · destination_object (str) – The (renamed) path of the object if given Can be omitted then the same name is used
    create_bucket(bucket_name storage_class'MULTI_REGIONAL' location'US' project_idNone labelsNone)[source]
    Creates a new bucket Google Cloud Storage uses a flat namespace so you can’t create a bucket with a name that is already in use
    See also
    For more information see Bucket Naming Guidelines httpscloudgooglecomstoragedocsbucketnaminghtml#requirements
    Parameters
    · bucket_name (str) – The name of the bucket
    · storage_class (str) –
    This defines how objects in the bucket are stored and determines the SLA and the cost of storage Values include
    o MULTI_REGIONAL
    o REGIONAL
    o STANDARD
    o NEARLINE
    o COLDLINE
    If this value is not specified when the bucket is created it will default to STANDARD
    · location (str) –
    The location of the bucket Object data for objects in the bucket resides in physical storage within this region Defaults to US
    See also
    httpsdevelopersgooglecomstoragedocsbucketlocations
    · project_id (str) – The ID of the GCP Project
    · labels (dict) – Userprovided labels in keyvalue pairs
    Returns
    If successful it returns the id of the bucket
    delete(bucket object generationNone)[source]
    Delete an object if versioning is not enabled for the bucket or if generation parameter is used
    Parameters
    · bucket (str) – name of the bucket where the object resides
    · object (str) – name of the object to delete
    · generation (str) – if present permanently delete the object of this generation
    Returns
    True if succeeded
    download(bucket object filenameNone)[source]
    Get a file from Google Cloud Storage
    Parameters
    · bucket (str) – The bucket to fetch from
    · object (str) – The object to fetch
    · filename (str) – If set a local file path where the file should be written to
    exists(bucket object)[source]
    Checks for the existence of a file in Google Cloud Storage
    Parameters
    · bucket (str) – The Google cloud storage bucket where the object is
    · object (str) – The name of the object to check in the Google cloud storage bucket
    get_conn()[source]
    Returns a Google Cloud Storage service object
    get_crc32c(bucket object)[source]
    Gets the CRC32c checksum of an object in Google Cloud Storage
    Parameters
    · bucket (str) – The Google cloud storage bucket where the object is
    · object (str) – The name of the object to check in the Google cloud storage bucket
    get_md5hash(bucket object)[source]
    Gets the MD5 hash of an object in Google Cloud Storage
    Parameters
    · bucket (str) – The Google cloud storage bucket where the object is
    · object (str) – The name of the object to check in the Google cloud storage bucket
    get_size(bucket object)[source]
    Gets the size of a file in Google Cloud Storage
    Parameters
    · bucket (str) – The Google cloud storage bucket where the object is
    · object (str) – The name of the object to check in the Google cloud storage bucket
    is_updated_after(bucket object ts)[source]
    Checks if an object is updated in Google Cloud Storage
    Parameters
    · bucket (str) – The Google cloud storage bucket where the object is
    · object (str) – The name of the object to check in the Google cloud storage bucket
    · ts (datetime) – The timestamp to check against
    list(bucket versionsNone maxResultsNone prefixNone delimiterNone)[source]
    List all objects from the bucket with the give string prefix in name
    Parameters
    · bucket (str) – bucket name
    · versions (bool) – if true list all versions of the objects
    · maxResults (int) – max count of items to return in a single page of responses
    · prefix (str) – prefix string which filters objects whose name begin with this prefix
    · delimiter (str) – filters objects based on the delimiter (for eg csv’)
    Returns
    a stream of object names matching the filtering criteria
    rewrite(source_bucket source_object destination_bucket destination_objectNone)[source]
    Has the same functionality as copy except that will work on files over 5 TB as well as when copying between locations andor storage classes
    destination_object can be omitted in which case source_object is used
    Parameters
    · source_bucket (str) – The bucket of the object to copy from
    · source_object (str) – The object to copy
    · destination_bucket (str) – The destination of the object to copied to
    · destination_object (str) – The (renamed) path of the object if given Can be omitted then the same name is used
    upload(bucket object filename mime_type'applicationoctetstream' gzipFalse)[source]
    Uploads a local file to Google Cloud Storage
    Parameters
    · bucket (str) – The bucket to upload to
    · object (str) – The object name to set when uploading the local file
    · filename (str) – The local file path to the file to be uploaded
    · mime_type (str) – The MIME type to set when uploading the file
    · gzip (bool) – Option to compress file for upload
    Google Kubernetes Engine
    Google Kubernetes Engine Cluster Operators
    · GKEClusterDeleteOperator Creates a Kubernetes Cluster in Google Cloud Platform
    · GKEPodOperator Deletes a Kubernetes Cluster in Google Cloud Platform
    GKEClusterCreateOperator
    GKEClusterDeleteOperator
    GKEPodOperator
    Google Kubernetes Engine Hook
    class airflowcontribhooksgcp_container_hookGKEClusterHook(project_id location)[source]
    Bases airflowhooksbase_hookBaseHook
    create_cluster(cluster retry timeout)[source]
    Creates a cluster consisting of the specified number and type of Google Compute Engine instances
    Parameters
    · cluster (dict or googlecloudcontainer_v1typesCluster) – A Cluster protobuf or dict If dict is provided it must be of the same form as the protobuf message googlecloudcontainer_v1typesCluster
    · retry (googleapi_coreretryRetry) – A retry object (googleapi_coreretryRetry) used to retry requests If None is specified requests will not be retried
    · timeout (float) – The amount of time in seconds to wait for the request to complete Note that if retry is specified the timeout applies to each individual attempt
    Returns
    The full url to the new or existing cluster
    raises
    ParseError On JSON parsing problems when trying to convert dict AirflowException cluster is not dict type nor Cluster proto type
    delete_cluster(name retry timeout)[source]
    Deletes the cluster including the Kubernetes endpoint and all worker nodes Firewalls and routes that were configured during cluster creation are also deleted Other Google Compute Engine resources that might be in use by the cluster (eg load balancer resources) will not be deleted if they weren’t present at the initial create time
    Parameters
    · name (str) – The name of the cluster to delete
    · retry (googleapi_coreretryRetry) – Retry object used to determine whenif to retry requests If None is specified requests will not be retried
    · timeout (float) – The amount of time in seconds to wait for the request to complete Note that if retry is specified the timeout applies to each individual attempt
    Returns
    The full url to the delete operation if successful else None
    get_cluster(name retry timeout)[source]
    Gets details of specified cluster
    Parameters
    · name (str) – The name of the cluster to retrieve
    · retry (googleapi_coreretryRetry) – A retry object used to retry requests If None is specified requests will not be retried
    · timeout (float) – The amount of time in seconds to wait for the request to complete Note that if retry is specified the timeout applies to each individual attempt
    Returns
    A googlecloudcontainer_v1typesCluster instance
    get_operation(operation_name)[source]
    Fetches the operation from Google Cloud
    Parameters
    operation_name (str) – Name of operation to fetch
    Returns
    The new updated operation from Google Cloud
    wait_for_operation(operation)[source]
    Given an operation continuously fetches the status from Google Cloud until either completion or an error occurring
    Parameters
    operation (A googlecloudcontainer_V1gapicenumsOperator) – The Operation to wait for
    Returns
    A new updated operation fetched from Google Cloud
    Qubole
    Apache Airflow has a native operator and hooks to talk to Qubole which lets you submit your big data jobs directly to Qubole from Apache Airflow
    QuboleOperator
    QubolePartitionSensor
    QuboleFileSensor
    Lineage
    Note
    Lineage support is very experimental and subject to change
    Airflow can help track origins of data what happens to it and where it moves over time This can aid having audit trails and data governance but also debugging of data flows
    Airflow tracks data by means of inlets and outlets of the tasks Let’s work from an example and see how it works
    from airflowoperatorsbash_operator import BashOperator
    from airflowoperatorsdummy_operator import DummyOperator
    from airflowlineagedatasets import File
    from airflowmodels import DAG
    from datetime import timedelta

    FILE_CATEGORIES [CAT1 CAT2 CAT3]

    args {
    'owner' 'airflow'
    'start_date' airflowutilsdatesdays_ago(2)
    }

    dag DAG(
    dag_id'example_lineage' default_argsargs
    schedule_interval'0 0 * * *'
    dagrun_timeouttimedelta(minutes60))

    f_final File(tmpfinal)
    run_this_last DummyOperator(task_id'run_this_last' dagdag
    inlets{auto True}
    outlets{datasets [f_final]})

    f_in File(tmpwhole_directory)
    outlets []
    for file in FILE_CATEGORIES
    f_out File(tmp{}{{{{ execution_date }}}}format(file))
    outletsappend(f_out)
    run_this BashOperator(
    task_id'run_me_first' bash_command'echo 1' dagdag
    inlets{datasets [f_in]}
    outlets{datasets outlets}
    )
    run_thisset_downstream(run_this_last)
    Tasks take the parameters inlets and outlets Inlets can be manually defined by a list of dataset {datasets [dataset1 dataset2]} or can be configured to look for outlets from upstream tasks {task_ids [task_id1 task_id2]} or can be configured to pick up outlets from direct upstream tasks {auto True} or a combination of them Outlets are defined as list of dataset {datasets [dataset1 dataset2]} Any fields for the dataset are templated with the context when the task is being executed
    Note
    Operators can add inlets and outlets automatically if the operator supports it
    In the example DAG task run_me_first is a BashOperator that takes 3 inlets CAT1 CAT2 CAT3 that are generated from a list Note that execution_date is a templated field and will be rendered when the task is running
    Note
    Behind the scenes Airflow prepares the lineage metadata as part of the pre_execute method of a task When the task has finished execution post_execute is called and lineage metadata is pushed into XCOM Thus if you are creating your own operators that override this method make sure to decorate your method with prepare_lineage and apply_lineage respectively
    Apache Atlas
    Airflow can send its lineage metadata to Apache Atlas You need to enable the atlas backend and configure it properly eg in your airflowcfg
    [lineage]
    backend airflowlineagebackendatlas

    [atlas]
    username my_username
    password my_password
    host host
    port 21000
    Please make sure to have the atlasclient package installed
    常见问题
    Why isn’t my task getting scheduled
    There are very many reasons why your task might not be getting scheduled Here are some of the common causes
    · Does your script compile can the Airflow engine parse it and find your DAG object To test this you can run airflow list_dags and confirm that your DAG shows up in the list You can also run airflow list_tasks foo_dag_id tree and confirm that your task shows up in the list as expected If you use the CeleryExecutor you may want to confirm that this works both where the scheduler runs as well as where the worker runs
    · Does the file containing your DAG contain the string airflow and DAG somewhere in the contents When searching the DAG directory Airflow ignores files not containing airflow and DAG in order to prevent the DagBag parsing from importing all python files collocated with user’s DAGs
    · Is your start_date set properly The Airflow scheduler triggers the task soon after the start_date + scheduler_interval is passed
    · Is your schedule_interval set properly The default schedule_interval is one day (datetimetimedelta(1)) You must specify a different schedule_interval directly to the DAG object you instantiate not as a default_param as task instances do not override their parent DAG’s schedule_interval
    · Is your start_date beyond where you can see it in the UI If you set your start_date to some time say 3 months ago you won’t be able to see it in the main view in the UI but you should be able to see it in the Menu > Browse >Task Instances
    · Are the dependencies for the task met The task instances directly upstream from the task need to be in a success state Also if you have set depends_on_pastTrue the previous task instance needs to have succeeded (except if it is the first run for that task) Also if wait_for_downstreamTrue make sure you understand what it means You can view how these properties are set from the Task Instance Details page for your task
    · Are the DagRuns you need created and active A DagRun represents a specific execution of an entire DAG and has a state (running success failed …) The scheduler creates new DagRun as it moves forward but never goes back in time to create new ones The scheduler only evaluates running DagRuns to see what task instances it can trigger Note that clearing tasks instances (from the UI or CLI) does set the state of a DagRun back to running You can bulk view the list of DagRuns and alter states by clicking on the schedule tag for a DAG
    · Is the concurrency parameter of your DAG reached concurrency defines how many running task instances a DAG is allowed to have beyond which point things get queued
    · Is the max_active_runs parameter of your DAG reached max_active_runs defines how many running concurrent instances of a DAG there are allowed to be
    You may also want to read the Scheduler section of the docs and make sure you fully understand how it proceeds
    How do I trigger tasks based on another task’s failure
    Check out the Trigger Rule section in the Concepts section of the documentation
    Why are connection passwords still not encrypted in the metadata db after I installed airflow[crypto]
    Check out the Securing Connections section in the Howto Guides section of the documentation
    What’s the deal with start_date
    start_date is partly legacy from the preDagRun era but it is still relevant in many ways When creating a new DAG you probably want to set a global start_date for your tasks using default_args The first DagRun to be created will be based on the min(start_date) for all your task From that point on the scheduler creates new DagRuns based on your schedule_interval and the corresponding task instances run as your dependencies are met When introducing new tasks to your DAG you need to pay special attention to start_date and may want to reactivate inactive DagRuns to get the new task onboarded properly
    We recommend against using dynamic values as start_date especially datetimenow() as it can be quite confusing The task is triggered once the period closes and in theory an @hourly DAG would never get to an hour after now as now() moves along
    Previously we also recommended using rounded start_date in relation to your schedule_interval This meant an @hourly would be at 0000 minutesseconds a @daily job at midnight a @monthly job on the first of the month This is no longer required Airflow will now auto align the start_date and the schedule_interval by using the start_date as the moment to start looking
    You can use any sensor or a TimeDeltaSensor to delay the execution of tasks within the schedule interval While schedule_interval does allow specifying a datetimetimedelta object we recommend using the macros or cron expressions instead as it enforces this idea of rounded schedules
    When using depends_on_pastTrue it’s important to pay special attention to start_date as the past dependency is not enforced only on the specific schedule of the start_date specified for the task It’s also important to watch DagRun activity status in time when introducing new depends_on_pastTrue unless you are planning on running a backfill for the new task(s)
    Also important to note is that the tasks start_date in the context of a backfill CLI command get overridden by the backfill’s command start_date This allows for a backfill on tasks that have depends_on_pastTrue to actually start if that wasn’t the case the backfill just wouldn’t start
    How can I create DAGs dynamically
    Airflow looks in your DAGS_FOLDER for modules that contain DAG objects in their global namespace and adds the objects it finds in the DagBag Knowing this all we need is a way to dynamically assign variable in the global namespace which is easily done in python using the globals() function for the standard library which behaves like a simple dictionary
    for i in range(10)
    dag_id 'foo_{}'format(i)
    globals()[dag_id] DAG(dag_id)
    # or better call a function that returns a DAG object
    What are all the airflow run commands in my process list
    There are many layers of airflow run commands meaning it can call itself
    · Basic airflow run fires up an executor and tell it to run an airflow run local command If using Celery this means it puts a command in the queue for it to run remotely on the worker If using LocalExecutor that translates into running it in a subprocess pool
    · Local airflow run local starts an airflow run raw command (described below) as a subprocess and is in charge of emitting heartbeats listening for external kill signals and ensures some cleanup takes place if the subprocess fails
    · Raw airflow run raw runs the actual operator’s execute method and performs the actual work
    How can my airflow dag run faster
    There are three variables we could control to improve airflow dag performance
    · parallelism This variable controls the number of task instances that the airflow worker can run simultaneously User could increase the parallelism variable in the airflowcfg
    · concurrency The Airflow scheduler will run no more than concurrency task instances for your DAG at any given time Concurrency is defined in your Airflow DAG If you do not set the concurrency on your DAG the scheduler will use the default value from the dag_concurrency entry in your airflowcfg
    · max_active_runs the Airflow scheduler will run no more than max_active_runs DagRuns of your DAG at a given time If you do not set the max_active_runs in your DAG the scheduler will use the default value from the max_active_runs_per_dag entry in your airflowcfg
    How can we reduce the airflow UI page load time
    If your dag takes long time to load you could reduce the value of default_dag_run_display_number configuration in airflowcfg to a smaller value This configurable controls the number of dag run to show in UI with default value 25
    How to fix Exception Global variable explicit_defaults_for_timestamp needs to be on (1)
    This means explicit_defaults_for_timestamp is disabled in your mysql server and you need to enable it by
    1 Set explicit_defaults_for_timestamp 1 under the mysqld section in your mycnf file
    2 Restart the Mysql server
    How to reduce airflow dag scheduling latency in production
    · max_threads Scheduler will spawn multiple threads in parallel to schedule dags This is controlled by max_threads with default value of 2 User should increase this value to a larger value(eg numbers of cpus where scheduler runs 1) in production
    · scheduler_heartbeat_sec User should consider to increase scheduler_heartbeat_sec config to a higher value(eg 60 secs) which controls how frequent the airflow scheduler gets the heartbeat and updates the job’s entry in database
    API 参考
    Operator
    Operators allow for generation of certain types of tasks that become nodes in the DAG when instantiated All operators derive from BaseOperator and inherit many attributes and methods that way Refer to the BaseOperator documentation for more details
    There are 3 main types of operators
    · Operators that performs an action or tell another system to perform an action
    · Transfer operators move data from one system to another
    · Sensors are a certain type of operator that will keep running until a certain criterion is met Examples include a specific file landing in HDFS or S3 a partition appearing in Hive or a specific time of the day Sensors are derived from BaseSensorOperator and run a poke method at a specified poke_interval until it returns True
    BaseOperator
    All operators are derived from BaseOperator and acquire much functionality through inheritance Since this is the core of the engine it’s worth taking the time to understand the parameters of BaseOperator to understand the primitive features that can be leveraged in your DAGs
    class airflowmodelsBaseOperator(task_id owner'Airflow' emailNone email_on_retryTrue email_on_failureTrue retries0 retry_delaydatetimetimedelta(0 300) retry_exponential_backoffFalse max_retry_delayNone start_dateNone end_dateNone schedule_intervalNone depends_on_pastFalse wait_for_downstreamFalse dagNone paramsNone default_argsNone adhocFalse priority_weight1 weight_ruleu'downstream' queue'default' poolNone slaNone execution_timeoutNone on_failure_callbackNone on_success_callbackNone on_retry_callbackNone trigger_ruleu'all_success' resourcesNone run_as_userNone task_concurrencyNone executor_configNone inletsNone outletsNone *args **kwargs)[source]
    Bases airflowutilsloglogging_mixinLoggingMixin
    Abstract base class for all operators Since operators create objects that become nodes in the dag BaseOperator contains many recursive methods for dag crawling behavior To derive this class you are expected to override the constructor as well as the execute’ method
    Operators derived from this class should perform or trigger certain tasks synchronously (wait for completion) Example of operators could be an operator that runs a Pig job (PigOperator) a sensor operator that waits for a partition to land in Hive (HiveSensorOperator) or one that moves data from Hive to MySQL (Hive2MySqlOperator) Instances of these operators (tasks) target specific operations running specific scripts functions or data transfers
    This class is abstract and shouldn’t be instantiated Instantiating a class derived from this one results in the creation of a task object which ultimately becomes a node in DAG objects Task dependencies should be set by using the set_upstream andor set_downstream methods
    Parameters
    · task_id (str) – a unique meaningful id for the task
    · owner (str) – the owner of the task using the unix username is recommended
    · retries (int) – the number of retries that should be performed before failing the task
    · retry_delay (timedelta) – delay between retries
    · retry_exponential_backoff (bool) – allow progressive longer waits between retries by using exponential backoff algorithm on retry delay (delay will be converted into seconds)
    · max_retry_delay (timedelta) – maximum delay interval between retries
    · start_date (datetime) – The start_date for the task determines the execution_date for the first task instance The best practice is to have the start_date rounded to your DAG’s schedule_interval Daily jobs have their start_date some day at 000000 hourly jobs have their start_date at 0000 of a specific hour Note that Airflow simply looks at the latest execution_date and adds the schedule_interval to determine the next execution_date It is also very important to note that different tasks’ dependencies need to line up in time If task A depends on task B and their start_date are offset in a way that their execution_date don’t line up A’s dependencies will never be met If you are looking to delay a task for example running a daily task at 2AM look into the TimeSensor and TimeDeltaSensor We advise against using dynamic start_date and recommend using fixed ones Read the FAQ entry about start_date for more information
    · end_date (datetime) – if specified the scheduler won’t go beyond this date
    · depends_on_past (bool) – when set to true task instances will run sequentially while relying on the previous task’s schedule to succeed The task instance for the start_date is allowed to run
    · wait_for_downstream (bool) – when set to true an instance of task X will wait for tasks immediately downstream of the previous instance of task X to finish successfully before it runs This is useful if the different instances of a task X alter the same asset and this asset is used by tasks downstream of task X Note that depends_on_past is forced to True wherever wait_for_downstream is used
    · queue (str) – which queue to target when running this job Not all executors implement queue management the CeleryExecutor does support targeting specific queues
    · dag (DAG) – a reference to the dag the task is attached to (if any)
    · priority_weight (int) – priority weight of this task against other task This allows the executor to trigger higher priority tasks before others when things get backed up Set priority_weight as a higher number for more important tasks
    · weight_rule (str) – weighting method used for the effective total priority weight of the task Options are { downstream | upstream | absolute } default is downstream When set to downstream the effective weight of the task is the aggregate sum of all downstream descendants As a result upstream tasks will have higher weight and will be scheduled more aggressively when using positive weight values This is useful when you have multiple dag run instances and desire to have all upstream tasks to complete for all runs before each dag can continue processing downstream tasks When set to upstream the effective weight is the aggregate sum of all upstream ancestors This is the opposite where downtream tasks have higher weight and will be scheduled more aggressively when using positive weight values This is useful when you have multiple dag run instances and prefer to have each dag complete before starting upstream tasks of other dags When set to absolute the effective weight is the exact priority_weight specified without additional weighting You may want to do this when you know exactly what priority weight each task should have Additionally when set to absolute there is bonus effect of significantly speeding up the task creation process as for very large DAGS Options can be set as string or using the constants defined in the static class airflowutilsWeightRule
    · pool (str) – the slot pool this task should run in slot pools are a way to limit concurrency for certain tasks
    · sla (datetimetimedelta) – time by which the job is expected to succeed Note that this represents the timedelta after the period is closed For example if you set an SLA of 1 hour the scheduler would send an email soon after 100AM on the 20160102 if the 20160101 instance has not succeeded yet The scheduler pays special attention for jobs with an SLA and sends alert emails for sla misses SLA misses are also recorded in the database for future reference All tasks that share the same SLA time get bundled in a single email sent soon after that time SLA notification are sent once and only once for each task instance
    · execution_timeout (datetimetimedelta) – max time allowed for the execution of this task instance if it goes beyond it will raise and fail
    · on_failure_callback (callable) – a function to be called when a task instance of this task fails a context dictionary is passed as a single parameter to this function Context contains references to related objects to the task instance and is documented under the macros section of the API
    · on_retry_callback (callable) – much like the on_failure_callback except that it is executed when retries occur
    · on_success_callback (callable) – much like the on_failure_callback except that it is executed when the task succeeds
    · trigger_rule (str) – defines the rule by which dependencies are applied for the task to get triggered Options are { all_success | all_failed | all_done | one_success | one_failed | dummy} default is all_success Options can be set as string or using the constants defined in the static class airflowutilsTriggerRule
    · resources (dict) – A map of resource parameter names (the argument names of the Resources constructor) to their values
    · run_as_user (str) – unix username to impersonate while running the task
    · task_concurrency (int) – When set a task will be able to limit the concurrent runs across execution_dates
    · executor_config (dict) –
    Additional tasklevel configuration parameters that are interpreted by a specific executor Parameters are namespaced by the name of executor
    Example to run this task in a specific docker container through the KubernetesExecutor
    MyOperator(
    executor_config{
    KubernetesExecutor
    {image myCustomDockerImage}
    }
    )
    clear(**kwargs)[source]
    Clears the state of task instances associated with the task following the parameters specified
    dag
    Returns the Operator’s DAG if set otherwise raises an error
    deps
    Returns the list of dependencies for the operator These differ from execution context dependencies in that they are specific to tasks and can be extendedoverridden by subclasses
    downstream_list
    @property list of tasks directly downstream
    execute(context)[source]
    This is the main method to derive when creating an operator Context is the same dictionary used as when rendering jinja templates
    Refer to get_template_context for more context
    get_direct_relative_ids(upstreamFalse)[source]
    Get the direct relative ids to the current task upstream or downstream
    get_direct_relatives(upstreamFalse)[source]
    Get the direct relatives to the current task upstream or downstream
    get_flat_relative_ids(upstreamFalse found_descendantsNone)[source]
    Get a flat list of relatives’ ids either upstream or downstream
    get_flat_relatives(upstreamFalse)[source]
    Get a flat list of relatives either upstream or downstream
    get_task_instances(session start_dateNone end_dateNone)[source]
    Get a set of task instance related to this task for a specific date range
    has_dag()[source]
    Returns True if the Operator has been assigned to a DAG
    on_kill()[source]
    Override this method to cleanup subprocesses when a task instance gets killed Any use of the threading subprocess or multiprocessing module within an operator needs to be cleaned up or it will leave ghost processes behind
    post_execute(context *args **kwargs)[source]
    This hook is triggered right after selfexecute() is called It is passed the execution context and any results returned by the operator
    pre_execute(context *args **kwargs)[source]
    This hook is triggered right before selfexecute() is called
    prepare_template()[source]
    Hook that is triggered after the templated fields get replaced by their content If you need your operator to alter the content of the file before the template is rendered it should override this method to do so
    render_template(attr content context)[source]
    Renders a template either from a file or directly in a field and returns the rendered result
    render_template_from_field(attr content context jinja_env)[source]
    Renders a template from a field If the field is a string it will simply render the string and return the result If it is a collection or nested set of collections it will traverse the structure and render all strings in it
    run(start_dateNone end_dateNone ignore_first_depends_on_pastFalse ignore_ti_stateFalse mark_successFalse)[source]
    Run a set of task instances for a date range
    schedule_interval
    The schedule interval of the DAG always wins over individual tasks so that tasks within a DAG always line up The task still needs a schedule_interval as it may not be attached to a DAG
    set_downstream(task_or_task_list)[source]
    Set a task or a task list to be directly downstream from the current task
    set_upstream(task_or_task_list)[source]
    Set a task or a task list to be directly upstream from the current task
    upstream_list
    @property list of tasks directly upstream
    xcom_pull(context task_idsNone dag_idNone keyu'return_value' include_prior_datesNone)[source]
    See TaskInstancexcom_pull()
    xcom_push(context key value execution_dateNone)[source]
    See TaskInstancexcom_push()
    BaseSensorOperator
    All sensors are derived from BaseSensorOperator All sensors inherit the timeout and poke_interval on top of the BaseOperator attributes
    class airflowsensorsbase_sensor_operatorBaseSensorOperator(poke_interval60 timeout604800 soft_failFalse mode'poke' *args **kwargs)[source]
    Bases airflowmodelsBaseOperator airflowmodelsSkipMixin
    Sensor operators are derived from this class and inherit these attributes
    Sensor operators keep executing at a time interval and succeed when a criteria is met and fail if and when they time out
    Parameters
    · soft_fail (bool) – Set to true to mark the task as SKIPPED on failure
    · poke_interval (int) – Time in seconds that the job should wait in between each tries
    · timeout (int) – Time in seconds before the task times out and fails
    · mode (str) – How the sensor operates Options are { poke | reschedule } default is poke When set to poke the sensor is taking up a worker slot for its whole execution time and sleeps between pokes Use this mode if the expected runtime of the sensor is short or if a short poke interval is requried When set to reschedule the sensor task frees the worker slot when the criteria is not yet met and it’s rescheduled at a later time Use this mode if the expected time until the criteria is met is The poke inteval should be more than one minute to prevent too much load on the scheduler
    deps
    Adds one additional dependency for all sensor operators that checks if a sensor task instance can be rescheduled
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    Core Operators
    Operators
    class airflowoperatorsbash_operatorBashOperator(bash_command xcom_pushFalse envNone output_encoding'utf8' *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Execute a Bash script command or set of commands
    Parameters
    · bash_command (str) – The command set of commands or reference to a bash script (must be sh’) to be executed (templated)
    · xcom_push (bool) – If xcom_push is True the last line written to stdout will also be pushed to an XCom when the bash command completes
    · env (dict) – If env is not None it must be a mapping that defines the environment variables for the new process these are used instead of inheriting the current process environment which is the default behavior (templated)
    · output_encoding (str) – Output encoding of bash command
    execute(context)[source]
    Execute the bash command in a temporary directory which will be cleaned afterwards
    class airflowoperatorspython_operatorBranchPythonOperator(python_callable op_argsNone op_kwargsNone provide_contextFalse templates_dictNone templates_extsNone *args **kwargs)[source]
    Bases airflowoperatorspython_operatorPythonOperator airflowmodelsSkipMixin
    Allows a workflow to branch or follow a single path following the execution of this task
    It derives the PythonOperator and expects a Python function that returns the task_id to follow The task_id returned should point to a task directly downstream from {self} All other branches or directly downstream tasks are marked with a state of skipped so that these paths can’t move forward The skipped states are propageted downstream to allow for the DAG state to fill up and the DAG run’s state to be inferred
    Note that using tasks with depends_on_pastTrue downstream from BranchPythonOperator is logically unsound as skipped status will invariably lead to block tasks that depend on their past successes skipped states propagates where all directly upstream tasks are skipped
    class airflowoperatorscheck_operatorCheckOperator(sql conn_idNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Performs checks against a db The CheckOperator expects a sql query that will return a single row Each value on that first row is evaluated using python bool casting If any of the values return False the check is failed and errors out
    Note that Python bool casting evals the following as False
    · False
    · 0
    · Empty string ()
    · Empty list ([])
    · Empty dictionary or set ({})
    Given a query like SELECT COUNT(*) FROM foo it will fail only if the count 0 You can craft much more complex query that could for instance check that the table has the same number of rows as the source table upstream or that the count of today’s partition is greater than yesterday’s partition or that a set of metrics are less than 3 standard deviation for the 7 day average
    This operator can be used as a data quality check in your pipeline and depending on where you put it in your DAG you have the choice to stop the critical path preventing from publishing dubious data or on the side and receive email alerts without stopping the progress of the DAG
    Note that this is an abstract class and get_db_hook needs to be defined Whereas a get_db_hook is hook that gets a single record from an external source
    Parameters
    sql (str) – the sql to be executed (templated)
    class airflowoperatorsdocker_operatorDockerOperator(image api_versionNone commandNone cpus10 docker_url'unixvarrundockersock' environmentNone force_pullFalse mem_limitNone network_modeNone tls_ca_certNone tls_client_certNone tls_client_keyNone tls_hostnameNone tls_ssl_versionNone tmp_dir'tmpairflow' userNone volumesNone working_dirNone xcom_pushFalse xcom_allFalse docker_conn_idNone dnsNone dns_searchNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Execute a command inside a docker container
    A temporary directory is created on the host and mounted into a container to allow storing files that together exceed the default disk size of 10GB in a container The path to the mounted directory can be accessed via the environment variable AIRFLOW_TMP_DIR
    If a login to a private registry is required prior to pulling the image a Docker connection needs to be configured in Airflow and the connection ID be provided with the parameter docker_conn_id
    Parameters
    · image (str) – Docker image from which to create the container If image tag is omitted latest will be used
    · api_version (str) – Remote API version Set to auto to automatically detect the server’s version
    · command (str or list) – Command to be run in the container (templated)
    · cpus (float) – Number of CPUs to assign to the container This value gets multiplied with 1024 See httpsdocsdockercomenginereferencerun#cpushareconstraint
    · dns (list of strings) – Docker custom DNS servers
    · dns_search (list of strings) – Docker custom DNS search domain
    · docker_url (str) – URL of the host running the docker daemon Default is unixvarrundockersock
    · environment (dict) – Environment variables to set in the container (templated)
    · force_pull (bool) – Pull the docker image on every run Default is False
    · mem_limit (float or str) – Maximum amount of memory the container can use Either a float value which represents the limit in bytes or a string like 128m or 1g
    · network_mode (str) – Network mode for the container
    · tls_ca_cert (str) – Path to a PEMencoded certificate authority to secure the docker connection
    · tls_client_cert (str) – Path to the PEMencoded certificate used to authenticate docker client
    · tls_client_key (str) – Path to the PEMencoded key used to authenticate docker client
    · tls_hostname (str or bool) – Hostname to match against the docker server certificate or False to disable the check
    · tls_ssl_version (str) – Version of SSL to use when communicating with docker daemon
    · tmp_dir (str) – Mount point inside the container to a temporary directory created on the host by the operator The path is also made available via the environment variable AIRFLOW_TMP_DIR inside the container
    · user (int or str) – Default user inside the docker container
    · volumes – List of volumes to mount into the container eg ['hostpathcontainerpath' 'hostpath2containerpath2ro']
    · working_dir (str) – Working directory to set on the container (equivalent to the w switch the docker client)
    · xcom_push (bool) – Does the stdout will be pushed to the next step using XCom The default is False
    · xcom_all (bool) – Push all the stdout or just the last line The default is False (last line)
    · docker_conn_id (str) – ID of the Airflow connection to use
    class airflowoperatorsdummy_operatorDummyOperator(*args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Operator that does literally nothing It can be used to group tasks in a DAG
    class airflowoperatorsdruid_check_operatorDruidCheckOperator(sql druid_broker_conn_id'druid_broker_default' *args **kwargs)[source]
    Bases airflowoperatorscheck_operatorCheckOperator
    Performs checks against Druid The DruidCheckOperator expects a sql query that will return a single row Each value on that first row is evaluated using python bool casting If any of the values return False the check is failed and errors out
    Note that Python bool casting evals the following as False
    · False
    · 0
    · Empty string ()
    · Empty list ([])
    · Empty dictionary or set ({})
    Given a query like SELECT COUNT(*) FROM foo it will fail only if the count 0 You can craft much more complex query that could for instance check that the table has the same number of rows as the source table upstream or that the count of today’s partition is greater than yesterday’s partition or that a set of metrics are less than 3 standard deviation for the 7 day average This operator can be used as a data quality check in your pipeline and depending on where you put it in your DAG you have the choice to stop the critical path preventing from publishing dubious data or on the side and receive email alterts without stopping the progress of the DAG
    Parameters
    · sql (str) – the sql to be executed
    · druid_broker_conn_id (str) – reference to the druid broker
    get_db_hook()[source]
    Return the druid db api hook
    get_first(sql)[source]
    Executes the druid sql to druid broker and returns the first resulting row
    Parameters
    sql (str) – the sql statement to be executed (str)
    class airflowoperatorsemail_operatorEmailOperator(to subject html_content filesNone ccNone bccNone mime_subtype'mixed' mime_charset'utf8' *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Sends an email
    Parameters
    · to (list or string (comma or semicolon delimited)) – list of emails to send the email to (templated)
    · subject (str) – subject line for the email (templated)
    · html_content (str) – content of the email html markup is allowed (templated)
    · files (list) – file names to attach in email
    · cc (list or string (comma or semicolon delimited)) – list of recipients to be added in CC field
    · bcc (list or string (comma or semicolon delimited)) – list of recipients to be added in BCC field
    · mime_subtype (str) – MIME sub content type
    · mime_charset (str) – character set parameter added to the ContentType header
    class airflowoperatorsgeneric_transferGenericTransfer(sql destination_table source_conn_id destination_conn_id preoperatorNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Moves data from a connection to another assuming that they both provide the required methods in their respective hooks The source hook needs to expose a get_records method and the destination a insert_rows method
    This is meant to be used on smallish datasets that fit in memory
    Parameters
    · sql (str) – SQL query to execute against the source database (templated)
    · destination_table (str) – target table (templated)
    · source_conn_id (str) – source connection
    · destination_conn_id (str) – source connection
    · preoperator (str or list of str) – sql statement or list of statements to be executed prior to loading the data (templated)
    class airflowoperatorshive_to_druidHiveToDruidTransfer(sql druid_datasource ts_dim metric_specNone hive_cli_conn_id'hive_cli_default' druid_ingest_conn_id'druid_ingest_default' metastore_conn_id'metastore_default' hadoop_dependency_coordinatesNone intervalsNone num_shards1 target_partition_size1 query_granularity'NONE' segment_granularity'DAY' hive_tblpropertiesNone job_propertiesNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Moves data from Hive to Druid [del]note that for now the data is loaded into memory before being pushed to Druid so this operator should be used for smallish amount of data[del]
    Parameters
    · sql (str) – SQL query to execute against the Druid database (templated)
    · druid_datasource (str) – the datasource you want to ingest into in druid
    · ts_dim (str) – the timestamp dimension
    · metric_spec (list) – the metrics you want to define for your data
    · hive_cli_conn_id (str) – the hive connection id
    · druid_ingest_conn_id (str) – the druid ingest connection id
    · metastore_conn_id (str) – the metastore connection id
    · hadoop_dependency_coordinates (list of str) – list of coordinates to squeeze int the ingest json
    · intervals (list) – list of time intervals that defines segments this is passed as is to the json object (templated)
    · hive_tblproperties (dict) – additional properties for tblproperties in hive for the staging table
    · job_properties (dict) – additional properties for job
    construct_ingest_query(static_path columns)[source]
    Builds an ingest query for an HDFS TSV load
    Parameters
    · static_path (str) – The path on hdfs where the data is
    · columns (list) – List of all the columns that are available
    class airflowoperatorshive_to_mysqlHiveToMySqlTransfer(sql mysql_table hiveserver2_conn_id'hiveserver2_default' mysql_conn_id'mysql_default' mysql_preoperatorNone mysql_postoperatorNone bulk_loadFalse *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Moves data from Hive to MySQL note that for now the data is loaded into memory before being pushed to MySQL so this operator should be used for smallish amount of data
    Parameters
    · sql (str) – SQL query to execute against Hive server (templated)
    · mysql_table (str) – target MySQL table use dot notation to target a specific database (templated)
    · mysql_conn_id (str) – source mysql connection
    · hiveserver2_conn_id (str) – destination hive connection
    · mysql_preoperator (str) – sql statement to run against mysql prior to import typically use to truncate of delete in place of the data coming in allowing the task to be idempotent (running the task twice won’t double load data) (templated)
    · mysql_postoperator (str) – sql statement to run against mysql after the import typically used to move data from staging to production and issue cleanup commands (templated)
    · bulk_load (bool) – flag to use bulk_load option This loads mysql directly from a tabdelimited text file using the LOAD DATA LOCAL INFILE command This option requires an extra connection parameter for the destination MySQL connection {local_infile’ true}
    class airflowoperatorshive_operatorHiveOperator(hql hive_cli_conn_idu'hive_cli_default' schemau'default' hiveconfsNone hiveconf_jinja_translateFalse script_begin_tagNone run_as_ownerFalse mapred_queueNone mapred_queue_priorityNone mapred_job_nameNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Executes hql code or hive script in a specific Hive database
    Parameters
    · hql (str) – the hql to be executed Note that you may also use a relative path from the dag file of a (template) hive script (templated)
    · hive_cli_conn_id (str) – reference to the Hive database (templated)
    · hiveconfs (dict) – if defined these key value pairs will be passed to hive as hiveconf keyvalue
    · hiveconf_jinja_translate (bool) – when True hiveconftype templating {var} gets translated into jinjatype templating {{ var }} and {hiveconfvar} gets translated into jinjatype templating {{ var }} Note that you may want to use this along with the DAG(user_defined_macrosmyargs) parameter View the DAG object documentation for more details
    · script_begin_tag (str) – If defined the operator will get rid of the part of the script before the first occurrence of script_begin_tag
    · mapred_queue (str) – queue used by the Hadoop CapacityScheduler (templated)
    · mapred_queue_priority (str) – priority within CapacityScheduler queue Possible settings include VERY_HIGH HIGH NORMAL LOW VERY_LOW
    · mapred_job_name (str) – This name will appear in the jobtracker This can make monitoring easier
    class airflowoperatorshive_stats_operatorHiveStatsCollectionOperator(table partition extra_exprsNone col_blacklistNone assignment_funcNone metastore_conn_id'metastore_default' presto_conn_id'presto_default' mysql_conn_id'airflow_db' *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Gathers partition statistics using a dynamically generated Presto query inserts the stats into a MySql table with this format Stats overwrite themselves if you rerun the same datepartition
    CREATE TABLE hive_stats (
    ds VARCHAR(16)
    table_name VARCHAR(500)
    metric VARCHAR(200)
    value BIGINT
    )
    Parameters
    · table (str) – the source table in the format databasetable_name (templated)
    · partition (dict of {colvalue}) – the source partition (templated)
    · extra_exprs (dict) – dict of expression to run against the table where keys are metric names and values are Presto compatible expressions
    · col_blacklist (list) – list of columns to blacklist consider blacklisting blobs large json columns …
    · assignment_func (function) – a function that receives a column name and a type and returns a dict of metric names and an Presto expressions If None is returned the global defaults are applied If an empty dictionary is returned no stats are computed for that column
    class airflowoperatorscheck_operatorIntervalCheckOperator(table metrics_thresholds date_filter_column'ds' days_back7 conn_idNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Checks that the values of metrics given as SQL expressions are within a certain tolerance of the ones from days_back before
    Note that this is an abstract class and get_db_hook needs to be defined Whereas a get_db_hook is hook that gets a single record from an external source
    Parameters
    · table (str) – the table name
    · days_back (int) – number of days between ds and the ds we want to check against Defaults to 7 days
    · metrics_threshold (dict) – a dictionary of ratios indexed by metrics
    class airflowoperatorslatest_only_operatorLatestOnlyOperator(task_id owner'Airflow' emailNone email_on_retryTrue email_on_failureTrue retries0 retry_delaydatetimetimedelta(0 300) retry_exponential_backoffFalse max_retry_delayNone start_dateNone end_dateNone schedule_intervalNone depends_on_pastFalse wait_for_downstreamFalse dagNone paramsNone default_argsNone adhocFalse priority_weight1 weight_ruleu'downstream' queue'default' poolNone slaNone execution_timeoutNone on_failure_callbackNone on_success_callbackNone on_retry_callbackNone trigger_ruleu'all_success' resourcesNone run_as_userNone task_concurrencyNone executor_configNone inletsNone outletsNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator airflowmodelsSkipMixin
    Allows a workflow to skip tasks that are not running during the most recent schedule interval
    If the task is run outside of the latest schedule interval all directly downstream tasks will be skipped
    class airflowoperatorsmssql_operatorMsSqlOperator(sql mssql_conn_id'mssql_default' parametersNone autocommitFalse databaseNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Executes sql code in a specific Microsoft SQL database
    Parameters
    · sql (str or string pointing to a template file with sql extension (templated)) – the sql code to be executed
    · mssql_conn_id (str) – reference to a specific mssql database
    · parameters (mapping or iterable) – (optional) the parameters to render the SQL query with
    · autocommit (bool) – if True each command is automatically committed (default value False)
    · database (str) – name of database which overwrite defined one in connection
    class airflowoperatorsmssql_to_hiveMsSqlToHiveTransfer(sql hive_table createTrue recreateFalse partitionNone delimiteru'x01' mssql_conn_id'mssql_default' hive_cli_conn_id'hive_cli_default' tblpropertiesNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Moves data from Microsoft SQL Server to Hive The operator runs your query against Microsoft SQL Server stores the file locally before loading it into a Hive table If the create or recreate arguments are set to True a CREATE TABLE and DROP TABLE statements are generated Hive data types are inferred from the cursor’s metadata Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization format If a large amount of data is loaded andor if the table gets queried considerably you may want to use this operator only to stage the data into a temporary table before loading it into its final destination using a HiveOperator
    Parameters
    · sql (str) – SQL query to execute against the Microsoft SQL Server database (templated)
    · hive_table (str) – target Hive table use dot notation to target a specific database (templated)
    · create (bool) – whether to create the table if it doesn’t exist
    · recreate (bool) – whether to drop and recreate the table at every execution
    · partition (dict) – target partition as a dict of partition columns and values (templated)
    · delimiter (str) – field delimiter in the file
    · mssql_conn_id (str) – source Microsoft SQL Server connection
    · hive_conn_id (str) – destination hive connection
    · tblproperties (dict) – TBLPROPERTIES of the hive table being created
    class airflowoperatorsmysql_operatorMySqlOperator(sql mysql_conn_id'mysql_default' parametersNone autocommitFalse databaseNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Executes sql code in a specific MySQL database
    Parameters
    · sql (Can receive a str representing a sql statement a list of str (sql statements) or reference to a template file Template reference are recognized by str ending in 'sql') – the sql code to be executed (templated)
    · mysql_conn_id (str) – reference to a specific mysql database
    · parameters (mapping or iterable) – (optional) the parameters to render the SQL query with
    · autocommit (bool) – if True each command is automatically committed (default value False)
    · database (str) – name of database which overwrite defined one in connection
    class airflowoperatorsmysql_to_hiveMySqlToHiveTransfer(sql hive_table createTrue recreateFalse partitionNone delimiteru'x01' mysql_conn_id'mysql_default' hive_cli_conn_id'hive_cli_default' tblpropertiesNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Moves data from MySql to Hive The operator runs your query against MySQL stores the file locally before loading it into a Hive table If the create or recreate arguments are set to True a CREATE TABLE and DROP TABLE statements are generated Hive data types are inferred from the cursor’s metadata Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization format If a large amount of data is loaded andor if the table gets queried considerably you may want to use this operator only to stage the data into a temporary table before loading it into its final destination using a HiveOperator
    Parameters
    · sql (str) – SQL query to execute against the MySQL database (templated)
    · hive_table (str) – target Hive table use dot notation to target a specific database (templated)
    · create (bool) – whether to create the table if it doesn’t exist
    · recreate (bool) – whether to drop and recreate the table at every execution
    · partition (dict) – target partition as a dict of partition columns and values (templated)
    · delimiter (str) – field delimiter in the file
    · mysql_conn_id (str) – source mysql connection
    · hive_conn_id (str) – destination hive connection
    · tblproperties (dict) – TBLPROPERTIES of the hive table being created
    class airflowoperatorspig_operatorPigOperator(pig pig_cli_conn_id'pig_cli_default' pigparams_jinja_translateFalse *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Executes pig script
    Parameters
    · pig (str) – the pig latin script to be executed (templated)
    · pig_cli_conn_id (str) – reference to the Hive database
    · pigparams_jinja_translate (bool) – when True pig paramstype templating {var} gets translated into jinjatype templating {{ var }} Note that you may want to use this along with the DAG(user_defined_macrosmyargs) parameter View the DAG object documentation for more details
    class airflowoperatorspostgres_operatorPostgresOperator(sql postgres_conn_id'postgres_default' autocommitFalse parametersNone databaseNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Executes sql code in a specific Postgres database
    Parameters
    · sql (Can receive a str representing a sql statement a list of str (sql statements) or reference to a template file Template reference are recognized by str ending in 'sql') – the sql code to be executed (templated)
    · postgres_conn_id (str) – reference to a specific postgres database
    · autocommit (bool) – if True each command is automatically committed (default value False)
    · parameters (mapping or iterable) – (optional) the parameters to render the SQL query with
    · database (str) – name of database which overwrite defined one in connection
    class airflowoperatorspresto_check_operatorPrestoCheckOperator(sql presto_conn_id'presto_default' *args **kwargs)[source]
    Bases airflowoperatorscheck_operatorCheckOperator
    Performs checks against Presto The PrestoCheckOperator expects a sql query that will return a single row Each value on that first row is evaluated using python bool casting If any of the values return False the check is failed and errors out
    Note that Python bool casting evals the following as False
    · False
    · 0
    · Empty string ()
    · Empty list ([])
    · Empty dictionary or set ({})
    Given a query like SELECT COUNT(*) FROM foo it will fail only if the count 0 You can craft much more complex query that could for instance check that the table has the same number of rows as the source table upstream or that the count of today’s partition is greater than yesterday’s partition or that a set of metrics are less than 3 standard deviation for the 7 day average
    This operator can be used as a data quality check in your pipeline and depending on where you put it in your DAG you have the choice to stop the critical path preventing from publishing dubious data or on the side and receive email alterts without stopping the progress of the DAG
    Parameters
    · sql (str) – the sql to be executed
    · presto_conn_id (str) – reference to the Presto database
    class airflowoperatorspresto_check_operatorPrestoIntervalCheckOperator(table metrics_thresholds date_filter_column'ds' days_back7 presto_conn_id'presto_default' *args **kwargs)[source]
    Bases airflowoperatorscheck_operatorIntervalCheckOperator
    Checks that the values of metrics given as SQL expressions are within a certain tolerance of the ones from days_back before
    Parameters
    · table (str) – the table name
    · days_back (int) – number of days between ds and the ds we want to check against Defaults to 7 days
    · metrics_threshold (dict) – a dictionary of ratios indexed by metrics
    · presto_conn_id (str) – reference to the Presto database
    class airflowoperatorspresto_to_mysqlPrestoToMySqlTransfer(sql mysql_table presto_conn_id'presto_default' mysql_conn_id'mysql_default' mysql_preoperatorNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Moves data from Presto to MySQL note that for now the data is loaded into memory before being pushed to MySQL so this operator should be used for smallish amount of data
    Parameters
    · sql (str) – SQL query to execute against Presto (templated)
    · mysql_table (str) – target MySQL table use dot notation to target a specific database (templated)
    · mysql_conn_id (str) – source mysql connection
    · presto_conn_id (str) – source presto connection
    · mysql_preoperator (str) – sql statement to run against mysql prior to import typically use to truncate of delete in place of the data coming in allowing the task to be idempotent (running the task twice won’t double load data) (templated)
    class airflowoperatorspresto_check_operatorPrestoValueCheckOperator(sql pass_value toleranceNone presto_conn_id'presto_default' *args **kwargs)[source]
    Bases airflowoperatorscheck_operatorValueCheckOperator
    Performs a simple value check using sql code
    Parameters
    · sql (str) – the sql to be executed
    · presto_conn_id (str) – reference to the Presto database
    class airflowoperatorspython_operatorPythonOperator(python_callable op_argsNone op_kwargsNone provide_contextFalse templates_dictNone templates_extsNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Executes a Python callable
    Parameters
    · python_callable (python callable) – A reference to an object that is callable
    · op_kwargs (dict) – a dictionary of keyword arguments that will get unpacked in your function
    · op_args (list) – a list of 位置参数 that will get unpacked when calling your callable
    · provide_context (bool) – if set to true Airflow will pass a set of keyword arguments that can be used in your function This set of kwargs correspond exactly to what you can use in your jinja templates For this to work you need to define **kwargs in your function header
    · templates_dict (dict of str) – a dictionary where the values are templates that will get templated by the Airflow engine sometime between __init__ and execute takes place and are made available in your callable’s context after the template has been applied (templated)
    · templates_exts (list(str)) – a list of file extensions to resolve while processing templated fields for examples ['sql' 'hql']
    class airflowoperatorspython_operatorPythonVirtualenvOperator(python_callable requirementsNone python_versionNone use_dillFalse system_site_packagesTrue op_argsNone op_kwargsNone string_argsNone templates_dictNone templates_extsNone *args **kwargs)[source]
    Bases airflowoperatorspython_operatorPythonOperator
    Allows one to run a function in a virtualenv that is created and destroyed automatically (with certain caveats)
    The function must be defined using def and not be part of a class All imports must happen inside the function and no variables outside of the scope may be referenced A global scope variable named virtualenv_string_args will be available (populated by string_args) In addition one can pass stuff through op_args and op_kwargs and one can use a return value Note that if your virtualenv runs in a different Python major version than Airflow you cannot use return values op_args or op_kwargs You can use string_args though param python_callable A python function with no references to outside variables
    defined with def which will be run in a virtualenv
    Parameters
    · requirements (list(str)) – A list of requirements as specified in a pip install command
    · python_version (str) – The Python version to run the virtualenv with Note that both 2 and 27 are acceptable forms
    · use_dill (bool) – Whether to use dill to serialize the args and result (pickle is default) This allow more complex types but requires you to include dill in your requirements
    · system_site_packages (bool) – Whether to include system_site_packages in your virtualenv See virtualenv documentation for more information
    · op_args – A list of 位置参数 to pass to python_callable
    · op_kwargs (dict) – A dict of keyword arguments to pass to python_callable
    · string_args (list(str)) – Strings that are present in the global var virtualenv_string_args available to python_callable at runtime as a list(str) Note that args are split by newline
    · templates_dict (dict of str) – a dictionary where the values are templates that will get templated by the Airflow engine sometime between __init__ and execute takes place and are made available in your callable’s context after the template has been applied
    · templates_exts (list(str)) – a list of file extensions to resolve while processing templated fields for examples ['sql' 'hql']
    class airflowoperatorss3_file_transform_operatorS3FileTransformOperator(source_s3_key dest_s3_key transform_scriptNone select_expressionNone source_aws_conn_id'aws_default' source_verifyNone dest_aws_conn_id'aws_default' dest_verifyNone replaceFalse *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Copies data from a source S3 location to a temporary location on the local filesystem Runs a transformation on this file as specified by the transformation script and uploads the output to a destination S3 location
    The locations of the source and the destination files in the local filesystem is provided as an first and second arguments to the transformation script The transformation script is expected to read the data from source transform it and write the output to the local destination file The operator then takes over control and uploads the local destination file to S3
    S3 Select is also available to filter the source contents Users can omit the transformation script if S3 Select expression is specified
    Parameters
    · source_s3_key (str) – The key to be retrieved from S3 (templated)
    · source_aws_conn_id (str) – source s3 connection
    · source_verify (bool or str) –
    Whether or not to verify SSL certificates for S3 connetion By default SSL certificates are verified You can provide the following values
    o False do not validate SSL certificates SSL will still be used
    (unless use_ssl is False) but SSL certificates will not be verified
    o pathtocertbundlepem A filename of the CA cert bundle to uses
    You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
    This is also applicable to dest_verify
    · dest_s3_key (str) – The key to be written from S3 (templated)
    · dest_aws_conn_id (str) – destination s3 connection
    · replace (bool) – Replace dest S3 key if it already exists
    · transform_script (str) – location of the executable transformation script
    · select_expression (str) – S3 Select expression
    class airflowoperatorss3_to_hive_operatorS3ToHiveTransfer(s3_key field_dict hive_table delimiter' ' createTrue recreateFalse partitionNone headersFalse check_headersFalse wildcard_matchFalse aws_conn_id'aws_default' verifyNone hive_cli_conn_id'hive_cli_default' input_compressedFalse tblpropertiesNone select_expressionNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Moves data from S3 to Hive The operator downloads a file from S3 stores the file locally before loading it into a Hive table If the create or recreate arguments are set to True a CREATE TABLE and DROP TABLE statements are generated Hive data types are inferred from the cursor’s metadata from
    Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization format If a large amount of data is loaded andor if the tables gets queried considerably you may want to use this operator only to stage the data into a temporary table before loading it into its final destination using a HiveOperator
    Parameters
    · s3_key (str) – The key to be retrieved from S3 (templated)
    · field_dict (dict) – A dictionary of the fields name in the file as keys and their Hive types as values
    · hive_table (str) – target Hive table use dot notation to target a specific database (templated)
    · create (bool) – whether to create the table if it doesn’t exist
    · recreate (bool) – whether to drop and recreate the table at every execution
    · partition (dict) – target partition as a dict of partition columns and values (templated)
    · headers (bool) – whether the file contains column names on the first line
    · check_headers (bool) – whether the column names on the first line should be checked against the keys of field_dict
    · wildcard_match (bool) – whether the s3_key should be interpreted as a Unix wildcard pattern
    · delimiter (str) – field delimiter in the file
    · aws_conn_id (str) – source s3 connection
    · hive_cli_conn_id (str) – destination hive connection
    · input_compressed (bool) – Boolean to determine if file decompression is required to process headers
    · tblproperties (dict) – TBLPROPERTIES of the hive table being created
    · select_expression (str) – S3 Select expression
    Parame verify
    Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
    (unless use_ssl is False) but SSL certificates will not be verified
    · pathtocertbundlepem A filename of the CA cert bundle to uses
    You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
    class airflowoperatorss3_to_redshift_operatorS3ToRedshiftTransfer(schema table s3_bucket s3_key redshift_conn_id'redshift_default' aws_conn_id'aws_default' verifyNone copy_options() autocommitFalse parametersNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Executes an COPY command to load files from s3 to Redshift
    Parameters
    · schema (str) – reference to a specific schema in redshift database
    · table (str) – reference to a specific table in redshift database
    · s3_bucket (str) – reference to a specific S3 bucket
    · s3_key (str) – reference to a specific S3 key
    · redshift_conn_id (str) – reference to a specific redshift database
    · aws_conn_id (str) – reference to a specific S3 connection
    · copy_options (list) – reference to a list of COPY options
    Parame verify
    Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
    (unless use_ssl is False) but SSL certificates will not be verified
    · pathtocertbundlepem A filename of the CA cert bundle to uses
    You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
    class airflowoperatorspython_operatorShortCircuitOperator(python_callable op_argsNone op_kwargsNone provide_contextFalse templates_dictNone templates_extsNone *args **kwargs)[source]
    Bases airflowoperatorspython_operatorPythonOperator airflowmodelsSkipMixin
    Allows a workflow to continue only if a condition is met Otherwise the workflow shortcircuits and downstream tasks are skipped
    The ShortCircuitOperator is derived from the PythonOperator It evaluates a condition and shortcircuits the workflow if the condition is False Any downstream tasks are marked with a state of skipped If the condition is True downstream tasks proceed as normal
    The condition is determined by the result of python_callable
    class airflowoperatorshttp_operatorSimpleHttpOperator(endpoint method'POST' dataNone headersNone response_checkNone extra_optionsNone xcom_pushFalse http_conn_id'http_default' *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Calls an endpoint on an HTTP system to execute an action
    Parameters
    · http_conn_id (str) – The connection to run the operator against
    · endpoint (str) – The relative part of the full url (templated)
    · method (str) – The HTTP method to use default POST
    · data (For POSTPUT depends on the contenttype parameter for GET a dictionary of keyvalue string pairs) – The data to pass POSTdata in POSTPUT and params in the URL for a GET request (templated)
    · headers (a dictionary of string keyvalue pairs) – The HTTP headers to be added to the GET request
    · response_check (A lambda or defined function) – A check against the requests’ response object Returns True for pass’ and False otherwise
    · extra_options (A dictionary of options where key is string and value depends on the option that's being modified) – Extra options for the requests’ library see the requests’ documentation (options to modify timeout ssl etc)
    class airflowoperatorsslack_operatorSlackAPIOperator(slack_conn_idNone tokenNone methodNone api_paramsNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Base Slack Operator The SlackAPIPostOperator is derived from this operator In the future additional Slack API Operators will be derived from this class as well
    Parameters
    · slack_conn_id (str) – Slack connection ID which its password is Slack API token
    · token (str) – Slack API token (httpsapislackcomweb)
    · method (str) – The Slack API Method to Call (httpsapislackcommethods)
    · api_params (dict) – API Method call parameters (httpsapislackcommethods)
    construct_api_call_params()[source]
    Used by the execute function Allows templating on the source fields of the api_call_params dict before construction
    Override in child classes Each SlackAPIOperator child class is responsible for having a construct_api_call_params function which sets selfapi_call_params with a dict of API call parameters (httpsapislackcommethods)
    execute(**kwargs)[source]
    SlackAPIOperator calls will not fail even if the call is not unsuccessful It should not prevent a DAG from completing in success
    class airflowoperatorsslack_operatorSlackAPIPostOperator(channel'#general' username'Airflow' text'No message has been setnHere is a cat video insteadnhttpswwwyoutubecomwatchvJaiyznGQ' icon_url'httpsrawgithubusercontentcomapacheincubatorairflowmasterairflowwwwstaticpin_100jpg' attachmentsNone *args **kwargs)[source]
    Bases airflowoperatorsslack_operatorSlackAPIOperator
    Posts messages to a slack channel
    Parameters
    · channel (str) – channel in which to post message on slack name (#general) or ID (C12318391) (templated)
    · username (str) – Username that airflow will be posting to Slack as (templated)
    · text (str) – message to send to slack (templated)
    · icon_url (str) – url to icon used for this message
    · attachments (array of hashes) – extra formatting details (templated) see httpsapislackcomdocsattachments
    construct_api_call_params()[source]
    Used by the execute function Allows templating on the source fields of the api_call_params dict before construction
    Override in child classes Each SlackAPIOperator child class is responsible for having a construct_api_call_params function which sets selfapi_call_params with a dict of API call parameters (httpsapislackcommethods)
    class airflowoperatorssqlite_operatorSqliteOperator(sql sqlite_conn_id'sqlite_default' parametersNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Executes sql code in a specific Sqlite database
    Parameters
    · sql (str or string pointing to a template file File must have a 'sql' extensions) – the sql code to be executed (templated)
    · sqlite_conn_id (str) – reference to a specific sqlite database
    · parameters (mapping or iterable) – (optional) the parameters to render the SQL query with
    class airflowoperatorssubdag_operatorSubDagOperator(**kwargs)[source]
    Bases airflowmodelsBaseOperator
    This runs a sub dag By convention a sub dag’s dag_id should be prefixed by its parent and a dot As in parentchild
    Parameters
    · subdag (airflowDAG) – the DAG object to run as a subdag of the current DAG
    · dag (airflowDAG) – the parent DAG for the subdag
    · executor (airflowexecutors) – the executor for this subdag Default to use SequentialExecutor Please find AIRFLOW74 for more details
    class airflowoperatorsdagrun_operatorTriggerDagRunOperator(trigger_dag_id python_callableNone execution_dateNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Triggers a DAG run for a specified dag_id
    Parameters
    · trigger_dag_id (str) – the dag_id to trigger
    · python_callable (python callable) – a reference to a python function that will be called while passing it the context object and a placeholder object obj for your callable to fill and return if you want a DagRun created This obj object contains a run_id and payload attribute that you can modify in your function The run_id should be a unique identifier for that DAG run and the payload has to be a picklable object that will be made available to your tasks while executing that DAG run Your function header should look like def foo(context dag_run_obj)
    · execution_date (datetimedatetime) – Execution date for the dag
    class airflowoperatorscheck_operatorValueCheckOperator(sql pass_value toleranceNone conn_idNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Performs a simple value check using sql code
    Note that this is an abstract class and get_db_hook needs to be defined Whereas a get_db_hook is hook that gets a single record from an external source
    Parameters
    sql (str) – the sql to be executed (templated)
    class airflowoperatorsredshift_to_s3_operatorRedshiftToS3Transfer(schema table s3_bucket s3_key redshift_conn_id'redshift_default' aws_conn_id'aws_default' verifyNone unload_options() autocommitFalse include_headerFalse *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Executes an UNLOAD command to s3 as a CSV with headers
    Parameters
    · schema (str) – reference to a specific schema in redshift database
    · table (str) – reference to a specific table in redshift database
    · s3_bucket (str) – reference to a specific S3 bucket
    · s3_key (str) – reference to a specific S3 key
    · redshift_conn_id (str) – reference to a specific redshift database
    · aws_conn_id (str) – reference to a specific S3 connection
    · unload_options (list) – reference to a list of UNLOAD options
    Parame verify
    Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values
    · False do not validate SSL certificates SSL will still be used
    (unless use_ssl is False) but SSL certificates will not be verified
    · pathtocertbundlepem A filename of the CA cert bundle to uses
    You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
    Sensors
    class airflowsensorsexternal_task_sensorExternalTaskSensor(external_dag_id external_task_id allowed_statesNone execution_deltaNone execution_date_fnNone *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Waits for a task to complete in a different DAG
    Parameters
    · external_dag_id (str) – The dag_id that contains the task you want to wait for
    · external_task_id (str) – The task_id that contains the task you want to wait for
    · allowed_states (list) – list of allowed states default is ['success']
    · execution_delta (datetimetimedelta) – time difference with the previous execution to look at the default is the same execution_date as the current task For yesterday use [positive] datetimetimedelta(days1) Either execution_delta or execution_date_fn can be passed to ExternalTaskSensor but not both
    · execution_date_fn (callable) – function that receives the current execution date and returns the desired execution dates to query Either execution_delta or execution_date_fn can be passed to ExternalTaskSensor but not both
    poke(**kwargs)[source]
    Function that the sensors defined while deriving this class should override
    class airflowsensorshdfs_sensorHdfsSensor(filepath hdfs_conn_id'hdfs_default' ignored_extNone ignore_copyingTrue file_sizeNone hook *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Waits for a file or folder to land in HDFS
    static filter_for_filesize(result sizeNone)[source]
    Will test the filepath result and test if its size is at least selffilesize
    Parameters
    · result – a list of dicts returned by Snakebite ls
    · size – the file size in MB a file should be at least to trigger True
    Returns
    (bool) depending on the matching criteria
    static filter_for_ignored_ext(result ignored_ext ignore_copying)[source]
    Will filter if instructed to do so the result to remove matching criteria
    Parameters
    · result – (list) of dicts returned by Snakebite ls
    · ignored_ext – (list) of ignored extensions
    · ignore_copying – (bool) shall we ignore
    Returns
    (list) of dicts which were not removed
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowsensorshive_partition_sensorHivePartitionSensor(table partitionds'{{ ds }}' metastore_conn_id'metastore_default' schema'default' poke_interval180 *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Waits for a partition to show up in Hive
    Note Because partition supports general logical operators it can be inefficient Consider using NamedHivePartitionSensor instead if you don’t need the full flexibility of HivePartitionSensor
    Parameters
    · table (str) – The name of the table to wait for supports the dot notation (my_databasemy_table)
    · partition (str) – The partition clause to wait for This is passed as is to the metastore Thrift client get_partitions_by_filter method and apparently supports SQL like notation as in ds'20150101' AND type'value' and comparison operators as in ds>20150101
    · metastore_conn_id (str) – reference to the metastore thrift service connection id
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowsensorshttp_sensorHttpSensor(endpoint http_conn_id'http_default' method'GET' request_paramsNone headersNone response_checkNone extra_optionsNone *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Executes a HTTP GET statement and returns False on failure caused by 404 Not Found or response_check returning False
    HTTP Error codes other than 404 (like 403) or Connection Refused Error would fail the sensor itself directly (no more poking)
    Parameters
    · http_conn_id (str) – The connection to run the sensor against
    · method (str) – The HTTP request method to use
    · endpoint (str) – The relative part of the full url
    · request_params (a dictionary of string keyvalue pairs) – The parameters to be added to the GET url
    · headers (a dictionary of string keyvalue pairs) – The HTTP headers to be added to the GET request
    · response_check (A lambda or defined function) – A check against the requests’ response object Returns True for pass’ and False otherwise
    · extra_options (A dictionary of options where key is string and value depends on the option that's being modified) – Extra options for the requests’ library see the requests’ documentation (options to modify timeout ssl etc)
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowsensorsmetastore_partition_sensorMetastorePartitionSensor(table partition_name schema'default' mysql_conn_id'metastore_mysql' *args **kwargs)[source]
    Bases airflowsensorssql_sensorSqlSensor
    An alternative to the HivePartitionSensor that talk directly to the MySQL db This was created as a result of observing sub optimal queries generated by the Metastore thrift service when hitting subpartitioned tables The Thrift service’s queries were written in a way that wouldn’t leverage the indexes
    Parameters
    · schema (str) – the schema
    · table (str) – the table
    · partition_name (str) – the partition name as defined in the PARTITIONS table of the Metastore Order of the fields does matter Examples ds20160101 or ds20160101subfoo for a sub partitioned table
    · mysql_conn_id (str) – a reference to the MySQL conn_id for the metastore
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowsensorsnamed_hive_partition_sensorNamedHivePartitionSensor(partition_names metastore_conn_id'metastore_default' poke_interval180 hookNone *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Waits for a set of partitions to show up in Hive
    Parameters
    · partition_names (list of strings) – List of fully qualified names of the partitions to wait for A fully qualified name is of the form schematablepk1pv1pk2pv2 for example defaultusersds20160101 This is passed as is to the metastore Thrift client get_partitions_by_name method Note that you cannot use logical or comparison operators as in HivePartitionSensor
    · metastore_conn_id (str) – reference to the metastore thrift service connection id
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowsensorss3_key_sensorS3KeySensor(bucket_key bucket_nameNone wildcard_matchFalse aws_conn_id'aws_default' verifyNone *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Waits for a key (a filelike instance on S3) to be present in a S3 bucket S3 being a keyvalue it does not support folders The path is just a key a resource
    Parameters
    · bucket_key (str) – The key being waited on Supports full s3 style url or relative path from root level When it’s specified as a full s3 url please leave bucket_name as None
    · bucket_name (str) – Name of the S3 bucket Only needed when bucket_key is not provided as a full s3 url
    · wildcard_match (bool) – whether the bucket_key should be interpreted as a Unix wildcard pattern
    · aws_conn_id (str) – a reference to the s3 connection
    · verify (bool or str) –
    Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
    (unless use_ssl is False) but SSL certificates will not be verified
    o pathtocertbundlepem A filename of the CA cert bundle to uses
    You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowsensorss3_prefix_sensorS3PrefixSensor(bucket_name prefix delimiter'' aws_conn_id'aws_default' verifyNone *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Waits for a prefix to exist A prefix is the first part of a key thus enabling checking of constructs similar to glob airfl* or SQL LIKE airfl’ There is the possibility to precise a delimiter to indicate the hierarchy or keys meaning that the match will stop at that delimiter Current code accepts sane delimiters ie characters that are NOT special characters in the Python regex engine
    Parameters
    · bucket_name (str) – Name of the S3 bucket
    · prefix (str) – The prefix being waited on Relative path from bucket root level
    · delimiter (str) – The delimiter intended to show hierarchy Defaults to ’
    · aws_conn_id (str) – a reference to the s3 connection
    · verify (bool or str) –
    Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
    (unless use_ssl is False) but SSL certificates will not be verified
    o pathtocertbundlepem A filename of the CA cert bundle to uses
    You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowsensorssql_sensorSqlSensor(conn_id sql *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Runs a sql statement until a criteria is met It will keep trying while sql returns no row or if the first cell in (0 0’ ’)
    Parameters
    · conn_id (str) – The connection to run the sensor against
    · sql – The sql to run To pass it needs to return at least one cell that contains a nonzero empty string value
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowsensorstime_sensorTimeSensor(target_time *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Waits until the specified time of the day
    Parameters
    target_time (datetimetime) – time after which the job succeeds
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowsensorstime_delta_sensorTimeDeltaSensor(delta *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Waits for a timedelta after the task’s execution_date + schedule_interval In Airflow the daily task stamped with execution_date 20160101 can only start running on 20160102 The timedelta here represents the time after the execution period has closed
    Parameters
    delta (datetimetimedelta) – time length to wait after execution_date before succeeding
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowsensorsweb_hdfs_sensorWebHdfsSensor(filepath webhdfs_conn_id'webhdfs_default' *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Waits for a file or folder to land in HDFS
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    Communitycontributed Operators
    Operators
    class airflowcontriboperatorsawsbatch_operatorAWSBatchOperator(job_name job_definition job_queue overrides max_retries4200 aws_conn_idNone region_nameNone **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Execute a job on AWS Batch Service
    Parameters
    · job_name (str) – the name for the job that will run on AWS Batch (templated)
    · job_definition (str) – the job definition name on AWS Batch
    · job_queue (str) – the queue name on AWS Batch
    · overrides (dict) – the same parameter that boto3 will receive on containerOverrides (templated) httpboto3readthedocsioenlatestreferenceservicesbatchhtml#submit_job
    · max_retries (int) – exponential backoff retries while waiter is not merged 4200 48 hours
    · aws_conn_id (str) – connection id of AWS credentials region name If None credential boto3 strategy will be used (httpboto3readthedocsioenlatestguideconfigurationhtml)
    · region_name (str) – region name to use in AWS Hook Override the region_name in connection (if provided)
    class airflowcontriboperatorsbigquery_check_operatorBigQueryCheckOperator(sql bigquery_conn_id'bigquery_default' use_legacy_sqlTrue *args **kwargs)[source]
    Bases airflowoperatorscheck_operatorCheckOperator
    Performs checks against BigQuery The BigQueryCheckOperator expects a sql query that will return a single row Each value on that first row is evaluated using python bool casting If any of the values return False the check is failed and errors out
    Note that Python bool casting evals the following as False
    · False
    · 0
    · Empty string ()
    · Empty list ([])
    · Empty dictionary or set ({})
    Given a query like SELECT COUNT(*) FROM foo it will fail only if the count 0 You can craft much more complex query that could for instance check that the table has the same number of rows as the source table upstream or that the count of today’s partition is greater than yesterday’s partition or that a set of metrics are less than 3 standard deviation for the 7 day average
    This operator can be used as a data quality check in your pipeline and depending on where you put it in your DAG you have the choice to stop the critical path preventing from publishing dubious data or on the side and receive email alterts without stopping the progress of the DAG
    Parameters
    · sql (str) – the sql to be executed
    · bigquery_conn_id (str) – reference to the BigQuery database
    · use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false)
    class airflowcontriboperatorsbigquery_check_operatorBigQueryValueCheckOperator(sql pass_value toleranceNone bigquery_conn_id'bigquery_default' use_legacy_sqlTrue *args **kwargs)[source]
    Bases airflowoperatorscheck_operatorValueCheckOperator
    Performs a simple value check using sql code
    Parameters
    · sql (str) – the sql to be executed
    · use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false)
    class airflowcontriboperatorsbigquery_check_operatorBigQueryIntervalCheckOperator(table metrics_thresholds date_filter_column'ds' days_back7 bigquery_conn_id'bigquery_default' use_legacy_sqlTrue *args **kwargs)[source]
    Bases airflowoperatorscheck_operatorIntervalCheckOperator
    Checks that the values of metrics given as SQL expressions are within a certain tolerance of the ones from days_back before
    This method constructs a query like so
    SELECT {metrics_threshold_dict_key} FROM {table}
    WHERE {date_filter_column}
    Parameters
    · table (str) – the table name
    · days_back (int) – number of days between ds and the ds we want to check against Defaults to 7 days
    · metrics_threshold (dict) – a dictionary of ratios indexed by metrics for example COUNT(*)’ 15 would require a 50 percent or less difference between the current day and the prior days_back
    · use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false)
    class airflowcontriboperatorsbigquery_get_dataBigQueryGetDataOperator(dataset_id table_id max_results'100' selected_fieldsNone bigquery_conn_id'bigquery_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Fetches the data from a BigQuery table (alternatively fetch data for selected columns) and returns data in a python list The number of elements in the returned list will be equal to the number of rows fetched Each element in the list will again be a list where element would represent the columns values for that row
    Example Result [['Tony' '10'] ['Mike' '20'] ['Steve' '15']]
    Note
    If you pass fields to selected_fields which are in different order than the order of columns already in BQ table the data will still be in the order of BQ table For example if the BQ table has 3 columns as [ABC] and you pass BA’ in the selected_fields the data would still be of the form 'AB'
    Example
    get_data BigQueryGetDataOperator(
    task_id'get_data_from_bq'
    dataset_id'test_dataset'
    table_id'Transaction_partitions'
    max_results'100'
    selected_fields'DATE'
    bigquery_conn_id'airflowserviceaccount'
    )
    Parameters
    · dataset_id (str) – The dataset ID of the requested table (templated)
    · table_id (str) – The table ID of the requested table (templated)
    · max_results (str) – The maximum number of records (rows) to be fetched from the table (templated)
    · selected_fields (str) – List of fields to return (commaseparated) If unspecified all fields are returned
    · bigquery_conn_id (str) – reference to a specific BigQuery hook
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    class airflowcontriboperatorsbigquery_operatorBigQueryCreateEmptyTableOperator(dataset_id table_id project_idNone schema_fieldsNone gcs_schema_objectNone time_partitioningNone bigquery_conn_id'bigquery_default' google_cloud_storage_conn_id'google_cloud_default' delegate_toNone labelsNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Creates a new empty table in the specified BigQuery dataset optionally with schema
    The schema to be used for the BigQuery table may be specified in one of two ways You may either directly pass the schema fields in or you may point the operator to a Google cloud storage object name The object in Google cloud storage must be a JSON file with the schema fields in it You can also create a table without schema
    Parameters
    · project_id (str) – The project to create the table into (templated)
    · dataset_id (str) – The dataset to create the table into (templated)
    · table_id (str) – The Name of the table to be created (templated)
    · schema_fields (list) –
    If set the schema field list as defined here httpscloudgooglecombigquerydocsreferencerestv2jobs#configurationloadschema
    Example
    schema_fields[{name emp_name type STRING mode REQUIRED}
    {name salary type INTEGER mode NULLABLE}]
    · gcs_schema_object (str) – Full path to the JSON file containing schema (templated) For example gstestbucketdir1dir2employee_schemajson
    · time_partitioning (dict) –
    configure optional time partitioning fields ie partition by field type and expiration as per API specifications
    See also
    httpscloudgooglecombigquerydocsreferencerestv2tables#timePartitioning
    · bigquery_conn_id (str) – Reference to a specific BigQuery hook
    · google_cloud_storage_conn_id (str) – Reference to a specific Google cloud storage hook
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · labels (dict) –
    a dictionary containing labels for the table passed to BigQuery
    Example (with schema JSON in GCS)
    CreateTable BigQueryCreateEmptyTableOperator(
    task_id'BigQueryCreateEmptyTableOperator_task'
    dataset_id'ODS'
    table_id'Employees'
    project_id'internalgcpproject'
    gcs_schema_object'gsschemabucketemployee_schemajson'
    bigquery_conn_id'airflowserviceaccount'
    google_cloud_storage_conn_id'airflowserviceaccount'
    )
    Corresponding Schema file (employee_schemajson)
    [
    {
    mode NULLABLE
    name emp_name
    type STRING
    }
    {
    mode REQUIRED
    name salary
    type INTEGER
    }
    ]
    Example (with schema in the DAG)
    CreateTable BigQueryCreateEmptyTableOperator(
    task_id'BigQueryCreateEmptyTableOperator_task'
    dataset_id'ODS'
    table_id'Employees'
    project_id'internalgcpproject'
    schema_fields[{name emp_name type STRING mode REQUIRED}
    {name salary type INTEGER mode NULLABLE}]
    bigquery_conn_id'airflowserviceaccount'
    google_cloud_storage_conn_id'airflowserviceaccount'
    )
    class airflowcontriboperatorsbigquery_operatorBigQueryCreateExternalTableOperator(bucket source_objects destination_project_dataset_table schema_fieldsNone schema_objectNone source_format'CSV' compression'NONE' skip_leading_rows0 field_delimiter' ' max_bad_records0 quote_characterNone allow_quoted_newlinesFalse allow_jagged_rowsFalse bigquery_conn_id'bigquery_default' google_cloud_storage_conn_id'google_cloud_default' delegate_toNone src_fmt_configs{} labelsNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Creates a new external table in the dataset with the data in Google Cloud Storage
    The schema to be used for the BigQuery table may be specified in one of two ways You may either directly pass the schema fields in or you may point the operator to a Google cloud storage object name The object in Google cloud storage must be a JSON file with the schema fields in it
    Parameters
    · bucket (str) – The bucket to point the external table to (templated)
    · source_objects (list) – List of Google cloud storage URIs to point table to (templated) If source_format is DATASTORE_BACKUP’ the list must only contain a single URI
    · destination_project_dataset_table (str) – The dotted ()
    BigQuery table to load data into (templated) If is not included project will be the project defined in the connection json
    · schema_fields (list) –
    If set the schema field list as defined here httpscloudgooglecombigquerydocsreferencerestv2jobs#configurationloadschema
    Example
    schema_fields[{name emp_name type STRING mode REQUIRED}
    {name salary type INTEGER mode NULLABLE}]
    Should not be set when source_format is DATASTORE_BACKUP’
    · schema_object (str) – If set a GCS object path pointing to a json file that contains the schema for the table (templated)
    · source_format (str) – File format of the data
    · compression (str) – [Optional] The compression type of the data source Possible values include GZIP and NONE The default value is NONE This setting is ignored for Google Cloud Bigtable Google Cloud Datastore backups and Avro formats
    · skip_leading_rows (int) – Number of rows to skip when loading from a CSV
    · field_delimiter (str) – The delimiter to use for the CSV
    · max_bad_records (int) – The maximum number of bad records that BigQuery can ignore when running the job
    · quote_character (str) – The value that is used to quote data sections in a CSV file
    · allow_quoted_newlines (bool) – Whether to allow quoted newlines (true) or not (false)
    · allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns The missing values are treated as nulls If false records with missing trailing columns are treated as bad records and if there are too many bad records an invalid error is returned in the job result Only applicable to CSV ignored for other formats
    · bigquery_conn_id (str) – Reference to a specific BigQuery hook
    · google_cloud_storage_conn_id (str) – Reference to a specific Google cloud storage hook
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · src_fmt_configs (dict) – configure optional fields specific to the source format
    param labels a dictionary containing labels for the table passed to BigQuery type labels dict
    class airflowcontriboperatorsbigquery_operatorBigQueryDeleteDatasetOperator(dataset_id project_idNone bigquery_conn_id'bigquery_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    This operator deletes an existing dataset from your Project in Big query httpscloudgooglecombigquerydocsreferencerestv2datasetsdelete param project_id The project id of the dataset type project_id str param dataset_id The dataset to be deleted type dataset_id str
    Example
    delete_temp_data BigQueryDeleteDatasetOperator(dataset_id 'tempdataset'
    project_id 'tempproject'
    bigquery_conn_id'_my_gcp_conn_'
    task_id'Deletetemp'
    dagdag)
    class airflowcontriboperatorsbigquery_operatorBigQueryCreateEmptyDatasetOperator(dataset_id project_idNone dataset_referenceNone bigquery_conn_id'bigquery_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    This operator is used to create new dataset for your Project in Big query httpscloudgooglecombigquerydocsreferencerestv2datasets#resource
    Parameters
    · project_id (str) – The name of the project where we want to create the dataset Don’t need to provide if projectId in dataset_reference
    · dataset_id (str) – The id of dataset Don’t need to provide if datasetId in dataset_reference
    · dataset_reference – Dataset reference that could be provided with request body More info httpscloudgooglecombigquerydocsreferencerestv2datasets#resource
    class airflowcontriboperatorsbigquery_operatorBigQueryOperator(sqlNone destination_dataset_tableFalse write_disposition'WRITE_EMPTY' allow_large_resultsFalse flatten_resultsNone bigquery_conn_id'bigquery_default' delegate_toNone udf_configFalse use_legacy_sqlTrue maximum_billing_tierNone maximum_bytes_billedNone create_disposition'CREATE_IF_NEEDED' schema_update_options() query_paramsNone labelsNone priority'INTERACTIVE' time_partitioningNone api_resource_configsNone cluster_fieldsNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Executes BigQuery SQL queries in a specific BigQuery database
    Parameters
    · sql (Can receive a str representing a sql statement a list of str (sql statements) or reference to a template file Template reference are recognized by str ending in 'sql') – the sql code to be executed (templated)
    · destination_dataset_table (str) – A dotted (|)
    that if set will store the results of the query (templated)
    · write_disposition (str) – Specifies the action that occurs if the destination table already exists (default WRITE_EMPTY’)
    · create_disposition (str) – Specifies whether the job is allowed to create new tables (default CREATE_IF_NEEDED’)
    · allow_large_results (bool) – Whether to allow large results
    · flatten_results (bool) – If true and query uses legacy SQL dialect flattens all nested and repeated fields in the query results allow_large_results must be true if this is set to false For standard SQL queries this flag is ignored and results are never flattened
    · bigquery_conn_id (str) – reference to a specific BigQuery hook
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · udf_config (list) – The User Defined Function configuration for the query See httpscloudgooglecombigqueryuserdefinedfunctions for details
    · use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false)
    · maximum_billing_tier (int) – Positive integer that serves as a multiplier of the basic price Defaults to None in which case it uses the value set in the project
    · maximum_bytes_billed (float) – Limits the bytes billed for this job Queries that will have bytes billed beyond this limit will fail (without incurring a charge) If unspecified this will be set to your project default
    · api_resource_configs (dict) – a dictionary that contain params configuration’ applied for Google BigQuery Jobs API httpscloudgooglecombigquerydocsreferencerestv2jobs for example {query’ {useQueryCache’ False}} You could use it if you need to provide some params that are not supported by BigQueryOperator like args
    · schema_update_options (tuple) – Allows the schema of the destination table to be updated as a side effect of the load job
    · query_params (dict) – a dictionary containing query parameter types and values passed to BigQuery
    · labels (dict) – a dictionary containing labels for the jobquery passed to BigQuery
    · priority (str) – Specifies a priority for the query Possible values include INTERACTIVE and BATCH The default value is INTERACTIVE
    · time_partitioning (dict) – configure optional time partitioning fields ie partition by field type and expiration as per API specifications
    · cluster_fields (list of str) – Request that the result of this query be stored sorted by one or more columns This is only available in conjunction with time_partitioning The order of columns given determines the sort order
    class airflowcontriboperatorsbigquery_table_delete_operatorBigQueryTableDeleteOperator(deletion_dataset_table bigquery_conn_id'bigquery_default' delegate_toNone ignore_if_missingFalse *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Deletes BigQuery tables
    Parameters
    · deletion_dataset_table (str) – A dotted (|)
    that indicates which table will be deleted (templated)
    · bigquery_conn_id (str) – reference to a specific BigQuery hook
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · ignore_if_missing (bool) – if True then return success even if the requested table does not exist
    class airflowcontriboperatorsbigquery_to_bigqueryBigQueryToBigQueryOperator(source_project_dataset_tables destination_project_dataset_table write_disposition'WRITE_EMPTY' create_disposition'CREATE_IF_NEEDED' bigquery_conn_id'bigquery_default' delegate_toNone labelsNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Copies data from one BigQuery table to another
    See also
    For more details about these parameters httpscloudgooglecombigquerydocsreferencev2jobs#configurationcopy
    Parameters
    · source_project_dataset_tables (list|string) – One or more dotted (project|project)
    BigQuery tables to use as the source data If is not included project will be the project defined in the connection json Use a list if there are multiple source tables (templated)
    · destination_project_dataset_table (str) – The destination BigQuery table Format is (project|project)
    (templated)
    · write_disposition (str) – The write disposition if the table already exists
    · create_disposition (str) – The create disposition if the table doesn’t exist
    · bigquery_conn_id (str) – reference to a specific BigQuery hook
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · labels (dict) – a dictionary containing labels for the jobquery passed to BigQuery
    class airflowcontriboperatorsbigquery_to_gcsBigQueryToCloudStorageOperator(source_project_dataset_table destination_cloud_storage_uris compression'NONE' export_format'CSV' field_delimiter' ' print_headerTrue bigquery_conn_id'bigquery_default' delegate_toNone labelsNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Transfers a BigQuery table to a Google Cloud Storage bucket
    See also
    For more details about these parameters httpscloudgooglecombigquerydocsreferencev2jobs
    Parameters
    · source_project_dataset_table (str) – The dotted (|)
    BigQuery table to use as the source data If is not included project will be the project defined in the connection json (templated)
    · destination_cloud_storage_uris (list) – The destination Google Cloud Storage URI (eg gssomebucketsomefiletxt) (templated) Follows convention defined here httpscloudgooglecombigqueryexportingdatafrombigquery#exportingmultiple
    · compression (str) – Type of compression to use
    · export_format (str) – File format to export
    · field_delimiter (str) – The delimiter to use when extracting to a CSV
    · print_header (bool) – Whether to print a header for a CSV file extract
    · bigquery_conn_id (str) – reference to a specific BigQuery hook
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · labels (dict) – a dictionary containing labels for the jobquery passed to BigQuery
    class airflowcontriboperatorscassandra_to_gcsCassandraToGoogleCloudStorageOperator(cql bucket filename schema_filenameNone approx_max_file_size_bytes1900000000 cassandra_conn_idu'cassandra_default' google_cloud_storage_conn_idu'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Copy data from Cassandra to Google cloud storage in JSON format
    Note Arrays of arrays are not supported
    classmethod convert_map_type(name value)[source]
    Converts a map to a repeated RECORD that contains two fields key’ and value’ each will be converted to its corresopnding data type in BQ
    classmethod convert_tuple_type(name value)[source]
    Converts a tuple to RECORD that contains n fields each will be converted to its corresponding data type in bq and will be named field_’ where index is determined by the order of the tuple elments defined in cassandra
    classmethod convert_user_type(name value)[source]
    Converts a user type to RECORD that contains n fields where n is the number of attributes Each element in the user type class will be converted to its corresponding data type in BQ
    class airflowcontriboperatorsdatabricks_operatorDatabricksSubmitRunOperator(jsonNone spark_jar_taskNone notebook_taskNone new_clusterNone existing_cluster_idNone librariesNone run_nameNone timeout_secondsNone databricks_conn_id'databricks_default' polling_period_seconds30 databricks_retry_limit3 databricks_retry_delay1 do_xcom_pushFalse **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Submits a Spark job run to Databricks using the api20jobsrunssubmit API endpoint
    There are two ways to instantiate this operator
    In the first way you can take the JSON payload that you typically use to call the api20jobsrunssubmit endpoint and pass it directly to our DatabricksSubmitRunOperator through the json parameter For example
    json {
    'new_cluster' {
    'spark_version' '210db3scala211'
    'num_workers' 2
    }
    'notebook_task' {
    'notebook_path' 'Usersairflow@examplecomPrepareData'
    }
    }
    notebook_run DatabricksSubmitRunOperator(task_id'notebook_run' jsonjson)
    Another way to accomplish the same thing is to use the named parameters of the DatabricksSubmitRunOperator directly Note that there is exactly one named parameter for each top level parameter in the runssubmit endpoint In this method your code would look like this
    new_cluster {
    'spark_version' '210db3scala211'
    'num_workers' 2
    }
    notebook_task {
    'notebook_path' 'Usersairflow@examplecomPrepareData'
    }
    notebook_run DatabricksSubmitRunOperator(
    task_id'notebook_run'
    new_clusternew_cluster
    notebook_tasknotebook_task)
    In the case where both the json parameter AND the named parameters are provided they will be merged together If there are conflicts during the merge the named parameters will take precedence and override the top level json keys
    Currently the named parameters that DatabricksSubmitRunOperator supports are
    · spark_jar_task
    · notebook_task
    · new_cluster
    · existing_cluster_id
    · libraries
    · run_name
    · timeout_seconds
    Parameters
    · json (dict) –
    A JSON object containing API parameters which will be passed directly to the api20jobsrunssubmit endpoint The other named parameters (ie spark_jar_task notebook_task) to this operator will be merged with this json dictionary if they are provided If there are conflicts during the merge the named parameters will take precedence and override the top level json keys (templated)
    See also
    For more information about templating see Jinja Templating httpsdocsdatabrickscomapilatestjobshtml#runssubmit
    · spark_jar_task (dict) –
    The main class and parameters for the JAR task Note that the actual JAR is specified in the libraries EITHER spark_jar_task OR notebook_task should be specified This field will be templated
    See also
    httpsdocsdatabrickscomapilatestjobshtml#jobssparkjartask
    · notebook_task (dict) –
    The notebook path and parameters for the notebook task EITHER spark_jar_task OR notebook_task should be specified This field will be templated
    See also
    httpsdocsdatabrickscomapilatestjobshtml#jobsnotebooktask
    · new_cluster (dict) –
    Specs for a new cluster on which this task will be run EITHER new_cluster OR existing_cluster_id should be specified This field will be templated
    See also
    httpsdocsdatabrickscomapilatestjobshtml#jobsclusterspecnewcluster
    · existing_cluster_id (str) – ID for existing cluster on which to run this task EITHER new_cluster OR existing_cluster_id should be specified This field will be templated
    · libraries (list of dicts) –
    Libraries which this run will use This field will be templated
    See also
    httpsdocsdatabrickscomapilatestlibrarieshtml#managedlibrarieslibrary
    · run_name (str) – The run name used for this task By default this will be set to the Airflow task_id This task_id is a required parameter of the superclass BaseOperator This field will be templated
    · timeout_seconds (int32) – The timeout for this run By default a value of 0 is used which means to have no timeout This field will be templated
    · databricks_conn_id (str) – The name of the Airflow connection to use By default and in the common case this will be databricks_default To use token based authentication provide the key token in the extra field for the connection
    · polling_period_seconds (int) – Controls the rate which we poll for the result of this run By default the operator will poll every 30 seconds
    · databricks_retry_limit (int) – Amount of times retry if the Databricks backend is unreachable Its value must be greater than or equal to 1
    · databricks_retry_delay (float) – Number of seconds to wait between retries (it might be a floating point number)
    · do_xcom_push (bool) – Whether we should push run_id and run_page_url to xcom
    class airflowcontriboperatorsdataflow_operatorDataFlowJavaOperator(jar job_name'{{tasktask_id}}' dataflow_default_optionsNone optionsNone gcp_conn_id'google_cloud_default' delegate_toNone poll_sleep10 job_classNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Start a Java Cloud DataFlow batch job The parameters of the operation will be passed to the job
    See also
    For more detail on job submission have a look at the reference httpscloudgooglecomdataflowpipelinesspecifyingexecparams
    Parameters
    · jar (str) – The reference to a self executing DataFlow jar (templated)
    · job_name (str) – The jobName’ to use when executing the DataFlow job (templated) This ends up being set in the pipeline options so any entry with key 'jobName' in options will be overwritten
    · dataflow_default_options (dict) – Map of default job options
    · options (dict) – Map of job specific options
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Platform for the dataflow job status while the job is in the JOB_STATE_RUNNING state
    · job_class (str) – The name of the dataflow job class to be executued it is often not the main class configured in the dataflow jar file
    jar options and job_name are templated so you can use variables in them
    Note that both dataflow_default_options and options will be merged to specify pipeline execution parameter and dataflow_default_options is expected to save highlevel options for instances project and zone information which apply to all dataflow operators in the DAG
    It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project zone and staging location
    default_args {
    'dataflow_default_options' {
    'project' 'mygcpproject'
    'zone' 'europewest1d'
    'stagingLocation' 'gsmystagingbucketstaging'
    }
    }
    You need to pass the path to your dataflow as a file reference with the jar parameter the jar needs to be a self executing jar (see documentation here httpsbeamapacheorgdocumentationrunnersdataflow#selfexecutingjar) Use options to pass on options to your job
    t1 DataFlowJavaOperator(
    task_id'datapflow_example'
    jar'{{varvaluegcp_dataflow_base}}pipelinebuildlibspipelineexample10jar'
    options{
    'autoscalingAlgorithm' 'BASIC'
    'maxNumWorkers' '50'
    'start' '{{ds}}'
    'partitionType' 'DAY'
    'labels' {'foo' 'bar'}
    }
    gcp_conn_id'gcpairflowserviceaccount'
    dagmydag)
    class airflowcontriboperatorsdataflow_operatorDataflowTemplateOperator(template job_name'{{tasktask_id}}' dataflow_default_optionsNone parametersNone gcp_conn_id'google_cloud_default' delegate_toNone poll_sleep10 *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Start a Templated Cloud DataFlow batch job The parameters of the operation will be passed to the job
    Parameters
    · template (str) – The reference to the DataFlow template
    · job_name – The jobName’ to use when executing the DataFlow template (templated)
    · dataflow_default_options (dict) – Map of default job environment options
    · parameters (dict) – Map of job specific parameters for the template
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Platform for the dataflow job status while the job is in the JOB_STATE_RUNNING state
    It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project zone and staging location
    See also
    httpscloudgooglecomdataflowdocsreferencerestv1b3LaunchTemplateParameters httpscloudgooglecomdataflowdocsreferencerestv1b3RuntimeEnvironment
    default_args {
    'dataflow_default_options' {
    'project' 'mygcpproject'
    'zone' 'europewest1d'
    'tempLocation' 'gsmystagingbucketstaging'
    }
    }
    }
    You need to pass the path to your dataflow template as a file reference with the template parameter Use parameters to pass on parameters to your job Use environment to pass on runtime environment variables to your job
    t1 DataflowTemplateOperator(
    task_id'datapflow_example'
    template'{{varvaluegcp_dataflow_base}}'
    parameters{
    'inputFile' gsbucketinputmy_inputtxt
    'outputFile' gsbucketoutputmy_outputtxt
    }
    gcp_conn_id'gcpairflowserviceaccount'
    dagmydag)
    template dataflow_default_options parameters and job_name are templated so you can use variables in them
    Note that dataflow_default_options is expected to save highlevel options for project information which apply to all dataflow operators in the DAG
    See also
    httpscloudgooglecomdataflowdocsreferencerestv1b3 LaunchTemplateParameters httpscloudgooglecomdataflowdocsreferencerestv1b3RuntimeEnvironment For more detail on job template execution have a look at the reference httpscloudgooglecomdataflowdocstemplatesexecutingtemplates
    class airflowcontriboperatorsdataflow_operatorDataFlowPythonOperator(py_file job_name'{{tasktask_id}}' py_optionsNone dataflow_default_optionsNone optionsNone gcp_conn_id'google_cloud_default' delegate_toNone poll_sleep10 *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Launching Cloud Dataflow jobs written in python Note that both dataflow_default_options and options will be merged to specify pipeline execution parameter and dataflow_default_options is expected to save highlevel options for instances project and zone information which apply to all dataflow operators in the DAG
    See also
    For more detail on job submission have a look at the reference httpscloudgooglecomdataflowpipelinesspecifyingexecparams
    Parameters
    · py_file (str) – Reference to the python dataflow pipleline filepy eg somelocalfilepathtoyourpythonpipelinefile
    · job_name (str) – The job_name’ to use when executing the DataFlow job (templated) This ends up being set in the pipeline options so any entry with key 'jobName' or 'job_name' in options will be overwritten
    · py_options – Additional python options
    · dataflow_default_options (dict) – Map of default job options
    · options (dict) – Map of job specific options
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Platform for the dataflow job status while the job is in the JOB_STATE_RUNNING state
    execute(context)[source]
    Execute the python dataflow job
    class airflowcontriboperatorsdataproc_operatorDataprocClusterCreateOperator(cluster_name project_id num_workers zone network_uriNone subnetwork_uriNone internal_ip_onlyNone tagsNone storage_bucketNone init_actions_urisNone init_action_timeout'10m' metadataNone custom_imageNone image_versionNone propertiesNone master_machine_type'n1standard4' master_disk_type'pdstandard' master_disk_size500 worker_machine_type'n1standard4' worker_disk_type'pdstandard' worker_disk_size500 num_preemptible_workers0 labelsNone region'global' gcp_conn_id'google_cloud_default' delegate_toNone service_accountNone service_account_scopesNone idle_delete_ttlNone auto_delete_timeNone auto_delete_ttlNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Create a new cluster on Google Cloud Dataproc The operator will wait until the creation is successful or an error occurs in the creation process
    The parameters allow to configure the cluster Please refer to
    httpscloudgooglecomdataprocdocsreferencerestv1projectsregionsclusters
    for a detailed explanation on the different parameters Most of the configuration parameters detailed in the link are available as a parameter to this operator
    Parameters
    · cluster_name (str) – The name of the DataProc cluster to create (templated)
    · project_id (str) – The ID of the google cloud project in which to create the cluster (templated)
    · num_workers (int) – The # of workers to spin up If set to zero will spin up cluster in a single node mode
    · storage_bucket (str) – The storage bucket to use setting to None lets dataproc generate a custom one for you
    · init_actions_uris (list[string]) – List of GCS uri’s containing dataproc initialization scripts
    · init_action_timeout (str) – Amount of time executable scripts in init_actions_uris has to complete
    · metadata (dict) – dict of keyvalue google compute engine metadata entries to add to all instances
    · image_version (str) – the version of software inside the Dataproc cluster
    · custom_image – custom Dataproc image for more info see httpscloudgooglecomdataprocdocsguidesdataprocimages
    · properties (dict) – dict of properties to set on config files (eg sparkdefaultsconf) see httpscloudgooglecomdataprocdocsreferencerestv1projectsregionsclusters#SoftwareConfig
    · master_machine_type (str) – Compute engine machine type to use for the master node
    · master_disk_type (str) – Type of the boot disk for the master node (default is pdstandard) Valid values pdssd (Persistent Disk Solid State Drive) or pdstandard (Persistent Disk Hard Disk Drive)
    · master_disk_size (int) – Disk size for the master node
    · worker_machine_type (str) – Compute engine machine type to use for the worker nodes
    · worker_disk_type (str) – Type of the boot disk for the worker node (default is pdstandard) Valid values pdssd (Persistent Disk Solid State Drive) or pdstandard (Persistent Disk Hard Disk Drive)
    · worker_disk_size (int) – Disk size for the worker nodes
    · num_preemptible_workers (int) – The # of preemptible worker nodes to spin up
    · labels (dict) – dict of labels to add to the cluster
    · zone (str) – The zone where the cluster will be located (templated)
    · network_uri (str) – The network uri to be used for machine communication cannot be specified with subnetwork_uri
    · subnetwork_uri (str) – The subnetwork uri to be used for machine communication cannot be specified with network_uri
    · internal_ip_only (bool) – If true all instances in the cluster will only have internal IP addresses This can only be enabled for subnetwork enabled networks
    · tags (list[string]) – The GCE tags to add to all instances
    · region (str) – leave as global’ might become relevant in the future (templated)
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · service_account (str) – The service account of the dataproc instances
    · service_account_scopes (list[string]) – The URIs of service account scopes to be included
    · idle_delete_ttl (int) – The longest duration that cluster would keep alive while staying idle Passing this threshold will cause cluster to be autodeleted A duration in seconds
    · auto_delete_time (datetimedatetime) – The time when cluster will be autodeleted
    · auto_delete_ttl (int) – The life duration of cluster the cluster will be autodeleted at the end of this duration A duration in seconds (If auto_delete_time is set this parameter will be ignored)
    Type
    custom_image str
    class airflowcontriboperatorsdataproc_operatorDataprocClusterScaleOperator(cluster_name project_id region'global' gcp_conn_id'google_cloud_default' delegate_toNone num_workers2 num_preemptible_workers0 graceful_decommission_timeoutNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Scale up or down a cluster on Google Cloud Dataproc The operator will wait until the cluster is rescaled
    Example
    t1 DataprocClusterScaleOperator(
    task_id'dataproc_scale'
    project_id'myproject'
    cluster_name'cluster1'
    num_workers10
    num_preemptible_workers10
    graceful_decommission_timeout'1h'
    dagdag)
    See also
    For more detail on about scaling clusters have a look at the reference httpscloudgooglecomdataprocdocsconceptsconfiguringclustersscalingclusters
    Parameters
    · cluster_name (str) – The name of the cluster to scale (templated)
    · project_id (str) – The ID of the google cloud project in which the cluster runs (templated)
    · region (str) – The region for the dataproc cluster (templated)
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · num_workers (int) – The new number of workers
    · num_preemptible_workers (int) – The new number of preemptible workers
    · graceful_decommission_timeout (str) – Timeout for graceful YARN decomissioning Maximum value is 1d
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    class airflowcontriboperatorsdataproc_operatorDataprocClusterDeleteOperator(cluster_name project_id region'global' gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Delete a cluster on Google Cloud Dataproc The operator will wait until the cluster is destroyed
    Parameters
    · cluster_name (str) – The name of the cluster to create (templated)
    · project_id (str) – The ID of the google cloud project in which the cluster runs (templated)
    · region (str) – leave as global’ might become relevant in the future (templated)
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    class airflowcontriboperatorsdataproc_operatorDataProcPigOperator(queryNone query_uriNone variablesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_pig_propertiesNone dataproc_pig_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Start a Pig query Job on a Cloud DataProc cluster The parameters of the operation will be passed to the cluster
    It’s a good practice to define dataproc_* parameters in the default_args of the dag like the cluster name and UDFs
    default_args {
    'cluster_name' 'cluster1'
    'dataproc_pig_jars' [
    'gsexampleudfjardatafu120datafujar'
    'gsexampleudfjargpig12gpigjar'
    ]
    }
    You can pass a pig script as string or file reference Use variables to pass on variables for the pig script to be resolved on the cluster or use the parameters to be resolved in the script as template parameters
    Example
    t1 DataProcPigOperator(
    task_id'dataproc_pig'
    query'a_pig_scriptpig'
    variables{'out' 'gsexampleoutput{{ds}}'}
    dagdag)
    See also
    For more detail on about job submission have a look at the reference httpscloudgooglecomdataprocreferencerestv1projectsregionsjobs
    Parameters
    · query (str) – The query or reference to the query file (pg or pig extension) (templated)
    · query_uri (str) – The uri of a pig script on Cloud Storage
    · variables (dict) – Map of named parameters for the query (templated)
    · job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
    · cluster_name (str) – The name of the DataProc cluster (templated)
    · dataproc_pig_properties (dict) – Map for the Pig properties Ideal to put in default arguments
    · dataproc_pig_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · region (str) – The specified region where the dataproc cluster is created
    · job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
    Variables
    dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
    class airflowcontriboperatorsdataproc_operatorDataProcHiveOperator(queryNone query_uriNone variablesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_hive_propertiesNone dataproc_hive_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Start a Hive query Job on a Cloud DataProc cluster
    Parameters
    · query (str) – The query or reference to the query file (q extension)
    · query_uri (str) – The uri of a hive script on Cloud Storage
    · variables (dict) – Map of named parameters for the query
    · job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes
    · cluster_name (str) – The name of the DataProc cluster
    · dataproc_hive_properties (dict) – Map for the Pig properties Ideal to put in default arguments
    · dataproc_hive_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · region (str) – The specified region where the dataproc cluster is created
    · job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
    Variables
    dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
    class airflowcontriboperatorsdataproc_operatorDataProcSparkSqlOperator(queryNone query_uriNone variablesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_spark_propertiesNone dataproc_spark_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Start a Spark SQL query Job on a Cloud DataProc cluster
    Parameters
    · query (str) – The query or reference to the query file (q extension) (templated)
    · query_uri (str) – The uri of a spark sql script on Cloud Storage
    · variables (dict) – Map of named parameters for the query (templated)
    · job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
    · cluster_name (str) – The name of the DataProc cluster (templated)
    · dataproc_spark_properties (dict) – Map for the Pig properties Ideal to put in default arguments
    · dataproc_spark_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · region (str) – The specified region where the dataproc cluster is created
    · job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
    Variables
    dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
    class airflowcontriboperatorsdataproc_operatorDataProcSparkOperator(main_jarNone main_classNone argumentsNone archivesNone filesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_spark_propertiesNone dataproc_spark_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Start a Spark Job on a Cloud DataProc cluster
    Parameters
    · main_jar (str) – URI of the job jar provisioned on Cloud Storage (use this or the main_class not both together)
    · main_class (str) – Name of the job class (use this or the main_jar not both together)
    · arguments (list) – Arguments for the job (templated)
    · archives (list) – List of archived files that will be unpacked in the work directory Should be stored in Cloud Storage
    · files (list) – List of files to be copied to the working directory
    · job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
    · cluster_name (str) – The name of the DataProc cluster (templated)
    · dataproc_spark_properties (dict) – Map for the Pig properties Ideal to put in default arguments
    · dataproc_spark_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · region (str) – The specified region where the dataproc cluster is created
    · job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
    Variables
    dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
    class airflowcontriboperatorsdataproc_operatorDataProcHadoopOperator(main_jarNone main_classNone argumentsNone archivesNone filesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_hadoop_propertiesNone dataproc_hadoop_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Start a Hadoop Job on a Cloud DataProc cluster
    Parameters
    · main_jar (str) – URI of the job jar provisioned on Cloud Storage (use this or the main_class not both together)
    · main_class (str) – Name of the job class (use this or the main_jar not both together)
    · arguments (list) – Arguments for the job (templated)
    · archives (list) – List of archived files that will be unpacked in the work directory Should be stored in Cloud Storage
    · files (list) – List of files to be copied to the working directory
    · job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
    · cluster_name (str) – The name of the DataProc cluster (templated)
    · dataproc_hadoop_properties (dict) – Map for the Pig properties Ideal to put in default arguments
    · dataproc_hadoop_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · region (str) – The specified region where the dataproc cluster is created
    · job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
    Variables
    dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
    class airflowcontriboperatorsdataproc_operatorDataProcPySparkOperator(main argumentsNone archivesNone pyfilesNone filesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_pyspark_propertiesNone dataproc_pyspark_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Start a PySpark Job on a Cloud DataProc cluster
    Parameters
    · main (str) – [Required] The Hadoop Compatible Filesystem (HCFS) URI of the main Python file to use as the driver Must be a py file
    · arguments (list) – Arguments for the job (templated)
    · archives (list) – List of archived files that will be unpacked in the work directory Should be stored in Cloud Storage
    · files (list) – List of files to be copied to the working directory
    · pyfiles (list) – List of Python files to pass to the PySpark framework Supported file types py egg and zip
    · job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
    · cluster_name (str) – The name of the DataProc cluster
    · dataproc_pyspark_properties (dict) – Map for the Pig properties Ideal to put in default arguments
    · dataproc_pyspark_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · region (str) – The specified region where the dataproc cluster is created
    · job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
    Variables
    dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
    class airflowcontriboperatorsdataproc_operatorDataprocWorkflowTemplateBaseOperator(project_id region'global' gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    class airflowcontriboperatorsdataproc_operatorDataprocWorkflowTemplateInstantiateOperator(template_id *args **kwargs)[source]
    Bases airflowcontriboperatorsdataproc_operatorDataprocWorkflowTemplateBaseOperator
    Instantiate a WorkflowTemplate on Google Cloud Dataproc The operator will wait until the WorkflowTemplate is finished executing
    See also
    Please refer to httpscloudgooglecomdataprocdocsreferencerestv1beta2projectsregionsworkflowTemplatesinstantiate
    Parameters
    · template_id (str) – The id of the template (templated)
    · project_id (str) – The ID of the google cloud project in which the template runs
    · region (str) – leave as global’ might become relevant in the future
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    class airflowcontriboperatorsdataproc_operatorDataprocWorkflowTemplateInstantiateInlineOperator(template *args **kwargs)[source]
    Bases airflowcontriboperatorsdataproc_operatorDataprocWorkflowTemplateBaseOperator
    Instantiate a WorkflowTemplate Inline on Google Cloud Dataproc The operator will wait until the WorkflowTemplate is finished executing
    See also
    Please refer to httpscloudgooglecomdataprocdocsreferencerestv1beta2projectsregionsworkflowTemplatesinstantiateInline
    Parameters
    · template (map) – The template contents (templated)
    · project_id (str) – The ID of the google cloud project in which the template runs
    · region (str) – leave as global’ might become relevant in the future
    · gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    class airflowcontriboperatorsdatastore_export_operatorDatastoreExportOperator(bucket namespaceNone datastore_conn_id'google_cloud_default' cloud_storage_conn_id'google_cloud_default' delegate_toNone entity_filterNone labelsNone polling_interval_in_seconds10 overwrite_existingFalse xcom_pushFalse *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Export entities from Google Cloud Datastore to Cloud Storage
    Parameters
    · bucket (str) – name of the cloud storage bucket to backup data
    · namespace (str) – optional namespace path in the specified Cloud Storage bucket to backup data If this namespace does not exist in GCS it will be created
    · datastore_conn_id (str) – the name of the Datastore connection id to use
    · cloud_storage_conn_id (str) – the name of the cloud storage connection id to forcewrite backup
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · entity_filter (dict) – description of what data from the project is included in the export refer to httpscloudgooglecomdatastoredocsreferencerestSharedTypesEntityFilter
    · labels (dict) – clientassigned labels for cloud storage
    · polling_interval_in_seconds (int) – number of seconds to wait before polling for execution status again
    · overwrite_existing (bool) – if the storage bucket + namespace is not empty it will be emptied prior to exports This enables overwriting existing backups
    · xcom_push (bool) – push operation name to xcom for reference
    class airflowcontriboperatorsdatastore_import_operatorDatastoreImportOperator(bucket file namespaceNone entity_filterNone labelsNone datastore_conn_id'google_cloud_default' delegate_toNone polling_interval_in_seconds10 xcom_pushFalse *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Import entities from Cloud Storage to Google Cloud Datastore
    Parameters
    · bucket (str) – container in Cloud Storage to store data
    · file (str) – path of the backup metadata file in the specified Cloud Storage bucket It should have the extension overall_export_metadata
    · namespace (str) – optional namespace of the backup metadata file in the specified Cloud Storage bucket
    · entity_filter (dict) – description of what data from the project is included in the export refer to httpscloudgooglecomdatastoredocsreferencerestSharedTypesEntityFilter
    · labels (dict) – clientassigned labels for cloud storage
    · datastore_conn_id (str) – the name of the connection id to use
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · polling_interval_in_seconds (int) – number of seconds to wait before polling for execution status again
    · xcom_push (bool) – push operation name to xcom for reference
    class airflowcontriboperatorsdiscord_webhook_operatorDiscordWebhookOperator(http_conn_idNone webhook_endpointNone message'' usernameNone avatar_urlNone ttsFalse proxyNone *args **kwargs)[source]
    Bases airflowoperatorshttp_operatorSimpleHttpOperator
    This operator allows you to post messages to Discord using incoming webhooks Takes a Discord connection ID with a default relative webhook endpoint The default endpoint can be overridden using the webhook_endpoint parameter (httpsdiscordappcomdevelopersdocsresourceswebhook)
    Each Discord webhook can be preconfigured to use a specific username and avatar_url You can override these defaults in this operator
    Parameters
    · http_conn_id (str) – Http connection ID with host as httpsdiscordcomapi and default webhook endpoint in the extra field in the form of {webhook_endpoint webhooks{webhookid}{webhooktoken}}
    · webhook_endpoint (str) – Discord webhook endpoint in the form of webhooks{webhookid}{webhooktoken}
    · message (str) – The message you want to send to your Discord channel (max 2000 characters) (templated)
    · username (str) – Override the default username of the webhook (templated)
    · avatar_url (str) – Override the default avatar of the webhook
    · tts (bool) – Is a texttospeech message
    · proxy (str) – Proxy to use to make the Discord webhook call
    execute(context)[source]
    Call the DiscordWebhookHook to post message
    class airflowcontriboperatorsdruid_operatorDruidOperator(json_index_file druid_ingest_conn_id'druid_ingest_default' max_ingestion_timeNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Allows to submit a task directly to druid
    Parameters
    · json_index_file (str) – The filepath to the druid index specification
    · druid_ingest_conn_id (str) – The connection id of the Druid overlord which accepts index jobs
    class airflowcontriboperatorsecs_operatorECSOperator(task_definition cluster overrides aws_conn_idNone region_nameNone launch_type'EC2' groupNone placement_constraintsNone platform_version'LATEST' network_configurationNone **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Execute a task on AWS EC2 Container Service
    Parameters
    · task_definition (str) – the task definition name on EC2 Container Service
    · cluster (str) – the cluster name on EC2 Container Service
    · overrides (dict) – the same parameter that boto3 will receive (templated) httpboto3readthedocsorgenlatestreferenceservicesecshtml#ECSClientrun_task
    · aws_conn_id (str) – connection id of AWS credentials region name If None credential boto3 strategy will be used (httpboto3readthedocsioenlatestguideconfigurationhtml)
    · region_name (str) – region name to use in AWS Hook Override the region_name in connection (if provided)
    · launch_type (str) – the launch type on which to run your task (EC2’ or FARGATE’)
    · group (str) – the name of the task group associated with the task
    · placement_constraints (list) – an array of placement constraint objects to use for the task
    · platform_version (str) – the platform version on which your task is running
    · network_configuration (dict) – the network configuration for the task
    class airflowcontriboperatorsemr_add_steps_operatorEmrAddStepsOperator(job_flow_id aws_conn_id's3_default' stepsNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    An operator that adds steps to an existing EMR job_flow
    Parameters
    · job_flow_id (str) – id of the JobFlow to add steps to (templated)
    · aws_conn_id (str) – aws connection to uses
    · steps (list) – boto3 style steps to be added to the jobflow (templated)
    class airflowcontriboperatorsemr_create_job_flow_operatorEmrCreateJobFlowOperator(aws_conn_id's3_default' emr_conn_id'emr_default' job_flow_overridesNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Creates an EMR JobFlow reading the config from the EMR connection A dictionary of JobFlow overrides can be passed that override the config from the connection
    Parameters
    · aws_conn_id (str) – aws connection to uses
    · emr_conn_id (str) – emr connection to use
    · job_flow_overrides – boto3 style arguments to override emr_connection extra (templated)
    class airflowcontriboperatorsemr_terminate_job_flow_operatorEmrTerminateJobFlowOperator(job_flow_id aws_conn_id's3_default' *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Operator to terminate EMR JobFlows
    Parameters
    · job_flow_id (str) – id of the JobFlow to terminate (templated)
    · aws_conn_id (str) – aws connection to uses
    class airflowcontriboperatorsfile_to_gcsFileToGoogleCloudStorageOperator(src dst bucket google_cloud_storage_conn_id'google_cloud_default' mime_type'applicationoctetstream' delegate_toNone gzipFalse *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Uploads a file to Google Cloud Storage Optionally can compress the file for upload
    Parameters
    · src (str) – Path to the local file (templated)
    · dst (str) – Destination path within the specified bucket (templated)
    · bucket (str) – The bucket to upload to (templated)
    · google_cloud_storage_conn_id (str) – The Airflow connection ID to upload with
    · mime_type (str) – The mimetype string
    · delegate_to (str) – The account to impersonate if any
    · gzip (bool) – Allows for file to be compressed and uploaded as gzip
    execute(context)[source]
    Uploads the file to Google cloud storage
    class airflowcontriboperatorsgcs_download_operatorGoogleCloudStorageDownloadOperator(bucket object filenameNone store_to_xcom_keyNone google_cloud_storage_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Downloads a file from Google Cloud Storage
    Parameters
    · bucket (str) – The Google cloud storage bucket where the object is (templated)
    · object (str) – The name of the object to download in the Google cloud storage bucket (templated)
    · filename (str) – The file path on the local file system (where the operator is being executed) that the file should be downloaded to (templated) If no filename passed the downloaded data will not be stored on the local file system
    · store_to_xcom_key (str) – If this param is set the operator will push the contents of the downloaded file to XCom with the key set in this parameter If not set the downloaded data will not be pushed to XCom (templated)
    · google_cloud_storage_conn_id (str) – The connection ID to use when connecting to Google cloud storage
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    class airflowcontriboperatorsgcs_list_operatorGoogleCloudStorageListOperator(bucket prefixNone delimiterNone google_cloud_storage_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    List all objects from the bucket with the give string prefix and delimiter in name
    This operator returns a python list with the name of objects which can be used by
    xcom in the downstream task
    Parameters
    · bucket (str) – The Google cloud storage bucket to find the objects (templated)
    · prefix (str) – Prefix string which filters objects whose name begin with this prefix (templated)
    · delimiter (str) – The delimiter by which you want to filter the objects (templated) For eg to lists the CSV files from in a directory in GCS you would use delimiter’csv’
    · google_cloud_storage_conn_id (str) – The connection ID to use when connecting to Google cloud storage
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    Example
    The following Operator would list all the Avro files from salessales2017 folder in data bucket
    GCS_Files GoogleCloudStorageListOperator(
    task_id'GCS_Files'
    bucket'data'
    prefix'salessales2017'
    delimiter'avro'
    google_cloud_storage_conn_idgoogle_cloud_conn_id
    )
    class airflowcontriboperatorsgcs_operatorGoogleCloudStorageCreateBucketOperator(bucket_name storage_class'MULTI_REGIONAL' location'US' project_idNone labelsNone google_cloud_storage_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Creates a new bucket Google Cloud Storage uses a flat namespace so you can’t create a bucket with a name that is already in use
    See also
    For more information see Bucket Naming Guidelines httpscloudgooglecomstoragedocsbucketnaminghtml#requirements
    Parameters
    · bucket_name (str) – The name of the bucket (templated)
    · storage_class (str) –
    This defines how objects in the bucket are stored and determines the SLA and the cost of storage (templated) Values include
    o MULTI_REGIONAL
    o REGIONAL
    o STANDARD
    o NEARLINE
    o COLDLINE
    If this value is not specified when the bucket is created it will default to STANDARD
    · location (str) –
    The location of the bucket (templated) Object data for objects in the bucket resides in physical storage within this region Defaults to US
    See also
    httpsdevelopersgooglecomstoragedocsbucketlocations
    · project_id (str) – The ID of the GCP Project (templated)
    · labels (dict) – Userprovided labels in keyvalue pairs
    · google_cloud_storage_conn_id (str) – The connection ID to use when connecting to Google cloud storage
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    Example
    The following Operator would create a new bucket testbucket with MULTI_REGIONAL storage class in EU region
    CreateBucket GoogleCloudStorageCreateBucketOperator(
    task_id'CreateNewBucket'
    bucket_name'testbucket'
    storage_class'MULTI_REGIONAL'
    location'EU'
    labels{'env' 'dev' 'team' 'airflow'}
    google_cloud_storage_conn_id'airflowserviceaccount'
    )
    class airflowcontriboperatorsgcs_to_bqGoogleCloudStorageToBigQueryOperator(bucket source_objects destination_project_dataset_table schema_fieldsNone schema_objectNone source_format'CSV' compression'NONE' create_disposition'CREATE_IF_NEEDED' skip_leading_rows0 write_disposition'WRITE_EMPTY' field_delimiter' ' max_bad_records0 quote_characterNone ignore_unknown_valuesFalse allow_quoted_newlinesFalse allow_jagged_rowsFalse max_id_keyNone bigquery_conn_id'bigquery_default' google_cloud_storage_conn_id'google_cloud_default' delegate_toNone schema_update_options() src_fmt_configsNone external_tableFalse time_partitioningNone cluster_fieldsNone autodetectFalse *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Loads files from Google cloud storage into BigQuery
    The schema to be used for the BigQuery table may be specified in one of two ways You may either directly pass the schema fields in or you may point the operator to a Google cloud storage object name The object in Google cloud storage must be a JSON file with the schema fields in it
    Parameters
    · bucket (str) – The bucket to load from (templated)
    · source_objects (list of str) – List of Google cloud storage URIs to load from (templated) If source_format is DATASTORE_BACKUP’ the list must only contain a single URI
    · destination_project_dataset_table (str) – The dotted ()
    BigQuery table to load data into If is not included project will be the project defined in the connection json (templated)
    · schema_fields (list) – If set the schema field list as defined here httpscloudgooglecombigquerydocsreferencev2jobs#configurationload Should not be set when source_format is DATASTORE_BACKUP’
    · schema_object (str) – If set a GCS object path pointing to a json file that contains the schema for the table (templated)
    · source_format (str) – File format to export
    · compression (str) – [Optional] The compression type of the data source Possible values include GZIP and NONE The default value is NONE This setting is ignored for Google Cloud Bigtable Google Cloud Datastore backups and Avro formats
    · create_disposition (str) – The create disposition if the table doesn’t exist
    · skip_leading_rows (int) – Number of rows to skip when loading from a CSV
    · write_disposition (str) – The write disposition if the table already exists
    · field_delimiter (str) – The delimiter to use when loading from a CSV
    · max_bad_records (int) – The maximum number of bad records that BigQuery can ignore when running the job
    · quote_character (str) – The value that is used to quote data sections in a CSV file
    · ignore_unknown_values (bool) – [Optional] Indicates if BigQuery should allow extra values that are not represented in the table schema If true the extra values are ignored If false records with extra columns are treated as bad records and if there are too many bad records an invalid error is returned in the job result
    · allow_quoted_newlines (bool) – Whether to allow quoted newlines (true) or not (false)
    · allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns The missing values are treated as nulls If false records with missing trailing columns are treated as bad records and if there are too many bad records an invalid error is returned in the job result Only applicable to CSV ignored for other formats
    · max_id_key (str) – If set the name of a column in the BigQuery table that’s to be loaded This will be used to select the MAX value from BigQuery after the load occurs The results will be returned by the execute() command which in turn gets stored in XCom for future operators to use This can be helpful with incremental loads–during future executions you can pick up from the max ID
    · bigquery_conn_id (str) – Reference to a specific BigQuery hook
    · google_cloud_storage_conn_id (str) – Reference to a specific Google cloud storage hook
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · schema_update_options (list) – Allows the schema of the destination table to be updated as a side effect of the load job
    · src_fmt_configs (dict) – configure optional fields specific to the source format
    · external_table (bool) – Flag to specify if the destination table should be a BigQuery external table Default Value is False
    · time_partitioning (dict) – configure optional time partitioning fields ie partition by field type and expiration as per API specifications Note that field’ is not available in concurrency with datasettablepartition
    · cluster_fields (list of str) – Request that the result of this load be stored sorted by one or more columns This is only available in conjunction with time_partitioning The order of columns given determines the sort order Not applicable for external tables
    class airflowcontriboperatorsgcs_to_gcsGoogleCloudStorageToGoogleCloudStorageOperator(source_bucket source_object destination_bucketNone destination_objectNone move_objectFalse google_cloud_storage_conn_id'google_cloud_default' delegate_toNone last_modified_timeNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Copies objects from a bucket to another with renaming if requested
    Parameters
    · source_bucket (str) – The source Google cloud storage bucket where the object is (templated)
    · source_object (str) – The source name of the object to copy in the Google cloud storage bucket (templated) You can use only one wildcard for objects (filenames) within your bucket The wildcard can appear inside the object name or at the end of the object name Appending a wildcard to the bucket name is unsupported
    · destination_bucket (str) – The destination Google cloud storage bucket where the object should be (templated)
    · destination_object (str) – The destination name of the object in the destination Google cloud storage bucket (templated) If a wildcard is supplied in the source_object argument this is the prefix that will be prepended to the final destination objects’ paths Note that the source path’s part before the wildcard will be removed if it needs to be retained it should be appended to destination_object For example with prefix foo* and destination_object blah the file foobaz will be copied to blahbaz to retain the prefix write the destination_object as eg blahfoo in which case the copied file will be named blahfoobaz
    · move_object (bool) – When move object is True the object is moved instead of copied to the new location This is the equivalent of a mv command as opposed to a cp command
    · google_cloud_storage_conn_id (str) – The connection ID to use when connecting to Google cloud storage
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · last_modified_time (datetime) – When specified if the object(s) were modified after last_modified_time they will be copiedmoved If tzinfo has not been set UTC will be assumed
    Examples
    The following Operator would copy a single file named salessales2017januaryavro in the data bucket to the file named copied_sales2017januarybackupavro in the data_backup bucket
    copy_single_file GoogleCloudStorageToGoogleCloudStorageOperator(
    task_id'copy_single_file'
    source_bucket'data'
    source_object'salessales2017januaryavro'
    destination_bucket'data_backup'
    destination_object'copied_sales2017januarybackupavro'
    google_cloud_storage_conn_idgoogle_cloud_conn_id
    )
    The following Operator would copy all the Avro files from salessales2017 folder (ie with names starting with that prefix) in data bucket to the copied_sales2017 folder in the data_backup bucket
    copy_files GoogleCloudStorageToGoogleCloudStorageOperator(
    task_id'copy_files'
    source_bucket'data'
    source_object'salessales2017*avro'
    destination_bucket'data_backup'
    destination_object'copied_sales2017'
    google_cloud_storage_conn_idgoogle_cloud_conn_id
    )
    The following Operator would move all the Avro files from salessales2017 folder (ie with names starting with that prefix) in data bucket to the same folder in the data_backup bucket deleting the original files in the process
    move_files GoogleCloudStorageToGoogleCloudStorageOperator(
    task_id'move_files'
    source_bucket'data'
    source_object'salessales2017*avro'
    destination_bucket'data_backup'
    move_objectTrue
    google_cloud_storage_conn_idgoogle_cloud_conn_id
    )
    class airflowcontriboperatorsgcs_to_s3GoogleCloudStorageToS3Operator(bucket prefixNone delimiterNone google_cloud_storage_conn_id'google_cloud_storage_default' delegate_toNone dest_aws_conn_idNone dest_s3_keyNone dest_verifyNone replaceFalse *args **kwargs)[source]
    Bases airflowcontriboperatorsgcs_list_operatorGoogleCloudStorageListOperator
    Synchronizes a Google Cloud Storage bucket with an S3 bucket
    Parameters
    · bucket (str) – The Google Cloud Storage bucket to find the objects (templated)
    · prefix (str) – Prefix string which filters objects whose name begin with this prefix (templated)
    · delimiter (str) – The delimiter by which you want to filter the objects (templated) For eg to lists the CSV files from in a directory in GCS you would use delimiter’csv’
    · google_cloud_storage_conn_id (str) – The connection ID to use when connecting to Google Cloud Storage
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · dest_aws_conn_id (str) – The destination S3 connection
    · dest_s3_key (str) – The base S3 key to be used to store the files (templated)
    Parame dest_verify
     
    Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
    (unless use_ssl is False) but SSL certificates will not be verified
    · pathtocertbundlepem A filename of the CA cert bundle to uses
    You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
    class airflowcontriboperatorshipchat_operatorHipChatAPIOperator(token base_url'httpsapihipchatcomv2' *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Base HipChat Operator All derived HipChat operators reference from HipChat’s official REST API documentation at httpswwwhipchatcomdocsapiv2 Before using any HipChat API operators you need to get an authentication token at httpswwwhipchatcomdocsapiv2auth In the future additional HipChat operators will be derived from this class as well
    Parameters
    · token (str) – HipChat REST API authentication token
    · base_url (str) – HipChat REST API base url
    prepare_request()[source]
    Used by the execute function Set the request method url and body of HipChat’s REST API call Override in child class Each HipChatAPI child operator is responsible for having a prepare_request method call which sets selfmethod selfurl and selfbody
    class airflowcontriboperatorshipchat_operatorHipChatAPISendRoomNotificationOperator(room_id message message_format'html' color'yellow' frm'airflow' attach_toNone notifyFalse cardNone *args **kwargs)[source]
    Bases airflowcontriboperatorshipchat_operatorHipChatAPIOperator
    Send notification to a specific HipChat room More info httpswwwhipchatcomdocsapiv2methodsend_room_notification
    Parameters
    · room_id (str) – Room in which to send notification on HipChat (templated)
    · message (str) – The message body (templated)
    · frm (str) – Label to be shown in addition to sender’s name
    · message_format (str) – How the notification is rendered html or text
    · color (str) – Background color of the msg yellow green red purple gray or random
    · attach_to (str) – The message id to attach this notification to
    · notify (bool) – Whether this message should trigger a user notification
    · card (dict) – HipChatdefined card object
    prepare_request()[source]
    Used by the execute function Set the request method url and body of HipChat’s REST API call Override in child class Each HipChatAPI child operator is responsible for having a prepare_request method call which sets selfmethod selfurl and selfbody
    class airflowcontriboperatorshive_to_dynamodbHiveToDynamoDBTransferOperator(sql table_name table_keys pre_processNone pre_process_argsNone pre_process_kwargsNone region_nameNone schema'default' hiveserver2_conn_id'hiveserver2_default' aws_conn_id'aws_default' *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Moves data from Hive to DynamoDB note that for now the data is loaded into memory before being pushed to DynamoDB so this operator should be used for smallish amount of data
    Parameters
    · sql (str) – SQL query to execute against the hive database (templated)
    · table_name (str) – target DynamoDB table
    · table_keys (list) – partition key and sort key
    · pre_process (function) – implement preprocessing of source data
    · pre_process_args (list) – list of pre_process function arguments
    · pre_process_kwargs (dict) – dict of pre_process function arguments
    · region_name (str) – aws region name (example useast1)
    · schema (str) – hive database schema
    · hiveserver2_conn_id (str) – source hive connection
    · aws_conn_id (str) – aws connection
    class airflowcontriboperatorsmlengine_operatorMLEngineBatchPredictionOperator(project_id job_id region data_format input_paths output_path model_nameNone version_nameNone uriNone max_worker_countNone runtime_versionNone gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Start a Google Cloud ML Engine prediction job
    NOTE For model origin users should consider exactly one from the three options below 1 Populate uri’ field only which should be a GCS location that points to a tensorflow savedModel directory 2 Populate model_name’ field only which refers to an existing model and the default version of the model will be used 3 Populate both model_name’ and version_name’ fields which refers to a specific version of a specific model
    In options 2 and 3 both model and version name should contain the minimal identifier For instance call
    MLEngineBatchPredictionOperator(

    model_name'my_model'
    version_name'my_version'
    )
    if the desired model version is projectsmy_projectmodelsmy_modelversionsmy_version
    See httpscloudgooglecommlenginereferencerestv1projectsjobs for further documentation on the parameters
    Parameters
    · project_id (str) – The Google Cloud project name where the prediction job is submitted (templated)
    · job_id (str) – A unique id for the prediction job on Google Cloud ML Engine (templated)
    · data_format (str) – The format of the input data It will default to DATA_FORMAT_UNSPECIFIED’ if is not provided or is not one of [TEXT TF_RECORD TF_RECORD_GZIP]
    · input_paths (list of string) – A list of GCS paths of input data for batch prediction Accepting wildcard operator * but only at the end (templated)
    · output_path (str) – The GCS path where the prediction results are written to (templated)
    · region (str) – The Google Compute Engine region to run the prediction job in (templated)
    · model_name (str) – The Google Cloud ML Engine model to use for prediction If version_name is not provided the default version of this model will be used Should not be None if version_name is provided Should be None if uri is provided (templated)
    · version_name (str) – The Google Cloud ML Engine model version to use for prediction Should be None if uri is provided (templated)
    · uri (str) – The GCS path of the saved model to use for prediction Should be None if model_name is provided It should be a GCS path pointing to a tensorflow SavedModel (templated)
    · max_worker_count (int) – The maximum number of workers to be used for parallel processing Defaults to 10 if not specified
    · runtime_version (str) – The Google Cloud ML Engine runtime version to use for batch prediction
    · gcp_conn_id (str) – The connection ID used for connection to Google Cloud Platform
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have doaminwide delegation enabled
    Raises
    ValueError if a unique modelversion origin cannot be determined
    class airflowcontriboperatorsmlengine_operatorMLEngineModelOperator(project_id model operation'create' gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Operator for managing a Google Cloud ML Engine model
    Parameters
    · project_id (str) – The Google Cloud project name to which MLEngine model belongs (templated)
    · model (dict) –
    A dictionary containing the information about the model If the operation is create then the model parameter should contain all the information about this model such as name
    If the operation is get the model parameter should contain the name of the model
    · operation (str) –
    The operation to perform Available operations are
    o create Creates a new model as provided by the model parameter
    o get Gets a particular model where the name is specified in model
    · gcp_conn_id (str) – The connection ID to use when fetching connection info
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    class airflowcontriboperatorsmlengine_operatorMLEngineVersionOperator(project_id model_name version_nameNone versionNone operation'create' gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Operator for managing a Google Cloud ML Engine version
    Parameters
    · project_id (str) – The Google Cloud project name to which MLEngine model belongs
    · model_name (str) – The name of the Google Cloud ML Engine model that the version belongs to (templated)
    · version_name (str) – A name to use for the version being operated upon If not None and the version argument is None or does not have a value for the name key then this will be populated in the payload for the name key (templated)
    · version (dict) – A dictionary containing the information about the version If the operation is create version should contain all the information about this version such as name and deploymentUrl If the operation is get or delete the version parameter should contain the name of the version If it is None the only operation possible would be list (templated)
    · operation (str) –
    The operation to perform Available operations are
    o create Creates a new version in the model specified by model_name in which case the version parameter should contain all the information to create that version (eg name deploymentUrl)
    o get Gets full information of a particular version in the model specified by model_name The name of the version should be specified in the version parameter
    o list Lists all available versions of the model specified by model_name
    o delete Deletes the version specified in version parameter from the model specified by model_name) The name of the version should be specified in the version parameter
    · gcp_conn_id (str) – The connection ID to use when fetching connection info
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    class airflowcontriboperatorsmlengine_operatorMLEngineTrainingOperator(project_id job_id package_uris training_python_module training_args region scale_tierNone runtime_versionNone python_versionNone job_dirNone gcp_conn_id'google_cloud_default' delegate_toNone mode'PRODUCTION' *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Operator for launching a MLEngine training job
    Parameters
    · project_id (str) – The Google Cloud project name within which MLEngine training job should run (templated)
    · job_id (str) – A unique templated id for the submitted Google MLEngine training job (templated)
    · package_uris (str) – A list of package locations for MLEngine training job which should include the main training program + any additional dependencies (templated)
    · training_python_module (str) – The Python module name to run within MLEngine training job after installing package_uris’ packages (templated)
    · training_args (str) – A list of templated command line arguments to pass to the MLEngine training program (templated)
    · region (str) – The Google Compute Engine region to run the MLEngine training job in (templated)
    · scale_tier (str) – Resource tier for MLEngine training job (templated)
    · runtime_version (str) – The Google Cloud ML runtime version to use for training (templated)
    · python_version (str) – The version of Python used in training (templated)
    · job_dir (str) – A Google Cloud Storage path in which to store training outputs and other data needed for training (templated)
    · gcp_conn_id (str) – The connection ID to use when fetching connection info
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · mode (str) – Can be one of DRY_RUN’’CLOUD’ In DRY_RUN’ mode no real training job will be launched but the MLEngine training job request will be printed out In CLOUD’ mode a real MLEngine training job creation request will be issued
    class airflowcontriboperatorsmongo_to_s3MongoToS3Operator(mongo_conn_id s3_conn_id mongo_collection mongo_query s3_bucket s3_key mongo_dbNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Mongo > S3
    A more specific baseOperator meant to move data from mongo via pymongo to s3 via boto
    things to note
    execute() is written to depend on transform() transform() is meant to be extended by child classes to perform transformations unique to those operators needs
    execute(context)[source]
    Executed by task_instance at runtime
    static transform(docs)[source]
    Processes pyMongo cursor and returns an iterable with each element being
    a JSON serializable dictionary
    Base transform() assumes no processing is needed ie docs is a pyMongo cursor of documents and cursor just needs to be passed through
    Override this method for custom transformations
    class airflowcontriboperatorsmysql_to_gcsMySqlToGoogleCloudStorageOperator(sql bucket filename schema_filenameNone approx_max_file_size_bytes1900000000 mysql_conn_id'mysql_default' google_cloud_storage_conn_id'google_cloud_default' schemaNone delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Copy data from MySQL to Google cloud storage in JSON format
    classmethod type_map(mysql_type)[source]
    Helper function that maps from MySQL fields to BigQuery fields Used when a schema_filename is set
    class airflowcontriboperatorspostgres_to_gcs_operatorPostgresToGoogleCloudStorageOperator(sql bucket filename schema_filenameNone approx_max_file_size_bytes1900000000 postgres_conn_id'postgres_default' google_cloud_storage_conn_id'google_cloud_default' delegate_toNone parametersNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Copy data from Postgres to Google Cloud Storage in JSON format
    classmethod convert_types(value)[source]
    Takes a value from Postgres and converts it to a value that’s safe for JSONGoogle Cloud StorageBigQuery Dates are converted to UTC seconds Decimals are converted to floats Times are converted to seconds
    classmethod type_map(postgres_type)[source]
    Helper function that maps from Postgres fields to BigQuery fields Used when a schema_filename is set
    class airflowcontriboperatorspubsub_operatorPubSubTopicCreateOperator(project topic fail_if_existsFalse gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Create a PubSub topic
    By default if the topic already exists this operator will not cause the DAG to fail
    with DAG('successful DAG') as dag
    (
    dag
    >> PubSubTopicCreateOperator(project'myproject'
    topic'my_new_topic')
    >> PubSubTopicCreateOperator(project'myproject'
    topic'my_new_topic')
    )
    The operator can be configured to fail if the topic already exists
    with DAG('failing DAG') as dag
    (
    dag
    >> PubSubTopicCreateOperator(project'myproject'
    topic'my_new_topic')
    >> PubSubTopicCreateOperator(project'myproject'
    topic'my_new_topic'
    fail_if_existsTrue)
    )
    Both project and topic are templated so you can use variables in them
    class airflowcontriboperatorspubsub_operatorPubSubTopicDeleteOperator(project topic fail_if_not_existsFalse gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Delete a PubSub topic
    By default if the topic does not exist this operator will not cause the DAG to fail
    with DAG('successful DAG') as dag
    (
    dag
    >> PubSubTopicDeleteOperator(project'myproject'
    topic'non_existing_topic')
    )
    The operator can be configured to fail if the topic does not exist
    with DAG('failing DAG') as dag
    (
    dag
    >> PubSubTopicCreateOperator(project'myproject'
    topic'non_existing_topic'
    fail_if_not_existsTrue)
    )
    Both project and topic are templated so you can use variables in them
    class airflowcontriboperatorspubsub_operatorPubSubSubscriptionCreateOperator(topic_project topic subscriptionNone subscription_projectNone ack_deadline_secs10 fail_if_existsFalse gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Create a PubSub subscription
    By default the subscription will be created in topic_project If subscription_project is specified and the GCP credentials allow the Subscription can be created in a different project from its topic
    By default if the subscription already exists this operator will not cause the DAG to fail However the topic must exist in the project
    with DAG('successful DAG') as dag
    (
    dag
    >> PubSubSubscriptionCreateOperator(
    topic_project'myproject' topic'mytopic'
    subscription'mysubscription')
    >> PubSubSubscriptionCreateOperator(
    topic_project'myproject' topic'mytopic'
    subscription'mysubscription')
    )
    The operator can be configured to fail if the subscription already exists
    with DAG('failing DAG') as dag
    (
    dag
    >> PubSubSubscriptionCreateOperator(
    topic_project'myproject' topic'mytopic'
    subscription'mysubscription')
    >> PubSubSubscriptionCreateOperator(
    topic_project'myproject' topic'mytopic'
    subscription'mysubscription' fail_if_existsTrue)
    )
    Finally subscription is not required If not passed the operator will generated a universally unique identifier for the subscription’s name
    with DAG('DAG') as dag
    (
    dag >> PubSubSubscriptionCreateOperator(
    topic_project'myproject' topic'mytopic')
    )
    topic_project topic subscription and subscription are templated so you can use variables in them
    class airflowcontriboperatorspubsub_operatorPubSubSubscriptionDeleteOperator(project subscription fail_if_not_existsFalse gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Delete a PubSub subscription
    By default if the subscription does not exist this operator will not cause the DAG to fail
    with DAG('successful DAG') as dag
    (
    dag
    >> PubSubSubscriptionDeleteOperator(project'myproject'
    subscription'nonexisting')
    )
    The operator can be configured to fail if the subscription already exists
    with DAG('failing DAG') as dag
    (
    dag
    >> PubSubSubscriptionDeleteOperator(
    project'myproject' subscription'nonexisting'
    fail_if_not_existsTrue)
    )
    project and subscription are templated so you can use variables in them
    class airflowcontriboperatorspubsub_operatorPubSubPublishOperator(project topic messages gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Publish messages to a PubSub topic
    Each Task publishes all provided messages to the same topic in a single GCP project If the topic does not exist this task will fail
    from base64 import b64encode as b64e

    m1 {'data' b64e('Hello World')
    'attributes' {'type' 'greeting'}
    }
    m2 {'data' b64e('Knock knock')}
    m3 {'attributes' {'foo' ''}}

    t1 PubSubPublishOperator(
    project'myproject'topic'my_topic'
    messages[m1 m2 m3]
    create_topicTrue
    dagdag)

    ``project`` ``topic`` and ``messages`` are templated so you can use
    variables in them
    class airflowcontriboperatorss3_copy_object_operatorS3CopyObjectOperator(source_bucket_key dest_bucket_key source_bucket_nameNone dest_bucket_nameNone source_version_idNone aws_conn_id'aws_default' verifyNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Creates a copy of an object that is already stored in S3
    Note the S3 connection used here needs to have access to both source and destination bucketkey
    Parameters
    · source_bucket_key (str) –
    The key of the source object
    It can be either full s3 style url or relative path from root level
    When it’s specified as a full s3 url please omit source_bucket_name
    · dest_bucket_key (str) –
    The key of the object to copy to
    The convention to specify dest_bucket_key is the same as source_bucket_key
    · source_bucket_name (str) –
    Name of the S3 bucket where the source object is in
    It should be omitted when source_bucket_key is provided as a full s3 url
    · dest_bucket_name (str) –
    Name of the S3 bucket to where the object is copied
    It should be omitted when dest_bucket_key is provided as a full s3 url
    · source_version_id (str) – Version ID of the source object (OPTIONAL)
    · aws_conn_id (str) – Connection id of the S3 connection to use
    · verify (bool or str) –
    Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified
    You can provide the following values
    o False do not validate SSL certificates SSL will still be used
    but SSL certificates will not be verified
    o pathtocertbundlepem A filename of the CA cert bundle to uses
    You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
    class airflowcontriboperatorss3_delete_objects_operatorS3DeleteObjectsOperator(bucket keys aws_conn_id'aws_default' verifyNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    To enable users to delete single object or multiple objects from a bucket using a single HTTP request
    Users may specify up to 1000 keys to delete
    Parameters
    · bucket (str) – Name of the bucket in which you are going to delete object(s)
    · keys (str or list) –
    The key(s) to delete from S3 bucket
    When keys is a string it’s supposed to be the key name of the single object to delete
    When keys is a list it’s supposed to be the list of the keys to delete
    You may specify up to 1000 keys
    · aws_conn_id (str) – Connection id of the S3 connection to use
    · verify (bool or str) –
    Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified
    You can provide the following values
    o False do not validate SSL certificates SSL will still be used
    but SSL certificates will not be verified
    o pathtocertbundlepem A filename of the CA cert bundle to uses
    You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
    class airflowcontriboperatorss3_list_operatorS3ListOperator(bucket prefix'' delimiter'' aws_conn_id'aws_default' verifyNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    List all objects from the bucket with the given string prefix in name
    This operator returns a python list with the name of objects which can be used by xcom in the downstream task
    Parameters
    · bucket (str) – The S3 bucket where to find the objects (templated)
    · prefix (str) – Prefix string to filters the objects whose name begin with such prefix (templated)
    · delimiter (str) – the delimiter marks key hierarchy (templated)
    · aws_conn_id (str) – The connection ID to use when connecting to S3 storage
    Parame verify
    Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
    (unless use_ssl is False) but SSL certificates will not be verified
    · pathtocertbundlepem A filename of the CA cert bundle to uses
    You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
    Example
    The following operator would list all the files (excluding subfolders) from the S3 customers201804 key in the data bucket
    s3_file S3ListOperator(
    task_id'list_3s_files'
    bucket'data'
    prefix'customers201804'
    delimiter''
    aws_conn_id'aws_customers_conn'
    )
    class airflowcontriboperatorss3_to_gcs_operatorS3ToGoogleCloudStorageOperator(bucket prefix'' delimiter'' aws_conn_id'aws_default' verifyNone dest_gcs_conn_idNone dest_gcsNone delegate_toNone replaceFalse *args **kwargs)[source]
    Bases airflowcontriboperatorss3_list_operatorS3ListOperator
    Synchronizes an S3 key possibly a prefix with a Google Cloud Storage destination path
    Parameters
    · bucket (str) – The S3 bucket where to find the objects (templated)
    · prefix (str) – Prefix string which filters objects whose name begin with such prefix (templated)
    · delimiter (str) – the delimiter marks key hierarchy (templated)
    · aws_conn_id (str) – The source S3 connection
    · dest_gcs_conn_id (str) – The destination connection ID to use when connecting to Google Cloud Storage
    · dest_gcs (str) – The destination Google Cloud Storage bucket and prefix where you want to store the files (templated)
    · delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    · replace (bool) – Whether you want to replace existing destination files or not
    Parame verify
    Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
    (unless use_ssl is False) but SSL certificates will not be verified
    · pathtocertbundlepem A filename of the CA cert bundle to uses
    You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
    Example
    s3_to_gcs_op S3ToGoogleCloudStorageOperator(
    task_id's3_to_gcs_example'
    bucket'mys3bucket'
    prefix'datacustomers201804'
    dest_gcs_conn_id'google_cloud_default'
    dest_gcs'gsmygcsbucketsomecustomers'
    replaceFalse
    dagmydag)
    Note that bucket prefix delimiter and dest_gcs are templated so you can use variables in them if you wish
    class airflowcontriboperatorssftp_operatorSFTPOperator(ssh_hookNone ssh_conn_idNone remote_hostNone local_filepathNone remote_filepathNone operation'put' confirmTrue *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    SFTPOperator for transferring files from remote host to local or vice a versa This operator uses ssh_hook to open sftp transport channel that serve as basis for file transfer
    Parameters
    · ssh_hook (SSHHook) – predefined ssh_hook to use for remote execution Either ssh_hook or ssh_conn_id needs to be provided
    · ssh_conn_id (str) – connection id from airflow Connections ssh_conn_id will be ingored if ssh_hook is provided
    · remote_host (str) – remote host to connect (templated) Nullable If provided it will replace the remote_host which was defined in ssh_hook or predefined in the connection of ssh_conn_id
    · local_filepath (str) – local file path to get or put (templated)
    · remote_filepath (str) – remote file path to get or put (templated)
    · operation – specify operation get’ or put’ defaults to put
    · confirm (bool) – specify if the SFTP operation should be confirmed defaults to True
    class airflowcontriboperatorsslack_webhook_operatorSlackWebhookOperator(http_conn_idNone webhook_tokenNone message'' channelNone usernameNone icon_emojiNone link_namesFalse proxyNone *args **kwargs)[source]
    Bases airflowoperatorshttp_operatorSimpleHttpOperator
    This operator allows you to post messages to Slack using incoming webhooks Takes both Slack webhook token directly and connection that has Slack webhook token If both supplied Slack webhook token will be used
    Each Slack webhook token can be preconfigured to use a specific channel username and icon You can override these defaults in this hook
    Parameters
    · http_conn_id (str) – connection that has Slack webhook token in the extra field
    · webhook_token (str) – Slack webhook token
    · message (str) – The message you want to send on Slack
    · channel (str) – The channel the message should be posted to
    · username (str) – The username to post to slack with
    · icon_emoji (str) – The emoji to use as icon for the user posting to Slack
    · link_names (bool) – Whether or not to find and link channel and usernames in your message
    · proxy (str) – Proxy to use to make the Slack webhook call
    execute(context)[source]
    Call the SlackWebhookHook to post the provided Slack message
    class airflowcontriboperatorsspark_jdbc_operatorSparkJDBCOperator(spark_app_name'airflowsparkjdbc' spark_conn_id'sparkdefault' spark_confNone spark_py_filesNone spark_filesNone spark_jarsNone num_executorsNone executor_coresNone executor_memoryNone driver_memoryNone verboseFalse keytabNone principalNone cmd_type'spark_to_jdbc' jdbc_tableNone jdbc_conn_id'jdbcdefault' jdbc_driverNone metastore_tableNone jdbc_truncateFalse save_modeNone save_formatNone batch_sizeNone fetch_sizeNone num_partitionsNone partition_columnNone lower_boundNone upper_boundNone create_table_column_typesNone *args **kwargs)[source]
    Bases airflowcontriboperatorsspark_submit_operatorSparkSubmitOperator
    This operator extends the SparkSubmitOperator specifically for performing data transfers tofrom JDBCbased databases with Apache Spark As with the SparkSubmitOperator it assumes that the sparksubmit binary is available on the PATH
    Parameters
    · spark_app_name (str) – Name of the job (default airflowsparkjdbc)
    · spark_conn_id (str) – Connection id as configured in Airflow administration
    · spark_conf (dict) – Any additional Spark configuration properties
    · spark_py_files (str) – Additional python files used (zip egg or py)
    · spark_files (str) – Additional files to upload to the container running the job
    · spark_jars (str) – Additional jars to upload and add to the driver and executor classpath
    · num_executors (int) – number of executor to run This should be set so as to manage the number of connections made with the JDBC database
    · executor_cores (int) – Number of cores per executor
    · executor_memory (str) – Memory per executor (eg 1000M 2G)
    · driver_memory (str) – Memory allocated to the driver (eg 1000M 2G)
    · verbose (bool) – Whether to pass the verbose flag to sparksubmit for debugging
    · keytab (str) – Full path to the file that contains the keytab
    · principal (str) – The name of the kerberos principal used for keytab
    · cmd_type (str) – Which way the data should flow 2 possible values spark_to_jdbc data written by spark from metastore to jdbc jdbc_to_spark data written by spark from jdbc to metastore
    · jdbc_table (str) – The name of the JDBC table
    · jdbc_conn_id – Connection id used for connection to JDBC database
    · jdbc_driver (str) – Name of the JDBC driver to use for the JDBC connection This driver (usually a jar) should be passed in the jars’ parameter
    · metastore_table (str) – The name of the metastore table
    · jdbc_truncate (bool) – (spark_to_jdbc only) Whether or not Spark should truncate or drop and recreate the JDBC table This only takes effect if save_mode’ is set to Overwrite Also if the schema is different Spark cannot truncate and will drop and recreate
    · save_mode (str) – The Spark savemode to use (eg overwrite append etc)
    · save_format (str) – (jdbc_to_sparkonly) The Spark saveformat to use (eg parquet)
    · batch_size (int) – (spark_to_jdbc only) The size of the batch to insert per round trip to the JDBC database Defaults to 1000
    · fetch_size (int) – (jdbc_to_spark only) The size of the batch to fetch per round trip from the JDBC database Default depends on the JDBC driver
    · num_partitions (int) – The maximum number of partitions that can be used by Spark simultaneously both for spark_to_jdbc and jdbc_to_spark operations This will also cap the number of JDBC connections that can be opened
    · partition_column (str) – (jdbc_to_sparkonly) A numeric column to be used to partition the metastore table by If specified you must also specify num_partitions lower_bound upper_bound
    · lower_bound (int) – (jdbc_to_sparkonly) Lower bound of the range of the numeric partition column to fetch If specified you must also specify num_partitions partition_column upper_bound
    · upper_bound (int) – (jdbc_to_sparkonly) Upper bound of the range of the numeric partition column to fetch If specified you must also specify num_partitions partition_column lower_bound
    · create_table_column_types – (spark_to_jdbconly) The database column data types to use instead of the defaults when creating the table Data type information should be specified in the same format as CREATE TABLE columns syntax (eg name CHAR(64) comments VARCHAR(1024)) The specified types should be valid spark sql data types
    Type
    jdbc_conn_id str
    execute(context)[source]
    Call the SparkSubmitHook to run the provided spark job
    class airflowcontriboperatorsspark_sql_operatorSparkSqlOperator(sql confNone conn_id'spark_sql_default' total_executor_coresNone executor_coresNone executor_memoryNone keytabNone principalNone master'yarn' name'defaultname' num_executorsNone yarn_queue'default' *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Execute Spark SQL query
    Parameters
    · sql (str) – The SQL query to execute (templated)
    · conf (str (format PROPVALUE)) – arbitrary Spark configuration property
    · conn_id (str) – connection_id string
    · total_executor_cores (int) – (Standalone & Mesos only) Total cores for all executors (Default all the available cores on the worker)
    · executor_cores (int) – (Standalone & YARN only) Number of cores per executor (Default 2)
    · executor_memory (str) – Memory per executor (eg 1000M 2G) (Default 1G)
    · keytab (str) – Full path to the file that contains the keytab
    · master (str) – sparkhostport mesoshostport yarn or local
    · name (str) – Name of the job
    · num_executors (int) – Number of executors to launch
    · verbose (bool) – Whether to pass the verbose flag to sparksql
    · yarn_queue (str) – The YARN queue to submit to (Default default)
    execute(context)[source]
    Call the SparkSqlHook to run the provided sql query
    class airflowcontriboperatorsspark_submit_operatorSparkSubmitOperator(application'' confNone conn_id'spark_default' filesNone py_filesNone driver_classpathNone jarsNone java_classNone packagesNone exclude_packagesNone repositoriesNone total_executor_coresNone executor_coresNone executor_memoryNone driver_memoryNone keytabNone principalNone name'airflowspark' num_executorsNone application_argsNone env_varsNone verboseFalse *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    This hook is a wrapper around the sparksubmit binary to kick off a sparksubmit job It requires that the sparksubmit binary is in the PATH or the sparkhome is set in the extra on the connection
    Parameters
    · application (str) – The application that submitted as a job either jar or py file (templated)
    · conf (dict) – Arbitrary Spark configuration properties
    · conn_id (str) – The connection id as configured in Airflow administration When an invalid connection_id is supplied it will default to yarn
    · files (str) – Upload additional files to the executor running the job separated by a comma Files will be placed in the working directory of each executor For example serialized objects
    · py_files (str) – Additional python files used by the job can be zip egg or py
    · jars (str) – Submit additional jars to upload and place them in executor classpath
    · driver_classpath (str) – Additional driverspecific classpath settings
    · java_class (str) – the main class of the Java application
    · packages (str) – Commaseparated list of maven coordinates of jars to include on the driver and executor classpaths (templated)
    · exclude_packages (str) – Commaseparated list of maven coordinates of jars to exclude while resolving the dependencies provided in packages’
    · repositories (str) – Commaseparated list of additional remote repositories to search for the maven coordinates given with packages’
    · total_executor_cores (int) – (Standalone & Mesos only) Total cores for all executors (Default all the available cores on the worker)
    · executor_cores (int) – (Standalone & YARN only) Number of cores per executor (Default 2)
    · executor_memory (str) – Memory per executor (eg 1000M 2G) (Default 1G)
    · driver_memory (str) – Memory allocated to the driver (eg 1000M 2G) (Default 1G)
    · keytab (str) – Full path to the file that contains the keytab
    · principal (str) – The name of the kerberos principal used for keytab
    · name (str) – Name of the job (default airflowspark) (templated)
    · num_executors (int) – Number of executors to launch
    · application_args (list) – Arguments for the application being submitted
    · env_vars (dict) – Environment variables for sparksubmit It supports yarn and k8s mode too
    · verbose (bool) – Whether to pass the verbose flag to sparksubmit process for debugging
    execute(context)[source]
    Call the SparkSubmitHook to run the provided spark job
    class airflowcontriboperatorssqoop_operatorSqoopOperator(conn_id'sqoop_default' cmd_type'import' tableNone queryNone target_dirNone appendNone file_type'text' columnsNone num_mappersNone split_byNone whereNone export_dirNone input_null_stringNone input_null_non_stringNone staging_tableNone clear_staging_tableFalse enclosed_byNone escaped_byNone input_fields_terminated_byNone input_lines_terminated_byNone input_optionally_enclosed_byNone batchFalse directFalse driverNone verboseFalse relaxed_isolationFalse propertiesNone hcatalog_databaseNone hcatalog_tableNone create_hcatalog_tableFalse extra_import_optionsNone extra_export_optionsNone *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Execute a Sqoop job Documentation for Apache Sqoop can be found here
    httpssqoopapacheorgdocs142SqoopUserGuidehtml
    execute(context)[source]
    Execute sqoop job
    class airflowcontriboperatorsssh_operatorSSHOperator(ssh_hookNone ssh_conn_idNone remote_hostNone commandNone timeout10 do_xcom_pushFalse *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    SSHOperator to execute commands on given remote host using the ssh_hook
    Parameters
    · ssh_hook (SSHHook) – predefined ssh_hook to use for remote execution Either ssh_hook or ssh_conn_id needs to be provided
    · ssh_conn_id (str) – connection id from airflow Connections ssh_conn_id will be ingored if ssh_hook is provided
    · remote_host (str) – remote host to connect (templated) Nullable If provided it will replace the remote_host which was defined in ssh_hook or predefined in the connection of ssh_conn_id
    · command (str) – command to execute on remote host (templated)
    · timeout (int) – timeout (in seconds) for executing the command
    · do_xcom_push (bool) – return the stdout which also get set in xcom by airflow platform
    class airflowcontriboperatorsvertica_operatorVerticaOperator(sql vertica_conn_id'vertica_default' *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Executes sql code in a specific Vertica database
    Parameters
    · vertica_conn_id (str) – reference to a specific Vertica database
    · sql (Can receive a str representing a sql statement a list of str (sql statements) or reference to a template file Template reference are recognized by str ending in 'sql') – the sql code to be executed (templated)
    class airflowcontriboperatorsvertica_to_hiveVerticaToHiveTransfer(sql hive_table createTrue recreateFalse partitionNone delimiteru'x01' vertica_conn_id'vertica_default' hive_cli_conn_id'hive_cli_default' *args **kwargs)[source]
    Bases airflowmodelsBaseOperator
    Moves data from Vertia to Hive The operator runs your query against Vertia stores the file locally before loading it into a Hive table If the create or recreate arguments are set to True a CREATE TABLE and DROP TABLE statements are generated Hive data types are inferred from the cursor’s metadata Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization format If a large amount of data is loaded andor if the table gets queried considerably you may want to use this operator only to stage the data into a temporary table before loading it into its final destination using a HiveOperator
    Parameters
    · sql (str) – SQL query to execute against the Vertia database (templated)
    · hive_table (str) – target Hive table use dot notation to target a specific database (templated)
    · create (bool) – whether to create the table if it doesn’t exist
    · recreate (bool) – whether to drop and recreate the table at every execution
    · partition (dict) – target partition as a dict of partition columns and values (templated)
    · delimiter (str) – field delimiter in the file
    · vertica_conn_id (str) – source Vertica connection
    · hive_conn_id (str) – destination hive connection
    Sensors
    class airflowcontribsensorsaws_redshift_cluster_sensorAwsRedshiftClusterSensor(cluster_identifier target_status'available' aws_conn_id'aws_default' *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Waits for a Redshift cluster to reach a specific status
    Parameters
    · cluster_identifier (str) – The identifier for the cluster being pinged
    · target_status (str) – The cluster status desired
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowcontribsensorsbash_sensorBashSensor(bash_command envNone output_encoding'utf8' *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Executes a bash commandscript and returns True if and only if the return code is 0
    Parameters
    · bash_command (str) – The command set of commands or reference to a bash script (must be sh’) to be executed
    · env (dict) – If env is not None it must be a mapping that defines the environment variables for the new process these are used instead of inheriting the current process environment which is the default behavior (templated)
    · output_encoding (str) – output encoding of bash command
    poke(context)[source]
    Execute the bash command in a temporary directory which will be cleaned afterwards
    class airflowcontribsensorsbigquery_sensorBigQueryTableSensor(project_id dataset_id table_id bigquery_conn_id'bigquery_default_conn' delegate_toNone *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Checks for the existence of a table in Google Bigquery
    param project_id
     
    The Google cloud project in which to look for the table The connection supplied to the hook must provide access to the specified project
    type project_id
     
    str
    param dataset_id
     
    The name of the dataset in which to look for the table storage bucket
    type dataset_id
     
    str
    param table_id
    The name of the table to check the existence of
    type table_id
    str
    param bigquery_conn_id
     
    The connection ID to use when connecting to Google BigQuery
    type bigquery_conn_id
     
    str
    param delegate_to
     
    The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    type delegate_to
     
    str
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowcontribsensorscassandra_record_sensorCassandraRecordSensor(table keys cassandra_conn_id *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Checks for the existence of a record in a Cassandra cluster
    For example if you want to wait for a record that has values v1’ and v2’ for each primary keys p1’ and p2’ to be populated in keyspace k’ and table t’ instantiate it as follows
    >>> cassandra_sensor CassandraRecordSensor(tablekt
    keys{p1 v1 p2 v2}
    cassandra_conn_idcassandra_default
    task_idcassandra_sensor)
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowcontribsensorscassandra_table_sensorCassandraTableSensor(table cassandra_conn_id *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Checks for the existence of a table in a Cassandra cluster
    For example if you want to wait for a table called t’ to be created in a keyspace k’ instantiate it as follows
    >>> cassandra_sensor CassandraTableSensor(tablekt
    cassandra_conn_idcassandra_default
    task_idcassandra_sensor)
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowcontribsensorsemr_base_sensorEmrBaseSensor(aws_conn_id'aws_default' *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Contains general sensor behavior for EMR Subclasses should implement get_emr_response() and state_from_response() methods Subclasses should also implement NON_TERMINAL_STATES and FAILED_STATE constants
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowcontribsensorsemr_job_flow_sensorEmrJobFlowSensor(job_flow_id *args **kwargs)[source]
    Bases airflowcontribsensorsemr_base_sensorEmrBaseSensor
    Asks for the state of the JobFlow until it reaches a terminal state If it fails the sensor errors failing the task
    Parameters
    job_flow_id (str) – job_flow_id to check the state of
    class airflowcontribsensorsemr_step_sensorEmrStepSensor(job_flow_id step_id *args **kwargs)[source]
    Bases airflowcontribsensorsemr_base_sensorEmrBaseSensor
    Asks for the state of the step until it reaches a terminal state If it fails the sensor errors failing the task
    Parameters
    · job_flow_id (str) – job_flow_id which contains the step check the state of
    · step_id (str) – step to check the state of
    class airflowcontribsensorsfile_sensorFileSensor(filepath fs_conn_id'fs_default' *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Waits for a file or folder to land in a filesystem
    If the path given is a directory then this sensor will only return true if any files exist inside it (either directly or within a subdirectory)
    Parameters
    · fs_conn_id (str) – reference to the File (path) connection id
    · filepath – File or folder name (relative to the base path set within the connection)
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowcontribsensorsftp_sensorFTPSensor(path ftp_conn_id'ftp_default' *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Waits for a file or directory to be present on FTP
    Parameters
    · path (str) – Remote file or directory path
    · ftp_conn_id (str) – The connection to run the sensor against
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowcontribsensorsftp_sensorFTPSSensor(path ftp_conn_id'ftp_default' *args **kwargs)[source]
    Bases airflowcontribsensorsftp_sensorFTPSensor
    Waits for a file or directory to be present on FTP over SSL
    class airflowcontribsensorsgcs_sensorGoogleCloudStorageObjectSensor(bucket object google_cloud_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Checks for the existence of a file in Google Cloud Storage Create a new GoogleCloudStorageObjectSensor
    param bucket
    The Google cloud storage bucket where the object is
    type bucket
    str
    param object
    The name of the object to check in the Google cloud storage bucket
    type object
    str
    param google_cloud_storage_conn_id
     
    The connection ID to use when connecting to Google cloud storage
    type google_cloud_storage_conn_id
     
    str
    param delegate_to
     
    The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    type delegate_to
     
    str
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowcontribsensorsgcs_sensorGoogleCloudStorageObjectUpdatedSensor(bucket object ts_func google_cloud_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Checks if an object is updated in Google Cloud Storage Create a new GoogleCloudStorageObjectUpdatedSensor
    param bucket
    The Google cloud storage bucket where the object is
    type bucket
    str
    param object
    The name of the object to download in the Google cloud storage bucket
    type object
    str
    param ts_func
    Callback for defining the update condition The default callback returns execution_date + schedule_interval The callback takes the context as parameter
    type ts_func
    function
    param google_cloud_storage_conn_id
     
    The connection ID to use when connecting to Google cloud storage
    type google_cloud_storage_conn_id
     
    str
    param delegate_to
     
    The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    type delegate_to
     
    str
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowcontribsensorsgcs_sensorGoogleCloudStoragePrefixSensor(bucket prefix google_cloud_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Checks for the existence of a files at prefix in Google Cloud Storage bucket Create a new GoogleCloudStorageObjectSensor
    param bucket
    The Google cloud storage bucket where the object is
    type bucket
    str
    param prefix
    The name of the prefix to check in the Google cloud storage bucket
    type prefix
    str
    param google_cloud_storage_conn_id
     
    The connection ID to use when connecting to Google cloud storage
    type google_cloud_storage_conn_id
     
    str
    param delegate_to
     
    The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
    type delegate_to
     
    str
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowcontribsensorshdfs_sensorHdfsSensorFolder(be_emptyFalse *args **kwargs)[source]
    Bases airflowsensorshdfs_sensorHdfsSensor
    poke(context)[source]
    poke for a non empty directory
    Returns
    Bool depending on the search criteria
    class airflowcontribsensorshdfs_sensorHdfsSensorRegex(regex *args **kwargs)[source]
    Bases airflowsensorshdfs_sensorHdfsSensor
    poke(context)[source]
    poke matching files in a directory with selfregex
    Returns
    Bool depending on the search criteria
    class airflowcontribsensorspubsub_sensorPubSubPullSensor(project subscription max_messages5 return_immediatelyFalse ack_messagesFalse gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Pulls messages from a PubSub subscription and passes them through XCom
    This sensor operator will pull up to max_messages messages from the specified PubSub subscription When the subscription returns messages the poke method’s criteria will be fulfilled and the messages will be returned from the operator and passed through XCom for downstream tasks
    If ack_messages is set to True messages will be immediately acknowledged before being returned otherwise downstream tasks will be responsible for acknowledging them
    project and subscription are templated so you can use variables in them
    execute(context)[source]
    Overridden to allow messages to be passed
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    class airflowcontribsensorssftp_sensorSFTPSensor(path sftp_conn_id'sftp_default' *args **kwargs)[source]
    Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
    Waits for a file or directory to be present on SFTP param path Remote file or directory path type path str param sftp_conn_id The connection to run the sensor against type sftp_conn_id str
    poke(context)[source]
    Function that the sensors defined while deriving this class should override
    Macros
    Here’s a list of variables and macros that can be used in templates
    Default Variables
    The Airflow engine passes a few variables by default that are accessible in all templates
    Variable
    Description
    {{ ds }}
    the execution date as YYYYMMDD
    {{ ds_nodash }}
    the execution date as YYYYMMDD
    {{ prev_ds }}
    the previous execution date as YYYYMMDD if {{ ds }} is 20160108 and schedule_interval is @weekly {{ prev_ds }} will be 20160101
    {{ prev_ds_nodash }}
    the previous execution date as YYYYMMDD if exists else ``None`
    {{ next_ds }}
    the next execution date as YYYYMMDD if {{ ds }} is 20160101 and schedule_interval is @weekly {{ next_ds }} will be 20160108
    {{ next_ds_nodash }}
    the next execution date as YYYYMMDD if exists else ``None`
    {{ yesterday_ds }}
    yesterday’s date as YYYYMMDD
    {{ yesterday_ds_nodash }}
    yesterday’s date as YYYYMMDD
    {{ tomorrow_ds }}
    tomorrow’s date as YYYYMMDD
    {{ tomorrow_ds_nodash }}
    tomorrow’s date as YYYYMMDD
    {{ ts }}
    same as execution_dateisoformat()
    {{ ts_nodash }}
    same as ts without and
    {{ execution_date }}
    the execution_date (datetimedatetime)
    {{ prev_execution_date }}
    the previous execution date (if available) (datetimedatetime)
    {{ next_execution_date }}
    the next execution date (datetimedatetime)
    {{ dag }}
    the DAG object
    {{ task }}
    the Task object
    {{ macros }}
    a reference to the macros package described below
    {{ task_instance }}
    the task_instance object
    {{ end_date }}
    same as {{ ds }}
    {{ latest_date }}
    same as {{ ds }}
    {{ ti }}
    same as {{ task_instance }}
    {{ params }}
    a reference to the userdefined params dictionary which can be overridden by the dictionary passed through trigger_dag c if you enabled dag_run_conf_overrides_params` in ``airflowcfg
    {{ varvaluemy_var }}
    global defined variables represented as a dictionary
    {{ varjsonmy_varpath }}
    global defined variables represented as a dictionary with deserialized JSON object append the path to the key within the JSON object
    {{ task_instance_key_str }}
    a unique humanreadable key to the task instance formatted {dag_id}_{task_id}_{ds}
    {{ conf }}
    the full configuration object located at airflowconfigurationconf which represents the content of your airflowcfg
    {{ run_id }}
    the run_id of the current DAG run
    {{ dag_run }}
    a reference to the DagRun object
    {{ test_mode }}
    whether the task instance was called using the CLI’s test subcommand
    Note that you can access the object’s attributes and methods with simple dot notation Here are some examples of what is possible {{ taskowner }} {{ tasktask_id }} {{ tihostname }} … Refer to the models documentation for more information on the objects’ attributes and methods
    The var template variable allows you to access variables defined in Airflow’s UI You can access them as either plaintext or JSON If you use JSON you are also able to walk nested structures such as dictionaries like {{ varjsonmy_dict_varkey1 }}
    Macros
    Macros are a way to expose objects to your templates and live under the macros namespace in your templates
    A few commonly used libraries and methods are made available
    Variable
    Description
    macrosdatetime
    The standard lib’s datetimedatetime
    macrostimedelta
    The standard lib’s datetimetimedelta
    macrosdateutil
    A reference to the dateutil package
    macrostime
    The standard lib’s time
    macrosuuid
    The standard lib’s uuid
    macrosrandom
    The standard lib’s random
    Some airflow specific macros are also defined
    airflowmacrosds_add(ds days)[source]
    Add or subtract days from a YYYYMMDD
    Parameters
    · ds (str) – anchor date in YYYYMMDD format to add to
    · days (int) – number of days to add to the ds you can use negative values
    >>> ds_add('20150101' 5)
    '20150106'
    >>> ds_add('20150106' 5)
    '20150101'
    airflowmacrosds_format(ds input_format output_format)[source]
    Takes an input string and outputs another string as specified in the output format
    Parameters
    · ds (str) – input string which contains a date
    · input_format (str) – input string format Eg Ymd
    · output_format (str) – output string format Eg Ymd
    >>> ds_format('20150101' Ymd mdy)
    '010115'
    >>> ds_format('152015' mdY Ymd)
    '20150105'
    airflowmacrosrandom() → x in the interval [0 1)
    airflowmacroshiveclosest_ds_partition(table ds beforeTrue schema'default' metastore_conn_id'metastore_default')[source]
    This function finds the date in a list closest to the target date An optional parameter can be given to get the closest before or after
    Parameters
    · table (str) – A hive table name
    · ds (datetimedate list) – A datestamp Ymd eg yyyymmdd
    · before (bool or None) – closest before (True) after (False) or either side of ds
    Returns
    The closest date
    Return type
    str or None
    >>> tbl 'airflowstatic_babynames_partitioned'
    >>> closest_ds_partition(tbl '20150102')
    '20150101'
    airflowmacroshivemax_partition(table schema'default' fieldNone filter_mapNone metastore_conn_id'metastore_default')[source]
    Gets the max partition for a table
    Parameters
    · schema (str) – The hive schema the table lives in
    · table (str) – The hive table you are interested in supports the dot notation as in my_databasemy_table if a dot is found the schema param is disregarded
    · metastore_conn_id (str) – The hive connection you are interested in If your default is set you don’t need to use this parameter
    · filter_map (map) – partition_keypartition_value map used for partition filtering eg {key1’ value1’ key2’ value2’} Only partitions matching all partition_keypartition_value pairs will be considered as candidates of max partition
    · field (str) – the field to get the max value from If there’s only one partition field this will be inferred
    >>> max_partition('airflowstatic_babynames_partitioned')
    '20150101'
    Model
    Models are built on top of the SQLAlchemy ORM Base class and instances are persisted in the database
    class airflowmodelsBaseOperator(task_id owner'Airflow' emailNone email_on_retryTrue email_on_failureTrue retries0 retry_delaydatetimetimedelta(0 300) retry_exponential_backoffFalse max_retry_delayNone start_dateNone end_dateNone schedule_intervalNone depends_on_pastFalse wait_for_downstreamFalse dagNone paramsNone default_argsNone adhocFalse priority_weight1 weight_ruleu'downstream' queue'default' poolNone slaNone execution_timeoutNone on_failure_callbackNone on_success_callbackNone on_retry_callbackNone trigger_ruleu'all_success' resourcesNone run_as_userNone task_concurrencyNone executor_configNone inletsNone outletsNone *args **kwargs)[source]
    Bases airflowutilsloglogging_mixinLoggingMixin
    Abstract base class for all operators Since operators create objects that become nodes in the dag BaseOperator contains many recursive methods for dag crawling behavior To derive this class you are expected to override the constructor as well as the execute’ method
    Operators derived from this class should perform or trigger certain tasks synchronously (wait for completion) Example of operators could be an operator that runs a Pig job (PigOperator) a sensor operator that waits for a partition to land in Hive (HiveSensorOperator) or one that moves data from Hive to MySQL (Hive2MySqlOperator) Instances of these operators (tasks) target specific operations running specific scripts functions or data transfers
    This class is abstract and shouldn’t be instantiated Instantiating a class derived from this one results in the creation of a task object which ultimately becomes a node in DAG objects Task dependencies should be set by using the set_upstream andor set_downstream methods
    Parameters
    · task_id (str) – a unique meaningful id for the task
    · owner (str) – the owner of the task using the unix username is recommended
    · retries (int) – the number of retries that should be performed before failing the task
    · retry_delay (timedelta) – delay between retries
    · retry_exponential_backoff (bool) – allow progressive longer waits between retries by using exponential backoff algorithm on retry delay (delay will be converted into seconds)
    · max_retry_delay (timedelta) – maximum delay interval between retries
    · start_date (datetime) – The start_date for the task determines the execution_date for the first task instance The best practice is to have the start_date rounded to your DAG’s schedule_interval Daily jobs have their start_date some day at 000000 hourly jobs have their start_date at 0000 of a specific hour Note that Airflow simply looks at the latest execution_date and adds the schedule_interval to determine the next execution_date It is also very important to note that different tasks’ dependencies need to line up in time If task A depends on task B and their start_date are offset in a way that their execution_date don’t line up A’s dependencies will never be met If you are looking to delay a task for example running a daily task at 2AM look into the TimeSensor and TimeDeltaSensor We advise against using dynamic start_date and recommend using fixed ones Read the FAQ entry about start_date for more information
    · end_date (datetime) – if specified the scheduler won’t go beyond this date
    · depends_on_past (bool) – when set to true task instances will run sequentially while relying on the previous task’s schedule to succeed The task instance for the start_date is allowed to run
    · wait_for_downstream (bool) – when set to true an instance of task X will wait for tasks immediately downstream of the previous instance of task X to finish successfully before it runs This is useful if the different instances of a task X alter the same asset and this asset is used by tasks downstream of task X Note that depends_on_past is forced to True wherever wait_for_downstream is used
    · queue (str) – which queue to target when running this job Not all executors implement queue management the CeleryExecutor does support targeting specific queues
    · dag (DAG) – a reference to the dag the task is attached to (if any)
    · priority_weight (int) – priority weight of this task against other task This allows the executor to trigger higher priority tasks before others when things get backed up Set priority_weight as a higher number for more important tasks
    · weight_rule (str) – weighting method used for the effective total priority weight of the task Options are { downstream | upstream | absolute } default is downstream When set to downstream the effective weight of the task is the aggregate sum of all downstream descendants As a result upstream tasks will have higher weight and will be scheduled more aggressively when using positive weight values This is useful when you have multiple dag run instances and desire to have all upstream tasks to complete for all runs before each dag can continue processing downstream tasks When set to upstream the effective weight is the aggregate sum of all upstream ancestors This is the opposite where downtream tasks have higher weight and will be scheduled more aggressively when using positive weight values This is useful when you have multiple dag run instances and prefer to have each dag complete before starting upstream tasks of other dags When set to absolute the effective weight is the exact priority_weight specified without additional weighting You may want to do this when you know exactly what priority weight each task should have Additionally when set to absolute there is bonus effect of significantly speeding up the task creation process as for very large DAGS Options can be set as string or using the constants defined in the static class airflowutilsWeightRule
    · pool (str) – the slot pool this task should run in slot pools are a way to limit concurrency for certain tasks
    · sla (datetimetimedelta) – time by which the job is expected to succeed Note that this represents the timedelta after the period is closed For example if you set an SLA of 1 hour the scheduler would send an email soon after 100AM on the 20160102 if the 20160101 instance has not succeeded yet The scheduler pays special attention for jobs with an SLA and sends alert emails for sla misses SLA misses are also recorded in the database for future reference All tasks that share the same SLA time get bundled in a single email sent soon after that time SLA notification are sent once and only once for each task instance
    · execution_timeout (datetimetimedelta) – max time allowed for the execution of this task instance if it goes beyond it will raise and fail
    · on_failure_callback (callable) – a function to be called when a task instance of this task fails a context dictionary is passed as a single parameter to this function Context contains references to related objects to the task instance and is documented under the macros section of the API
    · on_retry_callback (callable) – much like the on_failure_callback except that it is executed when retries occur
    · on_success_callback (callable) – much like the on_failure_callback except that it is executed when the task succeeds
    · trigger_rule (str) – defines the rule by which dependencies are applied for the task to get triggered Options are { all_success | all_failed | all_done | one_success | one_failed | dummy} default is all_success Options can be set as string or using the constants defined in the static class airflowutilsTriggerRule
    · resources (dict) – A map of resource parameter names (the argument names of the Resources constructor) to their values
    · run_as_user (str) – unix username to impersonate while running the task
    · task_concurrency (int) – When set a task will be able to limit the concurrent runs across execution_dates
    · executor_config (dict) –
    Additional tasklevel configuration parameters that are interpreted by a specific executor Parameters are namespaced by the name of executor
    Example to run this task in a specific docker container through the KubernetesExecutor
    MyOperator(
    executor_config{
    KubernetesExecutor
    {image myCustomDockerImage}
    }
    )
    clear(**kwargs)[source]
    Clears the state of task instances associated with the task following the parameters specified
    dag
    Returns the Operator’s DAG if set otherwise raises an error
    deps
    Returns the list of dependencies for the operator These differ from execution context dependencies in that they are specific to tasks and can be extendedoverridden by subclasses
    downstream_list
    @property list of tasks directly downstream
    execute(context)[source]
    This is the main method to derive when creating an operator Context is the same dictionary used as when rendering jinja templates
    Refer to get_template_context for more context
    get_direct_relative_ids(upstreamFalse)[source]
    Get the direct relative ids to the current task upstream or downstream
    get_direct_relatives(upstreamFalse)[source]
    Get the direct relatives to the current task upstream or downstream
    get_flat_relative_ids(upstreamFalse found_descendantsNone)[source]
    Get a flat list of relatives’ ids either upstream or downstream
    get_flat_relatives(upstreamFalse)[source]
    Get a flat list of relatives either upstream or downstream
    get_task_instances(session start_dateNone end_dateNone)[source]
    Get a set of task instance related to this task for a specific date range
    has_dag()[source]
    Returns True if the Operator has been assigned to a DAG
    on_kill()[source]
    Override this method to cleanup subprocesses when a task instance gets killed Any use of the threading subprocess or multiprocessing module within an operator needs to be cleaned up or it will leave ghost processes behind
    post_execute(context *args **kwargs)[source]
    This hook is triggered right after selfexecute() is called It is passed the execution context and any results returned by the operator
    pre_execute(context *args **kwargs)[source]
    This hook is triggered right before selfexecute() is called
    prepare_template()[source]
    Hook that is triggered after the templated fields get replaced by their content If you need your operator to alter the content of the file before the template is rendered it should override this method to do so
    render_template(attr content context)[source]
    Renders a template either from a file or directly in a field and returns the rendered result
    render_template_from_field(attr content context jinja_env)[source]
    Renders a template from a field If the field is a string it will simply render the string and return the result If it is a collection or nested set of collections it will traverse the structure and render all strings in it
    run(start_dateNone end_dateNone ignore_first_depends_on_pastFalse ignore_ti_stateFalse mark_successFalse)[source]
    Run a set of task instances for a date range
    schedule_interval
    The schedule interval of the DAG always wins over individual tasks so that tasks within a DAG always line up The task still needs a schedule_interval as it may not be attached to a DAG
    set_downstream(task_or_task_list)[source]
    Set a task or a task list to be directly downstream from the current task
    set_upstream(task_or_task_list)[source]
    Set a task or a task list to be directly upstream from the current task
    upstream_list
    @property list of tasks directly upstream
    xcom_pull(context task_idsNone dag_idNone keyu'return_value' include_prior_datesNone)[source]
    See TaskInstancexcom_pull()
    xcom_push(context key value execution_dateNone)[source]
    See TaskInstancexcom_push()
    class airflowmodelsChart(**kwargs)[source]
    Bases sqlalchemyextdeclarativeapiBase
    class airflowmodelsConnection(conn_idNone conn_typeNone hostNone loginNone passwordNone schemaNone portNone extraNone uriNone)[source]
    Bases sqlalchemyextdeclarativeapiBase airflowutilsloglogging_mixinLoggingMixin
    Placeholder to store information about different database instances connection information The idea here is that scripts use references to database instances (conn_id) instead of hard coding hostname logins and passwords when using operators or hooks
    extra_dejson
    Returns the extra property by deserializing json
    class airflowmodelsDAG(dag_id descriptionu'' schedule_intervaldatetimetimedelta(1) start_dateNone end_dateNone full_filepathNone template_searchpathNone user_defined_macrosNone user_defined_filtersNone default_argsNone concurrency16 max_active_runs16 dagrun_timeoutNone sla_miss_callbackNone default_viewu'tree' orientation'LR' catchupTrue on_success_callbackNone on_failure_callbackNone paramsNone)[source]
    Bases airflowdagbase_dagBaseDag airflowutilsloglogging_mixinLoggingMixin
    A dag (directed acyclic graph) is a collection of tasks with directional dependencies A dag also has a schedule a start date and an end date (optional) For each schedule (say daily or hourly) the DAG needs to run each individual tasks as their dependencies are met Certain tasks have the property of depending on their own past meaning that they can’t run until their previous schedule (and upstream tasks) are completed
    DAGs essentially act as namespaces for tasks A task_id can only be added once to a DAG
    Parameters
    · dag_id (str) – The id of the DAG
    · description (str) – The description for the DAG to eg be shown on the webserver
    · schedule_interval (datetimetimedelta or dateutilrelativedeltarelativedelta or str that acts as a cron expression) – Defines how often that DAG runs this timedelta object gets added to your latest task instance’s execution_date to figure out the next schedule
    · start_date (datetimedatetime) – The timestamp from which the scheduler will attempt to backfill
    · end_date (datetimedatetime) – A date beyond which your DAG won’t run leave to None for open ended scheduling
    · template_searchpath (str or list of stings) – This list of folders (non relative) defines where jinja will look for your templates Order matters Note that jinjaairflow includes the path of your DAG file by default
    · user_defined_macros (dict) – a dictionary of macros that will be exposed in your jinja templates For example passing dict(foo'bar') to this argument allows you to {{ foo }} in all jinja templates related to this DAG Note that you can pass any type of object here
    · user_defined_filters (dict) – a dictionary of filters that will be exposed in your jinja templates For example passing dict(hellolambda name 'Hello s' name) to this argument allows you to {{ 'world' | hello }} in all jinja templates related to this DAG
    · default_args (dict) – A dictionary of default parameters to be used as constructor keyword parameters when initialising operators Note that operators have the same hook and precede those defined here meaning that if your dict contains depends_on_past’ True here and depends_on_past’ False in the operator’s call default_args the actual value will be False
    · params (dict) – a dictionary of DAG level parameters that are made accessible in templates namespaced under params These params can be overridden at the task level
    · concurrency (int) – the number of task instances allowed to run concurrently
    · max_active_runs (int) – maximum number of active DAG runs beyond this number of DAG runs in a running state the scheduler won’t create new active DAG runs
    · dagrun_timeout (datetimetimedelta) – specify how long a DagRun should be up before timing out failing so that new DagRuns can be created
    · sla_miss_callback (typesFunctionType) – specify a function to call when reporting SLA timeouts
    · default_view (str) – Specify DAG default view (tree graph duration gantt landing_times)
    · orientation (str) – Specify DAG orientation in graph view (LR TB RL BT)
    · catchup (bool) – Perform scheduler catchup (or only run latest) Defaults to True
    · on_failure_callback (callable) – A function to be called when a DagRun of this dag fails A context dictionary is passed as a single parameter to this function
    · on_success_callback (callable) – Much like the on_failure_callback except that it is executed when the dag succeeds
    add_task(task)[source]
    Add a task to the DAG
    Parameters
    task (task) – the task you want to add
    add_tasks(tasks)[source]
    Add a list of tasks to the DAG
    Parameters
    tasks (list of tasks) – a lit of tasks you want to add
    clear(**kwargs)[source]
    Clears a set of task instances associated with the current dag for a specified date range
    cli()[source]
    Exposes a CLI specific to this DAG
    concurrency_reached
    Returns a boolean indicating whether the concurrency limit for this DAG has been reached
    create_dagrun(**kwargs)[source]
    Creates a dag run from this dag including the tasks associated with this dag Returns the dag run
    Parameters
    · run_id (str) – defines the the run id for this dag run
    · execution_date (datetime) – the execution date of this dag run
    · state (State) – the state of the dag run
    · start_date (datetime) – the date this dag run should be evaluated
    · external_trigger (bool) – whether this dag run is externally triggered
    · session (Session) – database session
    static deactivate_stale_dags(*args **kwargs)[source]
    Deactivate any DAGs that were last touched by the scheduler before the expiration date These DAGs were likely deleted
    Parameters
    expiration_date (datetime) – set inactive DAGs that were touched before this time
    Returns
    None
    static deactivate_unknown_dags(*args **kwargs)[source]
    Given a list of known DAGs deactivate any other DAGs that are marked as active in the ORM
    Parameters
    active_dag_ids (list[unicode]) – list of DAG IDs that are active
    Returns
    None
    filepath
    File location of where the dag object is instantiated
    folder
    Folder location of where the dag object is instantiated
    following_schedule(dttm)[source]
    Calculates the following schedule for this dag in local time
    Parameters
    dttm – utc datetime
    Returns
    utc datetime
    get_active_runs(**kwargs)[source]
    Returns a list of dag run execution dates currently running
    Parameters
    session –
    Returns
    List of execution dates
    get_dagrun(**kwargs)[source]
    Returns the dag run for a given execution date if it exists otherwise none
    Parameters
    · execution_date – The execution date of the DagRun to find
    · session –
    Returns
    The DagRun if found otherwise None
    get_last_dagrun(**kwargs)[source]
    Returns the last dag run for this dag None if there was none Last dag run can be any type of run eg scheduled or backfilled Overridden DagRuns are ignored
    get_num_active_runs(**kwargs)[source]
    Returns the number of active running dag runs
    Parameters
    · external_trigger (bool) – True for externally triggered active dag runs
    · session –
    Returns
    number greater than 0 for active dag runs
    static get_num_task_instances(*args **kwargs)[source]
    Returns the number of task instances in the given DAG
    Parameters
    · session – ORM session
    · dag_id (unicode) – ID of the DAG to get the task concurrency of
    · task_ids (list[unicode]) – A list of valid task IDs for the given DAG
    · states (list[state]) – A list of states to filter by if supplied
    Returns
    The number of running tasks
    Return type
    int
    get_run_dates(start_date end_dateNone)[source]
    Returns a list of dates between the interval received as parameter using this dag’s schedule interval Returned dates can be used for execution dates
    Parameters
    · start_date (datetime) – the start date of the interval
    · end_date (datetime) – the end date of the interval defaults to timezoneutcnow()
    Returns
    a list of dates within the interval following the dag’s schedule
    Return type
    list
    get_template_env()[source]
    Returns a jinja2 Environment while taking into account the DAGs template_searchpath user_defined_macros and user_defined_filters
    handle_callback(**kwargs)[source]
    Triggers the appropriate callback depending on the value of success namely the on_failure_callback or on_success_callback This method gets the context of a single TaskInstance part of this DagRun and passes that to the callable along with a reason’ primarily to differentiate DagRun failures note
    The logs end up in AIRFLOW_HOMElogsschedulerlatestPROJECTDAG_FILEpylog
    Parameters
    · dagrun – DagRun object
    · success – Flag to specify if failure or success callback should be called
    · reason – Completion reason
    · session – Database session
    is_paused
    Returns a boolean indicating whether this DAG is paused
    latest_execution_date
    Returns the latest date for which at least one dag run exists
    normalize_schedule(dttm)[source]
    Returns dttm + interval unless dttm is first interval then it returns dttm
    previous_schedule(dttm)[source]
    Calculates the previous schedule for this dag in local time
    Parameters
    dttm – utc datetime
    Returns
    utc datetime
    run(start_dateNone end_dateNone mark_successFalse localFalse executorNone donot_pickleFalse ignore_task_depsFalse ignore_first_depends_on_pastFalse poolNone delay_on_limit_secs10 verboseFalse confNone rerun_failed_tasksFalse)[source]
    Runs the DAG
    Parameters
    · start_date (datetime) – the start date of the range to run
    · end_date (datetime) – the end date of the range to run
    · mark_success (bool) – True to mark jobs as succeeded without running them
    · local (bool) – True to run the tasks using the LocalExecutor
    · executor (BaseExecutor) – The executor instance to run the tasks
    · donot_pickle (bool) – True to avoid pickling DAG object and send to workers
    · ignore_task_deps (bool) – True to skip upstream tasks
    · ignore_first_depends_on_past (bool) – True to ignore depends_on_past dependencies for the first set of tasks only
    · pool (str) – Resource pool to use
    · delay_on_limit_secs (float) – Time in seconds to wait before next attempt to run dag run when max_active_runs limit has been reached
    · verbose (bool) – Make logging output more verbose
    · conf (dict) – user defined dictionary passed from CLI
    set_dependency(upstream_task_id downstream_task_id)[source]
    Simple utility method to set dependency between two tasks that already have been added to the DAG using add_task()
    sub_dag(task_regex include_downstreamFalse include_upstreamTrue)[source]
    Returns a subset of the current dag as a deep copy of the current dag based on a regex that should match one or many tasks and includes upstream and downstream neighbours based on the flag passed
    subdags
    Returns a list of the subdag objects associated to this DAG
    sync_to_db(**kwargs)[source]
    Save attributes about this DAG to the DB Note that this method can be called for both DAGs and SubDAGs A SubDag is actually a SubDagOperator
    Parameters
    · dag (DAG) – the DAG object to save to the DB
    · sync_time (datetime) – The time that the DAG should be marked as sync’ed
    Returns
    None
    test_cycle()[source]
    Check to see if there are any cycles in the DAG Returns False if no cycle found otherwise raises exception
    topological_sort()[source]
    Sorts tasks in topographical order such that a task comes after any of its upstream dependencies
    Heavily inspired by httpblogjupoorg20120406topologicalsortingacyclicdirectedgraphs
    Returns
    list of tasks in topological order
    tree_view()[source]
    Shows an ascii tree representation of the DAG
    class airflowmodelsDagBag(dag_folderNone executorNone include_examplesTrue)[source]
    Bases airflowdagbase_dagBaseDagBag airflowutilsloglogging_mixinLoggingMixin
    A dagbag is a collection of dags parsed out of a folder tree and has high level configuration settings like what database to use as a backend and what executor to use to fire off tasks This makes it easier to run distinct environments for say production and development tests or for different teams or security profiles What would have been system level settings are now dagbag level so that one system can run multiple independent settings sets
    Parameters
    · dag_folder (unicode) – the folder to scan to find DAGs
    · executor – the executor to use when executing task instances in this DagBag
    · include_examples (bool) – whether to include the examples that ship with airflow or not
    · has_logged – an instance boolean that gets flipped from False to True after a file has been skipped This is to prevent overloading the user with logging messages about skipped files Therefore only once per DagBag is a file logged being skipped
    bag_dag(dag parent_dag root_dag)[source]
    Adds the DAG into the bag recurses into sub dags Throws AirflowDagCycleException if a cycle is detected in this dag or its subdags
    collect_dags(dag_folderNone only_if_updatedTrue)[source]
    Given a file path or a folder this method looks for python modules imports them and adds them to the dagbag collection
    Note that if a airflowignore file is found while processing the directory it will behave much like a gitignore ignoring files that match any of the regex patterns specified in the file
    Note The patterns in airflowignore are treated as unanchored regexes not shelllike glob patterns
    dagbag_report()[source]
    Prints a report around DagBag loading stats
    get_dag(dag_id)[source]
    Gets the DAG out of the dictionary and refreshes it if expired
    kill_zombies(**kwargs)[source]
    Fails tasks that haven’t had a heartbeat in too long
    process_file(filepath only_if_updatedTrue safe_modeTrue)[source]
    Given a path to a python module or zip file this method imports the module and look for dag objects within it
    size()[source]
    Returns
    the amount of dags contained in this dagbag
    class airflowmodelsDagModel(**kwargs)[source]
    Bases sqlalchemyextdeclarativeapiBase
    class airflowmodelsDagPickle(dag)[source]
    Bases sqlalchemyextdeclarativeapiBase
    Dags can originate from different places (user repos master repo …) and also get executed in different places (different executors) This object represents a version of a DAG and becomes a source of truth for a BackfillJob execution A pickle is a native python serialized object and in this case gets stored in the database for the duration of the job
    The executors pick up the DagPickle id and read the dag definition from the database
    class airflowmodelsDagRun(**kwargs)[source]
    Bases sqlalchemyextdeclarativeapiBase airflowutilsloglogging_mixinLoggingMixin
    DagRun describes an instance of a Dag It can be created by the scheduler (for regular runs) or by an external trigger
    static find(*args **kwargs)[source]
    Returns a set of dag runs for the given search criteria
    Parameters
    · dag_id (int list) – the dag_id to find dag runs for
    · run_id (str) – defines the the run id for this dag run
    · execution_date (datetime) – the execution date
    · state (State) – the state of the dag run
    · external_trigger (bool) – whether this dag run is externally triggered
    · no_backfills – return no backfills (True) return all (False)
    Defaults to False type no_backfills bool param session database session type session Session
    get_dag()[source]
    Returns the Dag associated with this DagRun
    Returns
    DAG
    classmethod get_latest_runs(**kwargs)[source]
    Returns the latest DagRun for each DAG
    get_previous_dagrun(**kwargs)[source]
    The previous DagRun if there is one
    get_previous_scheduled_dagrun(**kwargs)[source]
    The previous SCHEDULED DagRun if there is one
    static get_run(session dag_id execution_date)[source]
    Parameters
    · dag_id (unicode) – DAG ID
    · execution_date (datetime) – execution date
    Returns
    DagRun corresponding to the given dag_id and execution date
    if one exists None otherwise rtype DagRun
    get_task_instance(**kwargs)[source]
    Returns the task instance specified by task_id for this dag run
    Parameters
    task_id – the task id
    get_task_instances(**kwargs)[source]
    Returns the task instances for this dag run
    refresh_from_db(**kwargs)[source]
    Reloads the current dagrun from the database param session database session
    update_state(**kwargs)[source]
    Determines the overall state of the DagRun based on the state of its TaskInstances
    Returns
    State
    verify_integrity(**kwargs)[source]
    Verifies the DagRun by checking for removed tasks or tasks that are not in the database yet It will set state to removed or add the task if required
    class airflowmodelsDagStat(dag_id state count0 dirtyFalse)[source]
    Bases sqlalchemyextdeclarativeapiBase
    static create(*args **kwargs)[source]
    Creates the missing states the stats table for the dag specified
    Parameters
    · dag_id – dag id of the dag to create stats for
    · session – database session
    Returns

    static set_dirty(*args **kwargs)[source]
    Parameters
    · dag_id – the dag_id to mark dirty
    · session – database session
    Returns

    static update(*args **kwargs)[source]
    Updates the stats for dirtyoutofsync dags
    Parameters
    · dag_ids (list) – dag_ids to be updated
    · dirty_only (bool) – only updated for marked dirty defaults to True
    · session (Session) – db session to use
    class airflowmodelsImportError(**kwargs)[source]
    Bases sqlalchemyextdeclarativeapiBase
    exception airflowmodelsInvalidFernetToken[source]
    Bases exceptionsException
    class airflowmodelsKnownEvent(**kwargs)[source]
    Bases sqlalchemyextdeclarativeapiBase
    class airflowmodelsKnownEventType(**kwargs)[source]
    Bases sqlalchemyextdeclarativeapiBase
    class airflowmodelsKubeResourceVersion(**kwargs)[source]
    Bases sqlalchemyextdeclarativeapiBase
    class airflowmodelsKubeWorkerIdentifier(**kwargs)[source]
    Bases sqlalchemyextdeclarativeapiBase
    class airflowmodelsLog(event task_instance ownerNone extraNone **kwargs)[source]
    Bases sqlalchemyextdeclarativeapiBase
    Used to actively log events to the database
    class airflowmodelsNullFernet[source]
    Bases futuretypesnewobjectnewobject
    A Null encryptor class that doesn’t encrypt or decrypt but that presents a similar interface to Fernet
    The purpose of this is to make the rest of the code not have to know the difference and to only display the message once not 20 times when airflow initdb is ran
    class airflowmodelsPool(**kwargs)[source]
    Bases sqlalchemyextdeclarativeapiBase
    open_slots(**kwargs)[source]
    Returns the number of slots open at the moment
    queued_slots(**kwargs)[source]
    Returns the number of slots used at the moment
    used_slots(**kwargs)[source]
    Returns the number of slots used at the moment
    class airflowmodelsSlaMiss(**kwargs)[source]
    Bases sqlalchemyextdeclarativeapiBase
    Model that stores a history of the SLA that have been missed It is used to keep track of SLA failures over time and to avoid double triggering alert emails
    class airflowmodelsTaskFail(task execution_date start_date end_date)[source]
    Bases sqlalchemyextdeclarativeapiBase
    TaskFail tracks the failed run durations of each task instance
    class airflowmodelsTaskInstance(task execution_date stateNone)[source]
    Bases sqlalchemyextdeclarativeapiBase airflowutilsloglogging_mixinLoggingMixin
    Task instances store the state of a task instance This table is the authority and single source of truth around what tasks have run and the state they are in
    The SqlAlchemy model doesn’t have a SqlAlchemy foreign key to the task or dag model deliberately to have more control over transactions
    Database transactions on this table should insure double triggers and any confusion around what task instances are or aren’t ready to run even while multiple schedulers may be firing task instances
    are_dependencies_met(**kwargs)[source]
    Returns whether or not all the conditions are met for this task instance to be run given the context for the dependencies (eg a task instance being force run from the UI will ignore some dependencies)
    Parameters
    · dep_context (DepContext) – The execution context that determines the dependencies that should be evaluated
    · session (Session) – database session
    · verbose (bool) – whether log details on failed dependencies on info or debug log level
    are_dependents_done(**kwargs)[source]
    Checks whether the dependents of this task instance have all succeeded This is meant to be used by wait_for_downstream
    This is useful when you do not want to start processing the next schedule of a task until the dependents are done For instance if the task DROPs and recreates a table
    clear_xcom_data(**kwargs)[source]
    Clears all XCom data from the database for the task instance
    command(mark_successFalse ignore_all_depsFalse ignore_depends_on_pastFalse ignore_task_depsFalse ignore_ti_stateFalse localFalse pickle_idNone rawFalse job_idNone poolNone cfg_pathNone)[source]
    Returns a command that can be executed anywhere where airflow is installed This command is part of the message sent to executors by the orchestrator
    command_as_list(mark_successFalse ignore_all_depsFalse ignore_task_depsFalse ignore_depends_on_pastFalse ignore_ti_stateFalse localFalse pickle_idNone rawFalse job_idNone poolNone cfg_pathNone)[source]
    Returns a command that can be executed anywhere where airflow is installed This command is part of the message sent to executors by the orchestrator
    current_state(**kwargs)[source]
    Get the very latest state from the database if a session is passed we use and looking up the state becomes part of the session otherwise a new session is used
    error(**kwargs)[source]
    Forces the task instance’s state to FAILED in the database
    static generate_command(dag_id task_id execution_date mark_successFalse ignore_all_depsFalse ignore_depends_on_pastFalse ignore_task_depsFalse ignore_ti_stateFalse localFalse pickle_idNone file_pathNone rawFalse job_idNone poolNone cfg_pathNone)[source]
    Generates the shell command required to execute this task instance
    Parameters
    · dag_id (unicode) – DAG ID
    · task_id (unicode) – Task ID
    · execution_date (datetime) – Execution date for the task
    · mark_success (bool) – Whether to mark the task as successful
    · ignore_all_deps (bool) – Ignore all ignorable dependencies Overrides the other ignore_* parameters
    · ignore_depends_on_past (bool) – Ignore depends_on_past parameter of DAGs (eg for Backfills)
    · ignore_task_deps (bool) – Ignore taskspecific dependencies such as depends_on_past and trigger rule
    · ignore_ti_state (bool) – Ignore the task instance’s previous failuresuccess
    · local (bool) – Whether to run the task locally
    · pickle_id (unicode) – If the DAG was serialized to the DB the ID associated with the pickled DAG
    · file_path – path to the file containing the DAG definition
    · raw – raw mode (needs more details)
    · job_id – job ID (needs more details)
    · pool (unicode) – the Airflow pool that the task should run in
    · cfg_path (basestring) – the Path to the configuration file
    Returns
    shell command that can be used to run the task instance
    get_dagrun(**kwargs)[source]
    Returns the DagRun for this TaskInstance
    Parameters
    session –
    Returns
    DagRun
    init_on_load()[source]
    Initialize the attributes that aren’t stored in the DB
    init_run_context(rawFalse)[source]
    Sets the log context
    is_eligible_to_retry()[source]
    Is task instance is eligible for retry
    is_premature
    Returns whether a task is in UP_FOR_RETRY state and its retry interval has elapsed
    key
    Returns a tuple that identifies the task instance uniquely
    next_retry_datetime()[source]
    Get datetime of the next retry if the task instance fails For exponential backoff retry_delay is used as base and will be converted to seconds
    pool_full(**kwargs)[source]
    Returns a boolean as to whether the slot pool has room for this task to run
    previous_ti
    The task instance for the task that ran before this task instance
    ready_for_retry()[source]
    Checks on whether the task instance is in the right state and timeframe to be retried
    refresh_from_db(**kwargs)[source]
    Refreshes the task instance from the database based on the primary key
    Parameters
    lock_for_update – if True indicates that the database should lock the TaskInstance (issuing a FOR UPDATE clause) until the session is committed
    try_number
    Return the try number that this task number will be when it is actually run
    If the TI is currently running this will match the column in the databse in all othercases this will be incremenetd
    xcom_pull(task_idsNone dag_idNone keyu'return_value' include_prior_datesFalse)[source]
    Pull XComs that optionally meet certain criteria
    The default value for key limits the search to XComs that were returned by other tasks (as opposed to those that were pushed manually) To remove this filter pass keyNone (or any desired value)
    If a single task_id string is provided the result is the value of the most recent matching XCom from that task_id If multiple task_ids are provided a tuple of matching values is returned None is returned whenever no matches are found
    Parameters
    · key (str) – A key for the XCom If provided only XComs with matching keys will be returned The default key is return_value’ also available as a constant XCOM_RETURN_KEY This key is automatically given to XComs returned by tasks (as opposed to being pushed manually) To remove the filter pass keyNone
    · task_ids (str or iterable of strings (representing task_ids)) – Only XComs from tasks with matching ids will be pulled Can pass None to remove the filter
    · dag_id (str) – If provided only pulls XComs from this DAG If None (default) the DAG of the calling task is used
    · include_prior_dates (bool) – If False only XComs from the current execution_date are returned If True XComs from previous dates are returned as well
    xcom_push(key value execution_dateNone)[source]
    Make an XCom available for tasks to pull
    Parameters
    · key (str) – A key for the XCom
    · value (any pickleable object) – A value for the XCom The value is pickled and stored in the database
    · execution_date (datetime) – if provided the XCom will not be visible until this date This can be used for example to send a message to a task on a future date without it being immediately visible
    class airflowmodelsTaskReschedule(task execution_date try_number start_date end_date reschedule_date)[source]
    Bases sqlalchemyextdeclarativeapiBase
    TaskReschedule tracks rescheduled task instances
    static find_for_task_instance(*args **kwargs)[source]
    Returns all task reschedules for the task instance and try number in ascending order
    Parameters
    task_instance (TaskInstance) – the task instance to find task reschedules for
    class airflowmodelsUser(**kwargs)[source]
    Bases sqlalchemyextdeclarativeapiBase
    class airflowmodelsVariable(**kwargs)[source]
    Bases sqlalchemyextdeclarativeapiBase airflowutilsloglogging_mixinLoggingMixin
    classmethod setdefault(key default deserialize_jsonFalse)[source]
    Like a Python builtin dict object setdefault returns the current value for a key and if it isn’t there stores the default value and returns it
    Parameters
    · key (String) – Dict key for this Variable
    · default – Default value to set and return if the variable
    isn’t already in the DB type default Mixed param deserialize_json Store this as a JSON encoded value in the DB
    and unencode it when retrieving a value
    Returns
    Mixed
    class airflowmodelsXCom(**kwargs)[source]
    Bases sqlalchemyextdeclarativeapiBase airflowutilsloglogging_mixinLoggingMixin
    Base class for XCom objects
    classmethod get_many(**kwargs)[source]
    Retrieve an XCom value optionally meeting certain criteria TODO pickling has been deprecated and JSON is preferred
    pickling will be removed in Airflow 20
    classmethod get_one(**kwargs)[source]
    Retrieve an XCom value optionally meeting certain criteria TODO pickling has been deprecated and JSON is preferred
    pickling will be removed in Airflow 20
    Returns
    XCom value
    classmethod set(**kwargs)[source]
    Store an XCom value TODO pickling has been deprecated and JSON is preferred
    pickling will be removed in Airflow 20
    Returns
    None
    airflowmodelsclear_task_instances(tis session activate_dag_runsTrue dagNone)[source]
    Clears a set of task instances but makes sure the running ones get killed
    Parameters
    · tis – a list of task instances
    · session – current session
    · activate_dag_runs – flag to check for active dag run
    · dag – DAG object
    airflowmodelsget_fernet()[source]
    Deferred load of Fernet key
    This function could fail either because Cryptography is not installed or because the Fernet key is invalid
    Returns
    Fernet object
    Raises
    AirflowException if there’s a problem trying to load Fernet
    Hook
    Hooks are interfaces to external platforms and databases implementing a common interface when possible and acting as building blocks for operators
    class airflowhooksdbapi_hookDbApiHook(*args **kwargs)[source]
    Bases airflowhooksbase_hookBaseHook
    Abstract base class for sql hooks
    bulk_dump(table tmp_file)[source]
    Dumps a database table into a tabdelimited file
    Parameters
    · table (str) – The name of the source table
    · tmp_file (str) – The path of the target file
    bulk_load(table tmp_file)[source]
    Loads a tabdelimited file into a database table
    Parameters
    · table (str) – The name of the target table
    · tmp_file (str) – The path of the file to load into the table
    get_autocommit(conn)[source]
    Get autocommit setting for the provided connection Return True if connautocommit is set to True Return False if connautocommit is not set or set to False or conn does not support autocommit
    Parameters
    conn (connection object) – Connection to get autocommit setting from
    Returns
    connection autocommit setting
    rtype bool
    get_conn()[source]
    Returns a connection object
    get_cursor()[source]
    Returns a cursor
    get_first(sql parametersNone)[source]
    Executes the sql and returns the first resulting row
    Parameters
    · sql (str or list) – the sql statement to be executed (str) or a list of sql statements to execute
    · parameters (mapping or iterable) – The parameters to render the SQL query with
    get_pandas_df(sql parametersNone)[source]
    Executes the sql and returns a pandas dataframe
    Parameters
    · sql (str or list) – the sql statement to be executed (str) or a list of sql statements to execute
    · parameters (mapping or iterable) – The parameters to render the SQL query with
    get_records(sql parametersNone)[source]
    Executes the sql and returns a set of records
    Parameters
    · sql (str or list) – the sql statement to be executed (str) or a list of sql statements to execute
    · parameters (mapping or iterable) – The parameters to render the SQL query with
    insert_rows(table rows target_fieldsNone commit_every1000 replaceFalse)[source]
    A generic way to insert a set of tuples into a table a new transaction is created every commit_every rows
    Parameters
    · table (str) – Name of the target table
    · rows (iterable of tuples) – The rows to insert into the table
    · target_fields (iterable of strings) – The names of the columns to fill in the table
    · commit_every (int) – The maximum number of rows to insert in one transaction Set to 0 to insert all rows in one transaction
    · replace (bool) – Whether to replace instead of insert
    run(sql autocommitFalse parametersNone)[source]
    Runs a command or a list of commands Pass a list of sql statements to the sql parameter to get them to execute sequentially
    Parameters
    · sql (str or list) – the sql statement to be executed (str) or a list of sql statements to execute
    · autocommit (bool) – What to set the connection’s autocommit setting to before executing the query
    · parameters (mapping or iterable) – The parameters to render the SQL query with
    set_autocommit(conn autocommit)[source]
    Sets the autocommit flag on the connection
    class airflowhooksdocker_hookDockerHook(docker_conn_id'docker_default' base_urlNone versionNone tlsNone)[source]
    Bases airflowhooksbase_hookBaseHook airflowutilsloglogging_mixinLoggingMixin
    Interact with a private Docker registry
    Parameters
    docker_conn_id (str) – ID of the Airflow connection where credentials and extra configuration are stored
    class airflowhookshive_hooksHiveCliHook(hive_cli_conn_idu'hive_cli_default' run_asNone mapred_queueNone mapred_queue_priorityNone mapred_job_nameNone)[source]
    Bases airflowhooksbase_hookBaseHook
    Simple wrapper around the hive CLI
    It also supports the beeline a lighter CLI that runs JDBC and is replacing the heavier traditional CLI To enable beeline set the use_beeline param in the extra field of your connection as in { use_beeline true }
    Note that you can also set default hive CLI parameters using the hive_cli_params to be used in your connection as in {hive_cli_params hiveconf mapredjobtrackersomejobtracker444} Parameters passed here can be overridden by run_cli’s hive_conf param
    The extra connection parameter auth gets passed as in the jdbc connection string as is
    Parameters
    · mapred_queue (str) – queue used by the Hadoop Scheduler (Capacity or Fair)
    · mapred_queue_priority (str) – priority within the job queue Possible settings include VERY_HIGH HIGH NORMAL LOW VERY_LOW
    · mapred_job_name (str) – This name will appear in the jobtracker This can make monitoring easier
    load_df(df table field_dictNone delimiteru' ' encodingu'utf8' pandas_kwargsNone **kwargs)[source]
    Loads a pandas DataFrame into hive
    Hive data types will be inferred if not passed but column names will not be sanitized
    Parameters
    · df (DataFrame) – DataFrame to load into a Hive table
    · table (str) – target Hive table use dot notation to target a specific database
    · field_dict (OrderedDict) – mapping from column name to hive data type Note that it must be OrderedDict so as to keep columns’ order
    · delimiter (str) – field delimiter in the file
    · encoding (str) – str encoding to use when writing DataFrame to file
    · pandas_kwargs (dict) – passed to DataFrameto_csv
    · kwargs – passed to selfload_file
    load_file(filepath table delimiteru' ' field_dictNone createTrue overwriteTrue partitionNone recreateFalse tblpropertiesNone)[source]
    Loads a local file into Hive
    Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization format If a large amount of data is loaded andor if the tables gets queried considerably you may want to use this operator only to stage the data into a temporary table before loading it into its final destination using a HiveOperator
    Parameters
    · filepath (str) – local filepath of the file to load
    · table (str) – target Hive table use dot notation to target a specific database
    · delimiter (str) – field delimiter in the file
    · field_dict (OrderedDict) – A dictionary of the fields name in the file as keys and their Hive types as values Note that it must be OrderedDict so as to keep columns’ order
    · create (bool) – whether to create the table if it doesn’t exist
    · overwrite (bool) – whether to overwrite the data in table or partition
    · partition (dict) – target partition as a dict of partition columns and values
    · recreate (bool) – whether to drop and recreate the table at every execution
    · tblproperties (dict) – TBLPROPERTIES of the hive table being created
    run_cli(hql schemaNone verboseTrue hive_confNone)[source]
    Run an hql statement using the hive cli If hive_conf is specified it should be a dict and the entries will be set as keyvalue pairs in HiveConf
    Parameters
    hive_conf (dict) – if specified these key value pairs will be passed to hive as hiveconf keyvalue Note that they will be passed after the hive_cli_params and thus will override whatever values are specified in the database
    >>> hh HiveCliHook()
    >>> result hhrun_cli(USE airflow)
    >>> (OK in result)
    True
    test_hql(hql)[source]
    Test an hql statement using the hive cli and EXPLAIN
    class airflowhookshive_hooksHiveMetastoreHook(metastore_conn_idu'metastore_default')[source]
    Bases airflowhooksbase_hookBaseHook
    Wrapper to interact with the Hive Metastore
    check_for_named_partition(schema table partition_name)[source]
    Checks whether a partition with a given name exists
    Parameters
    · schema (str) – Name of hive schema (database) @table belongs to
    · table – Name of hive table @partition belongs to
    Partition
    Name of the partitions to check for (eg abcd)
    Return type
    bool
    >>> hh HiveMetastoreHook()
    >>> t 'static_babynames_partitioned'
    >>> hhcheck_for_named_partition('airflow' t ds20150101)
    True
    >>> hhcheck_for_named_partition('airflow' t dsxxx)
    False
    check_for_partition(schema table partition)[source]
    Checks whether a partition exists
    Parameters
    · schema (str) – Name of hive schema (database) @table belongs to
    · table – Name of hive table @partition belongs to
    Partition
    Expression that matches the partitions to check for (eg a b’ AND c d’)
    Return type
    bool
    >>> hh HiveMetastoreHook()
    >>> t 'static_babynames_partitioned'
    >>> hhcheck_for_partition('airflow' t ds'20150101')
    True
    get_databases(patternu'*')[source]
    Get a metastore table object
    get_metastore_client()[source]
    Returns a Hive thrift client
    get_partitions(schema table_name filterNone)[source]
    Returns a list of all partitions in a table Works only for tables with less than 32767 (java short max val) For subpartitioned table the number might easily exceed this
    >>> hh HiveMetastoreHook()
    >>> t 'static_babynames_partitioned'
    >>> parts hhget_partitions(schema'airflow' table_namet)
    >>> len(parts)
    1
    >>> parts
    [{'ds' '20150101'}]
    get_table(table_name dbu'default')[source]
    Get a metastore table object
    >>> hh HiveMetastoreHook()
    >>> t hhget_table(db'airflow' table_name'static_babynames')
    >>> ttableName
    'static_babynames'
    >>> [colname for col in tsdcols]
    ['state' 'year' 'name' 'gender' 'num']
    get_tables(db patternu'*')[source]
    Get a metastore table object
    max_partition(schema table_name fieldNone filter_mapNone)[source]
    Returns the maximum value for all partitions with given field in a table If only one partition key exist in the table the key will be used as field filter_map should be a partition_keypartition_value map and will be used to filter out partitions
    Parameters
    · schema (str) – schema name
    · table_name (str) – table name
    · field (str) – partition key to get max partition from
    · filter_map (map) – partition_keypartition_value map used for partition filtering
    >>> hh HiveMetastoreHook()
    >>> filter_map {'ds' '20150101' 'ds' '20140101'}
    >>> t 'static_babynames_partitioned'
    >>> hhmax_partition(schema'airflow' table_namet field'ds' filter_mapfilter_map)
    '20150101'
    table_exists(table_name dbu'default')[source]
    Check if table exists
    >>> hh HiveMetastoreHook()
    >>> hhtable_exists(db'airflow' table_name'static_babynames')
    True
    >>> hhtable_exists(db'airflow' table_name'does_not_exist')
    False
    class airflowhookshive_hooksHiveServer2Hook(hiveserver2_conn_idu'hiveserver2_default')[source]
    Bases airflowhooksbase_hookBaseHook
    Wrapper around the pyhive library
    Note that the default authMechanism is PLAIN to override it you can specify it in the extra of your connection in the UI as in
    get_pandas_df(hql schemau'default')[source]
    Get a pandas dataframe from a Hive query
    >>> hh HiveServer2Hook()
    >>> sql SELECT * FROM airflowstatic_babynames LIMIT 100
    >>> df hhget_pandas_df(sql)
    >>> len(dfindex)
    100
    get_records(hql schemau'default')[source]
    Get a set of records from a Hive query
    >>> hh HiveServer2Hook()
    >>> sql SELECT * FROM airflowstatic_babynames LIMIT 100
    >>> len(hhget_records(sql))
    100
    get_results(hql schemau'default' fetch_sizeNone hive_confNone)[source]
    Get results of the provided hql in target schema param hql hql to be executed param schema target schema default to default’ param fetch_size max size of result to fetch param hive_conf hive_conf to execute alone with the hql return results of hql execution
    to_csv(hql csv_filepath schemau'default' delimiteru' ' lineterminatoru'\r\n' output_headerTrue fetch_size1000 hive_confNone)[source]
    Execute hql in target schema and write results to a csv file param hql hql to be executed param csv_filepath filepath of csv to write results into param schema target schema default to default’ param delimiter delimiter of the csv file param lineterminator lineterminator of the csv file param output_header header of the csv file param fetch_size number of result rows to write into the csv file param hive_conf hive_conf to execute alone with the hql return
    airflowhookshive_hooksget_context_from_env_var()[source]
    Extract context from env variable eg dag_id task_id and execution_date so that they can be used inside BashOperator and PythonOperator return The context of interest
    class airflowhookshttp_hookHttpHook(method'POST' http_conn_id'http_default')[source]
    Bases airflowhooksbase_hookBaseHook
    Interact with HTTP servers param http_conn_id connection that has the base API url ie httpswwwgooglecom
    and optional authentication credentials Default headers can also be specified in the Extra field in json format
    Parameters
    method (str) – the API method to be called
    check_response(response)[source]
    Checks the status code and raise an AirflowException exception on non 2XX or 3XX status codes param response A requests response object type response requestsresponse
    get_conn(headersNone)[source]
    Returns http session for use with requests param headers additional headers to be passed through as a dictionary type headers dict
    run(endpoint dataNone headersNone extra_optionsNone)[source]
    Performs the request param endpoint the endpoint to be called ie resourcev1query type endpoint str param data payload to be uploaded or request parameters type data dict param headers additional headers to be passed through as a dictionary type headers dict param extra_options additional options to be used when executing the request
    ie {check_response’ False} to avoid checking raising exceptions on non 2XX or 3XX status codes

    run_and_check(session prepped_request extra_options)[source]
    Grabs extra options like timeout and actually runs the request checking for the result param session the session to be used to execute the request type session requestsSession param prepped_request the prepared request generated in run() type prepped_request sessionprepare_request param extra_options additional options to be used when executing the request
    ie {check_response’ False} to avoid checking raising exceptions on non 2XX or 3XX status codes

    run_with_advanced_retry(_retry_args *args **kwargs)[source]
    Runs Hookrun() with a Tenacity decorator attached to it This is useful for connectors which might be disturbed by intermittent issues and should not instantly fail param _retry_args Arguments which define the retry behaviour
    See Tenacity documentation at httpsgithubcomjdtenacity

    Example
    hook HttpHook(http_conn_id’my_conn’method’GET’) retry_args dict(
    waittenacitywait_exponential() stoptenacitystop_after_attempt(10) retryrequestsexceptionsConnectionError
    ) hookrun_with_advanced_retry(
    endpoint’v1test’ _retry_argsretry_args
    )
    class airflowhooksdruid_hookDruidDbApiHook(*args **kwargs)[source]
    Bases airflowhooksdbapi_hookDbApiHook
    Interact with Druid broker
    This hook is purely for users to query druid broker For ingestion please use druidHook
    get_conn()[source]
    Establish a connection to druid broker
    get_pandas_df(sql parametersNone)[source]
    Executes the sql and returns a pandas dataframe
    Parameters
    · sql (str or list) – the sql statement to be executed (str) or a list of sql statements to execute
    · parameters (mapping or iterable) – The parameters to render the SQL query with
    get_uri()[source]
    Get the connection uri for druid broker
    eg druidlocalhost8082druidv2sql
    insert_rows(table rows target_fieldsNone commit_every1000)[source]
    A generic way to insert a set of tuples into a table a new transaction is created every commit_every rows
    Parameters
    · table (str) – Name of the target table
    · rows (iterable of tuples) – The rows to insert into the table
    · target_fields (iterable of strings) – The names of the columns to fill in the table
    · commit_every (int) – The maximum number of rows to insert in one transaction Set to 0 to insert all rows in one transaction
    · replace (bool) – Whether to replace instead of insert
    set_autocommit(conn autocommit)[source]
    Sets the autocommit flag on the connection
    class airflowhooksdruid_hookDruidHook(druid_ingest_conn_id'druid_ingest_default' timeout1 max_ingestion_timeNone)[source]
    Bases airflowhooksbase_hookBaseHook
    Connection to Druid overlord for ingestion
    Parameters
    · druid_ingest_conn_id (str) – The connection id to the Druid overlord machine which accepts index jobs
    · timeout (int) – The interval between polling the Druid job for the status of the ingestion job Must be greater than or equal to 1
    · max_ingestion_time (int) – The maximum ingestion time before assuming the job failed
    class airflowhookshdfs_hookHDFSHook(hdfs_conn_id'hdfs_default' proxy_userNone autoconfigFalse)[source]
    Bases airflowhooksbase_hookBaseHook
    Interact with HDFS This class is a wrapper around the snakebite library
    Parameters
    · hdfs_conn_id – Connection id to fetch connection info
    · proxy_user (str) – effective user for HDFS operations
    · autoconfig (bool) – use snakebite’s automatically configured client
    get_conn()[source]
    Returns a snakebite HDFSClient object
    class airflowhooksmssql_hookMsSqlHook(*args **kwargs)[source]
    Bases airflowhooksdbapi_hookDbApiHook
    Interact with Microsoft SQL Server
    get_conn()[source]
    Returns a mssql connection object
    set_autocommit(conn autocommit)[source]
    Sets the autocommit flag on the connection
    class airflowhooksmysql_hookMySqlHook(*args **kwargs)[source]
    Bases airflowhooksdbapi_hookDbApiHook
    Interact with MySQL
    You can specify charset in the extra field of your connection as {charset utf8} Also you can choose cursor as {cursor SSCursor} Refer to the MySQLdbcursors for more details
    bulk_dump(table tmp_file)[source]
    Dumps a database table into a tabdelimited file
    bulk_load(table tmp_file)[source]
    Loads a tabdelimited file into a database table
    get_autocommit(conn)[source]
    MySql connection gets autocommit in a different way
    Parameters
    conn (connection object) – connection to get autocommit setting from
    Returns
    connection autocommit setting
    rtype bool
    get_conn()[source]
    Returns a mysql connection object
    set_autocommit(conn autocommit)[source]
    MySql connection sets autocommit in a different way
    class airflowhookspig_hookPigCliHook(pig_cli_conn_id'pig_cli_default')[source]
    Bases airflowhooksbase_hookBaseHook
    Simple wrapper around the pig CLI
    Note that you can also set default pig CLI properties using the pig_properties to be used in your connection as in {pig_properties Dpigtmpfilecompressiontrue}
    run_cli(pig verboseTrue)[source]
    Run an pig script using the pig cli
    >>> ph PigCliHook()
    >>> result phrun_cli(ls )
    >>> (hdfs in result)
    True
    class airflowhookspostgres_hookPostgresHook(*args **kwargs)[source]
    Bases airflowhooksdbapi_hookDbApiHook
    Interact with Postgres You can specify ssl parameters in the extra field of your connection as {sslmode require sslcert pathtocertpem etc}
    Note For Redshift use keepalives_idle in the extra connection parameters and set it to less than 300 seconds
    bulk_dump(table tmp_file)[source]
    Dumps a database table into a tabdelimited file
    bulk_load(table tmp_file)[source]
    Loads a tabdelimited file into a database table
    copy_expert(sql filename open)[source]
    Executes SQL using psycopg2 copy_expert method Necessary to execute COPY command without access to a superuser
    Note if this method is called with a COPY FROM statement and the specified input file does not exist it creates an empty file and no data is loaded but the operation succeeds So if users want to be aware when the input file does not exist they have to check its existence by themselves
    get_conn()[source]
    Returns a connection object
    class airflowhookspresto_hookPrestoHook(*args **kwargs)[source]
    Bases airflowhooksdbapi_hookDbApiHook
    Interact with Presto through PyHive
    >>> ph PrestoHook()
    >>> sql SELECT count(1) AS num FROM airflowstatic_babynames
    >>> phget_records(sql)
    [[340698]]
    get_conn()[source]
    Returns a connection object
    get_first(hql parametersNone)[source]
    Returns only the first row regardless of how many rows the query returns
    get_pandas_df(hql parametersNone)[source]
    Get a pandas dataframe from a sql query
    get_records(hql parametersNone)[source]
    Get a set of records from Presto
    insert_rows(table rows target_fieldsNone)[source]
    A generic way to insert a set of tuples into a table
    Parameters
    · table (str) – Name of the target table
    · rows (iterable of tuples) – The rows to insert into the table
    · target_fields (iterable of strings) – The names of the columns to fill in the table
    run(hql parametersNone)[source]
    Execute the statement against Presto Can be used to create views
    class airflowhooksS3_hookS3Hook(aws_conn_id'aws_default' verifyNone)[source]
    Bases airflowcontribhooksaws_hookAwsHook
    Interact with AWS S3 using the boto3 library
    check_for_bucket(bucket_name)[source]
    Check if bucket_name exists
    Parameters
    bucket_name (str) – the name of the bucket
    check_for_key(key bucket_nameNone)[source]
    Checks if a key exists in a bucket
    Parameters
    · key (str) – S3 key that will point to the file
    · bucket_name (str) – Name of the bucket in which the file is stored
    check_for_prefix(bucket_name prefix delimiter)[source]
    Checks that a prefix exists in a bucket
    Parameters
    · bucket_name (str) – the name of the bucket
    · prefix (str) – a key prefix
    · delimiter (str) – the delimiter marks key hierarchy
    check_for_wildcard_key(wildcard_key bucket_nameNone delimiter'')[source]
    Checks that a key matching a wildcard expression exists in a bucket
    Parameters
    · wildcard_key (str) – the path to the key
    · bucket_name (str) – the name of the bucket
    · delimiter (str) – the delimiter marks key hierarchy
    copy_object(source_bucket_key dest_bucket_key source_bucket_nameNone dest_bucket_nameNone source_version_idNone)[source]
    Creates a copy of an object that is already stored in S3
    Note the S3 connection used here needs to have access to both source and destination bucketkey
    Parameters
    · source_bucket_key (str) –
    The key of the source object
    It can be either full s3 style url or relative path from root level
    When it’s specified as a full s3 url please omit source_bucket_name
    · dest_bucket_key (str) –
    The key of the object to copy to
    The convention to specify dest_bucket_key is the same as source_bucket_key
    · source_bucket_name (str) –
    Name of the S3 bucket where the source object is in
    It should be omitted when source_bucket_key is provided as a full s3 url
    · dest_bucket_name (str) –
    Name of the S3 bucket to where the object is copied
    It should be omitted when dest_bucket_key is provided as a full s3 url
    · source_version_id (str) – Version ID of the source object (OPTIONAL)
    delete_objects(bucket keys)[source]
    Parameters
    · bucket (str) – Name of the bucket in which you are going to delete object(s)
    · keys (str or list) –
    The key(s) to delete from S3 bucket
    When keys is a string it’s supposed to be the key name of the single object to delete
    When keys is a list it’s supposed to be the list of the keys to delete
    get_bucket(bucket_name)[source]
    Returns a boto3S3Bucket object
    Parameters
    bucket_name (str) – the name of the bucket
    get_key(key bucket_nameNone)[source]
    Returns a boto3s3Object
    Parameters
    · key (str) – the path to the key
    · bucket_name (str) – the name of the bucket
    get_wildcard_key(wildcard_key bucket_nameNone delimiter'')[source]
    Returns a boto3s3Object object matching the wildcard expression
    Parameters
    · wildcard_key (str) – the path to the key
    · bucket_name (str) – the name of the bucket
    · delimiter (str) – the delimiter marks key hierarchy
    list_keys(bucket_name prefix'' delimiter'' page_sizeNone max_itemsNone)[source]
    Lists keys in a bucket under prefix and not containing delimiter
    Parameters
    · bucket_name (str) – the name of the bucket
    · prefix (str) – a key prefix
    · delimiter (str) – the delimiter marks key hierarchy
    · page_size (int) – pagination size
    · max_items (int) – maximum items to return
    list_prefixes(bucket_name prefix'' delimiter'' page_sizeNone max_itemsNone)[source]
    Lists prefixes in a bucket under prefix
    Parameters
    · bucket_name (str) – the name of the bucket
    · prefix (str) – a key prefix
    · delimiter (str) – the delimiter marks key hierarchy
    · page_size (int) – pagination size
    · max_items (int) – maximum items to return
    load_bytes(bytes_data key bucket_nameNone replaceFalse encryptFalse)[source]
    Loads bytes to S3
    This is provided as a convenience to drop a string in S3 It uses the boto infrastructure to ship a file to s3
    Parameters
    · bytes_data (bytes) – bytes to set as content for the key
    · key (str) – S3 key that will point to the file
    · bucket_name (str) – Name of the bucket in which to store the file
    · replace (bool) – A flag to decide whether or not to overwrite the key if it already exists
    · encrypt (bool) – If True the file will be encrypted on the serverside by S3 and will be stored in an encrypted form while at rest in S3
    load_file(filename key bucket_nameNone replaceFalse encryptFalse)[source]
    Loads a local file to S3
    Parameters
    · filename (str) – name of the file to load
    · key (str) – S3 key that will point to the file
    · bucket_name (str) – Name of the bucket in which to store the file
    · replace (bool) – A flag to decide whether or not to overwrite the key if it already exists If replace is False and the key exists an error will be raised
    · encrypt (bool) – If True the file will be encrypted on the serverside by S3 and will be stored in an encrypted form while at rest in S3
    load_string(string_data key bucket_nameNone replaceFalse encryptFalse encoding'utf8')[source]
    Loads a string to S3
    This is provided as a convenience to drop a string in S3 It uses the boto infrastructure to ship a file to s3
    Parameters
    · string_data (str) – str to set as content for the key
    · key (str) – S3 key that will point to the file
    · bucket_name (str) – Name of the bucket in which to store the file
    · replace (bool) – A flag to decide whether or not to overwrite the key if it already exists
    · encrypt (bool) – If True the file will be encrypted on the serverside by S3 and will be stored in an encrypted form while at rest in S3
    read_key(key bucket_nameNone)[source]
    Reads a key from S3
    Parameters
    · key (str) – S3 key that will point to the file
    · bucket_name (str) – Name of the bucket in which the file is stored
    select_key(key bucket_nameNone expression'SELECT * FROM S3Object' expression_type'SQL' input_serializationNone output_serializationNone)[source]
    Reads a key with S3 Select
    Parameters
    · key (str) – S3 key that will point to the file
    · bucket_name (str) – Name of the bucket in which the file is stored
    · expression (str) – S3 Select expression
    · expression_type (str) – S3 Select expression type
    · input_serialization (dict) – S3 Select input data serialization format
    · output_serialization (dict) – S3 Select output data serialization format
    Returns
    retrieved subset of original data by S3 Select
    Return type
    str
    See also
    For more details about S3 Select parameters httpboto3readthedocsioenlatestreferenceservicess3html#S3Clientselect_object_content
    class airflowhooksslack_hookSlackHook(tokenNone slack_conn_idNone)[source]
    Bases airflowhooksbase_hookBaseHook
    Interact with Slack using slackclient library
    class airflowhookssqlite_hookSqliteHook(*args **kwargs)[source]
    Bases airflowhooksdbapi_hookDbApiHook
    Interact with SQLite
    get_conn()[source]
    Returns a sqlite connection object
    Community contributed hooks
    class airflowcontribhooksaws_dynamodb_hookAwsDynamoDBHook(table_keysNone table_nameNone region_nameNone *args **kwargs)[source]
    Bases airflowcontribhooksaws_hookAwsHook
    Interact with AWS DynamoDB
    Parameters
    · table_keys (list) – partition key and sort key
    · table_name (str) – target DynamoDB table
    · region_name (str) – aws region name (example useast1)
    write_batch_data(items)[source]
    Write batch items to dynamodb table with provisioned throughout capacity
    class airflowcontribhooksaws_hookAwsHook(aws_conn_id'aws_default' verifyNone)[source]
    Bases airflowhooksbase_hookBaseHook
    Interact with AWS This class is a thin wrapper around the boto3 python library
    get_credentials(region_nameNone)[source]
    Get the underlying botocoreCredentials object
    This contains the attributes access_key secret_key and token
    get_session(region_nameNone)[source]
    Get the underlying boto3session
    class airflowcontribhooksaws_lambda_hookAwsLambdaHook(function_name region_nameNone log_type'None' qualifier'LATEST' invocation_type'RequestResponse' *args **kwargs)[source]
    Bases airflowcontribhooksaws_hookAwsHook
    Interact with AWS Lambda
    Parameters
    · function_name (str) – AWS Lambda Function Name
    · region_name (str) – AWS Region Name (example uswest2)
    · log_type (str) – Tail Invocation Request
    · qualifier (str) – AWS Lambda Function Version or Alias Name
    · invocation_type (str) – AWS Lambda Invocation Type (RequestResponse Event etc)
    invoke_lambda(payload)[source]
    Invoke Lambda Function
    class airflowcontribhooksaws_firehose_hookAwsFirehoseHook(delivery_stream region_nameNone *args **kwargs)[source]
    Bases airflowcontribhooksaws_hookAwsHook
    Interact with AWS Kinesis Firehose param delivery_stream Name of the delivery stream type delivery_stream str param region_name AWS region name (example useast1) type region_name str
    get_conn()[source]
    Returns AwsHook connection object
    put_records(records)[source]
    Write batch records to Kinesis Firehose
    class airflowcontribhooksbigquery_hookBigQueryHook(bigquery_conn_id'bigquery_default' delegate_toNone use_legacy_sqlTrue)[source]
    Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook airflowhooksdbapi_hookDbApiHook airflowutilsloglogging_mixinLoggingMixin
    Interact with BigQuery This hook uses the Google Cloud Platform connection
    get_conn()[source]
    Returns a BigQuery PEP 249 connection object
    get_pandas_df(sql parametersNone dialectNone)[source]
    Returns a Pandas DataFrame for the results produced by a BigQuery query The DbApiHook method must be overridden because Pandas doesn’t support PEP 249 connections except for SQLite See
    httpsgithubcompydatapandasblobmasterpandasiosqlpy#L447 httpsgithubcompydatapandasissues6900
    Parameters
    · sql (str) – The BigQuery SQL to execute
    · parameters (mapping or iterable) – The parameters to render the SQL query with (not used leave to override superclass method)
    · dialect (str in {'legacy' 'standard'}) – Dialect of BigQuery SQL – legacy SQL or standard SQL defaults to use selfuse_legacy_sql if not specified
    get_service()[source]
    Returns a BigQuery service object
    insert_rows(table rows target_fieldsNone commit_every1000)[source]
    Insertion is currently unsupported Theoretically you could use BigQuery’s streaming API to insert rows into a table but this hasn’t been implemented
    table_exists(project_id dataset_id table_id)[source]
    Checks for the existence of a table in Google BigQuery
    Parameters
    · project_id (str) – The Google cloud project in which to look for the table The connection supplied to the hook must provide access to the specified project
    · dataset_id (str) – The name of the dataset in which to look for the table
    · table_id (str) – The name of the table to check the existence of
    class airflowcontribhookscassandra_hookCassandraHook(cassandra_conn_id'cassandra_default')[source]
    Bases airflowhooksbase_hookBaseHook airflowutilsloglogging_mixinLoggingMixin
    Hook used to interact with Cassandra
    Contact points can be specified as a commaseparated string in the hosts’ field of the connection
    Port can be specified in the port field of the connection
    If SSL is enabled in Cassandra pass in a dict in the extra field as kwargs for sslwrap_socket() For example
    {
    ssl_options’ {
    ca_certs’ PATH_TO_CA_CERTS
    }
    }
    Default load balancing policy is RoundRobinPolicy To specify a different LB policy
    · DCAwareRoundRobinPolicy
    {
    load_balancing_policy’ DCAwareRoundRobinPolicy’ load_balancing_policy_args’ {
    local_dc’ LOCAL_DC_NAME optional used_hosts_per_remote_dc’ SOME_INT_VALUE optional
    }
    }
    · WhiteListRoundRobinPolicy
    {
    load_balancing_policy’ WhiteListRoundRobinPolicy’ load_balancing_policy_args’ {
    hosts’ [HOST1’ HOST2’ HOST3’]
    }
    }
    · TokenAwarePolicy
    {
    load_balancing_policy’ TokenAwarePolicy’ load_balancing_policy_args’ {
    child_load_balancing_policy’ CHILD_POLICY_NAME optional child_load_balancing_policy_args’ { … } optional
    }
    }
    For details of the Cluster config see cassandracluster
    get_conn()[source]
    Returns a cassandra Session object
    record_exists(table keys)[source]
    Checks if a record exists in Cassandra
    Parameters
    · table (str) – Target Cassandra table Use dot notation to target a specific keyspace
    · keys (dict) – The keys and their values to check the existence
    shutdown_cluster()[source]
    Closes all sessions and connections associated with this Cluster
    table_exists(table)[source]
    Checks if a table exists in Cassandra
    Parameters
    table (str) – Target Cassandra table Use dot notation to target a specific keyspace
    class airflowcontribhookscloudant_hookCloudantHook(cloudant_conn_id'cloudant_default')[source]
    Bases airflowhooksbase_hookBaseHook
    Interact with Cloudant
    This class is a thin wrapper around the cloudant python library See the documentation here
    db()[source]
    Returns the Database object for this hook
    See the documentation for cloudantpython here httpsgithubcomcloudantlabscloudantpython
    class airflowcontribhooksdatabricks_hookDatabricksHook(databricks_conn_id'databricks_default' timeout_seconds180 retry_limit3 retry_delay10)[source]
    Bases airflowhooksbase_hookBaseHook airflowutilsloglogging_mixinLoggingMixin
    Interact with Databricks
    run_now(json)[source]
    Utility function to call the api20jobsrunnow endpoint
    Parameters
    json (dict) – The data used in the body of the request to the runnow endpoint
    Returns
    the run_id as a string
    Return type
    str
    submit_run(json)[source]
    Utility function to call the api20jobsrunssubmit endpoint
    Parameters
    json (dict) – The data used in the body of the request to the submit endpoint
    Returns
    the run_id as a string
    Return type
    str
    class airflowcontribhooksdatastore_hookDatastoreHook(datastore_conn_id'google_cloud_datastore_default' delegate_toNone)[source]
    Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
    Interact with Google Cloud Datastore This hook uses the Google Cloud Platform connection
    This object is not threads safe If you want to make multiple requests simultaneously you will need to create a hook per thread
    allocate_ids(partialKeys)[source]
    Allocate IDs for incomplete keys see httpscloudgooglecomdatastoredocsreferencerestv1projectsallocateIds
    Parameters
    partialKeys – a list of partial keys
    Returns
    a list of full keys
    begin_transaction()[source]
    Get a new transaction handle
    See also
    httpscloudgooglecomdatastoredocsreferencerestv1projectsbeginTransaction
    Returns
    a transaction handle
    commit(body)[source]
    Commit a transaction optionally creating deleting or modifying some entities
    See also
    httpscloudgooglecomdatastoredocsreferencerestv1projectscommit
    Parameters
    body – the body of the commit request
    Returns
    the response body of the commit request
    delete_operation(name)[source]
    Deletes the longrunning operation
    Parameters
    name – the name of the operation resource
    export_to_storage_bucket(bucket namespaceNone entity_filterNone labelsNone)[source]
    Export entities from Cloud Datastore to Cloud Storage for backup
    get_conn(version'v1')[source]
    Returns a Google Cloud Storage service object
    get_operation(name)[source]
    Gets the latest state of a longrunning operation
    Parameters
    name – the name of the operation resource
    import_from_storage_bucket(bucket file namespaceNone entity_filterNone labelsNone)[source]
    Import a backup from Cloud Storage to Cloud Datastore
    lookup(keys read_consistencyNone transactionNone)[source]
    Lookup some entities by key
    See also
    httpscloudgooglecomdatastoredocsreferencerestv1projectslookup
    Parameters
    · keys – the keys to lookup
    · read_consistency – the read consistency to use default strong or eventual Cannot be used with a transaction
    · transaction – the transaction to use if any
    Returns
    the response body of the lookup request
    poll_operation_until_done(name polling_interval_in_seconds)[source]
    Poll backup operation state until it’s completed
    rollback(transaction)[source]
    Roll back a transaction
    See also
    httpscloudgooglecomdatastoredocsreferencerestv1projectsrollback
    Parameters
    transaction – the transaction to roll back
    run_query(body)[source]
    Run a query for entities
    See also
    httpscloudgooglecomdatastoredocsreferencerestv1projectsrunQuery
    Parameters
    body – the body of the query request
    Returns
    the batch of query results
    class airflowcontribhooksdiscord_webhook_hookDiscordWebhookHook(http_conn_idNone webhook_endpointNone message'' usernameNone avatar_urlNone ttsFalse proxyNone *args **kwargs)[source]
    Bases airflowhookshttp_hookHttpHook
    This hook allows you to post messages to Discord using incoming webhooks Takes a Discord connection ID with a default relative webhook endpoint The default endpoint can be overridden using the webhook_endpoint parameter (httpsdiscordappcomdevelopersdocsresourceswebhook)
    Each Discord webhook can be preconfigured to use a specific username and avatar_url You can override these defaults in this hook
    Parameters
    · http_conn_id (str) – Http connection ID with host as httpsdiscordcomapi and default webhook endpoint in the extra field in the form of {webhook_endpoint webhooks{webhookid}{webhooktoken}}
    · webhook_endpoint (str) – Discord webhook endpoint in the form of webhooks{webhookid}{webhooktoken}
    · message (str) – The message you want to send to your Discord channel (max 2000 characters)
    · username (str) – Override the default username of the webhook
    · avatar_url (str) – Override the default avatar of the webhook
    · tts (bool) – Is a texttospeech message
    · proxy (str) – Proxy to use to make the Discord webhook call
    execute()[source]
    Execute the Discord webhook call
    class airflowcontribhooksemr_hookEmrHook(emr_conn_idNone *args **kwargs)[source]
    Bases airflowcontribhooksaws_hookAwsHook
    Interact with AWS EMR emr_conn_id is only neccessary for using the create_job_flow method
    create_job_flow(job_flow_overrides)[source]
    Creates a job flow using the config from the EMR connection Keys of the json extra hash may have the arguments of the boto3 run_job_flow method Overrides for this config may be passed as the job_flow_overrides
    class airflowcontribhooksfs_hookFSHook(conn_id'fs_default')[source]
    Bases airflowhooksbase_hookBaseHook
    Allows for interaction with an file server
    Connection should have a name and a path specified under extra
    example Conn Id fs_test Conn Type File (path) Host Shchema Login Password Port empty Extra {path tmp}
    class airflowcontribhooksftp_hookFTPHook(ftp_conn_id'ftp_default')[source]
    Bases airflowhooksbase_hookBaseHook airflowutilsloglogging_mixinLoggingMixin
    Interact with FTP
    Errors that may occur throughout but should be handled downstream
    close_conn()[source]
    Closes the connection An error will occur if the connection wasn’t ever opened
    create_directory(path)[source]
    Creates a directory on the remote system
    Parameters
    path (str) – full path to the remote directory to create
    delete_directory(path)[source]
    Deletes a directory on the remote system
    Parameters
    path (str) – full path to the remote directory to delete
    delete_file(path)[source]
    Removes a file on the FTP Server
    Parameters
    path (str) – full path to the remote file
    describe_directory(path)[source]
    Returns a dictionary of {filename {attributes}} for all files on the remote system (where the MLSD command is supported)
    Parameters
    path (str) – full path to the remote directory
    get_conn()[source]
    Returns a FTP connection object
    get_mod_time(path)[source]
    Returns a datetime object representing the last time the file was modified
    Parameters
    path (string) – remote file path
    get_size(path)[source]
    Returns the size of a file (in bytes)
    Parameters
    path (string) – remote file path
    list_directory(path nlstFalse)[source]
    Returns a list of files on the remote system
    Parameters
    path (str) – full path to the remote directory to list
    rename(from_name to_name)[source]
    Rename a file
    Parameters
    · from_name – rename file from name
    · to_name – rename file to name
    retrieve_file(remote_full_path local_full_path_or_buffer callbackNone)[source]
    Transfers the remote file to a local location
    If local_full_path_or_buffer is a string path the file will be put at that location if it is a filelike buffer the file will be written to the buffer but not closed
    Parameters
    · remote_full_path (str) – full path to the remote file
    · local_full_path_or_buffer (str or filelike buffer) – full path to the local file or a filelike buffer
    · callback (callable) – callback which is called each time a block of data is read if you do not use a callback these blocks will be written to the file or buffer passed in if you do pass in a callback note that writing to a file or buffer will need to be handled inside the callback [default output_handlewrite()]
    Example
    hook FTPHook(ftp_conn_id’my_conn’)
    remote_path pathtoremotefile’ local_path pathtolocalfile’
    # with a custom callback (in this case displaying progress on each read) def print_progress(percent_progress)
    selfloginfo(Percent Downloaded s’ percent_progress)
    total_downloaded 0 total_file_size hookget_size(remote_path) output_handle open(local_path wb’) def write_to_file_with_progress(data)
    total_downloaded + len(data) output_handlewrite(data) percent_progress (total_downloaded total_file_size) * 100 print_progress(percent_progress)
    hookretrieve_file(remote_path None callbackwrite_to_file_with_progress)
    # without a custom callback data is written to the local_path hookretrieve_file(remote_path local_path)
    store_file(remote_full_path local_full_path_or_buffer)[source]
    Transfers a local file to the remote location
    If local_full_path_or_buffer is a string path the file will be read from that location if it is a filelike buffer the file will be read from the buffer but not closed
    Parameters
    · remote_full_path (str) – full path to the remote file
    · local_full_path_or_buffer (str or filelike buffer) – full path to the local file or a filelike buffer
    class airflowcontribhooksftp_hookFTPSHook(ftp_conn_id'ftp_default')[source]
    Bases airflowcontribhooksftp_hookFTPHook
    get_conn()[source]
    Returns a FTPS connection object
    class airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook(gcp_conn_id'google_cloud_default' delegate_toNone)[source]
    Bases airflowhooksbase_hookBaseHook airflowutilsloglogging_mixinLoggingMixin
    A base hook for Google cloudrelated hooks Google cloud has a shared REST API client that is built in the same way no matter which service you use This class helps construct and authorize the credentials needed to then call apiclientdiscoverybuild() to actually discover and build a client for a Google cloud service
    The class also contains some miscellaneous helper functions
    All hook derived from this base hook use the Google Cloud Platform’ connection type Two ways of authentication are supported
    Default credentials Only the Project Id’ is required You’ll need to have set up default credentials such as by the GOOGLE_APPLICATION_DEFAULT environment variable or from the metadata server on Google Compute Engine
    JSON key file Specify Project Id’ Key Path’ and Scope’
    Legacy P12 key files are not supported
    class airflowcontribhooksgcp_container_hookGKEClusterHook(project_id location)[source]
    Bases airflowhooksbase_hookBaseHook
    create_cluster(cluster retry timeout)[source]
    Creates a cluster consisting of the specified number and type of Google Compute Engine instances
    Parameters
    · cluster (dict or googlecloudcontainer_v1typesCluster) – A Cluster protobuf or dict If dict is provided it must be of the same form as the protobuf message googlecloudcontainer_v1typesCluster
    · retry (googleapi_coreretryRetry) – A retry object (googleapi_coreretryRetry) used to retry requests If None is specified requests will not be retried
    · timeout (float) – The amount of time in seconds to wait for the request to complete Note that if retry is specified the timeout applies to each individual attempt
    Returns
    The full url to the new or existing cluster
    raises
    ParseError On JSON parsing problems when trying to convert dict AirflowException cluster is not dict type nor Cluster proto type
    delete_cluster(name retry timeout)[source]
    Deletes the cluster including the Kubernetes endpoint and all worker nodes Firewalls and routes that were configured during cluster creation are also deleted Other Google Compute Engine resources that might be in use by the cluster (eg load balancer resources) will not be deleted if they weren’t present at the initial create time
    Parameters
    · name (str) – The name of the cluster to delete
    · retry (googleapi_coreretryRetry) – Retry object used to determine whenif to retry requests If None is specified requests will not be retried
    · timeout (float) – The amount of time in seconds to wait for the request to complete Note that if retry is specified the timeout applies to each individual attempt
    Returns
    The full url to the delete operation if successful else None
    get_cluster(name retry timeout)[source]
    Gets details of specified cluster
    Parameters
    · name (str) – The name of the cluster to retrieve
    · retry (googleapi_coreretryRetry) – A retry object used to retry requests If None is specified requests will not be retried
    · timeout (float) – The amount of time in seconds to wait for the request to complete Note that if retry is specified the timeout applies to each individual attempt
    Returns
    A googlecloudcontainer_v1typesCluster instance
    get_operation(operation_name)[source]
    Fetches the operation from Google Cloud
    Parameters
    operation_name (str) – Name of operation to fetch
    Returns
    The new updated operation from Google Cloud
    wait_for_operation(operation)[source]
    Given an operation continuously fetches the status from Google Cloud until either completion or an error occurring
    Parameters
    operation (A googlecloudcontainer_V1gapicenumsOperator) – The Operation to wait for
    Returns
    A new updated operation fetched from Google Cloud
    class airflowcontribhooksgcp_dataflow_hookDataFlowHook(gcp_conn_id'google_cloud_default' delegate_toNone poll_sleep10)[source]
    Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
    get_conn()[source]
    Returns a Google Cloud Dataflow service object
    class airflowcontribhooksgcp_dataproc_hookDataProcHook(gcp_conn_id'google_cloud_default' delegate_toNone api_version'v1beta2')[source]
    Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
    Hook for Google Cloud Dataproc APIs
    await(operation)
    Awaits for Google Cloud Dataproc Operation to complete
    get_conn()[source]
    Returns a Google Cloud Dataproc service object
    wait(operation)[source]
    Awaits for Google Cloud Dataproc Operation to complete
    class airflowcontribhooksgcp_mlengine_hookMLEngineHook(gcp_conn_id'google_cloud_default' delegate_toNone)[source]
    Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
    create_job(project_id job use_existing_job_fnNone)[source]
    Launches a MLEngine job and wait for it to reach a terminal state
    Parameters
    · project_id (str) – The Google Cloud project id within which MLEngine job will be launched
    · job (dict) –
    MLEngine Job object that should be provided to the MLEngine API such as
    {
    'jobId' 'my_job_id'
    'trainingInput' {
    'scaleTier' 'STANDARD_1'

    }
    }
    · use_existing_job_fn (function) – In case that a MLEngine job with the same job_id already exist this method (if provided) will decide whether we should use this existing job continue waiting for it to finish and returning the job object It should accepts a MLEngine job object and returns a boolean value indicating whether it is OK to reuse the existing job If use_existing_job_fn’ is not provided we by default reuse the existing MLEngine job
    Returns
    The MLEngine job object if the job successfully reach a terminal state (which might be FAILED or CANCELLED state)
    Return type
    dict
    create_model(project_id model)[source]
    Create a Model Blocks until finished
    create_version(project_id model_name version_spec)[source]
    Creates the Version on Google Cloud ML Engine
    Returns the operation if the version was created successfully and raises an error otherwise
    delete_version(project_id model_name version_name)[source]
    Deletes the given version of a model Blocks until finished
    get_conn()[source]
    Returns a Google MLEngine service object
    get_model(project_id model_name)[source]
    Gets a Model Blocks until finished
    list_versions(project_id model_name)[source]
    Lists all available versions of a model Blocks until finished
    set_default_version(project_id model_name version_name)[source]
    Sets a version to be the default Blocks until finished
    class airflowcontribhooksgcp_pubsub_hookPubSubHook(gcp_conn_id'google_cloud_default' delegate_toNone)[source]
    Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
    Hook for accessing Google PubSub
    The GCP project against which actions are applied is determined by the project embedded in the Connection referenced by gcp_conn_id
    acknowledge(project subscription ack_ids)[source]
    Pulls up to max_messages messages from PubSub subscription
    Parameters
    · project (str) – the GCP project name or ID in which to create the topic
    · subscription (str) – the PubSub subscription name to delete do not include the projects{project}topics’ prefix
    · ack_ids (list) – List of ReceivedMessage ackIds from a previous pull response
    create_subscription(topic_project topic subscriptionNone subscription_projectNone ack_deadline_secs10 fail_if_existsFalse)[source]
    Creates a PubSub subscription if it does not already exist
    Parameters
    · topic_project (str) – the GCP project ID of the topic that the subscription will be bound to
    · topic (str) – the PubSub topic name that the subscription will be bound to create do not include the projects{project}subscriptions prefix
    · subscription (str) – the PubSub subscription name If empty a random name will be generated using the uuid module
    · subscription_project (str) – the GCP project ID where the subscription will be created If unspecified topic_project will be used
    · ack_deadline_secs (int) – Number of seconds that a subscriber has to acknowledge each message pulled from the subscription
    · fail_if_exists (bool) – if set raise an exception if the topic already exists
    Returns
    subscription name which will be the systemgenerated value if the subscription parameter is not supplied
    Return type
    str
    create_topic(project topic fail_if_existsFalse)[source]
    Creates a PubSub topic if it does not already exist
    Parameters
    · project (str) – the GCP project ID in which to create the topic
    · topic (str) – the PubSub topic name to create do not include the projects{project}topics prefix
    · fail_if_exists (bool) – if set raise an exception if the topic already exists
    delete_subscription(project subscription fail_if_not_existsFalse)[source]
    Deletes a PubSub subscription if it exists
    Parameters
    · project (str) – the GCP project ID where the subscription exists
    · subscription (str) – the PubSub subscription name to delete do not include the projects{project}subscriptions prefix
    · fail_if_not_exists (bool) – if set raise an exception if the topic does not exist
    delete_topic(project topic fail_if_not_existsFalse)[source]
    Deletes a PubSub topic if it exists
    Parameters
    · project (str) – the GCP project ID in which to delete the topic
    · topic (str) – the PubSub topic name to delete do not include the projects{project}topics prefix
    · fail_if_not_exists (bool) – if set raise an exception if the topic does not exist
    get_conn()[source]
    Returns a PubSub service object
    Return type
    apiclientdiscoveryResource
    publish(project topic messages)[source]
    Publishes messages to a PubSub topic
    Parameters
    · project (str) – the GCP project ID in which to publish
    · topic (str) – the PubSub topic to which to publish do not include the projects{project}topics prefix
    · messages (list of PubSub messages see httpcloudgooglecompubsubdocsreferencerestv1PubsubMessage) – messages to publish if the data field in a message is set it should already be base64 encoded
    pull(project subscription max_messages return_immediatelyFalse)[source]
    Pulls up to max_messages messages from PubSub subscription
    Parameters
    · project (str) – the GCP project ID where the subscription exists
    · subscription (str) – the PubSub subscription name to pull from do not include the projects{project}topics’ prefix
    · max_messages (int) – The maximum number of messages to return from the PubSub API
    · return_immediately (bool) – If set the PubSub API will immediately return if no messages are available Otherwise the request will block for an undisclosed but bounded period of time
    return A list of PubSub ReceivedMessage objects each containing
    an ackId property and a message property which includes the base64encoded message content See httpscloudgooglecompubsubdocsreferencerestv1 projectssubscriptionspull#ReceivedMessage
    class airflowcontribhooksgcs_hookGoogleCloudStorageHook(google_cloud_storage_conn_id'google_cloud_default' delegate_toNone)[source]
    Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
    Interact with Google Cloud Storage This hook uses the Google Cloud Platform connection
    copy(source_bucket source_object destination_bucketNone destination_objectNone)[source]
    Copies an object from a bucket to another with renaming if requested
    destination_bucket or destination_object can be omitted in which case source bucketobject is used but not both
    Parameters
    · source_bucket (str) – The bucket of the object to copy from
    · source_object (str) – The object to copy
    · destination_bucket (str) – The destination of the object to copied to Can be omitted then the same bucket is used
    · destination_object (str) – The (renamed) path of the object if given Can be omitted then the same name is used
    create_bucket(bucket_name storage_class'MULTI_REGIONAL' location'US' project_idNone labelsNone)[source]
    Creates a new bucket Google Cloud Storage uses a flat namespace so you can’t create a bucket with a name that is already in use
    See also
    For more information see Bucket Naming Guidelines httpscloudgooglecomstoragedocsbucketnaminghtml#requirements
    Parameters
    · bucket_name (str) – The name of the bucket
    · storage_class (str) –
    This defines how objects in the bucket are stored and determines the SLA and the cost of storage Values include
    o MULTI_REGIONAL
    o REGIONAL
    o STANDARD
    o NEARLINE
    o COLDLINE
    If this value is not specified when the bucket is created it will default to STANDARD
    · location (str) –
    The location of the bucket Object data for objects in the bucket resides in physical storage within this region Defaults to US
    See also
    httpsdevelopersgooglecomstoragedocsbucketlocations
    · project_id (str) – The ID of the GCP Project
    · labels (dict) – Userprovided labels in keyvalue pairs
    Returns
    If successful it returns the id of the bucket
    delete(bucket object generationNone)[source]
    Delete an object if versioning is not enabled for the bucket or if generation parameter is used
    Parameters
    · bucket (str) – name of the bucket where the object resides
    · object (str) – name of the object to delete
    · generation (str) – if present permanently delete the object of this generation
    Returns
    True if succeeded
    download(bucket object filenameNone)[source]
    Get a file from Google Cloud Storage
    Parameters
    · bucket (str) – The bucket to fetch from
    · object (str) – The object to fetch
    · filename (str) – If set a local file path where the file should be written to
    exists(bucket object)[source]
    Checks for the existence of a file in Google Cloud Storage
    Parameters
    · bucket (str) – The Google cloud storage bucket where the object is
    · object (str) – The name of the object to check in the Google cloud storage bucket
    get_conn()[source]
    Returns a Google Cloud Storage service object
    get_crc32c(bucket object)[source]
    Gets the CRC32c checksum of an object in Google Cloud Storage
    Parameters
    · bucket (str) – The Google cloud storage bucket where the object is
    · object (str) – The name of the object to check in the Google cloud storage bucket
    get_md5hash(bucket object)[source]
    Gets the MD5 hash of an object in Google Cloud Storage
    Parameters
    · bucket (str) – The Google cloud storage bucket where the object is
    · object (str) – The name of the object to check in the Google cloud storage bucket
    get_size(bucket object)[source]
    Gets the size of a file in Google Cloud Storage
    Parameters
    · bucket (str) – The Google cloud storage bucket where the object is
    · object (str) – The name of the object to check in the Google cloud storage bucket
    is_updated_after(bucket object ts)[source]
    Checks if an object is updated in Google Cloud Storage
    Parameters
    · bucket (str) – The Google cloud storage bucket where the object is
    · object (str) – The name of the object to check in the Google cloud storage bucket
    · ts (datetime) – The timestamp to check against
    list(bucket versionsNone maxResultsNone prefixNone delimiterNone)[source]
    List all objects from the bucket with the give string prefix in name
    Parameters
    · bucket (str) – bucket name
    · versions (bool) – if true list all versions of the objects
    · maxResults (int) – max count of items to return in a single page of responses
    · prefix (str) – prefix string which filters objects whose name begin with this prefix
    · delimiter (str) – filters objects based on the delimiter (for eg csv’)
    Returns
    a stream of object names matching the filtering criteria
    rewrite(source_bucket source_object destination_bucket destination_objectNone)[source]
    Has the same functionality as copy except that will work on files over 5 TB as well as when copying between locations andor storage classes
    destination_object can be omitted in which case source_object is used
    Parameters
    · source_bucket (str) – The bucket of the object to copy from
    · source_object (str) – The object to copy
    · destination_bucket (str) – The destination of the object to copied to
    · destination_object (str) – The (renamed) path of the object if given Can be omitted then the same name is used
    upload(bucket object filename mime_type'applicationoctetstream' gzipFalse)[source]
    Uploads a local file to Google Cloud Storage
    Parameters
    · bucket (str) – The bucket to upload to
    · object (str) – The object name to set when uploading the local file
    · filename (str) – The local file path to the file to be uploaded
    · mime_type (str) – The MIME type to set when uploading the file
    · gzip (bool) – Option to compress file for upload
    class airflowcontribhooksmongo_hookMongoHook(conn_id'mongo_default' *args **kwargs)[source]
    Bases airflowhooksbase_hookBaseHook
    PyMongo Wrapper to Interact With Mongo Database Mongo Connection Documentation httpsdocsmongodbcommanualreferenceconnectionstringindexhtml You can specify connection string options in extra field of your connection httpsdocsmongodbcommanualreferenceconnectionstringindexhtml#connectionstringoptions ex
    {replicaSet test ssl True connectTimeoutMS 30000}
    aggregate(mongo_collection aggregate_query mongo_dbNone **kwargs)[source]
    Runs an aggregation pipeline and returns the results httpsapimongodbcompythoncurrentapipymongocollectionhtml#pymongocollectionCollectionaggregate httpsapimongodbcompythoncurrentexamplesaggregationhtml
    find(mongo_collection query find_oneFalse mongo_dbNone **kwargs)[source]
    Runs a mongo find query and returns the results httpsapimongodbcompythoncurrentapipymongocollectionhtml#pymongocollectionCollectionfind
    get_collection(mongo_collection mongo_dbNone)[source]
    Fetches a mongo collection object for querying
    Uses connection schema as DB unless specified
    get_conn()[source]
    Fetches PyMongo Client
    insert_many(mongo_collection docs mongo_dbNone **kwargs)[source]
    Inserts many docs into a mongo collection httpsapimongodbcompythoncurrentapipymongocollectionhtml#pymongocollectionCollectioninsert_many
    insert_one(mongo_collection doc mongo_dbNone **kwargs)[source]
    Inserts a single document into a mongo collection httpsapimongodbcompythoncurrentapipymongocollectionhtml#pymongocollectionCollectioninsert_one
    class airflowcontribhookspinot_hookPinotDbApiHook(*args **kwargs)[source]
    Bases airflowhooksdbapi_hookDbApiHook
    Connect to pinot db(httpsgithubcomlinkedinpinot) to issue pql
    get_conn()[source]
    Establish a connection to pinot broker through pinot dbqpi
    get_first(sql)[source]
    Executes the sql and returns the first resulting row
    Parameters
    sql (str or list) – the sql statement to be executed (str) or a list of sql statements to execute
    get_pandas_df(sql parametersNone)[source]
    Executes the sql and returns a pandas dataframe
    Parameters
    · sql (str or list) – the sql statement to be executed (str) or a list of sql statements to execute
    · parameters (mapping or iterable) – The parameters to render the SQL query with
    get_records(sql)[source]
    Executes the sql and returns a set of records
    Parameters
    sql (str) – the sql statement to be executed (str) or a list of sql statements to execute
    get_uri()[source]
    Get the connection uri for pinot broker
    eg httplocalhost9000pql
    insert_rows(table rows target_fieldsNone commit_every1000)[source]
    A generic way to insert a set of tuples into a table a new transaction is created every commit_every rows
    Parameters
    · table (str) – Name of the target table
    · rows (iterable of tuples) – The rows to insert into the table
    · target_fields (iterable of strings) – The names of the columns to fill in the table
    · commit_every (int) – The maximum number of rows to insert in one transaction Set to 0 to insert all rows in one transaction
    · replace (bool) – Whether to replace instead of insert
    set_autocommit(conn autocommit)[source]
    Sets the autocommit flag on the connection
    class airflowcontribhooksredshift_hookRedshiftHook(aws_conn_id'aws_default' verifyNone)[source]
    Bases airflowcontribhooksaws_hookAwsHook
    Interact with AWS Redshift using the boto3 library
    cluster_status(cluster_identifier)[source]
    Return status of a cluster
    Parameters
    cluster_identifier (str) – unique identifier of a cluster
    create_cluster_snapshot(snapshot_identifier cluster_identifier)[source]
    Creates a snapshot of a cluster
    Parameters
    · snapshot_identifier (str) – unique identifier for a snapshot of a cluster
    · cluster_identifier (str) – unique identifier of a cluster
    delete_cluster(cluster_identifier skip_final_cluster_snapshotTrue final_cluster_snapshot_identifier'')[source]
    Delete a cluster and optionally create a snapshot
    Parameters
    · cluster_identifier (str) – unique identifier of a cluster
    · skip_final_cluster_snapshot (bool) – determines cluster snapshot creation
    · final_cluster_snapshot_identifier (str) – name of final cluster snapshot
    describe_cluster_snapshots(cluster_identifier)[source]
    Gets a list of snapshots for a cluster
    Parameters
    cluster_identifier (str) – unique identifier of a cluster
    restore_from_cluster_snapshot(cluster_identifier snapshot_identifier)[source]
    Restores a cluster from its snapshot
    Parameters
    · cluster_identifier (str) – unique identifier of a cluster
    · snapshot_identifier (str) – unique identifier for a snapshot of a cluster
    class airflowcontribhookssalesforce_hookSalesforceHook(conn_id *args **kwargs)[source]
    Bases airflowhooksbase_hookBaseHook airflowutilsloglogging_mixinLoggingMixin
    describe_object(obj)[source]
    Get the description of an object from Salesforce
    This description is the object’s schema and some extra metadata that Salesforce stores for each object
    Parameters
    obj – Name of the Salesforce object that we are getting a description of
    get_available_fields(obj)[source]
    Get a list of all available fields for an object
    This only returns the names of the fields
    get_object_from_salesforce(obj fields)[source]
    Get all instances of the object from Salesforce For each model only get the fields specified in fields
    All we really do underneath the hood is run
    SELECT FROM
    make_query(query)[source]
    Make a query to Salesforce Returns result in dictionary
    Parameters
    query – The query to make to Salesforce
    sign_in()[source]
    Sign into Salesforce
    If we have already signed it this will just return the original object
    write_object_to_file(query_results filename fmt'csv' coerce_to_timestampFalse record_time_addedFalse)[source]
    Write query results to file
    Acceptable formats are
    · csv
    commaseparatedvalues file This is the default format
    · json
    JSON array Each element in the array is a different row
    · ndjson
    JSON array but each element is newline delimited instead of comma delimited like in json
    This requires a significant amount of cleanup Pandas doesn’t handle output to CSV and json in a uniform way This is especially painful for datetime types Pandas wants to write them as strings in CSV but as millisecond Unix timestamps
    By default this function will try and leave all values as they are represented in Salesforce You use the coerce_to_timestamp flag to force all datetimes to become Unix timestamps (UTC) This is can be greatly beneficial as it will make all of your datetime fields look the same and makes it easier to work with in other database environments
    Parameters
    · query_results – the results from a SQL query
    · filename – the name of the file where the data should be dumped to
    · fmt – the format you want the output in Default csv
    · coerce_to_timestamp – True if you want all datetime fields to be converted into Unix timestamps False if you want them to be left in the same format as they were in Salesforce Leaving the value as False will result in datetimes being strings Defaults to False
    · record_time_added – (optional) True if you want to add a Unix timestamp field to the resulting data that marks when the data was fetched from Salesforce Default False
    class airflowcontribhookssftp_hookSFTPHook(ftp_conn_id'sftp_default' *args **kwargs)[source]
    Bases airflowcontribhooksssh_hookSSHHook
    This hook is inherited from SSH hook Please refer to SSH hook for the input arguments
    Interact with SFTP Aims to be interchangeable with FTPHook
    Pitfalls In contrast with FTPHook describe_directory only returns size type and
    modify It doesn’t return unixowner unixmode perm unixgroup and unique
    · retrieve_file and store_file only take a local full path and not a buffer
    · If no mode is passed to create_directory it will be created with 777 permissions
    Errors that may occur throughout but should be handled downstream
    close_conn()[source]
    Closes the connection An error will occur if the connection wasnt ever opened
    create_directory(path mode777)[source]
    Creates a directory on the remote system param path full path to the remote directory to create type path str param mode int representation of octal mode for directory
    delete_directory(path)[source]
    Deletes a directory on the remote system param path full path to the remote directory to delete type path str
    delete_file(path)[source]
    Removes a file on the FTP Server param path full path to the remote file type path str
    describe_directory(path)[source]
    Returns a dictionary of {filename {attributes}} for all files on the remote system (where the MLSD command is supported) param path full path to the remote directory type path str
    get_conn()[source]
    Returns an SFTP connection object
    list_directory(path)[source]
    Returns a list of files on the remote system param path full path to the remote directory to list type path str
    retrieve_file(remote_full_path local_full_path)[source]
    Transfers the remote file to a local location If local_full_path is a string path the file will be put at that location param remote_full_path full path to the remote file type remote_full_path str param local_full_path full path to the local file type local_full_path str
    store_file(remote_full_path local_full_path)[source]
    Transfers a local file to the remote location If local_full_path_or_buffer is a string path the file will be read from that location param remote_full_path full path to the remote file type remote_full_path str param local_full_path full path to the local file type local_full_path str
    class airflowcontribhooksslack_webhook_hookSlackWebhookHook(http_conn_idNone webhook_tokenNone message'' channelNone usernameNone icon_emojiNone link_namesFalse proxyNone *args **kwargs)[source]
    Bases airflowhookshttp_hookHttpHook
    This hook allows you to post messages to Slack using incoming webhooks Takes both Slack webhook token directly and connection that has Slack webhook token If both supplied Slack webhook token will be used
    Each Slack webhook token can be preconfigured to use a specific channel username and icon You can override these defaults in this hook
    Parameters
    · http_conn_id (str) – connection that has Slack webhook token in the extra field
    · webhook_token (str) – Slack webhook token
    · message (str) – The message you want to send on Slack
    · channel (str) – The channel the message should be posted to
    · username (str) – The username to post to slack with
    · icon_emoji (str) – The emoji to use as icon for the user posting to Slack
    · link_names (bool) – Whether or not to find and link channel and usernames in your message
    · proxy (str) – Proxy to use to make the Slack webhook call
    execute()[source]
    Remote Popen (actually execute the slack webhook call)
    Parameters
    · cmd – command to remotely execute
    · kwargs – extra arguments to Popen (see subprocessPopen)
    class airflowcontribhooksspark_jdbc_hookSparkJDBCHook(spark_app_name'airflowsparkjdbc' spark_conn_id'sparkdefault' spark_confNone spark_py_filesNone spark_filesNone spark_jarsNone num_executorsNone executor_coresNone executor_memoryNone driver_memoryNone verboseFalse principalNone keytabNone cmd_type'spark_to_jdbc' jdbc_tableNone jdbc_conn_id'jdbcdefault' jdbc_driverNone metastore_tableNone jdbc_truncateFalse save_modeNone save_formatNone batch_sizeNone fetch_sizeNone num_partitionsNone partition_columnNone lower_boundNone upper_boundNone create_table_column_typesNone *args **kwargs)[source]
    Bases airflowcontribhooksspark_submit_hookSparkSubmitHook
    This hook extends the SparkSubmitHook specifically for performing data transfers tofrom JDBCbased databases with Apache Spark
    Parameters
    · spark_app_name (str) – Name of the job (default airflowsparkjdbc)
    · spark_conn_id (str) – Connection id as configured in Airflow administration
    · spark_conf (dict) – Any additional Spark configuration properties
    · spark_py_files (str) – Additional python files used (zip egg or py)
    · spark_files (str) – Additional files to upload to the container running the job
    · spark_jars (str) – Additional jars to upload and add to the driver and executor classpath
    · num_executors (int) – number of executor to run This should be set so as to manage the number of connections made with the JDBC database
    · executor_cores (int) – Number of cores per executor
    · executor_memory (str) – Memory per executor (eg 1000M 2G)
    · driver_memory (str) – Memory allocated to the driver (eg 1000M 2G)
    · verbose (bool) – Whether to pass the verbose flag to sparksubmit for debugging
    · keytab (str) – Full path to the file that contains the keytab
    · principal (str) – The name of the kerberos principal used for keytab
    · cmd_type (str) – Which way the data should flow 2 possible values spark_to_jdbc data written by spark from metastore to jdbc jdbc_to_spark data written by spark from jdbc to metastore
    · jdbc_table (str) – The name of the JDBC table
    · jdbc_conn_id – Connection id used for connection to JDBC database
    · jdbc_driver (str) – Name of the JDBC driver to use for the JDBC connection This driver (usually a jar) should be passed in the jars’ parameter
    · metastore_table (str) – The name of the metastore table
    · jdbc_truncate (bool) – (spark_to_jdbc only) Whether or not Spark should truncate or drop and recreate the JDBC table This only takes effect if save_mode’ is set to Overwrite Also if the schema is different Spark cannot truncate and will drop and recreate
    · save_mode (str) – The Spark savemode to use (eg overwrite append etc)
    · save_format (str) – (jdbc_to_sparkonly) The Spark saveformat to use (eg parquet)
    · batch_size (int) – (spark_to_jdbc only) The size of the batch to insert per round trip to the JDBC database Defaults to 1000
    · fetch_size (int) – (jdbc_to_spark only) The size of the batch to fetch per round trip from the JDBC database Default depends on the JDBC driver
    · num_partitions (int) – The maximum number of partitions that can be used by Spark simultaneously both for spark_to_jdbc and jdbc_to_spark operations This will also cap the number of JDBC connections that can be opened
    · partition_column (str) – (jdbc_to_sparkonly) A numeric column to be used to partition the metastore table by If specified you must also specify num_partitions lower_bound upper_bound
    · lower_bound (int) – (jdbc_to_sparkonly) Lower bound of the range of the numeric partition column to fetch If specified you must also specify num_partitions partition_column upper_bound
    · upper_bound (int) – (jdbc_to_sparkonly) Upper bound of the range of the numeric partition column to fetch If specified you must also specify num_partitions partition_column lower_bound
    · create_table_column_types – (spark_to_jdbconly) The database column data types to use instead of the defaults when creating the table Data type information should be specified in the same format as CREATE TABLE columns syntax (eg name CHAR(64) comments VARCHAR(1024)) The specified types should be valid spark sql data types
    Type
    jdbc_conn_id str
    class airflowcontribhooksspark_sql_hookSparkSqlHook(sql confNone conn_id'spark_sql_default' total_executor_coresNone executor_coresNone executor_memoryNone keytabNone principalNone master'yarn' name'defaultname' num_executorsNone verboseTrue yarn_queue'default')[source]
    Bases airflowhooksbase_hookBaseHook
    This hook is a wrapper around the sparksql binary It requires that the sparksql binary is in the PATH param sql The SQL query to execute type sql str param conf arbitrary Spark configuration property type conf str (format PROPVALUE) param conn_id connection_id string type conn_id str param total_executor_cores (Standalone & Mesos only) Total cores for all executors
    (Default all the available cores on the worker)
    Parameters
    · executor_cores (int) – (Standalone & YARN only) Number of cores per executor (Default 2)
    · executor_memory (str) – Memory per executor (eg 1000M 2G) (Default 1G)
    · keytab (str) – Full path to the file that contains the keytab
    · master (str) – sparkhostport mesoshostport yarn or local
    · name (str) – Name of the job
    · num_executors (int) – Number of executors to launch
    · verbose (bool) – Whether to pass the verbose flag to sparksql
    · yarn_queue (str) – The YARN queue to submit to (Default default)
    run_query(cmd'' **kwargs)[source]
    Remote Popen (actually execute the Sparksql query)
    Parameters
    · cmd – command to remotely execute
    · kwargs – extra arguments to Popen (see subprocessPopen)
    class airflowcontribhooksspark_submit_hookSparkSubmitHook(confNone conn_id'spark_default' filesNone py_filesNone driver_classpathNone jarsNone java_classNone packagesNone exclude_packagesNone repositoriesNone total_executor_coresNone executor_coresNone executor_memoryNone driver_memoryNone keytabNone principalNone name'defaultname' num_executorsNone application_argsNone env_varsNone verboseFalse)[source]
    Bases airflowhooksbase_hookBaseHook airflowutilsloglogging_mixinLoggingMixin
    This hook is a wrapper around the sparksubmit binary to kick off a sparksubmit job It requires that the sparksubmit binary is in the PATH or the spark_home to be supplied param conf Arbitrary Spark configuration properties type conf dict param conn_id The connection id as configured in Airflow administration When an
    invalid connection_id is supplied it will default to yarn
    Parameters
    · files (str) – Upload additional files to the executor running the job separated by a comma Files will be placed in the working directory of each executor For example serialized objects
    · py_files (str) – Additional python files used by the job can be zip egg or py
    · driver_classpath (str) – Additional driverspecific classpath settings
    · jars (str) – Submit additional jars to upload and place them in executor classpath
    · java_class (str) – the main class of the Java application
    · packages – Commaseparated list of maven coordinates of jars to include on the
    driver and executor classpaths type packages str param exclude_packages Commaseparated list of maven coordinates of jars to exclude while resolving the dependencies provided in packages’ type exclude_packages str param repositories Commaseparated list of additional remote repositories to search for the maven coordinates given with packages’ type repositories str param total_executor_cores (Standalone & Mesos only) Total cores for all executors (Default all the available cores on the worker) type total_executor_cores int param executor_cores (Standalone YARN and Kubernetes only) Number of cores per executor (Default 2) type executor_cores int param executor_memory Memory per executor (eg 1000M 2G) (Default 1G) type executor_memory str param driver_memory Memory allocated to the driver (eg 1000M 2G) (Default 1G) type driver_memory str param keytab Full path to the file that contains the keytab type keytab str param principal The name of the kerberos principal used for keytab type principal str param name Name of the job (default airflowspark) type name str param num_executors Number of executors to launch type num_executors int param application_args Arguments for the application being submitted type application_args list param env_vars Environment variables for sparksubmit It
    supports yarn and k8s mode too
    Parameters
    verbose (bool) – Whether to pass the verbose flag to sparksubmit process for debugging
    submit(application'' **kwargs)[source]
    Remote Popen to execute the sparksubmit job
    Parameters
    · application (str) – Submitted application jar or py file
    · kwargs – extra arguments to Popen (see subprocessPopen)
    class airflowcontribhookssqoop_hookSqoopHook(conn_id'sqoop_default' verboseFalse num_mappersNone hcatalog_databaseNone hcatalog_tableNone propertiesNone)[source]
    Bases airflowhooksbase_hookBaseHook airflowutilsloglogging_mixinLoggingMixin
    This hook is a wrapper around the sqoop 1 binary To be able to use the hook it is required that sqoop is in the PATH
    Additional arguments that can be passed via the extra’ JSON field of the sqoop connection * job_tracker Job tracker local|jobtrackerport * namenode Namenode * lib_jars Comma separated jar files to include in the classpath * files Comma separated files to be copied to the map reduce cluster * archives Comma separated archives to be unarchived on the compute
    machines
    · password_file Path to file containing the password
    Parameters
    · conn_id (str) – Reference to the sqoop connection
    · verbose (bool) – Set sqoop to verbose
    · num_mappers (int) – Number of map tasks to import in parallel
    · properties (dict) – Properties to set via the D argument
    Popen(cmd **kwargs)[source]
    Remote Popen
    Parameters
    · cmd – command to remotely execute
    · kwargs – extra arguments to Popen (see subprocessPopen)
    Returns
    handle to subprocess
    export_table(table export_dir input_null_string input_null_non_string staging_table clear_staging_table enclosed_by escaped_by input_fields_terminated_by input_lines_terminated_by input_optionally_enclosed_by batch relaxed_isolation extra_export_optionsNone)[source]
    Exports Hive table to remote location Arguments are copies of direct sqoop command line Arguments param table Table remote destination param export_dir Hive table to export param input_null_string The string to be interpreted as null for
    string columns
    Parameters
    · input_null_non_string – The string to be interpreted as null for nonstring columns
    · staging_table – The table in which data will be staged before being inserted into the destination table
    · clear_staging_table – Indicate that any data present in the staging table can be deleted
    · enclosed_by – Sets a required field enclosing character
    · escaped_by – Sets the escape character
    · input_fields_terminated_by – Sets the field separator character
    · input_lines_terminated_by – Sets the endofline character
    · input_optionally_enclosed_by – Sets a field enclosing character
    · batch – Use batch mode for underlying statement execution
    · relaxed_isolation – Transaction isolation to read uncommitted for the mappers
    · extra_export_options – Extra export options to pass as dict If a key doesn’t have a value just pass an empty string to it Don’t include prefix of – for sqoop options
    import_query(query target_dir appendFalse file_type'text' split_byNone directNone driverNone extra_import_optionsNone)[source]
    Imports a specific query from the rdbms to hdfs param query Free format query to run param target_dir HDFS destination dir param append Append data to an existing dataset in HDFS param file_type avro sequence text or parquet
    Imports data to hdfs into the specified format Defaults to text
    Parameters
    · split_by – Column of the table used to split work units
    · direct – Use direct import fast path
    · driver – Manually specify JDBC driver class to use
    · extra_import_options – Extra import options to pass as dict If a key doesn’t have a value just pass an empty string to it Don’t include prefix of – for sqoop options
    import_table(table target_dirNone appendFalse file_type'text' columnsNone split_byNone whereNone directFalse driverNone extra_import_optionsNone)[source]
    Imports table from remote location to target dir Arguments are copies of direct sqoop command line arguments param table Table to read param target_dir HDFS destination dir param append Append data to an existing dataset in HDFS param file_type avro sequence text or parquet
    Imports data to into the specified format Defaults to text
    Parameters
    · columns – Columns to import from table
    · split_by – Column of the table used to split work units
    · where – WHERE clause to use during import
    · direct – Use direct connector if exists for the database
    · driver – Manually specify JDBC driver class to use
    · extra_import_options – Extra import options to pass as dict If a key doesn’t have a value just pass an empty string to it Don’t include prefix of – for sqoop options
    class airflowcontribhooksssh_hookSSHHook(ssh_conn_idNone remote_hostNone usernameNone passwordNone key_fileNone portNone timeout10 keepalive_interval30)[source]
    Bases airflowhooksbase_hookBaseHook airflowutilsloglogging_mixinLoggingMixin
    Hook for ssh remote execution using Paramiko ref httpsgithubcomparamikoparamiko This hook also lets you create ssh tunnel and serve as basis for SFTP file transfer
    Parameters
    · ssh_conn_id (str) – connection id from airflow Connections from where all the required parameters can be fetched like username password or key_file Thought the priority is given to the param passed during init
    · remote_host (str) – remote host to connect
    · username (str) – username to connect to the remote_host
    · password (str) – password of the username to connect to the remote_host
    · key_file (str) – key file to use to connect to the remote_host
    · port (int) – port of remote host to connect (Default is paramiko SSH_PORT)
    · timeout (int) – timeout for the attempt to connect to the remote_host
    · keepalive_interval (int) – send a keepalive packet to remote host every keepalive_interval seconds
    get_conn()[source]
    Opens a ssh connection to the remote host
    return paramikoSSHClient object
    get_tunnel(remote_port remote_host'localhost' local_portNone)[source]
    Creates a tunnel between two hosts Like ssh L host
    Parameters
    · remote_port (int) – The remote port to create a tunnel to
    · remote_host (str) – The remote host to create a tunnel to (default localhost)
    · local_port (int) – The local port to attach the tunnel to
    Returns
    sshtunnelSSHTunnelForwarder object
    class airflowcontribhooksvertica_hookVerticaHook(*args **kwargs)[source]
    Bases airflowhooksdbapi_hookDbApiHook
    Interact with Vertica
    get_conn()[source]
    Returns verticaql connection object
    Executor
    Executors are the mechanism by which task instances get run
    class airflowexecutorslocal_executorLocalExecutor(parallelism32)[source]
    Bases airflowexecutorsbase_executorBaseExecutor
    LocalExecutor executes tasks locally in parallel It uses the multiprocessing Python library and queues to parallelize the execution of tasks
    end()[source]
    This method is called when the caller is done submitting job and wants to wait synchronously for the job submitted previously to be all done
    execute_async(key command queueNone executor_configNone)[source]
    This method will execute the command asynchronously
    start()[source]
    Executors may need to get things started For example LocalExecutor starts N workers
    sync()[source]
    Sync will get called periodically by the heartbeat method Executors should override this to perform gather statuses
    class airflowexecutorssequential_executorSequentialExecutor[source]
    Bases airflowexecutorsbase_executorBaseExecutor
    This executor will only run one task instance at a time can be used for debugging It is also the only executor that can be used with sqlite since sqlite doesn’t support multiple connections
    Since we want airflow to work out of the box it defaults to this SequentialExecutor alongside sqlite as you first install it
    end()[source]
    This method is called when the caller is done submitting job and wants to wait synchronously for the job submitted previously to be all done
    execute_async(key command queueNone executor_configNone)[source]
    This method will execute the command asynchronously
    sync()[source]
    Sync will get called periodically by the heartbeat method Executors should override this to perform gather statuses
    Communitycontributed executors
    class airflowcontribexecutorsmesos_executorMesosExecutor(parallelism32)[source]
    Bases airflowexecutorsbase_executorBaseExecutor airflowwwwutilsLoginMixin
    MesosExecutor allows distributing the execution of task instances to multiple mesos workers
    Apache Mesos is a distributed systems kernel which abstracts CPU memory storage and other compute resources away from machines (physical or virtual) enabling faulttolerant and elastic distributed systems to easily be built and run effectively See httpmesosapacheorg
    end()[source]
    This method is called when the caller is done submitting job and wants to wait synchronously for the job submitted previously to be all done
    execute_async(key command queueNone executor_configNone)[source]
    This method will execute the command asynchronously
    start()[source]
    Executors may need to get things started For example LocalExecutor starts N workers
    sync()[source]
    Sync will get called periodically by the heartbeat method Executors should override this to perform gather statuses

    Revision 14f9b55c
    Built with Sphinx using a theme provided by Read the Docs
    Read the Docs v latest
    Apache Airflow 20 新特性
    Apache Airflow 20 Release文档做整理文章容Apache Airflow官方Apache Airflow 20 is hereAstronomer(Apache Airflow云服务提供商)Introducing Airflow 20
    官方文档翻译+注解方式说明Apache Airflow 20 版新特性
    TaskFlow API(AIP31) 种新编写dags方式
    DAGs现更容易进行编写特PythonOperator时候务间赖更清楚XCom更加
    TaskFlow API编写dag案例
    from airflowdecorators import dag task
    from airflowutilsdates import days_ago

    @dag(default_args{'owner' 'airflow'} schedule_intervalNone start_datedays_ago(2))
    def tutorial_taskflow_api_etl()
    @task
    def extract()
    return {1001 30127 1002 43321 1003 50222}

    @task
    def transform(order_data_dict dict) > dict
    total_order_value 0

    for value in order_data_dictvalues()
    total_order_value + value

    return {total_order_value total_order_value}

    @task()
    def load(total_order_value float)

    print(Total order value is 2f total_order_value)

    order_data extract()
    order_summary transform(order_data)
    load(order_summary[total_order_value])

    tutorial_etl_dag tutorial_taskflow_api_etl()
    面代码编写airflowtask变更简单需dag task需方法面添加注解行极提高生产力棒
    更详细信息请参考
    TaskFlow API Tutorial
    TaskFlow API Documentation
    完全REST API(AIP32)
    现完全实验性 API 全面 OpenAPI 规范
    文档REST API提供两套线文档界面进行交互Swagger静态Redoc需Airflow REST API户非常友

    Swagger

    Redoc


    更关REST API详细信息参考
    REST API Documentation
    调度器性显著提升
    作 AIP15(Scheduler HA+performance) Kamil 做工作部分显着提高 Airflow Scheduler 性现更快启动务
    Astronomerio已调度程序进行基准测试——快(数字进行三次检查开始太相信)
    具体性测试参考面两表格分延迟基准测试调度器水扩展基准测试
    基准性提升
    We have been using task latency as the key metric to benchmark scheduler performance and validate improvements Often evident in the Gantt view of the Airflow UI we define task latency as the time it takes for a task to begin executing once its dependencies have been met 直务延迟作衡量调度程序性验证改进关键指标 通常 Airflow UI 甘特图视图中明显务延迟定义务满足赖关系开始执行需时间
    Along with the above architectural changes Airflow 20 also incorporates optimizations in the task startup process and in the scheduler loop which reduces task latency 述架构变化Airflow 20 务启动程调度器循环中进行优化减少务延迟
    To sufficiently test this without skewing numbers based on the actual task work time we have chosen to benchmark using a simple BashOperator task with a trivial execution time The benchmarking configuration was 4 Celery Workers PostgreSQL DB 1 Web Server 1 Scheduler 充分测试点根实际务工作时间偏斜数字选择简单 BashOperator 务进行基准测试执行时间短 基准测试配置:4 Celery WorkersPostgreSQL DB1 Web 服务器1 调度程序
    Results for 1000 tasks run measured as total task latency (referenced below as task lag) 运行 1000 务结果总务延迟(称务延迟)衡量
    Scenario
    DAG shape
    11010 Total Task Lag
    20 beta Total Task Lag
    Speedup
    100 DAG files 1 DAG per file
    10 Tasks per DAG
    Linear
    200 seconds
    116 seconds
    17 times
    10 DAG files 1 DAG per file
    100 Tasks per DAG
    Linear
    144 seconds
    143 seconds
    10 times
    10 DAG files 10 DAGs per file
    10 Tasks per DAG
    Binary Tree
    200 seconds
    12 seconds
    16 times
    出 airflowtask间流转延迟降低较110版取十倍提升
    扩展性基准
    We have been using task throughput as the key metric for measuring Airflow scalability and to identify bottlenecks Task throughput is measured in tasks per minute This represents the number of tasks that can be scheduled queued executed and monitored by Airflow every minute 直务吞吐量作衡量 Airflow 扩展性识瓶颈关键指标 务吞吐量分钟务数衡量 表示 Airflow 分钟调度排队执行监控务数量
    To sufficiently test this without skewing numbers based on the actual task work time we have chosen to benchmark using a simple PythonOperator task with a trivial execution time The benchmarking configuration was Celery Workers PostgreSQL DB 1 Web Server 充分测试点根实际务工作时间偏斜数字选择简单 PythonOperator 务进行基准测试执行时间短 基准测试配置:Celery WorkersPostgreSQL DB1 Web 服务器
    Results for task throughput (metric explained above) using Airflow 20 beta builds run with 5000 DAGs each with 10 parallel tasks on a single Airflow deployment The benchmark was performed on Google Cloud and each Scheduler was run on a n1standard1 machine type Airflow 20 测试版构建务吞吐量(指标见文)结果运行 5000 DAG DAG 单 Airflow 部署 10 行务 基准测试 Google Cloud 执行调度程序 n1standard1 机器类型运行
    Schedulers
    Workers
    Task Throughput (average)
    Task Throughput (low)
    Task Throughput (high)
    1
    12
    285
    248
    323
    2
    12
    541
    492
    578
    3
    12
    698
    6325
    774

    面图表知道Airflow 20 调度器具备线性扩展力显著提升Airflow调度力
    调度器高兼容 (AIP15)
    现支持运行调度程序实例 弹性(防调度程序出现障)调度性非常
    完全功您需 Postgres 96+ MySQL 8+(MySQL 5 MariaDB 恐怕调度程序)
    运行调度程序需配置设置——需方启动调度程序(确保访问 DAG 文件)通数库您现调度程序合作
    更关调度器高信息参考 Scheduler HA documentation
    务组 (AIP34)
    SubDAG 通常 UI 中务进行分组执行行许缺点(行执行单务)改善种体验引入Task Group(务组)种组织务方法提供 SubDAG 相分组行没执行时间缺陷
    SubDAG 现然效认 SubDAG 前现换Task Group(务组)果您发现种情况示例请 GitHub 创建issue告诉
    更关务组详细信息参考 Task Group documentation
    崭新户界面
    已 Airflow UI a visual refresh 更新样式

    图表视图中添加动刷新务状态选项您需连续刷新钮)
    查 文档中屏幕截图 解更信息
    减少传感器负载智传感器 (AIP17)
    果您 Airflow 集群中量传感器您会发现reschedule模式传感器执行会占集群部分改善点添加种称Smart Sensors(智传感器)新模式
    功处抢先体验阶段已 Airbnb 充分测试稳定保留未版中进行兼容更改权利(果必须话会 非常努力)
    Read more about it in the Smart Sensors documentation
    简化KubernetesExecutor
    Airflow 20重新架构 KubernetesExecutor时 Airflow 户更快更容易理解更灵活 户现访问完整 Kubernetes API 创建 yaml pod_template_file airflowcfg 中指定参数
     pod_override 参数换 executor_config 字典该参数采 Kubernetes V1Pod 象进行 a11 设置覆盖 更改 KubernetesExecutor 中删 3000 行代码运行速度更快减少潜错误
    更关简化KubernetesExecutor相关信息请参考
    Docs on pod_template_file
    Docs on pod_override
    Airflow core(核心)providers(第三方安装包) Airflow 拆分 60 包:

    Airflow 20 单统天包已 Airflow 分core(核心) 61 (目前)provider packages(第三方安装包)provider package(第三方安装包)特定外部服务(GoogleAmazonMicrosoftSnowflake)数库(PostgresMySQL)协议 (HTTPFTP)现您构建块创建定义 Airflow 安装仅选择您需容添加您求常见提供程序会动安装(ftphttpimapsqlite)常您安装 Airflow 时选择适附加功时会动安装提供程序
    provider architecture (第三方安装包架构)应该更容易获完全定制具正确 Python 赖项集致运行时
    全部您编写定义提供程序理方式添加定义连接类型连接表单定义指操作员额外链接等容您构建提供程序安装 Python 包 Airflow UI 中直接显示您定义设置
    Jarek Potiuk 博客写关 providers in much more detail 文章
    更关providers相关信息参考
    Docs on the providers concept and writing custom providers
    Docs on the all providers packages available
    安全性
    作 Airflow 20 努力部分意识关注安全性减少暴露区域 功领域形式表现出 例新 REST API 中操作现需授权样配置设置中现需指定 Fernet 密钥
    配置
    airflowcfg 文件形式配置部分进步合理化特围绕core模块外量配置选项已弃移特定组件配置文件例 Kubernetes 执行相关配置 podtemplatefile

    基Apache Airflow企业级数框架架构设计
    Spark作数计算引擎话基Airflow现代码编写企业级数框架直接通yml配置文件动生成DAG数流通yml配置指定DAG运行Spark SQLPyspark代码Hive SQL保存结果数库(MySQL)DML SQL通配置指定导入数文件路径数质量检查连接外部数源数导出文件发送数邮件通知邮件第三方软件台调API集成等样基配置开发方式助开发效率期维护效率提升框架中项目运行性通整框架性调优优化整框架Python语言开发提升开发效率时兼容Python代码写Airflow DAG
    该数框架根目录建立10文件夹作
    1存储Python代码写Airflow DAGpy文件(兼容)
    2存储数计算集群初始化配置启动脚(启动AWS EMRsh脚)
    3存储操作集群存储空间文件(包括查询复制移动删压缩解压加密解密文件等操作)校验业务数(数完整性检查数质量检查)校验yml配置文件正确性完整性第三方软件台调API集成运行程序启动进程日志记录查发送邮件系统账号密码安全传输存储享文件夹交互数文件工具函数函数库py文件
    4存储目录结构化yml配置文件集根业务模型业务题进行目录分类存储业务题放单独yml文件yml文件包含联系邮箱列表该框架整合第三方时通讯软件账号DAG开始运行日期时间行务数活动行务数业务题yml文件统放文件夹文件夹放框架根目录子目录里yml配置文件数获取处理类型分手工文件数源获取基础数基础数筛选滤加工数加工数进行连接分组统计数统计数数取报表数数导出文件数源通享文件夹时消息软件邮箱输出数分放文件夹中位框架根目录子目录里
    5存储Airflow种DAGsensor定义种配置信息模板等
    6存储Airflow框架日志
    7存储框架赖jar库
    8存储pyspark脚文件集
    9存储SQL脚文件集(Spark SQLHive SQL报表数库表视图DDLDMLDCL)
    10存储Python虚拟环境

    文档香网(httpswwwxiangdangnet)户传

    《香当网》用户分享的内容,不代表《香当网》观点或立场,请自行判断内容的真实性和可靠性!
    该内容是文档的文本内容,更好的格式请下载文档

    下载文档到电脑,查找使用更方便

    文档的实际排版效果,会与网站的显示效果略有不同!!

    需要 10 香币 [ 分享文档获得香币 ]

    下载文档

    相关文档

    AE开发实例代码总结

    1、AE开发技术文档一、数据加载问题1、加载个人数据库个人数据库是保存在Access中的数据库。加载方式有两种:通过名字和通过属性加载(也许不只是这两种,AE中实现同一功能可以有多种方式)A、...

    3年前   
    620    0

    IOS开发环境搭建

    IOS环境搭建与开发入门一、 注册APPLE ID1. 在苹果官网上下载iTunes。 官方下载地址:2. 安装iTunes.3. 启动iTunes,在导航栏选择iTunes store4....

    2年前   
    375    0

    数据结构大作业(含源代码)

    数据结构大作业作业题目: 职工信息管理系统 姓 名: 学 号: ...

    3年前   
    453    0

    基于Action的数据分析大数据平台

     基于Action的用户行为分析大数据平台Action-based user behavior analytics big data platform内容摘要电商平台作为当前最受欢迎,热度最...

    10个月前   
    243    0

    实用版本技术开发合同

    实用版本技术开发合同 技术合同其实很简洁的,今日我就给大家来看看技术合同,大家一起来保藏一下哦 有关于技术开发合同 项目名称:_______ 托付方:______...

    1个月前   
    137    0

    关于异地开发中的源代码管理问题

    关于异地开发中的源代码管理问题最近在带领一个异地的团队在进行.NetB/S系统开发工作。两地相隔1000多公里,两地都有开发人员,源码的统一管理就成了需要解决的问题。针对这个问题,想到如下的解...

    10年前   
    518    0

    敏捷开发中高质量Java代码开发实践

    本文将介绍在敏捷开发过程中如何通过采取一系列的步骤来保证和提高整个工程的代码质量,阐述了每一步可以利用的工具和最正确实践,从而使开发过程更加标准化,成就高质量的代码。概述Java 工程开发过程...

    2年前   
    300    0

    搭建网络问政平台

      搭建网络问政平台   引  言   2003年5月,国内第一个具有网络问政性质的平台——“网上民声”在山东**诞生。负责创办这个栏目的网络媒体,是当时成立仅有一年的胶东在线网站。...

    9年前   
    7437    0

    搭建一个管理平台

    搭建一个管理平台中国的一批明星企业之所以会出现“其兴也勃,其亡也忽”的现象,是因为根本原因就在于企业壮大后,决策者要么盲目实行多元化经营,最后死在贪大求快上;要么出现“内耗”,死在管理瓶颈上。...

    10年前   
    585    0

    领导之前要先搭建信任平台

    领导之前要先搭建信任平台央视?百家讲坛?中讲?汉代风云人物?时,讲了冤死的晁错?的案例。晁错是一个忠诚度高而工作起来又很忘我的“好员工“,而其之所以从一个“宠臣“沦落到“冤死鬼“,笔者认为主要...

    7个月前   
    187    0

    国家队信息化平台数据库软件设计与开发合同

    国家队信息化平台数据库软件设计与开发合同  项目名称:_____________________________  委托人:_______________________________  (...

    12年前   
    363    0

    数据交换平台软件产品开发项目可行性报告(v3)

     电子信息产业发展基金 项目可行性报告 项目名称: 数据交换平台软件产品开发项目 承担单位: 北京湘计立德信息技术有限公司 地址及邮编: 北京市海淀区上地信息路28号信息...

    12年前   
    4329    0

    国家队信息化平台数据库软件设计与开发合同

    国家队信息化平台数据库软件设计与开发合同  项目名称:  委 托 人:  (甲方)  研究开发人:  (乙方)    签订地点:      省 (市)   市、县(区)  签订日期:  年  ...

    11年前   
    493    0

    授信基础数据

    单位:万元 项目简要报告 报审单位 中国民生银行深圳分行营业部--森博工作室 项目申请人 长城宽带网络服务有限公司 申请授信类型 综合授信额度 金 额 ...

    8年前   
    26659    0

    公平与爱是构建和谐班级的基础(论文)

    公平与爱是构建和谐班级的基础 刘猛 **县职业技术集团学校 摘要:目前,部分学校存在以学习成绩好坏、家长社会地位高低为标准依次安排座位,相比于中等生和差等生,优等生得到了更多的锻炼及表现...

    11年前   
    9742    0

    开发云平台提升企业信息应用和管理水平

    XX油气田分公司建设300亿战略大气区和一流的天然气工业基地的奋斗目标,对信息化建设提出了更高要求;实施勘探开发三大攻坚战,需要信息技术提供强有力的支撑;推进技术进步、提高生产经营管理水平,需要...

    6年前   
    1599    0

    锐捷教你轻松搭建经济实用网络平台-交换机解决方案

    锐捷教你轻松搭建经济实用网络平台-交换机解决方案  在信息化的初期,企业已经在使用一些信息化设施,如台式机、便携式电脑、打印机、服务器、传真机等,以提高工作效率。一些业务系统,如办公自动化、财...

    9年前   
    414    0

    搭建全员素质提升平台 构筑人才强企战略高地

    搭建全员素质提升平台 构筑人才强企战略高地   在现代企业中,员工是推动先进生产力发展的基本力量,是企业在激烈的市场竞争中实现持续发展的重要因素。实施员工素质提升工程具有鲜明的时代意义和现...

    10年前   
    7857    0

    搭建平台鼓励非公企业积极承担社会责任

    商会组织要大力弘扬优秀企业家精神,发挥示范带动作用,推动民营企业积极履行社会责任。以强化理想信念,提高政治站位。深入开展非公经济人士理想信念教育,不断加强非公经济人士思想政治工作。

    2年前   
    386    0

    国家社科基金项目申报数据代码表(2022年)

    国家社科基金项目申报数据代码表编码类型代码名 称项目类别A重点项目B一般项目C青年项目D一般自选项目E青年自选项目F后期资助项目X西部项目学科分类马列·科社KSA马、恩、列、斯思想研究K...

    1年前   
    541    0