「深度学习福利」大神带你进阶工程师,立即查看>>> 环境信息 完全分布式集群(一)集群基础环境及zookeeper-3.4.10安装部署 hadoop集群安装配置过程 安装hive前需要先部署hadoop集群 完全分布式集群(二)hadoop2.6.5安装部署 安装hive2.1.1 下载并通过FTP工具将apache-hive-2.1.1-bin.tar.gz安装包上传至服务器,解压,并修改软件目录。因整个集群目前node225节点只作为集群的DataNode,所以本次将hive安装在node225节点上 [root@node225 ~]# gtar -xzf /home/hadoop/apache-hive-2.1.1-bin.tar.gz -C /usr/local/ [root@node225 ~]# mv /usr/local/apache-hive-2.1.1-bin /usr/local/hive-2.1.1 配置hive环境变量信息 [root@node225 ~]# vi /etc/profile #追加如下内容,目录需要结合实际修改 export HIVE_HOME=/usr/local/hive-2.1.1 export HIVE_CONF_DIR=${HIVE_HOME}/conf export PATH=${HIVE_HOME}/bin:$PATH # 使配置生效 [root@node225 ~]# source /etc/profile 在集群的任意节点上创建hive配置需要的目录并设置操作权限,前提是确保hadoop集群正常启动,如下在node222节点创建。 [hadoop@node222 ~]$ hdfs dfs -mkdir -p /user/hive/warehouse [hadoop@node222 ~]$ hdfs dfs -chmod 777 /user/hive/warehouse [hadoop@node222 ~]$ hdfs dfs -mkdir -p /tmp/hive/ [hadoop@node222 ~]$ hdfs dfs -chmod 777 /tmp/hive 本次安装为多用户模式,需要在mysql上创建hive元数据库 -- 创建hive数据库 ipems_dvp@localhost : (none) 10:26:05> create database hive; Query OK, 1 row affected (0.01 sec) -- 创建hive用户并设置密码 ipems_dvp@localhost : (none) 10:27:01> create user 'hive'@'%' identified by 'Aa123456789'; Query OK, 0 rows affected (0.07 sec) -- 授权 ipems_dvp@localhost : (none) 10:36:12> grant all privileges on hive.* to 'hive'@'%'; Query OK, 0 rows affected (0.07 sec) -- 刷新权限 ipems_dvp@localhost : (none) 10:36:26> flush privileges; Query OK, 0 rows affected (0.02 sec) 拷贝模板生成hive配置文件,并修改文件内容,为屏蔽每次hive连接时提示“WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set.”是Mysql数据库的SSL连接问题,提示警告不建议使用没有带服务器身份验证的SSL连接。在javax.jdo.option.ConnectionURL配置项中增加“&useSSL=false”,其中“&”为XML中的“&”符号。 [root@node225 ~]# cp /usr/local/hive-2.1.1/conf/hive-default.xml.template /usr/local/hive-2.1.1/conf/hive-site.xml # hive-site.xml默认里边配置项非常多,可先清空,后填充如下内容 [root@node225 ~]# cat "" /usr/local/hive-2.1.1/conf/hive-site.xml [root@node225 ~]# vi /usr/local/hive-2.1.1/conf/hive-site.xml # 配置项 hive.default.fileformatTextFilejavax.jdo.option.ConnectionURLjdbc:mysql://192.168.0.200:3306/hive?createDatabaseIfNotExist=true&useSSL=falsejavax.jdo.option.ConnectionDriverNamecom.mysql.jdbc.Driverjavax.jdo.option.ConnectionUserNamehivejavax.jdo.option.ConnectionPasswordAa123456789hive.metastore.schema.verificationfalse 将mysql的jdbc连接驱动上传至hive的lib目录 [root@node225 ~]# ls /usr/local/hive-2.1.1/lib/mysql-connector-java-5.1.40-bin.jar /usr/local/hive-2.1.1/lib/mysql-connector-java-5.1.40-bin.jar 将hadoop的home和hive配置目录环境信息追加至/usr/local/hive-2.1.1/conf/hive-env.sh [root@node225 ~]# vi /usr/local/hive-2.1.1/conf/hive-env.sh # 追加内容 export HADOOP_HOME=/usr/local/hadoop-2.6.5 export HIVE_CONF_DIR=/usr/local/hive-2.1.1/conf export HIVE_AUX_JARS_PATH=/usr/local/hive-2.1.1/lib 初始化hive元数据库 [root@node225 ~]# /usr/local/hive-2.1.1/bin/schematool -initSchema -dbType mysql which: no hbase in (.:/usr/local/jdk1.8.0_66//bin:/usr/local/zookeeper-3.4.10/bin:ZK_HOME/sbin:ZK_HOME/lib:/usr/local/hadoop-2.6.5//bin:/usr/local/hadoop-2.6.5//sbin:/usr/local/hadoop-2.6.5//lib:/usr/local/hive-2.1.1/bin:/usr/local/mongodb/bin:.:/usr/local/jdk1.8.0_66//bin:/usr/local/zookeeper-3.4.10/bin:ZK_HOME/sbin:ZK_HOME/lib:/usr/local/hadoop-2.6.5//bin:/usr/local/hadoop-2.6.5//sbin:/usr/local/hadoop-2.6.5//lib:/usr/local/hive-2.1.1/bin/bin:/usr/local/mongodb/bin:/usr/local/zookeeper-3.4.10/bin:ZK_HOME/sbin:ZK_HOME/lib:/usr/local/hadoop-2.6.5//bin:/usr/local/hadoop-2.6.5//sbin:/usr/local/hadoop-2.6.5//lib:/usr/local/jdk1.8.0_66//bin:/usr/local/mongodb/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin) ...... Sun Sep 30 10:52:41 CST 2018 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. schemaTool completed 通过hiveCLI连接测试,正常进入,并执行简单的hiveQL命令测试 [root@node225 ~]# /usr/local/hive-2.1.1/bin/hive which: no hbase in (.:/usr/local/jdk1.8.0_66//bin:/usr/local/zookeeper-3.4.10/bin:ZK_HOME/sbin:ZK_HOME/lib:/usr/local/hadoop-2.6.5//bin:/usr/local/hadoop-2.6.5//sbin:/usr/local/hadoop-2.6.5//lib:/usr/local/hive-2.1.1/bin:/usr/local/mongodb/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin) Logging initialized using configuration in jar:file:/usr/local/hive-2.1.1/lib/hive-common-2.1.1.jar!/hive-log4j2.properties Async: true Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. hive> show databases; 连接如果提示如下"SLF4J: Class path contains multiple SLF4J bindings."是SLF4J相关提示是因为发生jar包冲突了,本次采用hadoop的jar包,所以重命名hive的对应jar包, # 提示信息 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.5/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] # 处理方法 [root@node225 ~]# mv /usr/local/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar /usr/local/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar.bak 如果提示“Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby”是因为两个互为HA的namenode节点均处于standby 状态,通过50070端口查看确定该状态,启动NameNode节点上的zkfc服务 [hadoop@node222 ~]$ /usr/local/hadoop-2.6.5/sbin/hadoop-daemon.sh start zkfc [hadoop@node224 ~]$ /usr/local/hadoop-2.6.5/sbin/hadoop-daemon.sh start zkfc
「深度学习福利」大神带你进阶工程师,立即查看>>> NO.1 索罗斯在1987年撰写的 《金融炼金术》 一书中,曾经提出过一个重要的命题:I believe the market prices are always wrong in the sense that they present a biased view of the future. 市场有效假说只是理论上的假设,实际上市场参与者并不总是理性的,并且在每一个时间点上,参与者不可能完全获取和客观解读所有的信息,再者就算是同样的信息,每个人的反馈都不尽相同。 也就是说,价格本身就已经包含了市场参与者的错误预期,所以本质上市场价格总错误的。这或许是套利者的利润来源。
二、获取并计算数据 第1步:获取基础数据对象、账户余额、boll 指标数据,以供交易逻辑使用。 // 交易条件 function onTick() { var data = new Data(tradeTypeA, tradeTypeB); // 创建一个基础数据对象 var accountStocks = data.accountData.Stocks; // 账户余额 var boll = data.boll(dataLength, timeCycle); // 获取boll指标数据 if (!boll) return; // 如果没有boll数据就返回 }
三、下单并对后续处理 第1步:根据上述的策略逻辑,执行买卖操作。首先会判断价格和指标条件是否成立,然后再判断持仓条件是否成立,最后执行 trade ( ) 下单函数 // 交易条件 function onTick() { var data = new Data(tradeTypeA, tradeTypeB); // 创建一个基础数据对象 var accountStocks = data.accountData.Stocks; // 账户余额 var boll = data.boll(dataLength, timeCycle); // 获取boll指标数据 if (!boll) return; // 如果没有boll数据就返回 // 价差说明 // basb = (合约A卖一价 - 合约B买一价) // sabb = (合约A买一价 - 合约B卖一价) if (data.sabb > boll.middle && data.sabb < boll.up) { // 如果sabb高于中轨 if (data.mp(tradeTypeA, 0)) { // 下单前检测合约A是否有多单 data.trade(tradeTypeA, "closebuy"); // 合约A平多 } if (data.mp(tradeTypeB, 1)) { // 下单前检测合约B是否有空单 data.trade(tradeTypeB, "closesell"); // 合约B平空 } } else if (data.basb < boll.middle && data.basb > boll.down) { // 如果basb低于中轨 if (data.mp(tradeTypeA, 1)) { // 下单前检测合约A是否有空单 data.trade(tradeTypeA, "closesell"); // 合约A平空 } if (data.mp(tradeTypeB, 0)) { // 下单前检测合约B是否有多单 data.trade(tradeTypeB, "closebuy"); // 合约B平多 } } if (accountStocks * Math.max(data.askA, data.askB) > 1) { // 如果账户有余额 if (data.basb < boll.down) { // 如果basb价差低于下轨 if (!data.mp(tradeTypeA, 0)) { // 下单前检测合约A是否有多单 data.trade(tradeTypeA, "buy"); // 合约A开多 } if (!data.mp(tradeTypeB, 1)) { // 下单前检测合约B是否有空单 data.trade(tradeTypeB, "sell"); // 合约B开空 } } else if (data.sabb > boll.up) { // 如果sabb价差高于上轨 if (!data.mp(tradeTypeA, 1)) { // 下单前检测合约A是否有空单 data.trade(tradeTypeA, "sell"); // 合约A开空 } if (!data.mp(tradeTypeB, 0)) { // 下单前检测合约B是否有多单 data.trade(tradeTypeB, "buy"); // 合约B开多 } } } }
第2步:下单完成后,需要对未成交的订单、持有单个合约等非正常情况做处理。以及绘制图表。 // 交易条件 function onTick() { var data = new Data(tradeTypeA, tradeTypeB); // 创建一个基础数据对象 var accountStocks = data.accountData.Stocks; // 账户余额 var boll = data.boll(dataLength, timeCycle); // 获取boll指标数据 if (!boll) return; // 如果没有boll数据就返回 // 价差说明 // basb = (合约A卖一价 - 合约B买一价) // sabb = (合约A买一价 - 合约B卖一价) if (data.sabb > boll.middle && data.sabb < boll.up) { // 如果sabb高于中轨 if (data.mp(tradeTypeA, 0)) { // 下单前检测合约A是否有多单 data.trade(tradeTypeA, "closebuy"); // 合约A平多 } if (data.mp(tradeTypeB, 1)) { // 下单前检测合约B是否有空单 data.trade(tradeTypeB, "closesell"); // 合约B平空 } } else if (data.basb < boll.middle && data.basb > boll.down) { // 如果basb低于中轨 if (data.mp(tradeTypeA, 1)) { // 下单前检测合约A是否有空单 data.trade(tradeTypeA, "closesell"); // 合约A平空 } if (data.mp(tradeTypeB, 0)) { // 下单前检测合约B是否有多单 data.trade(tradeTypeB, "closebuy"); // 合约B平多 } } if (accountStocks * Math.max(data.askA, data.askB) > 1) { // 如果账户有余额 if (data.basb < boll.down) { // 如果basb价差低于下轨 if (!data.mp(tradeTypeA, 0)) { // 下单前检测合约A是否有多单 data.trade(tradeTypeA, "buy"); // 合约A开多 } if (!data.mp(tradeTypeB, 1)) { // 下单前检测合约B是否有空单 data.trade(tradeTypeB, "sell"); // 合约B开空 } } else if (data.sabb > boll.up) { // 如果sabb价差高于上轨 if (!data.mp(tradeTypeA, 1)) { // 下单前检测合约A是否有空单 data.trade(tradeTypeA, "sell"); // 合约A开空 } if (!data.mp(tradeTypeB, 0)) { // 下单前检测合约B是否有多单 data.trade(tradeTypeB, "buy"); // 合约B开多 } } } data.cancelOrders(); // 撤单 data.drawingChart(boll); // 画图 data.isEven(); // 处理持有单个合约 }
「深度学习福利」大神带你进阶工程师,立即查看>>> hadoop集群搭建: 配置hosts (4个节点一致) 192.168.83.11 hd1 192.168.83.22 hd2 192.168.83.33 hd3 192.168.83.44 hd4 配置主机名(重启生效) [hadoop@hd1 ~]$ more /etc/sysconfig/network NETWORKING=yes HOSTNAME=hd1 配置用户用户组 [hadoop@hd1 ~]$ id hadoop uid=1001(hadoop) gid=10010(hadoop) groups=10010(hadoop) 配置JDK [hadoop@hd1 ~]$ env|grep JAVA JAVA_HOME=/usr/java/jdk1.8.0_11 [hadoop@hd1 ~]$ java -version java version "1.8.0_11" Java(TM) SE Runtime Environment (build 1.8.0_11-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode) 配置ssh免密登录 ssh-keygen -t rsa ssh-keygen -t dsa cat ~/.ssh/*.pub > ~/.ssh/authorizedkeys scp ~/.ssh/authorizedkeys hdoop@hd2:/.ssh/authorizedkeys 配置环境变量: export JAVA_HOME=/usr/java/jdk1.8.0_11 export JRE_HOME=/usr/java/jdk1.8.0_11/jre export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib:$JRE_HOME/lib export HADOOP_INSTALL=/home/hadoop/hadoop-2.7.1 export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin export PATRH=$PATH:/home/hadoop/zookeeper-3.4.6/bin 注意:hadoopo有一个小bug,在~/.bash_profile配置JAVA_HOME不生效,只能在hadoop-env.sh配置JAVA_HOME # The java implementation to use. #export JAVA_HOME=${JAVA_HOME} export JAVA_HOME=/usr/java/jdk1.8.0_11 软件准备: [hadoop@hd1 software]$ ls -l total 931700 -rw-r--r-- 1 hadoop hadoop 17699306 Oct 6 17:30 zookeeper-3.4.6.tar.gz -rw-r--r--. 1 hadoop hadoop 336424960 Jul 18 23:13 hadoop-2.7.1.tar tar -xvf /usr/hadoop/hadoop-2.7.1.tar -C /home/hadoop/ tar -xvf ../software/zookeeper-3.4.6.tar.gz -C /home/hadoop/ hadoop HA配置 参考 http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html。 目的:“Using the Quorum Journal Manager or Conventional Shared Storage”,使用JN( Quorum JournalNode)就是为了解决共享存储的问题,当然官方也推荐使用NFS,不过本人觉得NFS存在性能问题,不敢使用。 Architecture 官方文档有详细介绍结构,看的比较费劲,转载 yameing 的CSDN 博客片段 (全文地址请点击:https://blog.csdn.net/yameing/article/details/39696151?utm_source=copy。) 在一个普通的高可用集群里,有两台独立机器被配置为NN。在任何时间里,只有一个是处于活动状态,而另一个则处于备用状态。活动NN负责所有客户端通信,同时,备用NN只是一个简单的从节点,维护一个为了在需要时能快速故障恢复的状态。 为了备用NN能通过活动NN同步状态,两个节点通过一组独立进程JN进行通信。 任何执行在活动NN的edits,将持久地记录到大多数JN里。备用NN能够在这些JN里读取到edits,并且不断的监控记录的改变。当备用NN读取到这些edits时,就把它们执行一遍,就保证两个NN同步 。发现故障恢复时,备份NN在确保从JN中读取到所有edits后,就将自己提升为活动NN。这就确保了再发生故障恢复前命名空间已完全同步。 为了提供快速的故障恢复,备用NN拥有最新的块地址信息也是非常重要的。为了实现这个要求, DN同时配置有两个NN的地址,并且同时向两者发送块地址信息和心跳 。 在同一时间里,保证高可用集群中 只有一个活动NN 是至关重要的。否则,两个NN的状态将很快出现不一致,数据有丢失的风险,或者其他错误的结果。为了确保这种属性、防止所谓的 脑裂场景(split-brain scenario) , 在同一时间里,JN只允许一个NN写edits 。故障恢复期间,将成为活动节点的NN简单的获取写edits的角色,这将有效的阻止其他NN继续处于活动状态,允许新活动节点安全的进行故障恢复。 节点及实例规划: NameNode 机器:运行活动NN和备用NN的硬件配置应该是一致的。这和非高可用集群的配置一样。 l JournalNode 机器:JN进程相对轻量级,所以这些进程可以合理的配置在Hadoop集群的其他机器里,如NN,JT、RM。注意: 必须至少有3个JN进程,因为edits需要写入到大多数的JN里。这就允许系统单台机器的错误 。你也可以运行3个以上JN,但为了实际提高系统对错误的容忍度,最好运行奇数个JN。执行N个JN的集群上,系统可以容忍(N-1)/2台机器错误的同时保持正常工作。 注意:在高可用集群里, 备用NN也扮演checkpoint,所以没必要再运行一个Secondary NN,CheckpointNode,或BackupNode 。事实上,这样做(运行上述几个节点)是一种错误。这也就允许复用原来指定为Secondary Namenode 的硬件,将一个非高可用的HDFS集群重新配置为高可用的。 Configuration details To configure HA NameNodes, you must add several configuration options to your hdfs-site.xml configuration file. dfs.nameservices - the logical name for this new nameservice dfs.nameservicesmycluster dfs.ha.namenodes.[nameservice ID] - unique identifiers for each NameNode in the nameservice dfs.ha.namenodes.myclusternn1,nn2 dfs.namenode.rpc-address.[nameservice ID].[name node ID] - the fully-qualified RPC address for each NameNode to listen on dfs.namenode.rpc-address.mycluster.nn1machine1.example.com:8020dfs.namenode.rpc-address.mycluster.nn2machine2.example.com:8020 dfs.namenode.http-address.[nameservice ID].[name node ID] - the fully-qualified HTTP address for each NameNode to listen on dfs.namenode.http-address.mycluster.nn1machine1.example.com:50070dfs.namenode.http-address.mycluster.nn2machine2.example.com:50070 dfs.namenode.shared.edits.dir - the URI which identifies the group of JNs where the NameNodes will write/read edits dfs.namenode.shared.edits.dirqjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster dfs.client.failover.proxy.provider.[nameservice ID] - the Java class that HDFS clients use to contact the Active NameNode dfs.client.failover.proxy.provider.myclusterorg.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider dfs.ha.fencing.methods - a list of scripts or Java classes which will be used to fence the Active NameNode during a failover. Importantly, when using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata from a split-brain scenario. sshfence - SSH to the Active NameNode and kill the process dfs.ha.fencing.methodssshfencedfs.ha.fencing.ssh.private-key-files/home/exampleuser/.ssh/id_rsa fs.defaultFS - the default path prefix used by the Hadoop FS client when none is given.in your core-site.xml file: fs.defaultFShdfs://mycluster dfs.journalnode.edits.dir - the path where the JournalNode daemon will store its local state dfs.journalnode.edits.dir/path/to/journal/node/local/data 配置Datanode [hadoop@hd1 hadoop]$ more slaves hd2 hd3 hd4 以上配置完成,可以把配置文件拷贝到其他的节点,完成hadoop集群配置工作。 启动: 启动JN:running the command“hadoop-daemon.sh start journalnode”。 如果是一个新的集群 ,需要先格式化NN hdfs namenode -format 在其中一个NN节点上。 如果已经格式化完成,需要拷贝NN元数据到另外的节点,这个时候需要在未格式化的NN节点上执行“hdfs namenode -bootstrapStandby”。(注意:在拷贝元数据之前,需要提前启动format过的NN,只启动一个节点),启动已经格式化的节点的NN hadoop-daemon.sh start namenode 如果把一个非HA转换成HA,需要执行“ hdfs namenode -initializeSharedEdits ” Note: This is not yet implemented, and at present will always return success, unless the given NameNode is completely down. 格式化NN hdfs namenode -format,报错: 18/10/07 21:42:34 INFO ipc.Client: Retrying connect to server: hd2/192.168.83.22:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 18/10/07 21:42:34 WARN namenode.NameNode: Encountered exception during format: org.apache.hadoop.hdfs.qjournal.client.QuorumException: Unable to check if JNs are ready for formatting. 1 exceptions thrown: 192.168.83.22:8485: Call From hd1/192.168.83.11 to hd2:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81) at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223) at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.hasSomeData(QuorumJournalManager.java:232) at org.apache.hadoop.hdfs.server.common.Storage.confirmFormat(Storage.java:900) at org.apache.hadoop.hdfs.server.namenode.FSImage.confirmFormat(FSImage.java:184) at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:987) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1429) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554) 18/10/07 21:42:34 INFO ipc.Client: Retrying connect to server: hd3/192.168.83.33:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 解决:在格式化NN的时候,需要连接JN,如果连接JN失败或者超时都会出现这种错误,首先检查JN是否启动,如果由于网络延迟导致 可以通过设置timeout来规避这个错误 。 2 3 ipc.client.connect.max.retries 4 100 5 Indicates the number of retries a client will make to establish a server connection. 6 7 8 ipc.client.connect.retry.interval 9 10000 10 Indicates the number of milliseconds a client will wait for before retrying to establish a server connection. 11 --------------------- 本文来自 锐湃 的CSDN 博客 ,全文地址请点击:https://blog.csdn.net/chuyouyinghe/article/details/78976933?utm_source=copy 注意: 1) 仅对于这种由于服务没有启动完成造成连接超时的问题,都可以调整core-site.xml中的ipc参数来解决。如果目标服务本身没有启动成功,这边调整ipc参数是无效的。 2) 该配置使namenode连接journalnode最大时间增加至1000s(maxRetries=100, sleepTime=10000),假如集群节点数过多,或者网络情况不稳定,造成连接时间超过1000s,仍会导致namenode挂掉。 Automatic Failover: 上面介绍了如何配置人工故障恢复。这种方式下,即使活动NN挂掉了,系统不会自动触发负责恢复。下面描述如何配置和部署自动故障恢复。 Automatic failover adds two new components to an HDFS deployment: a ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC). 自动故障恢复增加了两个组件: Zookeeper quorum、ZKFailoverController(ZKFC) 。 Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. The implementation of automatic HDFS failover relies on ZooKeeper for the following things: Introduction Apache Zookeeper是一个高可用的服务,它能维护少量的协调数据,通知客户数据的变化,监控客户端失败。HDFS故障自动恢复依赖ZK以下特性: Failure detection - each of the NameNode machines in the cluster maintains a persistent session in ZooKeeper. If the machine crashes, the ZooKeeper session will expire, notifying the other NameNode that a failover should be triggered. 故障检测(Failure detection) - 集群里的每个NN与ZK保持一个持久会话。如果机器宕机,ZK会话将过期,然后提醒其他NN从而触发一个故障恢复。 Active NameNode election - ZooKeeper provides a simple mechanism to exclusively select a node as active. If the current active NameNode crashes, another node may take a special exclusive lock in ZooKeeper indicating that it should become the next active. 活动NN选举(Active NameNode election) - ZK提供一个简单的机制专门用来选举活动节点。如果当前的活动NN宕机,另外一个节点会拿到一个在ZK里的特殊的独占锁,这表示这个节点将会成为下一个活动节点。 The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client which also monitors and manages the state of the NameNode. Each of the machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible for: ZKFC是一个新的组件,它是一个ZK客户端,同时监听和管理NN的状态。 运行NN的机器上须同时运行ZKFC ,它的责任是: Health monitoring - the ZKFC pings its local NameNode on a periodic basis with a health-check command. So long as the NameNode responds in a timely fashion with a healthy status, the ZKFC considers the node healthy. If the node has crashed, frozen, or otherwise entered an unhealthy state, the health monitor will mark it as unhealthy. 健康监控(Health monitoring) - ZKFC使用“health-check”命令定期ping本地的NN。只要NN及时的响应一个健康状态,则认为这个节点是健康的。如果节点宕机,无响应或者进入了其他不健康状态,健康监控器认为它是不健康的。 ZooKeeper session management - when the local NameNode is healthy, the ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it also holds a special “lock” znode. This lock uses ZooKeeper’s support for “ephemeral” nodes; if the session expires, the lock node will be automatically deleted. ZK会话管理(ZooKeeper session management) - 当本地NN是健康的,ZKFC持有ZK的一个打开的会话。如果本地NN是活动状态,ZKFC同时持有一个特殊的锁节点(a special "lock" znode)。这个锁使用了ZK的临时节点。如果会话过期,这个锁节点将被自动删除。 ZooKeeper-based election - if the local NameNode is healthy, and the ZKFC sees that no other node currently holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has “won the election”, and is responsible for running a failover to make its local NameNode active. The failover process is similar to the manual failover described above: first, the previous active is fenced if necessary, and then the local NameNode transitions to active state. 基于ZK的选举(ZooKeeper-based election) - 如果本地节点是健康的并且ZKFC发现当前没有节点持有锁,它就尝试获取这个锁。如果成功,它就赢得了选举,执行一次故障恢复以使自己成为活动NN。这个故障恢复过程和前面介绍的人工故障恢复是相似的:1、隔离前活动节点(如果需要);2、本地NN转换成活动节点 Deploying ZooKeeper Before you begin configuring automatic failover, you should shut down your cluster. It is not currently possible to transition from a manual failover setup to an automatic failover setup while the cluster is running. 注意:在开始配置自动故障恢复前,关闭你的集群。目前还不支持在集群运行时将人工故障恢复转换为自动故障恢复。 Installer ZooKeeper: ZK 集群模式部署 参考文档:https://zookeeper.apache.org/doc/r3.4.6/zookeeperStarted.html 解压 zookeeper-3.4.6.tar.gz vi conf/zoo.cfg # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. # do not use /tmp for storage, /tmp here is just # example sakes. dataDir=/home/hadoop/zookeeper-3.4.6/tmp # the port at which the clients will connect clientPort=2181 # the maximum number of client connections. # increase this if you need to handle more clients #maxClientCnxns=60 # # Be sure to read the maintenance section of the # administrator guide before turning on autopurge. # # http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance # # The number of snapshots to retain in dataDir #autopurge.snapRetainCount=3 # Purge task interval in hours # Set to "0" to disable auto purge feature #autopurge.purgeInterval=1 server.1=hd1:2888:3888 server.2=hd2:2888:3888 server.3=hd3:2888:3888 配置ServerID The entries of the form server.X list the servers that make up the ZooKeeper service. When the server starts up, it knows which server it is by looking for the file myid in the data directory. That file has the contains the server number, server.X标记ZK服务,当服务启动,他会去DataDir目录下寻找一个myid的文件,这个文件包括server.X, server.1,.2,.3是zookeeper Server.id mkdir -p /usr/hadoop/zookeeper-3.4.6/tmp vi /usr/hadoop/zookeeper-3.4.6/tmp/myid [hadoop@hd1 tmp]$ more myid 1 hd1 的 /usr/hadoop/zookeeper-3.4.6/tmp/myid文件写入1,hd2写入2,hd3写入3,依此类推,,, 以上完成ZK的配置工作,可以把配置文件拷贝到其他ZK节点,完成ZK集群的配置。 启动ZK : [hadoop@hd1 bin]$ sh zkServer.sh start JMX enabled by default Using config: /home/hadoop/zookeeper-3.4.6/bin/../conf/zoo.cfg Starting zookeeper ... STARTED [hadoop@hd1 bin]$ jps 1957 QuorumPeerMain 1976 Jps Configuring automatic failover: In your hdfs-site.xml file, add: dfs.ha.automatic-failover.enabledtrue In your core-site.xml file, add: ha.zookeeper.quorumhd1:2181,hd2:2181,hd3:2181 This lists the host-port pairs running the ZooKeeper service. 如上地址-端口 应该运行着ZK服务。 Initializing HA state in ZooKeeper After the configuration keys have been added, the next step is to initialize required state in ZooKeeper. You can do so by running the following command from one of the NameNode hosts . 以上配置都完成之后下一步就是初始化ZK,可以运行如下命令完成初始化: hdfs zkfc -formatZK This will create a znode in ZooKeeper inside of which the automatic failover system stores its data. Starting the cluster with “start-dfs.sh” Since automatic failover has been enabled in the configuration, the start-dfs.sh script will now automatically start a ZKFC daemon on any machine that runs a NameNode. When the ZKFCs start, they will automatically select one of the NameNodes to become active. HA 启动总结 启动JN(hd2,hd3,hd4): [hadoop@hd4 ~]$ hadoop-daemon.sh start journalnode starting journalnode, logging to /usr/hadoop/hadoop-2.7.1/logs/hadoop-hadoop-journalnode-hd4.out [hadoop@hd4 ~]$ jps 1843 JournalNode 1879 Jps NN格式化(任意一台NN执行,hd1,hd2 ): [hadoop@hd1 sbin]$ hdfs namenode -format 18/10/07 05:54:30 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = hd1/192.168.83.11 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.7.1 ........ 18/10/07 05:54:34 INFO namenode.FSImage: Allocated new BlockPoolId: BP-841723191-192.168.83.11-1538862874971 18/10/07 05:54:34 INFO common.Storage: Storage directory /usr/hadoop/hadoop-2.7.1/dfs/name has been successfully formatted. 18/10/07 05:54:35 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0 18/10/07 05:54:35 INFO util.ExitUtil: Exiting with status 0 18/10/07 05:54:35 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at hd1/192.168.83.11 ************************************************************/ NN元数据拷贝(注意:在拷贝元数据之前,需要提前启动format过的NN,只启动一个节点) 启动format过的NN [hadoop@hd1 current]$ hadoop-daemon.sh start namenode starting namenode, logging to /usr/hadoop/hadoop-2.7.1/logs/hadoop-hadoop-namenode-hd1.out [hadoop@hd1 current]$ jps 1777 QuorumPeerMain 2177 Jps 在未format NN节点(hd2)执行元数据拷贝命令 [hadoop@hd2 ~]$ hdfs namenode -bootstrapStandby 18/10/07 06:07:15 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = localhost.localdomain/127.0.0.1 STARTUP_MSG: args = [-bootstrapStandby] STARTUP_MSG: version = 2.7.1 。。。。。。。。。。。。。。。。。。。。。。 ************************************************************/ 18/10/07 06:07:15 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT] 18/10/07 06:07:15 INFO namenode.NameNode: createNameNode [-bootstrapStandby] 18/10/07 06:07:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable ===================================================== About to bootstrap Standby ID nn2 from: Nameservice ID: mycluster Other Namenode ID: nn1 Other NN's HTTP address: http://hd1:50070 Other NN's IPC address: hd1/192.168.83.11:8020 Namespace ID: 1626081692 Block pool ID: BP-841723191-192.168.83.11-1538862874971 Cluster ID: CID-230e9e54-e6d1-4baf-a66a-39cc69368ed8 Layout version: -63 isUpgradeFinalized: true ===================================================== 18/10/07 06:07:17 INFO common.Storage: Storage directory /usr/hadoop/hadoop-2.7.1/dfs/name has been successfully formatted. 18/10/07 06:07:18 INFO namenode.TransferFsImage: Opening connection to http://hd1:50070/imagetransfer?getimage=1&txid=0&storageInfo=-63:1626081692:0:CID-230e9e54-e6d1-4baf-a66a-39cc69368ed8 18/10/07 06:07:18 INFO namenode.TransferFsImage: Image Transfer timeout configured to 60000 milliseconds 18/10/07 06:07:18 INFO namenode.TransferFsImage: Transfer took 0.01s at 0.00 KB/s 18/10/07 06:07:18 INFO namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000000000000 size 353 bytes. 18/10/07 06:07:18 INFO util.ExitUtil: Exiting with status 0 18/10/07 06:07:18 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1 ************************************************************/ 启动ZK hdfs zkfc -formatZK 格式化ZK报错: 8/10/07 22:34:06 INFO zookeeper.ClientCnxn: Opening socket connection to server hd1/192.168.83.11:2181. Will not attempt to authenticate using SASL (unknown error) 18/10/07 22:34:06 INFO zookeeper.ClientCnxn: Socket connection established to hd1/192.168.83.11:2181, initiating session 18/10/07 22:34:06 INFO zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 18/10/07 22:34:07 INFO zookeeper.ClientCnxn: Opening socket connection to server hd2/192.168.83.22:2181. Will not attempt to authenticate using SASL (unknown error) 18/10/07 22:34:07 INFO zookeeper.ClientCnxn: Socket connection established to hd2/192.168.83.22:2181, initiating session 18/10/07 22:34:07 INFO zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 18/10/07 22:34:07 ERROR ha.ActiveStandbyElector: Connection timed out: couldn't connect to ZooKeeper in 5000 milliseconds 18/10/07 22:34:07 INFO zookeeper.ZooKeeper: Session: 0x0 closed 18/10/07 22:34:07 INFO zookeeper.ClientCnxn: EventThread shut down 18/10/07 22:34:07 FATAL ha.ZKFailoverController: Unable to start failover controller. Unable to connect to ZooKeeper quorum at hd1:2181,hd2:2181,hd3:2181. Please check the configured value for ha.zookeeper.quorum and ensure that ZooKeeper is running. [hadoop@hd1 ~]$
Looks like your zookeeper quorum was not able to elect a master. Maybe you have misconfigured your zookeeper? Make sure that you have entered all 3 servers in your zoo.cfg with a unique ID. Make sure you have the same config on all 3 of your machines and and make sure that every server is using the correct myId as specified in the cfg. 修改之后重新执行: [hadoop@hd1 bin]$ hdfs zkfc -formatZK 18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/home/hadoop/hadoop-2.7.1/lib/native 18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp 18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Client environment:java.compiler= 18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux 18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64 18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Client environment:os.version=2.6.32-504.el6.x86_64 18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Client environment:user.name=hadoop 18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Client environment:user.home=/home/hadoop 18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Client environment:user.dir=/home/hadoop/zookeeper-3.4.6/bin 18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=hd1:2181,hd2:2181,hd3:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5119fb47 18/10/09 20:27:21 INFO zookeeper.ClientCnxn: Opening socket connection to server hd1/192.168.83.11:2181. Will not attempt to authenticate using SASL (unknown error) 18/10/09 20:27:22 INFO zookeeper.ClientCnxn: Socket connection established to hd1/192.168.83.11:2181, initiating session 18/10/09 20:27:22 INFO zookeeper.ClientCnxn: Session establishment complete on server hd1/192.168.83.11:2181, sessionid = 0x16658c662c80000, negotiated timeout = 5000 18/10/09 20:27:22 INFO ha.ActiveStandbyElector: Session connected. 18/10/09 20:27:22 INFO ha.ActiveStandbyElector: Successfully created /hadoop-ha/mycluster in ZK. 18/10/09 20:27:22 INFO zookeeper.ZooKeeper: Session: 0x16658c662c80000 closed 18/10/09 20:27:22 INFO zookeeper.ClientCnxn: EventThread shut down 启动ZK: [hadoop@hd1 bin]$ ./zkServer.sh start JMX enabled by default Using config: /home/hadoop/zookeeper-3.4.6/bin/../conf/zoo.cfg Starting zookeeper ... STARTED 启动所有: [hadoop@hd1 bin]$ start-dfs.sh 18/10/09 20:36:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Starting namenodes on [hd1 hd2] hd2: namenode running as process 2065. Stop it first. hd1: namenode running as process 2011. Stop it first. hd2: starting datanode, logging to /home/hadoop/hadoop-2.7.1/logs/hadoop-hadoop-datanode-hd2.out hd4: starting datanode, logging to /home/hadoop/hadoop-2.7.1/logs/hadoop-hadoop-datanode-hd4.out hd3: starting datanode, logging to /home/hadoop/hadoop-2.7.1/logs/hadoop-hadoop-datanode-hd3.out Starting journal nodes [hd2 hd3 hd4] hd4: journalnode running as process 1724. Stop it first. hd2: journalnode running as process 1839. Stop it first. hd3: journalnode running as process 1725. Stop it first. 18/10/09 20:37:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Starting ZK Failover Controllers on NN hosts [hd1 hd2] hd1: zkfc running as process 3045. Stop it first. hd2: zkfc running as process 2601. Stop it first. 查看hd2 DN日志: [hadoop@hd2 logs]$ jps 1984 QuorumPeerMain 2960 Jps 2065 NameNode 2601 DFSZKFailoverController 1839 JournalNode 2018-10-09 20:37:07,674 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /home/hadoop/hadoop-2.7.1/dfs/data/in_use.lock acquired by nodename 2787@hd2 2018-10-09 20:37:07,674 WARN org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException: Incompatible clusterIDs in /home/hadoop/hadoop-2.7.1/dfs/data: namenode clusterID = CID-e28f1182-d452 -4f23-9b37-9a59d4bdeaa0; datanode clusterID = CID-876d5634-38e8-464c-be02-714ee8c72878 2018-10-09 20:37:07,675 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool (Datanode Uuid unassigned) service to hd2/192.168.83.22:8020. Exiti ng. java.io.IOException: All specified directories are failed to load. at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:477) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1361) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1326) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:316) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:801) at java.lang.Thread.run(Thread.java:745) 2018-10-09 20:37:07,676 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool (Datanode Uuid unassigned) service to hd1/192.168.83.11:8020. Exiti ng. java.io.IOException: All specified directories are failed to load. at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:477) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1361) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1326) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:316) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:801) at java.lang.Thread.run(Thread.java:745) 2018-10-09 20:37:07,683 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool (Datanode Uuid unassigned) service to hd1/192.168.83.11:8020 2018-10-09 20:37:07,684 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool (Datanode Uuid unassigned) service to hd2/192.168.83.22:8020 2018-10-09 20:37:07,687 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool (Datanode Uuid unassigned) 2018-10-09 20:37:09,688 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode 2018-10-09 20:37:09,689 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0 2018-10-09 20:37:09,698 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down DataNode at hd2/192.168.83.22 ************************************************************/ 发现hd2 DN没有启动,从日志可以看出 namenode clusterID = CID-e28f1182-d452 -4f23-9b37-9a59d4bdeaa0; datanode clusterID = CID-876d5634-38e8-464c-be02-714ee8c72878 NN_ID和DN_ID不一致导致启动失败。回想自己的操作,由于NN重复格式化 导致NN_ID 发生变化,而DN_ID 没有变化导致不一致,解决办法很简单 把DN 数据删除重新启动DN。 重新查看hd2节点: [hadoop@hd2 dfs]$ jps 1984 QuorumPeerMain 2065 NameNode 3123 DataNode 3268 Jps 2601 DFSZKFailoverController 1839 JournalNode 到目前为止,各个节点实例都启动完毕,现在罗列一下: hd1: [hadoop@hd1 bin]$ jps 4180 Jps 3045 DFSZKFailoverController 2135 QuorumPeerMain 2011 NameNode hd2: [hadoop@hd2 dfs]$ jps 1984 QuorumPeerMain 2065 NameNode 3123 DataNode 3268 Jps 2601 DFSZKFailoverController 1839 JournalNode hd3: [hadoop@hd3 bin]$ jps 2631 Jps 2523 DataNode 1725 JournalNode 1807 QuorumPeerMain hd4: [hadoop@hd4 ~]$ jps 2311 DataNode 2425 Jps 1724 JournalNode
通过web界面访问NN(任意一个NN): http://192.168.83.11:50070 http://192.168.83.22:50070 [hadoop @hd1 bin]$ hdfs dfs -put zookeeper.out / 18/10/09 21:11:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [hadoop @hd1 bin]$ hdfs dfs -ls / 18/10/09 21:11:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 1 items -rw-r--r-- 3 hadoop supergroup 25698 2018-10-09 21:11 /zookeeper.out 下面配置MR mapred-site.xml mapreduce.framework.nameyarn yarn-site.xml yarn.resourcemanager.hostnamehd1yarn.resourcemanager.aux-servicesmapreduce_shuffle web界面管理MR: http://hd1:8088/ 查看服务默认端口 可以在 http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/ClusterSetup.html 下面的configuration 配置项找。 NN手动管理: [hadoop@hd1 bin]$ hdfs haadmin Usage: haadmin [-transitionToActive [--forceactive] ] [-transitionToStandby ] --前面定义的nn1,nn2 [-failover [--forcefence] [--forceactive] ] [-getServiceState ] [-checkHealth ] [-help ] Generic options supported are -conf specify an application configuration file -D use value for given property -fs specify a namenode -jt specify a ResourceManager -files specify comma separated files to be copied to the map reduce cluster -libjars specify comma separated jar files to include in the classpath. -archives specify comma separated archives to be unarchived on the compute machines. This guide describes high-level uses of each of these subcommands. For specific usage information of each subcommand, you should run “ hdfs haadmin -help ”. transitionToActive and transitionToStandby - transition the state of the given NameNode to Active or Standby These subcommands cause a given NameNode to transition to the Active or Standby state, respectively. These commands do not attempt to perform any fencing, and thus should rarely be used. Instead, one should almost always prefer to use the “ hdfs haadmin -failover ” subcommand. failover - initiate a failover between two NameNodes This subcommand causes a failover from the first provided NameNode to the second. If the first NameNode is in the Standby state, this command simply transitions the second to the Active state without error. If the first NameNode is in the Active state, an attempt will be made to gracefully transition it to the Standby state. If this fails, the fencing methods (as configured by dfs.ha.fencing.methods ) will be attempted in order until one succeeds. Only after this process will the second NameNode be transitioned to the Active state. If no fencing method succeeds, the second NameNode will not be transitioned to the Active state, and an error will be returned. getServiceState - determine whether the given NameNode is Active or Standby Connect to the provided NameNode to determine its current state, printing either “standby” or “active” to STDOUT appropriately. This subcommand might be used by cron jobs or monitoring scripts which need to behave differently based on whether the NameNode is currently Active or Standby. checkHealth - check the health of the given NameNode Connect to the provided NameNode to check its health. The NameNode is capable of performing some diagnostics on itself, including checking if internal services are running as expected. This command will return 0 if the NameNode is healthy, non-zero otherwise. One might use this command for monitoring purposes. Note: This is not yet implemented, and at present will always return success, unless the given NameNode is completely down.
首先我们来认识一下HDFS, HDFS(Hadoop Distributed File System )Hadoop分布式文件系统。它其实是将一个大文件分成若干块保存在不同服务器的多个节点中。通过联网让用户感觉像是在本地一样查看文件,为了降低文件丢失造成的错误,它会为每个小文件复制多个副本(默认为三个),以此来实现多机器上的多用户分享文件和存储空间。 HDFS特点: ① 保存多个副本,且提供容错机制,副本丢失或宕机自动恢复。默认存3份。 ② 运行在廉价的机器上。 ③ 适合大数据的处理。因为小文件也占用一个块,小文件越多(1000个1k文件)块越 多,NameNode压力越大。
那么,读操作流程为: a. client向namenode发送读请求。 b. namenode查看Metadata信息,返回fileA的block的位置。 block1:host2,host1,host3 block2:host7,host8,host4 c. block的位置是有先后顺序的,先读block1,再读block2。而且block1去host2上读取;然后block2,去host7上读取;
简介 开始学es,我习惯边学边记,总结出现的问题和解决方法。本文是在两台 Linux 虚拟机下,安装了三个节点。本次搭建es同时实践了两种模式——单机模式和分布式模式。条件允许的话,可以在多台机器上配置es节点,如果你机器性能有限,那么可以在一台虚拟机上完成多节点的配置。 如图,是本次3个节点的分布。 虚拟机主机名 IP es节点 master
参考: kubeflow open mpi Kubeflow: Cloud-native machine learning with Kubernetes Bringing Your Data Pipeline Into The Machine Learning Era Introducing Argo — A Container-Native Workflow Engine for Kubernetes Introducing Seldon Deploy jupyterhub 文档 MPI AND SCALABLE DISTRIBUTED MACHINE LEARNING Chainer 使复杂神经网络变的简单 pachyderm 文档 https://github.com/fnproject/fn-helm/issues/21