使用Corosync+Pacemaker+nfs 实现高可用的Web集群-linux交流-黑帽联盟

定位发表于 2017-5-10 21:22:05

使用Corosync+Pacemaker+nfs 实现高可用的Web集群

一、实验环境说明二、配置前的准备工作三、安装corosync和pacemaker，并提供配置四、启动并检查corosync五、crmsh 的安装及使用简介六、使用crmsh配置集群资源七、测试资源八、关于资源约束的介绍以及使用资源约束定义资源
一、环境说明1.操作系统
[*]CentOS 6.7  X86_64  64位系统
2.软件环境
[*]Corosync 1.4.7
[*]Pacemaker 1.1.15
[*]crmsh 3.0.0
3.拓扑准备

node1  172.16.120.176node2 172.16.120.180NFS server:172.16.120.88webip  172.16.120.188

二、配置前的准备工作
1.配置各节点主机名可以相互解析
2.配置各节点时间同步
3.配置各节点ssh可以基于公私钥通信
4.关闭防火墙和selinux

对于环境安装之前发过，链接：heartbeat和corosync高可用集群前提环境搭建

三、安装corosync和pacemaker，并提供配置
1.安装（node1节点上和node2节点上都执行下面的操作）
yum --nogpgcheck localinstall *.rpm

rpm文件如下：
cluster-glue-1.0.5-6.el6.x86_64.rpm
cluster-glue-libs-1.0.5-6.el6.x86_64.rpm
corosync-1.4.7-5.el6.x86_64.rpm
corosynclib-1.4.7-5.el6.x86_64.rpm
heartbeat-3.0.4-2.el6.x86_64.rpm
heartbeat-libs-3.0.4-2.el6.x86_64.rpm
pacemaker-1.1.15-5.el6.x86_64.rpm
pacemaker-cts-1.1.15-5.el6.x86_64.rpm
pacemaker-libs-1.1.15-5.el6.x86_64.rpm
resource-agents-3.9.5-46.el6.x86_64.rpm
libesmtp-1.0.4-15.el6.x86_64.rpm

2.提供配置文件
# cd /etc/corosync/
# ls
amf.conf.examplecorosync.conf.example.udpu uidgid.d
corosync.conf.exampleservice.d

#可以看出，corosync 提供了一个配置文件的样例，我们只需拷贝一份作为配置文件即可：

# cp corosync.conf.example corosync.conf

3.定义配置
配置文件详解：

compatibility: whitetank#是否兼容whitetank(0.8之前的corosync)
totem {#定义集群节点之间心跳层信息传递
      version: 2
      secauth: on  #是否启用安全认证功能，应启动）
      threads: 2  #启动几个线程用于心跳信息传递
      interface {  #定义心跳信息传递接口
            ringnumber: 0          #循环次数为几次 0表示不允许循环
            bindnetaddr:172.16.120.1# 绑定的网络地址不是主机地址写网卡所在的网络的地址
            mcastaddr:226.94.1.1 #多播地址
            mcastport: 5405
            ttl: 1 #
      }
}
logging {#定义日志信息
      fileline: off
      to_stderr: no  #日志信息发往错误输出即发到屏幕
      to_logfile: yes
      to_syslog: no  #是否记录在/var/log/message 改为no
      logfile: /var/log/cluster/corosync.log
      debug: off
      timestamp: on  #当前时间的时间戳关闭可以减少系统调用，节约系统资源
      logger_subsys {
            subsys: AMF
            debug: off
      }
}
amf {
      mode: disabled
}

service {#定义服务
ver: 0
name: pacemaker  #启用pacemaker
}
aisexec {          # 定义进程执行时的身份以及所属组
user: root
group: root
}

# 用 man corosync.conf 可以查看所有选项的意思。

4..生成秘钥文件
由于之前定义的secauth: on，所以应提供秘钥文件

# corosync-keygen
Corosync Cluster Engine Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Press keys on your keyboard to generate entropy.
Press keys on your keyboard to generate entropy (bits = 192).

#注：corosync生成key文件会默认调用/dev/random随机数设备，一旦系统中断的IRQS的随机数不够用，将会产生大量的等待时间，因此，为了节约时间，我们在生成key之前讲random替换成urandom，以便节约时间。
mv /dev/{random,random.bak}
ln -s /dev/urandom /dev/random

5. 为node2提供相同的配置，即将key文件authkey与配置文件corosync.conf复制到node2上
# scp authkey corosync.conf root@node2:/etc/corosync/
authkey 100% 1280.1KB/s 00:00
corosync.conf100% 541 0.5KB/s 00:00

#到此为止corosync 安装配置完毕，

四、启动并检查corosync
1.启动服务
#service corosync start
StartingCorosync Cluster Engine (corosync): [ OK ]

2.查看corosync引擎是否正常启动
#grep -e "Corosync Cluster Engine" -e "configuration file"/var/log/cluster/corosync.log
Feb 26 17:33:28corosync Corosync Cluster Engine ('1.4.1'): started and ready to provideservice.
Feb 26 17:33:28corosync Successfully read main configuration file'/etc/corosync/corosync.conf'.

3.查看初始化成员节点通知是否正常发出
#grep TOTEM /var/log/cluster/corosync.log
Feb 26 17:33:28corosync Initializing transport (UDP/IP Multicast).
Feb 26 17:33:28corosync Initializing transmit/receive security: libtomcryptSOBER128/SHA1HMAC (mode 0).
Feb 26 17:33:28corosync The network interface is now up.
Feb 26 17:33:28corosync Process pause detected for 616 ms, flushing membershipmessages.
Feb 26 17:33:28corosync A processor joined or left the membership and a new membershipwas formed.
Feb 26 17:33:46corosync A processor joined or left the membership and a newmembership was formed.

4.检查启动过程中是否有错误产生
#grep ERROR: /var/log/cluster/corosync.log
Feb 26 17:33:28corosync ERROR: process_ais_conf: You have configured a cluster using thePacemaker plugin for Corosync. The plugin is not supported in this environment and willbe removed very soon.
Feb 26 17:33:28corosync ERROR: process_ais_conf: Please see Chapter 8 of 'Clustersfrom Scratch' (http://www.clusterlabs.org/doc) for details on using Pacemakerwith CMAN

#上面的错误信息表示packmaker不久之后将不再作为corosync的插件运行，因此，建议使用cman作为集群基础架构服务；此处可安全忽略。

5.查看pacemaker是否正常启动
# grep pcmk_startup /var/log/cluster/corosync.log
Feb 2617:33:28 corosync info: pcmk_startup: CRM: Initialized
Feb 2617:33:28 corosync Logging: Initialized pcmk_startup
Feb 2617:33:28 corosync info: pcmk_startup: Maximum core file size is:4294967295
Feb 2617:33:28 corosync info: pcmk_startup: Service: 9
Feb 2617:33:28 corosync info: pcmk_startup: Local hostname: node1.test.com

6.如果上面命令执行均没有问题，接着可以执行如下命令启动node2上的corosync
#ssh node2 "service corosync start"
StartingCorosync Cluster Engine (corosync): [ OK ]

7.查看状态
# crm_mon
Last updated:Wed Feb 26 17:41:58 2014
Last change:Wed Feb 26 17:33:51 2014 via crmd on node1.test.com
Stack: classicopenais (with plugin)
CurrentDC: node1.test.com- partition with quorum
Version:1.1.10-14.el6_5.2-368c726
2 Nodesconfigured, 2 expected votes
0 Resourcesconfigured
Online:

#执行以下命令可以看出服务正常启动，此时node1是DC,但是0Resources configured此时，我们开始定义资源信息

五、crmsh 的安装及使用简介
这个我之前也发过，链接：CentOS6.x 安装crmsh
命令简介，链接：pacemaker资源管理器（CRM）命令注解

六、使用crmsh配置集群资源
1. 定义web的ip，我们把此资源命名为webip（下面的nic后面跟上服务器上的网卡名称，我这里是eth1，所以写eth1，如果eth0，就写eth0）
# crm configure
crm(live)configure# primitive webip IPaddr params ip=172.16.120.188 nic=eth1 cidr_netmask=16
crm(live)configure# verify （这里是验证上面的操作是否正确）

#配置资源时，会出现报错，主要是因为STONITH resources没有定义，这里没有STONITH设备，所以我们先关闭这个属性

定义全局属性：
crm(live)configure# property stonith-enabled=false
crm(live)configure# primitive Webip ocf:heartbeat:IPaddr paramsip=172.16.120.188

crm(live)configure# verify  #验证配置文件，没有报错
crm(live)configure# commit  #最后提交，保存

2. 定义apache服务资源,命名为httpd
crm(live)configure#primitive httpd lsb:httpd
crm(live)configure# verify  #验证配置文件，没有报错
crm(live)configure# commit  #最后提交，保存

查看资源状态是否被挂起
crm(live)configure#cd ..
crm(live)status

输出的信息我就不贴出来了
# 通过输出的信息可以得知，webip和httpd两个资源被自动运行于两个节点上，由于httpd和webip必须结合使用才有意义，所以必须要把两个资源绑定在一起。
绑定资源的方法有两种，一种是定义组资源，将webip与httpd加入同一个组中，另一种方法是定义资源约束，以实现将资源运行在同一节点上。此时，采用定义组的方法，实现资源绑定

4.定义组资源（这里也可以通过约束来定义，这里暂时不说了）
crm(live)configure#group webservice webip httpd
crm(live)#status

我们可以看到两个资源运行在同一个节点上了

七、测试资源
1.在node1和node2上安装apache服务，并提供测试页面

# echo "<h1>node2.test.com<h1>" >/var/www/html/index.html
# service httpd start
# curl node1.test.com
<h1>node1.test.com<h1>
# service httpd stop
# chkconfig httpd off

# echo "<h1>node2.test.com<h1>" >/var/www/html/index.html
# service httpd start
# curl node1.test.com
<h1>node2.test.com<h1>
# service httpd stop
# chkconfig httpd off

#在node1和node2测试apache服务没问题之后，关闭服务，并保证服务开机不启动

2.测试
在浏览输入webip

3.模拟下服务故障，测试资源能否自动转移

# service corosync stopcrm(live)#status
Last updated:Thu Feb 27 10:34:09 2014
Last change:Thu Feb 27 10:33:27 2014 via crm_attribute on node1.test.com
Stack: classicopenais (with plugin)
Current DC:node2.test.com - partition WITHOUT quorum#发现node1虽然在线，但是资源并没有转移，这是为什么呢？
通过查看输出信息，node1.drbd.com - partition WITHOUTquorum，可以得知，此时的集群状态为"WITHOUT quorum，即不符合法定的quorum，所有集群服务本身已经不符合正常运行的条件，这对于只有两节点的集群来讲是不合理的。因此，我们可以定义全局属性来忽略quorum不能满足的集群状态检查。

4.定义全局属性来忽略quorum不能满足的集群状态检查
crm(live)# configure property no-quorum-policy=ignore

5.再次测试
启动刚才停止的node1的服务，再次查看状态。
crm(live)#status
Last updated:Thu Feb 27 10:48:24 2014
Last change:Thu Feb 27 10:48:08 2014 via crm_attribute on node1.test.com
Stack: classicopenais (with plugin)
Current DC:node1.test.com - partition with quorum
Version:1.1.10-14.el6_5.2-368c726
2 Nodes configured, 2 expected votes
2 Resourcesconfigured
Online:
ResourceGroup: Webservice
Webip (ocf::heartbeat:IPaddr): Started node2.test.com
httpd  (lsb:httpd): Started node2.test.com
crm(live)#
#查看状态可知，node1和node2正常运行，且资源运行在node2上。

# ssh node2 "service corosync stop" #停止node2上的corosync
SignalingCorosync Cluster Engine (corosync) to terminate: [ OK ]
Waiting forcorosync services to unload:..[ OK ]

# crm status #再次查看状态验证
Last updated:Thu Feb 27 10:51:10 2014
Last change:Thu Feb 27 10:48:08 2014 via crm_attribute on node1.test.com
Stack: classicopenais (with plugin)
Current DC:node1.test.com - partition WITHOUT quorum
Version:1.1.10-14.el6_5.2-368c726
2 Nodesconfigured, 2 expected votes
2 Resourcesconfigured
Online:
OFFLINE:
ResourceGroup: Webservice
Webip (ocf::heartbeat:IPaddr): Startednode1.test.com
httpd (lsb:httpd): Started node1.test.com

#发现此时资源已经自动转移到node1上

八、关于资源约束的介绍
1.资源约束简介
资源约束则用以指定在哪些群集节点上运行资源，以何种顺序装载资源，以及特定资源依赖于哪些其它资源。pacemaker共给我们提供了三种资源约束方法：
1）Resource Location（资源位置）：定义资源更倾向于在哪些节点上运行；
2）Resource Collocation（资源排列）：排列约束用以定义集群资源可以或不可以在某个节点上同时运行；
3）Resource Order（资源顺序）：顺序约束定义集群资源在节点上启动的顺序；
定义约束时，还需要指定分数。各种分数是集群工作方式的重要组成部分。其实，从迁移资源到决定在已降级集群中停止哪些资源的整个过程是通过以某种方式修改分数来实现的。分数按每个资源来计算，资源分数为负的任何节点都无法运行该资源。在计算出资源分数后，集群选择分数最高的节点。INFINITY（无穷大）目前定义为 1,000,000。加减无穷大遵循以下3个基本规则：
1）任何值 + 无穷大 = 无穷大
2）任何值 - 无穷大 = -无穷大
3）无穷大 - 无穷大 = -无穷大
定义资源约束时，也可以指定每个约束的分数。分数表示指派给此资源约束的值。分数较高的约束先应用，分数较低的约束后应用。通过使用不同的分数为既定资源创建更多位置约束，可以指定资源要故障转移至的目标节点的顺序。

前面我们提过，绑定资源的方法有两种，一种是定义组资源，将各个资源加入同一个组中，另一种方法是定义资源约束，以实现将资源运行在同一节点上。
此时，采用定义资源约束的方法，实现资源绑定。为了加深印象，这次我们使用资源约束来实现资源的绑定。

2.使用资源约束来实现资源的绑定。
crm(live)resource# stop Webservice #停止原来的资源
crm(live)configure# delete Webservice #删除原来定义的组

crm(live)# status  #查看状态，发现组Webip已经被删除，而Webip和httpd被分配到不同的节点上
Last updated: Thu Feb 27 11:26:59 2014
Last change: Thu Feb 27 11:26:53 2014 via cibadmin on node1.test.com
Stack: classic openais (with plugin)
Current DC: node1.test.com - partition with quorum
Version: 1.1.10-14.el6_5.2-368c726
2 Nodes configured, 2 expected votes
2 Resources configured
Online: [ node1.test.com node2.test.com ]

Webip (ocf::heartbeat:IPaddr): Started node1.test.com
httpd (lsb:httpd): Started node2.test.com

3.增加共享存储
此时，为实现共享存储，添加了一台NFSserver, NFS共享 172.16.120.88：/web
并在/web目录下建了一个index.html 其内容为 from nfs server

4.添加nfs服务至集群。
crm(live)configure# primitive webstore ocf:heartbeat:Filesystem params device="172.16.120.88:/web" directory="/var/www/html" fstype="nfs"
crm(live)configure# verify
WARNING: Webstore: default timeout 20s for start is smaller than theadvised 60
WARNING: Webstore: default timeout 20s for stop is smaller than theadvised 60
crm(live)configure# commit

5.定义资源约束：
crm(live)configure# colocation httpd_with_webstore INFINITY： httpd webstore
crm(live)configure# colocation httpd_with_webip INFINITY:  httpd webip
crm(live)configure# verify
WARNING: Webstore: default timeout 20s for start is smaller than theadvised 60
WARNING: Webstore: default timeout 20s for stop is smaller than the advised60
crm(live)configure# commit
crm(live)# status
Last updated: Thu Feb 27 12:39:14 2014
Last change: Thu Feb 27 12:38:09 2014 via cibadmin on node1.test.com
Stack: classic openais (with plugin)
Current DC: node1.test.com - partition with quorum
Version: 1.1.10-14.el6_5.2-368c726
2 Nodes configured, 2 expected votes
3 Resources configured

Online: [ node1.test.com node2.test.com ]

Webip (ocf::heartbeat:IPaddr): Started node2.test.com
httpd (lsb:httpd): Started node2.test.com
Webstore (ocf::heartbeat:Filesystem): Started node2.test.com
# 查看资源状况，可以看出Webip,httpd,Webstore 运行在同一节点

7.测试
测试显示，网页的内容来自NFS server

8.故障模拟
在node2上模拟故障，手动执行standby,可以看出三个资源被自动转移至node1，实验成功
crm(live)node# standby
crm(live)node# cd
crm(live)# status
Last updated: Thu Feb 27 12:42:46 2014
Last change: Thu Feb 27 12:42:40 2014 via crm_attribute on node2.test.com
Stack: classic openais (with plugin)
Current DC: node1.test.com - partition with quorum
Version: 1.1.10-14.el6_5.2-368c726
2 Nodes configured, 2 expected votes
3 Resources configured
Node node2.test.com: standby
Online: [ node1.test.com ]
Webip (ocf::heartbeat:IPaddr): Started node1.test.com
httpd (lsb:httpd): Started node1.test.com
Webstore (ocf::heartbeat:Filesystem): Started node1.test.com

页: [1]

黑帽联盟's Archiver

使用Corosync+Pacemaker+nfs 实现高可用的Web集群