Perl 6 - the future is here, just unevenly distributed

IRC log for #fuel, 2014-01-20

| Channels | #fuel index | Today | | Search | Google Search | Plain-Text | summary

All times shown according to UTC.

Time Nick Message
02:39 topshare joined #fuel
02:47 IlyaE joined #fuel
03:28 vkozhukalov joined #fuel
03:58 topshare joined #fuel
04:07 oleseyuk joined #fuel
04:20 besem9krispy joined #fuel
04:55 ArminderS joined #fuel
04:57 Arminder ok, I got a situation here
04:57 Arminder where i have fuel 4.0 deployed in HA with neutron+gre segmentation
04:58 Arminder so 3 controllers+2 ceph nodes+1 compute
04:59 Arminder each node got 4 NICs, so admin, public, storage & mgmt on each separate NIC and no VLAN IDs set
04:59 Arminder now I wish to increase the throughput for storage transfers, so added a dual NIC to each of 2 ceph nodes
05:00 Arminder now, is it possible to bond these 2 addl. NICs to storage network?
05:00 Arminder with the openstack already deployed, can we add this later
05:01 Arminder since its not in production, so I can break this
05:01 Arminder just wanted to know how to do it, if its possible
05:02 Arminder also by doing this, will it really increase the available bandwidth for data transfers between ceph nodes
05:02 Arminder just by adding addl NICs to ceph nodes only
05:02 Arminder since I'll be adding more of compute nodes later
05:03 Arminder anyone here that can provide inputs is highly appreciated
05:03 Arminder leave me a pointer here even if I'm not around
05:13 AndreyDanin joined #fuel
05:17 IlyaE joined #fuel
06:07 akupko joined #fuel
06:12 e0ne joined #fuel
06:38 xarses http://docs.mirantis.com/fuel/fuel-4.0/reference-architecture.html?highlight=bond#id30
06:39 xarses Arminder: ^
06:39 xarses better link http://docs.mirantis.com/fuel/fuel-4.0/reference-architecture.html?highlight=bond#adjust-the-network-configuration-via-cli
06:43 ArminderS thanks xarses
06:44 ArminderS thats pre-deployment, right
06:46 bookwar joined #fuel
06:46 ArminderS is it possible once the environment is deployed
06:52 ArminderS xarses: ^
07:48 mihgen joined #fuel
08:01 e0ne joined #fuel
08:17 anotchenko joined #fuel
08:27 meow-nofer__ joined #fuel
08:28 vkozhukalov joined #fuel
08:29 anotchenko joined #fuel
08:40 mattymo joined #fuel
08:43 mihgen joined #fuel
08:51 Arminder joined #fuel
08:51 vkozhukalov joined #fuel
08:53 mihgen_ joined #fuel
09:02 teran joined #fuel
09:09 jkirnosova joined #fuel
09:11 miguitas joined #fuel
09:34 mrasskazov joined #fuel
09:38 jouston joined #fuel
09:51 anotchenko joined #fuel
10:16 vkozhukalov joined #fuel
10:22 e0ne_ joined #fuel
10:47 ogelbukh joined #fuel
10:51 e0ne joined #fuel
11:05 teran joined #fuel
11:05 teran joined #fuel
11:07 e0ne_ joined #fuel
11:08 mihgen joined #fuel
11:13 anotchenko joined #fuel
11:17 e0ne joined #fuel
11:42 besem9krispy joined #fuel
12:41 e0ne joined #fuel
12:42 anotchenko joined #fuel
12:49 MiroslavAnashkin joined #fuel
12:53 e0ne_ joined #fuel
12:58 MiroslavAnashkin joined #fuel
13:02 akupko joined #fuel
13:38 mrasskazov joined #fuel
13:54 MiroslavAnashkin jhurlbert: In case you need to shut down one of controllers in HA mode:
13:54 anotchenko joined #fuel
13:55 MiroslavAnashkin jhurlbert: It is possible to shut it down by poweroff command - but there is possibility Pacemaker start to sync at the time of shut down and forget about the shutdown at all.
13:56 MiroslavAnashkin jhurlbert: So, we recommend to run `crm resource list` or `crm resource status`- to get the list of rinning services under Pacemaker
14:00 MiroslavAnashkin jhurlbert: And then run `crm resource stop p_neutron-openvswitch-agent`, then `crm resource stop p_mysql`
14:01 MiroslavAnashkin jhurlbert: These commands should stop local resource only. Please avoid stopping anything named like clone_p_* - these commands stop resource cluster wide.
14:02 MiroslavAnashkin jhurlbert: After you stopped local resources on the controller - simply shut it down the usual way, and nevermind about rabbitmq.
14:32 kpimenova joined #fuel
14:57 anotchenko joined #fuel
14:59 besem9krispy joined #fuel
15:07 richardkiene joined #fuel
15:13 richardkiene_ joined #fuel
15:31 MiroslavAnashkin richardkiene_: Regarding the packet loss. Please try to reduce MTU on the guest instances to 1400-1465. The following bug contains description on how to do it. https://bugs.launchpad.net/fuel/+bug/1256289
15:31 richardkiene_ MiroslavAnashkin: I believe I tried that and things actually got worse
15:31 richardkiene_ But I'll give it another go :)
15:32 richardkiene_ I'll look up my NIC model for the GRO bug in a little bit
15:33 MiroslavAnashkin richardkiene_: You may also turn off TSO (tcp segmentation offload)  on the instance machine for outbound traffic with `ethtool -K eth0 tso off`
15:33 richardkiene_ MiroslavAnashkin: That would be a VM level change, not on the host?
15:34 MiroslavAnashkin richardkiene_: Yes, on VM
15:34 richardkiene_ I'll give both of those a try here in a few minutes. Very much appreciate the help.
15:44 richardkiene_ I'm not sure the TSO fix applies to my VMs
15:44 richardkiene_ We're running Ubuntu 12.04 LTS VMs
15:44 richardkiene_ our host machines are all CentOS, but it looks like the bug applies to CentOS if I'm not mistaken
15:44 richardkiene_ CentOS VMs that is
15:45 richardkiene_ Or am I misunderstanding that bug?
15:46 IlyaE joined #fuel
15:49 MiroslavAnashkin richardkiene_: Yes, this bug is in CentOS, but setting TSO OFF on VM affects packet transition between VM and host.
15:50 richardkiene_ Gotcha. Setting TSO off by itself has not reduced packet loss (I've got a ping going from a physical machine to the floating iP of one of the VMs)
15:52 richardkiene_ MiroslavAnashkin: When I apply the DHCP change, is there anything special to do with the HA controller setup or should I just apply it and restart the dhcp agent on each controller node?
15:58 xarses Arminder: no, not through fuel
15:58 angdraug joined #fuel
16:00 richardkiene_ MiroslavAnashkin: I applied that workaround to all 3 controllers and now I have 100% packet loss when pinging the Floating IP of a VM
16:00 MiroslavAnashkin richardkiene_: In case of HA mode please run `crm resource restart clone_p_neutron-plugin-openswitch-agent` on any controller. This command restarts this agent on every controller and restarts dependant p_neutron-dhcp-agent and p_neutron-l3-agent respectively.
16:00 albionandrew joined #fuel
16:00 richardkiene_ Ah, that may be why things are weird now :)
16:01 MiroslavAnashkin richardkiene_: Please check the correct name of this agent with `crm resource list` first.
16:01 MiroslavAnashkin richardkiene_: Agent restart may take up to 5 minutes
16:03 albionandrew HI I just joined this and I can only see richards last post but we are getting a few failed agents for example  p_neutron-openvswitch-agent_monitor_20000 is the way to deal with it to go to the node shown in the output of cdm status and do a crm resource start ... or is there another way I should be looking at this?
16:03 richardkiene_ MiroslavAnashkin: `crm resource restart clone_p_neutron-openvswitch-agent` appears to be the correct command
16:04 mihgen joined #fuel
16:04 richardkiene_ I'll wait 5 min and see if I can get a ping through
16:08 MiroslavAnashkin richardkiene_: please check with `crm status` first.
16:09 richardkiene_ MiroslavAnashkin: That is showing some errors...
16:09 richardkiene_ I rebooted the VM and now it is pingable
16:10 richardkiene_ no packet loss so far
16:10 MiroslavAnashkin richardkiene_: Yes, it collects errors from previous runs. But if there is no error in the current statuses section - then crm is OK.
16:11 richardkiene_ MiroslavAnashkin: Here is what the output looks like http://paste.openstack.org/show/61575/
16:13 MiroslavAnashkin richardkiene_: It show everything is OK with crm currently, but there were errors in the past.
16:18 richardkiene_ Still seeing periods of packet loss, unfortunately
16:23 vk joined #fuel
16:23 MiroslavAnashkin richardkiene_: Hmm, I would first check the physical network connectivity, without GRE.
16:23 e0ne joined #fuel
16:26 e0ne joined #fuel
16:28 ihamad joined #fuel
16:41 rmoe joined #fuel
16:43 dhblaz joined #fuel
16:43 dhblaz Our crmd is running out of file descriptors and it is causing problems with services
16:44 dhblaz See log lines like this in crmd.log:
16:44 dhblaz 2014-01-20T16:08:42.302022+00:00 err:     error: qb_ipcs_us_connection_acceptor: Could not accept client connection: Too many open files (24)
16:45 mrasskazov joined #fuel
16:46 mattymo dhblaz, I ran into that issue. Raise /etc/security/limits.conf to 1024000 (extra 0)
16:46 vk joined #fuel
16:47 mattymo and this short workaround fixed the running crmd: echo -n "Max core file size=unlimited:unlimited" >/proc/`pidof crmd`/limits
16:47 dhblaz This is what that file looks like on our controllers:
16:47 dhblaz # Raising open file limit for OpenStack services
16:47 dhblaz *         soft    nofile          102400
16:47 dhblaz *         hard    nofile          112640
16:47 albionandrew mattymo - would you suggest doing that just on the controllers? I work with dhblaz
16:47 mattymo just controllers
16:47 albionandrew mattymo thanks
16:47 mattymo computes just run a handful of VMs and make API calls.
16:47 jhurlbert MiroslavAnashkin: Thanks for the controller shutdown steps.
16:48 dhblaz If I understand mattymo correctly raise hard and soft limits to 1024000
16:48 Nikolay Hi, dhblaz! Could you run  sysctl -a  |  grep file-max
16:48 mattymo yes
16:48 mattymo oh yes, file-max as well!
16:49 Nikolay sysctl sets limits for kernel, while /etc/security/limits.conf sets limits for processes
16:49 dhblaz do you mean Max open files?
16:49 kaliningrad joined #fuel
16:49 dhblaz [root@node-17 ~]# sysctl -a  |  grep file-max
16:49 dhblaz fs.file-max = 811542
16:49 Nikolay I mean you should check your kernel settings first
16:51 dhblaz I don't understand what limit we are hitting:
16:51 dhblaz [root@node-17 ~]# lsof | fgrep -c crmd
16:51 dhblaz 1084
16:51 dhblaz mattymo: I don't think the "short workaround" you describe above would do what you say it would
16:52 dhblaz perhaps echo -n "Max Max open files=unlimited:unlimited" >/proc/`pidof crmd`/limits
16:53 dhblaz It looks like I really need a way to adjust the ulimits for crmd
16:53 dhblaz The soft limit listed in /proc/`pidof crmd`/limits is 1024
16:53 Nikolay I think fs.file-max = 811542 is OK for most cases.
16:55 Nikolay Try to check how many files are opened in system with
16:55 Nikolay # cat /proc/sys/fs/file-nr
16:55 Nikolay or with
16:55 Nikolay # lsof | wc -l
16:56 dhblaz [root@node-17 ~]# cat /proc/sys/fs/file-nr
16:56 dhblaz 50880811542
16:58 Nikolay Well... so you have total 5088 files opened in the system with limit of 811542. No so many... So the kernel settings are not the limit.
17:02 dhblaz Right the problem is clearly the Max open files for crmd
17:02 dhblaz It doesn't appear that that ulimit is set from /etc/security/limits.conf as mattymo suggests
17:02 dhblaz because [root@node-17 ~]# for PID in `pidof crmd`; do fgrep 'Max open files' /proc/$PID/limits; done
17:02 dhblaz Max open files            1024                 4096                 files
17:03 dhblaz But our hard and soft limits in limits.conf were 102400 and 112640 respectively
17:03 dhblaz FYI all three of our controllers just went down
17:03 dhblaz Well perhaps down isn't a good description
17:03 dhblaz [root@fuelpxe01 ~]# ssh node-17
17:03 dhblaz Warning: Permanently added 'node-17' (RSA) to the list of known hosts.
17:04 dhblaz Last login: Mon Jan 20 16:55:35 2014 from 10.29.5.2
17:04 dhblaz Connection to node-17 closed.
17:09 vkozhukalov joined #fuel
17:26 besem9krispy joined #fuel
17:26 richardkiene_ MiroslavAnashkin: The physical network appears fine, even during periods of high packet loss to the VMs I have no connectivity issues with the host machines
17:27 dhblaz are you using neutron?  If so consider testing with vlan splinters on just to rule that out as an issue.
17:28 richardkiene_ MiroslavAnashkin: Though, now I'm no longer experiencing packet loss between VMs, and lower packet loss pinging the floating IPs from a physical machine. So that is an improvement
17:28 richardkiene_ dhblaz: Yes Neutron with GRE
17:28 dhblaz Do you use any vlan tagging?  Or do you have an interface for each network?
17:29 richardkiene_ VLAN tagging for public and management
17:29 dhblaz centos?
17:29 richardkiene_ storage is tagged but on its own interface
17:30 richardkiene_ Centos host machines, Ubuntu VMs
17:30 dhblaz I would try using vlan splinters just to test.  There is a performance hit so if it doesn't fix it you can turn it off.
17:31 dhblaz I used this one liner to turn it on all interfaces:
17:31 dhblaz for interface in eth0 eth1 eth2 eth3 ; do ovs-vsctl set interface $interface other-config:enable-vlan-splinters=true; done
17:31 dhblaz If you have different interfaces you should modify the list
17:32 dhblaz I'm not sure if you realize but if you are using ceph at least storage traffic is over the management network.
17:32 dhblaz Only osd replication traffic is on the storage network.
17:32 dhblaz so we chose to put public/storage on one network and have management by itself.
17:33 dhblaz Because our storage nodes doesn't have any public traffic
17:33 dhblaz sorry for bad English there
17:33 richardkiene_ I did not know that, glad I found out early!
17:34 richardkiene_ does subnet size or tunnel ID range play a role in Neutron GRE performance
17:34 richardkiene_ we have a really wide subnet and tunnel ID range
17:34 dhblaz I don't know perhaps a mirantis employee here would
17:35 dhblaz I get the impression that ubuntu nodes would give more reliable operation
17:35 dhblaz mostly because the kernel that comes with ubuntu
17:36 dhblaz We use centos here but it is because all our vms are RHEL based and we have a lot of experience with RHEL.  If you have more experience with ubuntu you might consider using it for the nodes instead of centos.
17:39 richardkiene_ Yeah we are definitely more of an Ubuntu shop
17:39 richardkiene_ we went with Centos because we had issues deploying HA controllers with Ubuntu
17:39 dhblaz interesting
17:39 richardkiene_ though that may be due to an OpenStack / Fuel bug where you need to choose your password carefully :)
17:40 dhblaz Ah, I read about your multiple $
17:40 richardkiene_ Ubuntu failed at a much earlier stage so we thought it was the race condition bug (that is supposedly solved), it may have been the password manifesting it self earlier due to the differences in the puppet scripts
17:44 richardkiene_ We're not so far down the road that we can't re-deploy with Ubuntu again, so we might just try that first
17:45 richardkiene_ we're better at Ubuntu, and it appears to have less issues with Neutron GRE... I'm guessing kernel differences?
17:45 dhblaz Does ubuntu use a 3.x kernel?
17:45 dhblaz Because centos is using a 2.6 kernel
17:46 dhblaz kernels above 3.3 do not require vlan splinters
17:47 IlyaE joined #fuel
17:50 dhblaz If your instances can talk to one another I would not suspect that the Neutron GRE is actually the problem.
17:50 dhblaz I think the issue is a problem with the neutron l3 agent
17:51 dhblaz I should point out that I am no expert here.  I have just experienced a lot of the pain you describe myself.
17:51 jhurlbert i think ubuntu is using 3.2
17:51 jhurlbert maybe 3.3 though
18:02 angdraug joined #fuel
18:10 e0ne joined #fuel
18:15 MiroslavAnashkin Ubuntu, deployed with Mirantis Openstack has 3.8 kernel.
18:17 richardkiene_ Must be injecting a newer kernel?
18:17 vk joined #fuel
18:17 richardkiene_ The latest kernel on our 12.04 LTS boxes is 3.2.0-58
18:18 MiroslavAnashkin yes, all the older kernels has different issues with OVS, VLAN splinters, etc.
18:21 richardkiene_ makes sense
18:25 dhblaz I'm seeing this in my crm resource list:
18:26 dhblaz p_neutron-dhcp-agent(ocf::mirantis:neutron-agent-dhcp):Started (unmanaged) FAILED
18:26 dhblaz Any suggestions on fixing that resource?
18:29 MiroslavAnashkin dhblaz: `crm resource restart clone_p_neutron-plugin-openswitch-agent` on any controller.
18:32 besem9krispy joined #fuel
18:46 vk joined #fuel
18:55 IlyaE joined #fuel
19:05 jhurlbert when we try to install openstack in a multi-node ha config, the first controller to be setup fails because it can't pull the packages down from the fuel controller
19:05 jhurlbert here's the fuel log for the controller node: http://paste.openstack.org/show/61592/
19:08 vk joined #fuel
19:12 dhblaz If anyone has a redhat subscription and tell me what the solution is here I would greatly appreciate it:
19:12 dhblaz https://access.redhat.com/site/solutions/118153
19:18 MiroslavAnashkin jhurlbert: can you ping 10.20.0.2 from failed controller? You may also run `mco ping` from master node to see which nodes are online and accessible.
19:19 jhurlbert yes, i can ping 10.20.0.2 from the failed controller, and mco ping returns all of my nodes
19:20 jhurlbert i can also successfully wget one of the packages from the failed controller
19:23 IlyaE joined #fuel
19:23 jhurlbert the problems we are having look similar to this actually: https://bugs.launchpad.net/fuel/+bug/1269765
19:26 MiroslavAnashkin jhurlbert: Yes, true. While this bug logged against the very early version of Fuel 4.1. Which unterface are you use for Fuel Admin network?
19:27 jhurlbert oh, we are using fuel 4.0 if that matters. we are using eth0.
19:27 xarses joined #fuel
19:27 dhblaz xarses: I see you participated in this thread:
19:27 dhblaz http://serverfault.com/questions/374852/how-to-set-ulimits-for-a-service-starting-at-boot
19:28 dhblaz How did you decide to remedy this issue
19:28 dhblaz our crmd is soft limited to 1024 files and this is not enough for a production system
19:30 MiroslavAnashkin jhurlbert: Could you please generate diagnostic snapshot and share it somewhere?
19:30 xarses I think in our case we put umask in the init scripts that where an issue. keep in mind that my issue is related to services started by rc init
19:31 jhurlbert sure, one moment
19:32 dhblaz xarses: umask or ulimit?
19:32 xarses dhblaz: ^^
19:32 xarses ulimit
19:33 xarses sorry
19:33 dhblaz I tried adding ulimit -n unlimited to /etc/init.d/corosync
19:33 dhblaz and that didn't help
19:33 xarses odd
19:33 dhblaz unless I restart the service logged into a shell
19:33 dhblaz If I changed it to ulimit -n soft I get hard/soft limit set to 1024/1024
19:34 dhblaz unlimit -n unlimited
19:34 dhblaz 1024/4096
19:34 Nikolay http://qupera.blogspot.ru/2012/07/solving-too-many-open-files-under-linux.html
19:34 dhblaz which is the same as nothing
19:34 Nikolay you can also make a wrapper shell script for corosync and put something like
19:35 Nikolay ulimit -n xxx
19:35 Nikolay to this script to change the limit of opened file for process
19:36 teran joined #fuel
19:37 jhurlbert MiroslavAnashkin: here is the diag snapshot: https://drive.google.com/file/d/0BwII9gsxwO6UbjkzNHQyRzg3aVU/edit?usp=sharing
19:41 MiroslavAnashkin jhurlbert: OK, got it. Thank you!
19:42 jhurlbert MiroslavAnashkin: awesome, thanks! if it matters, node-43 was the first failed controller
19:43 ssarychev joined #fuel
19:44 jhurlbert also, that snapshot is in the middle of the deployment failing, so if i need to take another after the deployment is done, let me know
19:46 dhblaz Okay I missed this before: /etc/sysconfig/corosync: line 2: ulimit: open files: cannot modify limit: Operation not permitted
19:55 dhblaz I was able to resolve the crmd limit with a /etc/sysconfig/corosync on the controllers
19:55 dhblaz [root@node-16 ~]# cat /etc/sysconfig/corosync
19:55 dhblaz #Set ulimit for corosync so crmd gets more file descriptors
19:55 dhblaz ulimit -n 8192
19:56 dhblaz replacing 8192 with unlimited didn't appear to work as I got the error message I pasted above
19:56 dhblaz Also trying to use the fix provided by mattymo locked me out of the controllers were I applied his proposed changes.  Andrew was able to reverse the changes in single user mode.
20:01 mrasskazov joined #fuel
20:09 gnutovd joined #fuel
20:25 dhblaz dubious, all of a sudden my instances aren't reachable and all of my nova services are down.
20:29 jhurlbert here is a diag snapshot after fuel tried to complete the deployment: https://drive.google.com/file/d/0BwII9gsxwO6US3lWLWhzLWllVlE/edit?usp=sharing
20:30 dhblaz Seems to be a problem with rabbitmq
20:35 mrasskazov joined #fuel
20:43 mrasskazov joined #fuel
20:49 vk joined #fuel
20:55 IlyaE joined #fuel
20:55 werweg joined #fuel
20:59 teran_ joined #fuel
21:04 teran joined #fuel
21:06 dhblaz Getting this in my rabbitmq log:
21:06 dhblaz AMQPLAIN login refused: user 'nova' - invalid credentials
21:06 dhblaz I'm not sure what happened to make this happen suddenly
21:08 rmoe_ joined #fuel
21:09 vk joined #fuel
21:15 xarses dhblaz: the expected credentials that fuel used to build will be in /etc/astute.yaml you can check that it's still set correctly
21:17 dhblaz xarses: thank you,  I'm not sure what is trying to connect to the rabbitmq server
21:17 xarses with the nova user, it should be nova
21:18 xarses I think each service has it's own user in rabbit
21:19 dhblaz Is this bad?
21:19 dhblaz [root@node-16 ~]# rabbitmqctl list_queues
21:19 dhblaz Listing queues ...
21:19 dhblaz ...done.
21:19 xarses yes
21:19 xarses do all three show that?
21:19 dhblaz yes
21:20 xarses aight, stop them all
21:20 dhblaz Did mysql lose more more tables?
21:20 xarses let me find the restart docs
21:20 xarses rabbit dosn't use mysql
21:20 xarses its because they didn't start right
21:20 xarses and rejoined wrong
21:21 dhblaz I did this:
21:21 dhblaz sudo /etc/init.d/rabbitmq-server stop
21:21 dhblaz on all three controllers
21:22 xarses ok, can you ping all of the other controllers by dns name (node-17, node-18, etc..)
21:22 xarses you will want to check from each node
21:24 xarses then on the last node stopped, we can start rabbit and wait for it to come up
21:25 xarses do a 'rabbitmqctl list_queues' to ensure that the queues restored
21:25 xarses and  'rabbitmqctl cluster_status' to verify the cluster status
21:26 xarses if the queues came up, then we can start the other two at the same time
21:26 xarses and the check the cluster_status again and verify that all of them joined
21:27 dhblaz confirmed can ping the other two controllers from each of the three controllers
21:27 dhblaz Hmmm, I don't remember which node was the last one stopped
21:27 dhblaz Is there another way to check which one has more recent messages on it?
21:30 xarses dosn't exacly matter, only that the last one is the best one to start
21:30 dhblaz I found a file /var/lib/rabbitmq/mnesia/rabbit\@node-*/nodes_running_at_shutdown
21:30 dhblaz It looks like node-16 was the last to shutdown
21:30 dhblaz [root@node-16 ~]# rabbitmqctl list_queuesListing queues ...
21:30 dhblaz ...done.
21:30 xarses blah, it should have restored it
21:31 dhblaz :(
21:31 xarses we can try stopping it and starting one of the others
21:31 xarses see if it will restore the queue info
21:32 xarses let me see if I can find some docs to do it by hand
21:35 dhblaz When I started node-17
21:35 dhblaz it took a while to give up on the master
21:35 dhblaz but the queues are there now
21:35 xarses oh, yummy
21:36 xarses now you should be able to start the other two
21:36 dhblaz should I start-16 (the previous master) or node-18 first?
21:36 xarses and they will replicate from the running node
21:37 IlyaE joined #fuel
21:37 dhblaz starting node-18
21:37 dhblaz counting down like this:
21:37 dhblaz 3 attempts left to start RabbitMQ Server before consider start failed.
21:38 dhblaz [root@node-18 ~]# rabbitmqctl cluster status
21:38 xarses ya, not abnormal, rabbit is slow to start
21:38 dhblaz Clustering node 'rabbit@node-18' with [status] ...
21:38 dhblaz Error: mnesia_unexpectedly_running
21:38 xarses well that is though
21:38 dhblaz Sorry had the wrong command
21:38 dhblaz [root@node-18 ~]# rabbitmqctl cluster_status
21:38 dhblaz Cluster status of node 'rabbit@node-18' ...
21:38 dhblaz [{nodes,[{disc,['rabbit@node-18','rabbit@node-17']}]},
21:38 dhblaz {running_nodes,['rabbit@node-17','rabbit@node-18']}]
21:38 dhblaz ...done.
21:42 dhblaz queues look good now and so does nova service-list
21:42 dhblaz but still can't get to instance
21:43 dhblaz doing: crm resource restart clone_p_neutron-openvswitch-agent
21:43 dhblaz and: crm resource restart p_neutron-l3-agent
21:46 dhblaz hmmm
21:46 dhblaz crm status shows:
21:46 dhblaz p_neutron-dhcp-agent   (ocf::mirantis:neutron-agent-dhcp):     Started node-16
21:47 dhblaz but neutron agent-list shows a failed dhcp-agent on node-17
21:47 dhblaz should I neutron agent-delete the node-17 dhcp-agent?
21:53 xarses try applying this patch https://bugs.launchpad.net/fuel/+bug/1269334
21:53 xarses and then a crm resource cleanup
21:55 dhblaz do that on all controllers?
21:55 xarses ya
21:56 xarses you will want to patch the actual files, not the ones stuck in the puppet manifests
21:56 xarses since you are already deployed
21:56 xarses it should probably also fix that occasional issue you have seen where the ports will go down
22:10 dhblaz oh
22:10 dhblaz I missed that last thing you just said
22:10 dhblaz ... trying again
22:12 dhblaz so crm resource cleanup isn't a valid command
22:12 dhblaz what do I need to get crmd to use the new files?
22:17 xarses from my co-workers, you have to restart all of the crmd services one at a time, it sounds like it will bounce all of the managed services
22:18 dhblaz crm resource restart for each service or /etc/init.d/corosync restart?
22:21 dhblaz When you give this kind of advice it is probably worthwhile to let people know they need to wait for Galera to sync
22:24 dhblaz the corosync init.d script also has a reload option; I wonder if that would have been enough
22:28 dhblaz The fact that crm thinks that mysql is up on a node long before it actually is ready to service requests
22:28 dhblaz this causes a problem where there is a dependency on mysql
22:40 dhblaz If that patch was supposed to keep cleanup the neutron dhcp agent when the resource moved nodes it doesn't appear to work.

| Channels | #fuel index | Today | | Search | Google Search | Plain-Text | summary