Perl 6 - the future is here, just unevenly distributed

IRC log for #fuel, 2014-02-27

| Channels | #fuel index | Today | | Search | Google Search | Plain-Text | summary

All times shown according to UTC.

Time Nick Message
00:00 dmit2k but please comment on "to ensure that settings that changed at the cluster are re-written to the surviving members"
00:00 xarses dmit2k: i spoke with some people, and unless I get a clear message back from the person who wrote the disable change, we will probably revert it.
00:00 dmit2k what you mean by "changed at the cluster"?
00:01 xarses if you inspect /etc/astute.yaml on any of the nodes
00:01 dmit2k yes?..
00:01 xarses it contains all of the peramiters that are passed in, one of them is the list of nodes, that data will change when you re-deploy the controllers
00:02 xarses that data is used by some of the cluster services to update their config files
00:02 xarses and when necessary, the services will be restarted
00:04 dmit2k so when I actually "redeploy" a node, Fuel just creates another one instead, right? and if I delete the faulty one from Fuel and then deploy - it will replace all instances of that removed node with credentials of the new one?
00:06 dmit2k rebuilding all related configs on ALL nodes
00:06 dmit2k correct?
00:08 xarses when rebuilding a node, it actually just provisions it as if it was a new node, yes. for controllers the old config refreances are replaced with what ever is relevant for the "new" node they might get the same IP's, but they will get a new hostname node-# since # is the id of the node in the DB
00:08 xarses I've got to go, back in a bit
00:12 rvyalov joined #fuel
00:48 ilbot3 joined #fuel
00:48 Topic for #fuel is now Fuel for Openstack: http://fuel.mirantis.com/ | Paste here http://paste.openstack.org/ | IRC logs http://irclog.perlgeek.de/fuel/
00:57 e0ne joined #fuel
01:12 xarses joined #fuel
01:15 rmoe joined #fuel
01:24 xarses ls
01:24 xarses oops
01:41 dmit2k xarses: Thank you a lot! You made it much more clear for me!
01:42 dmit2k Would be really handy to have a list of affected files and configs replaced by Fuel if I add or delete a controller (or some other type) node.
01:42 dmit2k I suppose that similar process also happens every time I add a new CEPH or compute node as well, just different configs will be affected?
01:51 xarses nope
01:52 xarses dmit2k: it only applies to nodes that are also controllers
01:57 e0ne joined #fuel
02:27 crandquist joined #fuel
02:57 e0ne joined #fuel
03:57 e0ne joined #fuel
04:09 rmoe joined #fuel
04:57 e0ne joined #fuel
05:36 vkozhukalov joined #fuel
05:51 mihgen joined #fuel
05:56 Ch00k joined #fuel
05:57 e0ne joined #fuel
06:48 saju_m joined #fuel
06:57 e0ne joined #fuel
07:48 pbrooko joined #fuel
07:51 Ch00k joined #fuel
07:54 vkozhukalov joined #fuel
07:57 e0ne joined #fuel
08:01 pbrooko joined #fuel
08:06 mrasskazov1 joined #fuel
08:10 saju_m joined #fuel
08:12 evgeniyl joined #fuel
08:26 saju_m joined #fuel
08:37 miguitas joined #fuel
08:43 amartellone joined #fuel
08:43 alex_didenko joined #fuel
08:44 amartellone Yesterday, @orsetto in #fuel-devasked a question about a failure in the live migration process, in an environment deployed with Fuel 4.0? Any idea? Many thanks
08:44 amartellone Yesterday, @orsetto in #fuel-dev asked a question about a failure in the live migration process, in an environment deployed with Fuel 4.0? Any idea? Many thanks
08:47 e0ne joined #fuel
09:24 vk joined #fuel
09:25 rvyalov joined #fuel
09:34 e0ne joined #fuel
09:57 mihgen joined #fuel
09:57 vk joined #fuel
09:58 tatyana joined #fuel
10:09 jeremydei joined #fuel
10:35 saju_m joined #fuel
10:41 mihgen joined #fuel
11:35 Ch00k joined #fuel
12:11 anotchenko joined #fuel
12:19 Ch00k joined #fuel
12:23 tatyana_ joined #fuel
12:37 TVR__ joined #fuel
12:39 TVR__ so... my adding 2 controllers +OSD to my HA 3 Controller +OSD cluster worked and failed....
12:39 TVR__ it worked as in it looks good from outside of fuel
12:40 TVR__ it failed as fuel had an error on my primary node
12:40 TVR__ it would seem the mysql sync timed out
12:40 TVR__ (/Stage[main]/Galera/Exec[wait-for-synced-state]/returns) change from notrun to 0 failed: /usr/bin/mysql -Nbe "show status like 'wsrep_local_state_comment'" | /bin/grep -q Synced && sleep 10 returned 1 instead of one of [0] at /etc/puppet/modules/galera/manifests/init.pp:274
12:41 TVR__ I am allowing fuel to try again
12:41 TVR__ all the ceph backend adding of mons and OSD's worked just fine
12:41 TVR__ health HEALTH_OK
12:44 Dr_Drache joined #fuel
12:55 TVR__ Dr_Drache .. do you have a cluster up?
12:55 Dr_Drache TVR__, yea
12:56 Dr_Drache small one, but its up for now
12:56 TVR__ how many OSD's in it?
12:56 Dr_Drache only 2.
12:56 TVR__ only 2 disks?
12:57 Dr_Drache ohh
12:57 Dr_Drache no.
12:57 TVR__ ceph -s
12:59 Dr_Drache 4
12:59 TVR__ ok... so from the ceph node, you can benchmark it with rados bench...
12:59 TVR__ rados bench 16 -p volumes -b 1048576 -t 80 write
12:59 TVR__ as an example
13:00 TVR__ rados bench 16 -p volumes -b 4096 -t 4000 write
13:00 TVR__ would be for small file access
13:00 TVR__ the -p needs to point to one of your pools
13:01 TVR__ -t is concurrent connections
13:01 Dr_Drache 4000
13:01 Dr_Drache damn
13:01 TVR__ well 4000 x 4k files is not allot
13:01 TVR__ well... 4k object
13:03 Dr_Drache oh, I missunderstood
13:03 Dr_Drache I thought it was 4000 x 4000
13:03 TVR__ bytes, yea..
13:03 Dr_Drache 4000 connections doing 4000 files
13:04 Dr_Drache of size 4K
13:04 TVR__ anyway.. I notice an ~ 80% linear improvement with number of OSDs added to throughput of object store...
13:04 Dr_Drache 80%, so if i go to 8, it would be 80% faster.
13:05 Dr_Drache or the inverse?
13:06 TVR__ it is about 80% of the multiplier...so.. going from 4 disks to 16 disks you would expect a 4x improvement.. but instead you get a little better than a 3x improvement
13:07 Dr_Drache ahh, ok.
13:07 Dr_Drache stupid math.
13:07 Dr_Drache but that's what I'd expect anyway.
13:07 TVR__ and... with 9 disks I cannot sustain 5k connections at 4096 bytes... but with 15 disks I can
13:08 TVR__ I expected similar, but now I have some numbers to justify it...
13:09 anotchenko joined #fuel
13:10 Dr_Drache I need to change my disk configurations. I have a fair # of ML-Sata
13:10 TVR__ Over here they were thinking of one or two 12 disk x 4T types, but I will now argue for 48 1T 2.5" disks instead for performance.... which makes sense, but now I have some numbers to show the gains
13:12 Dr_Drache are you using arrays, or straight drives?
13:13 TVR__ JBOD
13:14 Dr_Drache k.
13:20 justif joined #fuel
13:20 Dr_Drache I want 4.1 today
13:20 Dr_Drache :P
13:28 TVR__ ok.. so my redeploy with fuel had the same timeouts.. interesting...
13:32 Dr_Drache time outs?
13:34 TVR__ (/Stage[main]/Galera/Exec[wait-for-synced-state]/returns) change from notrun to 0 failed: /usr/bin/mysql -Nbe "show status like 'wsrep_local_state_comment'" | /bin/grep -q Synced && sleep 10 returned 1 instead of one of [0] at /etc/puppet/modules/galera/manifests/init.pp:274
13:35 Dr_Drache wth
13:38 TVR__ so the issue is with mysql...
13:38 TVR__ from node-2
13:38 TVR__ [root@node-2 ~]# /usr/bin/mysql -Nbe "show status like 'wsrep_local_state_comment'" | /bin/grep -q Synced
13:38 TVR__ [root@node-2 ~]#
13:39 TVR__ from node-1
13:39 TVR__ [root@node-1 ~]# /usr/bin/mysql -Nbe "show status like 'wsrep_local_state_comment'" | /bin/grep -q Synced
13:39 TVR__ ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
13:39 Dr_Drache but, if it's just a redeploy.
13:40 Dr_Drache shouldn't everything be better?
13:40 Dr_Drache err.. "good"
13:40 TVR__ one would think...
13:40 TVR__ I will file a ticket
13:52 TVR__ ticket filed
13:53 Dr_Drache woot
13:53 Dr_Drache sadly, I wish my ubuntu issue wasn't back burnered.
13:58 mattymo which ubuntu issue?
13:59 mattymo Dr_Drache, ^
14:00 Dr_Drache mattymo, can't deploy ubuntu on Dell R415's
14:00 mattymo you mean the NIC ordering bug/
14:00 mattymo ?
14:00 Dr_Drache no
14:00 Dr_Drache won't boot.
14:02 Dr_Drache all logs show it's perfect.
14:02 Dr_Drache i'd think it's that machine, but there are 3 of them.
14:05 Dr_Drache xarses, and I recently figured out that it has nothing to do with the bug they were looking at. (pervious bug on HPs smartArray, was patched)
14:11 anotchenko joined #fuel
14:25 TVR__ It would seem support says adding any additional controllers is considered "impossible".. and that is their final word for the fuel 4.0 release, without a timeline for a release that will have this feature...
14:27 MiroslavAnashkin TVR__: Could you please grab /etc/mysql/conf.d/wsrep.cnf from the failed controller, remove all vital passwords and share it?
14:29 mihgen joined #fuel
14:29 TVR__ ok.. when it comes back up...with support stating it was impossible... and my seeing it was a mysql related issue.. I rebooted it to bring it to fresh state
14:30 TVR__ it is also dev environment.. so my pass will be default
14:30 TVR__ heh
14:32 MiroslavAnashkin OK, anyway I am checking the new controller addition scenario right now
14:32 TVR__ cool...
14:32 crandquist joined #fuel
14:33 anotchenko joined #fuel
14:34 TVR__ It would seem to me from what I know... this should be possible.... the ceph part is working, so it would seem from my basic knowledge of the way the controllers work, it would be getting the mysql part to play nice that is the issue
14:34 TVR__ again... limited knowledge here..
14:35 Dr_Drache well, i KNOW it can be done manually, on a manually deployed cluster
14:38 TVR__ http://pastebin.com/LPwWJQSk
14:38 TVR__ there is my /etc/mysql/conf.d/wsrep.cnf from node-1
14:42 TVR__ after reboot /var/lib/mysql/mysql.sock now exists and the puppet command that failed.. now will run to completion...
14:43 TVR__ MiroslavAnashkin .. any objections to me clicking Deploy Changes again? I believe it will complete now
14:43 MiroslavAnashkin No objection
14:45 anotchenko joined #fuel
14:46 e0ne_ joined #fuel
14:47 TVR__ watching logs.. so far all green
14:50 jobewan joined #fuel
14:59 Ch00k_ joined #fuel
15:04 piontec joined #fuel
15:12 amartellone_ joined #fuel
15:14 GeertJohan_ joined #fuel
15:16 Bomfunk_ joined #fuel
15:19 TVR__ ok.. so my assessment of it being mysql issue was spot on.... and after restarting the problem server, and then clicking Deploy Changes, everything worked. The issue was, the mysql was not creating the /var/lib/mysql/mysql.sock or starting correctly...
15:19 piontec hi everyone
15:20 TVR__ so I will say I am disapointed support simply closed my ticket with words like "impossible" as a simple reboot and redeploy made it quite possible....
15:20 piontec can I have a support realted question about a basic fuel deployment?
15:20 mattymo piontec, go ahead and ask
15:20 mattymo TVR__, do you know the # of your support case?
15:20 piontec I'm deploying 6 machines with 4 NICs each
15:20 Dr_Drache TVR__, that means it wasn't acually investigated.
15:20 Dr_Drache IMO.
15:20 TVR__ I am looking it up.. one sec
15:21 mattymo piontec, 4 nics ! nice
15:21 piontec eth0 - PXE, eth1.12 - public, eth1.14 - management, eth4 - private, eth5 - storage
15:21 piontec i'm using neutron with default VLAN segment
15:21 TVR__ 1464
15:22 piontec all network related configs seem to be OK, 'network verification' before deployent works
15:22 piontec now, just after the deployment one of the health checks fails
15:23 piontec it's "Check network connectivity from instance via floating IP" - "5. Check that public IP 8.8.8.8 can be pinged from instance."
15:23 sanek joined #fuel
15:23 Dr_Drache that's normal
15:23 piontec ok, so i'm creating some VMs manually
15:23 mattymo TVR__, I work on the dev team of Fuel. We don't have logic to handle more than 3 controllers
15:23 mattymo and it's a bug that you can add more
15:23 Dr_Drache mattymo, so a bug to acually scale?
15:24 xarses TVR__: btw the nic down timeout issue you where talking about has been fixed
15:24 piontec and the result is (I have 2 default subnets: net04 and net04_ext):
15:24 piontec 1) a MV attached to net04 works as expected, it has connectivoty with the Internet
15:24 piontec 2) I can add a floating IP to the machine and it seems to work
15:25 piontec 3) I'm creating another machine, this time attached to net04_ext only
15:25 piontec the result is that it doesn't get Ip configuration from DHCP at all
15:25 piontec i'm testing using fuel 4.0 and default cirros image
15:26 Dr_Drache piontec, you need to share, and enable DHCP
15:26 piontec via VNC console i see, that eth0 has no address
15:26 Dr_Drache in the subnet configuration of net04_ext
15:26 xarses piontec: net04_ext doesn't have dhcp on it, you have to manually set the address
15:26 piontec yup, I tried 2 solutions: 1) enabled DHCP on sub04_ext
15:26 TVR__ mattymo cool man.. good to hear
15:26 xarses piontec: or assign net04 and a floating ip
15:27 piontec result: IP is there (good), but htere is now way I can connect using the public IP to that machine :(
15:27 piontec the only Warning i found in neutrons logs is here: http://paste.openstack.org/show/70263/
15:28 xarses piontec: did you update the secuity policy to allow traffic to the vm?
15:28 Dr_Drache piontec, you make a new security policy that allows ingress?
15:28 piontec the other thing I tried was to enable a DHCP server on my router for the public net
15:28 vk joined #fuel
15:28 meow-nofer_ joined #fuel
15:28 piontec but it logs only "ip offered" and nothing more, and the VM is stuck without IP
15:28 piontec Dr_Drache, I tried with default security policy and with custom (ingress ssh and icmp)
15:29 piontec sub04 with floating IP works with the same policy
15:29 mattymo the problem is you have to redeploy the other controllers with the info from the other controllers
15:29 piontec works == can caonnect using public IP
15:29 mattymo I think specifically the logic for galera is missing to provide more contrllers
15:30 mattymo controllers*
15:30 mattymo the other disadvantage to Fuel HA is we don't provide a load-balanced HA. It's really just 1 master and 2 standby controllers for 90% of all activity
15:30 Dr_Drache mattymo, but you already have all the data, can't you inject the new info to the old controllers?
15:30 TVR__ piontec... as stated by others... 1. share the ext network... 2. create an access and security policy allowing ingress access .. by default, the ext network would be considered the network to the real world, so your network dhcp server handles that... doing step 1 and step 2 you should be able to get an address...
15:31 mattymo Dr_Drache, probably, but we never tried it. We made a reference architecture, designed it, and someone here finds a bug, adds more controllers, and discovers it doesn't work :(
15:31 mattymo We can try to fix it, but what's the point of 1 master and 3 standby controllers?
15:32 Dr_Drache of course, whats the point of non-HA in a HA setup :P
15:32 mattymo high performance
15:32 piontec TVR__, yeah, I got that far; I got public IP, I can communicate from the VM with the public net, but I still cannot conenct to the VM using public IP; I'll just check the sharing option - I think i might have it disabled
15:32 Dr_Drache so, in that case, adding more, would be higher performace?
15:32 mattymo one part is that OpenStack isn't designed to handle HA correctly
15:33 Dr_Drache I thought HA was a huge calling of OS.
15:33 mattymo if you attach to AMQP and a DB, it will never give up that session
15:33 mattymo nova-api will stay married to the same amqp and db until it gets restarted
15:33 Dr_Drache it's been praded around being HA for a while.
15:33 TVR__ most likely, it is in the access policy... if you have both default AND a custom one... the default will drop before the other one allows
15:33 mattymo we could introduce periodic restarts to properly do LB
15:35 Dr_Drache mattymo, I guess I was put off, by being told that OS scales, then not really being able to scale fully. (doesn't effect me)
15:36 mattymo Dr_Drache, believe me, we know many points about OpenStack scalability
15:37 warpig joined #fuel
15:37 mattymo 1000 is the cap for # of compute nodes right now. It's too much strain on a dedicated DB server to handle any more compute node updates
15:38 mattymo and requests for resource availability
15:38 warpig hi all, quick question for you...
15:38 mattymo warpig, go ahead
15:38 warpig I'm deploying OpenStack with Fuel and every time a node fails to deploy cleanly, Fuel can't seem to remove the environment and release the associated nodes...
15:38 warpig any way to do a full reset?
15:38 warpig v4.0
15:39 warpig The only way I've got around this so far is a fresh install of Fuel... :o(
15:40 mattymo warpig, 4.1 is almost ready
15:40 mattymo we have a lot of fixes for resetting in 4.1
15:41 xarses warpig, if after you delete the cluster and that doesn't workd you  can remove the nodes from cobbler "cobbler system remove --name node-10" for each of the deployed and then reboot the nodes they should 'rediscover' again
15:41 mattymo xarses, beat me to it
15:41 warpig OK, will try that, thanks.
15:42 warpig The nodes have already rebooted into the microkernel, but haven't registered with Fuel as usable nodes.
15:43 TVR__ I updated my ticket with the additional information.
15:43 Dr_Drache mattymo, so, do i get 4.1 tomarrow? LOL
15:43 Dr_Drache acaully
15:43 Dr_Drache real question.
15:43 Dr_Drache this changes things.
15:44 zer0tweets joined #fuel
15:44 Dr_Drache I cannot use (as it stands) fuel deployed OS as HA solution, if 1/2 my cluster dies, i'm SOL.
15:44 TVR__ how long did you wait for them to remove?
15:45 xarses warpig: so they all have the hostname of "boostrap"?
15:46 TVR__ warpig I have found that deleting a cluster sometimes takes a bit of time... sometimes greater than 10 min if it was a big cluster
15:47 warpig xarses: yep.  However, Fuel interface says "3 total nodes, 0 unallocated nodes" and the environment says "Removing"
15:48 warpig TVR:  Only a small cluster - 1 controller, 2 compute
15:48 mattymo Dr_Drache, if you have 1 controller left alive, it should live
15:48 mattymo you will lose running instances
15:48 Dr_Drache mattymo, so, HA works in that type of situation.
15:48 warpig [root@fuel ~]# fuel --env 11 env delete
15:48 warpig HTTP Error 400: Bad Request (Environment removal already started)
15:49 Dr_Drache mattymo, but, I could restart them correct?
15:49 warpig Has been over 10 min now...  :o(
15:49 TVR__ I would go the cobbler system remove <name> for each one and then give the power a poke on each one
15:49 mattymo Dr_Drache, yes - the most likely failure you'll notice is Galera gives up and doesn't know who is master
15:50 Dr_Drache mattymo, sweet, as long as I wasn't understanding things incorrectly.
15:50 mattymo either consult our Fuel documentation or contact support. I've gotten the hang of recovering Galera if all hosts shut down in a short period of time
15:51 mattymo it involves comparing a file on the controllers, changing a setting, starting mysql, waiting, then repeat on the others, then back in business
15:51 mattymo that's in the event of total power loss
15:51 xarses warpig: thats not right, if you can create a bug at https://bugs.launchpad.net/fuel/+filebug and upload a diagnostic snapshot to the ticket so we can look over all the logs
15:52 TVR__ which is what might have caused my mysql issue when I added 2 controllers... Galera probably did not realize my node-1 was still up... assumed a different master, and node-1 thought it was master... and why my rebooting of the node-1 allowed the Deploy changes to work after..
15:52 piontec mattymo, when can we expect 4.1?
15:52 Dr_Drache mattymo, yea, we'd use support in that case, we are holding off until I can prove a ubuntu test deployment.
15:53 warpig xarses:  Will do.  Do you have a link describing how to create the snapshot?
15:53 warpig Is it in the docs?
15:53 Dr_Drache warpig, click support.
15:53 Dr_Drache in the fuel UI
15:53 warpig OK, cool - thanks.
15:53 Dr_Drache 1/2 way down
15:53 Dr_Drache create snapshot
15:53 xarses Dr_Drache: its simpler than that, in most cases, corosyc will do all that stuff for you and just start correctly. There are cases where you will manually have to "rebuild" the galera like mattymo describes
15:53 mattymo piontec, soon. I can't say more than that
15:54 Dr_Drache wait for it, then download and upload.
15:54 TVR__ mattymo having the recovering you do in a doc somewhere would be fantastic, if you have it
15:54 Dr_Drache xarses, I was getting confused. glad mattymo steered me right.
15:55 mattymo xarses, in most cases, yes it's fine. I caused a couple situations on my own where I broke corosync and it, in turn, broke Galera
15:55 xarses piontec: fuel 4.1 should drop tomorrow
15:56 xarses mattymo: thats the side case that still requires doing it manually
15:56 Dr_Drache anyone a dell expert? LOL, I have some questions, that's not fuel/OS related
15:58 xarses I can answer some questions
15:58 piontec ok, still no luck with my config: here some basic output http://paste.openstack.org/show/70277/
15:59 Dr_Drache well, xarses, I created an iso with the repository manager, and it doesn't boot on the R415's either.
15:59 piontec the result is still no IP on the VM (I want to use DHCP on my router, but it still only sends an offer, never gets ACK)
16:04 anotchenko joined #fuel
16:20 xdeller joined #fuel
16:24 xarses Dr_Drache: ya sorry, I never did any of that with them, It kind of sounds like its time to see if there is a bios update
16:25 Dr_Drache xarses, there isn't, and no way to flash the bios anyway, that iso is the bios updater.
16:25 Dr_Drache LOL
16:25 Dr_Drache shit
16:25 Dr_Drache lol
16:29 anotchenko joined #fuel
16:39 vkozhukalov joined #fuel
16:50 Ch00k joined #fuel
17:08 zer0tweets joined #fuel
17:27 rmoe joined #fuel
17:50 xarses joined #fuel
17:52 xarses joined #fuel
18:11 angdraug joined #fuel
18:14 Ch00k joined #fuel
18:36 e0ne joined #fuel
18:37 saju_m joined #fuel
18:41 vk joined #fuel
18:46 rverchikov joined #fuel
18:51 saju_m joined #fuel
18:55 vkozhukalov joined #fuel
19:06 AndreyDanin_ joined #fuel
19:28 dburmistrov joined #fuel
19:40 mutex has anyone here done a migration from one fuel installation to another for instances/images etc ?
20:00 nmarkov joined #fuel
20:18 MiroslavAnashkin joined #fuel
20:26 meow-nofer joined #fuel
20:26 nmarkov joined #fuel
20:26 meow-nofer__ joined #fuel
20:39 dmit2k hello everybody
20:40 dmit2k TVR__: hi, have two reference ceph clusters and can proceed with some tests for you if you want
20:41 TVR__ ok.. what / which tests?
20:41 TVR__ heh
20:41 TVR__ many projects here
20:41 dmit2k TVR__: BTW, how's your testing with HA?.. I see that you had problems with mysql when creating additional controllers
20:41 TVR__ adding controllers? bench testing ?
20:41 dmit2k both :)
20:42 dmit2k TVR__: just as a side note - I have had this same problem today when deploying the clean cluster from scratch
20:42 TVR__ rados bench has shown decent results, as the more OSDs the better the performance... right now I am fighting to get cosbench working so I can hammer the ceph backend
20:42 dmit2k TVR__: one of the controller nodes did not start with the same problem
20:43 TVR__ how many controllers do you have?
20:44 TVR__ ok.. so my "fix" was ... seeing as how the sock was not there and mysql ~believed~ is was up... was to reboot the server and then apply changes again.. once I verified the mysql socket was there
20:44 dmit2k I flushed my test cluster which had 3 controllers (each of them also OSD) node and 2 compute nodes
20:45 dmit2k then redeployed from scratch, with new cluster config
20:45 TVR__ have you run any bench tests on them before you add (added) and more OSDs?
20:45 dmit2k Fuel failed to deploy one single node with the same error you had
20:46 dmit2k then I simply restarted that node and fired Deploy again
20:46 TVR__ ok.. so the others were 'ready' but that one failed...
20:46 xarses_ joined #fuel
20:46 dmit2k things went smooth, everything UP
20:46 TVR__ what was the output of ceph -s on it?
20:46 TVR__ was ceph OK on it?
20:46 dmit2k I could not get to it, as the cluster was new
20:47 dmit2k it ws partially deployed
20:47 TVR__ ok.. so a bit different timeing.. but same issue..
20:47 dmit2k so I suppose there are dragons inside
20:47 dmit2k and maybe it is more common issue
20:48 dmit2k I'm much concerned about HA / recovery things
20:49 dmit2k TVR__: did you try to destroy one controller node of 3 and redeploy it back?
20:50 dmit2k parses yesterday explained me much about redeployment procedures, think you can look through the IRC logs for last night
20:50 dmit2k *xarses
20:53 dmit2k TVR__: so did you try to destroy one working controller and redeploy a substitute?..
20:55 dmit2k I suppose that adding +1 controller and replacing faulty one may work different way when redeploying
20:55 TVR__ no... but that is my next task if I cannot get this cosbench to work
20:55 dmit2k TVR__: OK, I will do some tests myself tomorrow, can compare results
20:56 TVR__ I will be killing node-1 when I do.. so as to be sure it is the primary at that time
20:56 dmit2k That was also my question - how can I know which one is master at the moment?
20:56 Dr_Drache wonder what time 4.1 will be posted
20:57 dmit2k or tends to be master
20:57 dmit2k Dr_Drache: hello
20:57 dmit2k Dr_Drache: waiting for Christmas? :)
20:57 Dr_Drache dmit2k, yea
20:58 dmit2k Dr_Drache: suppose that was a kind of 1-sf April :)
20:58 Dr_Drache we were hinted tomarrow.
20:58 dmit2k too many uncovered issues for me, don;t think it will be released now
20:59 Dr_Drache ?
20:59 Arminder joined #fuel
20:59 mihgen hi folks
20:59 mihgen I heard word issues
20:59 mihgen in master? I'd like to learn more :)
20:59 rvyalov joined #fuel
21:00 mihgen we called for code freeze, so stable/4.1 branch was created
21:00 dmit2k BTW, what did devs decide on that "feature" to disallow addition of a controller for existing cluster? to revoke it back or not?
21:00 mihgen dmit2k: it needs to be QAed first
21:00 dmit2k we had a discussion about it last night
21:01 mihgen then we can re-enable that
21:01 dmit2k mihgen: good point
21:01 mihgen did you ever try it?
21:01 dmit2k yes
21:01 dmit2k 50/50
21:01 mihgen what does it mean?)
21:01 mihgen works in 50% of cases?
21:01 dmit2k yes :)
21:01 mihgen nice :)
21:02 mihgen so we will see how it works…
21:02 mihgen so what about issues you are talking about?
21:02 dmit2k I've heared a good explanation of that process from parses last night, and so tried myself
21:02 dmit2k *xaresr
21:02 dmit2k fck, sorry, autocorrection
21:03 mihgen )
21:03 dmit2k so I destroyed 1 controller from HA of 3, then deleted that node, then reassigned the same roles to it
21:04 dmit2k Fuel redeployed it back OK, I had to wait some more minutes while it settled, but then it all worked OK
21:04 dmit2k next
21:04 dmit2k I turned off another controller
21:05 dmit2k waited until things settled
21:05 dmit2k and added another physical server with the same roles
21:06 dmit2k it finally started, but had errors on the OpenStack stuff provisioning stage
21:07 dmit2k after several reboots it deployed, but then I had to wait-reboot-wait several times before it become a part of the cluster
21:07 dmit2k that's why I think it works 50/50
21:07 mihgen wait "added physical with same roles" - what roles?
21:07 dmit2k controller + OSD
21:08 mihgen oh ok. so you added third to the env, right?
21:08 mihgen after you shut down previous one
21:09 mihgen yep, I got it. okey dude sounds like a real issue
21:09 mihgen but what openstack stuff provisioning issues you had?
21:09 mihgen you mean puppet or when you tried to spin up VM?
21:10 dmit2k unfortunately I was in a hurry to redeploy the cluster to test Infiniband, so didn't noted it exactly :( something about timeout waiting for mysql
21:10 dmit2k puppet
21:11 dmit2k and I also had an issue which I saw somewhere in bug reports recently - I was unable to start VMs afterwards
21:12 dmit2k sorry, was in a hurry, so I just destroyed VM and created back, then it worked
21:12 Dr_Drache hmmm
21:13 Dr_Drache mihgen, so 4.1 isn't coming tomarrow?
21:14 mihgen dmit2k: dude please take diagnostic snapshot next time and file a bug :)
21:14 mihgen yep there is one bug when you restart controller and your VM can't reattach volume (if it had it attached)
21:14 mihgen Dr_Drache: we are trying to buy two more days for testing
21:15 dmit2k mihgen: agree! :)
21:15 mihgen will see how it goes. it's more or less good so far..
21:15 dmit2k mihgen: guys @ CloudStack were so hurry to release 4.2.1 and left so many bugs there...
21:16 dmit2k mihgen: let it take more time but have more solid
21:16 dmit2k sorry, have to turn off autocorrection
21:16 mihgen yep I'm fully on same position..
21:17 Dr_Drache dmit2k, i'm here to find more bugs :P
21:17 mihgen :)
21:17 Dr_Drache mihgen, that's fine, i'm not trying to push, other than I want 5.0 sooner
21:17 dmit2k Dr_Drache: you already got one! BIG :)
21:17 dmit2k Dr_Drache: DELL
21:17 Dr_Drache (14.04 ubuntu, firefly, icehouse)
21:17 mihgen Dr_Drache: 5.0 will include icehouse
21:17 Dr_Drache mihgen, yea, i know. i'm giddy.
21:18 Dr_Drache dmit2k, the dells are fine.
21:18 mihgen so our goal would be to release asap after packages are ready..
21:18 dmit2k mihgen: would you please explain your seeing
21:18 dmit2k :
21:18 dmit2k so you disable the feature to add another controller
21:18 dmit2k then
21:19 Dr_Drache mihgen, i may have fixed ubuntu
21:19 dmit2k if I have one faulty and absolutely dead
21:19 dmit2k my steps?
21:19 dmit2k Dr_Drache: how?
21:19 mihgen Dr_Drache: what do you mean? make sure that all works under 14.04 ?)
21:20 mihgen dmit2k: you remove one faulty, and add new one
21:20 Dr_Drache mihgen, well, i thought 5.0 would come with 14.04, since it's the next LTS, and 12.04 would be EOl
21:20 mihgen will not that patch allow you to do such?
21:20 mihgen Dr_Drache: yep I would love to but I'm not sure how much effort it would take to backport everything to the new version
21:20 mihgen especially all those tricks around ovs+kernel versions
21:21 Dr_Drache mihgen, well, icehouse is being build on 14.04. figured it would be fine. but what do i know :P
21:23 dmit2k mihgen: if you disable the option to add another controller, will it still allow me to add one _instead_ of removed?
21:23 mihgen Dr_Drache: it's true.. but all those ovs things, likely provisioning / disk stuff..
21:23 mihgen dmit2k: it should. if it's not - then it's gonna be bug for sure
21:23 dmit2k so Fuel will remember the initial count of controllers and simply will not allow to exceed it?
21:24 mihgen I didn't do that patch and also wondering now :)
21:24 mihgen vk: are you around?
21:24 dmit2k same for us...
21:26 dmit2k xarses_ was also surprised by that last night, and promised to contact the submitter
21:27 mihgen I'm trying to reproduce on fake UI..
21:27 mihgen vk knows about it, but it looks like that he went to bed already
21:28 xarses mihgen: I'll have a deployment soon to retest, and if i confirm its not needed im submitting a request to remove the block in the UI
21:28 mihgen meanwhile I can confirm the issue dmit2k pointed out
21:28 mihgen it's real crap
21:29 mihgen if my controller files, I can't replace it to another one
21:29 mihgen thanks guys for attention on it
21:29 mihgen xarses: I think we rather need to revert that patch now
21:30 dmit2k mihgen: we are here to help you guys :)
21:30 mihgen and not to play with logic at that late moment at all
21:30 xarses mihgen: sounds good to me
21:30 dmit2k mihgen: I would vote for this
21:30 xarses I'll propose a commit for it.
21:30 mihgen dmit2k: I owe you a beer dude)
21:31 Dr_Drache hmmm
21:31 dmit2k heh, maybe exchange for some good advise sometimes :)
21:31 mihgen come to moscow or .. where you reside?) I might be in your district, who knows )
21:31 dmit2k I'm from LV
21:32 mihgen vegas?
21:32 dmit2k .LV :) Latvia
21:32 mihgen ))
21:32 mihgen never been.. I thought you are somewhere in US as you don't sleep still ..
21:33 dmit2k heh... being an IT engineer means that you sleep by days and work at nights :)
21:33 xarses or dont sleep at all
21:33 mihgen sorry, but I'll buy beer to xarses sooner, I'm traveling to CA next week :)
21:34 dmit2k xarses: agree ;(
21:34 dmit2k :)
21:35 dmit2k may I ask one more tech question?..
21:35 Dr_Drache no
21:35 Dr_Drache LOL
21:35 dmit2k Dr_Drache: go patch your DELLs :))))
21:35 dmit2k so
21:36 Dr_Drache dmit2k, dells have a lower TCO than much else.
21:36 Dr_Drache plus, I came into this shop being a dell shop
21:36 dmit2k Dr_Drache: just kidding :)
21:36 Dr_Drache dmit2k, all good.
21:36 Dr_Drache only thing I hate about them.... well not enough LEDs
21:37 dmit2k so... I have a need to use Infiniband for inter-VM traffic
21:37 mihgen dmit2k: did you ask about this via LP questions?
21:37 dmit2k and have to reuse some older cards from Mellanox without ethernet bridges or EoIB
21:38 dmit2k mihgen: yes I did today and got a lot of help from A. Korolev
21:38 dmit2k but this is another question :)
21:38 mihgen I won't help here for sure… why don't you join https://launchpad.net/~fuel-dev mailing list?
21:39 mihgen that's perhaps even better way for such heavy questions)
21:39 dmit2k which network model I should use to accomplish this without OVS/linux bridges?
21:39 dmit2k OK...
21:40 dmit2k BTW, was quite surpassed you do not support infiniband and nobody asked questions so far
21:40 dmit2k *surprised
21:43 dmit2k OK, not so heavy question! :))) Is there going to be any support for soft raid in Fuel?
21:44 dmit2k due to some hardware and CEPH optimisation issues I have no option to use hardware raid
21:45 Dr_Drache why use raid at all?
21:45 Dr_Drache at least for ceph
21:45 dmit2k for system
21:46 dmit2k we use SSD for booting, and SSDs tend to die
21:47 dmit2k there may be no problem for OSD-only or compute nodes which are easily redeployed, but for controllers it can become a pain
21:48 dmit2k when the disk dies some in-the-ram components continue working and may result in data corruption or cluster failures
21:48 Dr_Drache and you can't hardware raid them?
21:49 dmit2k unfortunately :( and moreover - I do not trust hardware raids for some time
21:49 Dr_Drache .......
21:50 dmit2k as I was able to recover soft raid 1 arrays manually, but if your hardware one dies - ....
21:50 Dr_Drache how can you not raid drives, and not trust hardware raid over software?
21:50 Dr_Drache plug them into another controller of same type
21:50 Dr_Drache boome
21:50 Dr_Drache data is there.
21:50 dmit2k not always :) really
21:51 Dr_Drache either way.
21:51 Dr_Drache I can't say I agree that it's easyer to get data from a software raid than to restore a backup.
21:51 mihgen well we configure raid… I think raid1 for /boot ..
21:51 dmit2k often when HW raid dies it corrupts both disks in raid1 array
21:51 mihgen not sure though
21:51 mihgen but it's not exposed to the UI
21:52 Dr_Drache dmit2k, stop using hardware that dies :P
21:52 dmit2k Dr_Drache: stop using DELLs! :))))
21:52 mihgen some info is here: http://docs.mirantis.com/fuel-dev/develop/nailgun/partitions.html
21:53 dmit2k mihgen: yes, I saw it, but for me it is not raid - you just create /boot partition on every disk (which itself may be inappropriate in some cases)
21:53 mihgen ok.. then that's what we have for now :)
21:53 dmit2k but I'd like to have an option for a software raid1 to keep the system on it
21:53 mihgen we are planning to write python library for doing all required disk configurations
21:54 mihgen and get rid of anaconda at all
21:54 dmit2k anaconda is not evil :)
21:54 mihgen library would be more powerful than what anaconda provides now
21:54 mihgen we had a number of issues around disks with it, and with preseed no less
21:55 xarses dmit2k: https://review.openstack.org/#/c/76979/ and https://review.openstack.org/76982
21:56 dmit2k xarses: THANX!
21:57 * dmit2k happy
21:57 xarses vote! or die
21:57 mihgen why don't you +1 too, folks?)
21:58 xarses angdraug: ^^
21:58 dmit2k don't have a plus nbutton
22:00 xarses dmit2k: you have to be all setup in review.openstack.com (login and such) and then there is a review button
22:00 dmit2k she check....
22:01 dmit2k *shd
22:06 dmit2k hmm, so if I vote +1 -- is it for revoking the patch?
22:07 angdraug merged
22:09 dmit2k yepp
22:14 xarses =)
22:16 e0ne joined #fuel
22:24 dhblaz joined #fuel
22:24 dhblaz I am having trouble with vms not getting leases from dhcp
22:24 dhblaz I don't see anything too telling in the logs
22:25 dhblaz but I do see that two of my controllers is running dnsmasq and the other isn't
22:25 dhblaz Can someone tell me if all of my controllers should be running dnsmasq or just one?
22:26 dhblaz 2 of 3 doesn't sound right
22:27 xarses dhblaz: just one instance of dhcp-agent and l3-agent
22:27 xarses somewhere across your controllers
22:30 dhblaz is there a way for me to fix crm to maintain this requirement?
22:32 xarses it should only run one anyways
22:34 xarses you will probably find that while dnsmasq might be running on two systems, only one has dhcp-agent running
22:34 dhblaz This is true
22:35 xarses if two dhcp-agents are running then there is a big problem with crm
22:35 xarses you probably have this bug https://bugs.launchpad.net/fuel/+bug/1269334
22:35 xarses which could have caused the split
22:36 dhblaz I think I applied that patch already
22:38 xarses I'll poke around the script to see if it kills off dnsmasq as part of the shutdown
22:43 xarses ok, there is a chance that if its sent SIGKILL dnsmasq might not be cleaned out, I'll test as soon as my deployment finishes.
22:45 dhblaz based on the logs
22:45 dhblaz it looks like node-17 was (at least trying to) answer queries
22:46 dhblaz at 22:06 I did crm resource restart p_neutron-dhcp-agent
22:47 dhblaz at this point node-16 started answering queries and things started working
22:47 dhblaz right before I did the crm resource restart I did
22:47 dhblaz neutron agent-list
22:47 xarses does crm resource cleanup p_neutron-dhcp-agent remove the "defunct" dnsmasq
22:48 dhblaz No
22:49 dhblaz But I don't know how important it is
22:49 dhblaz the neutron agent-list showed
22:49 dhblaz DHCP agent         | node-17.mumms.com | :-)   | True
22:49 dhblaz After the resource restart it showed:
22:50 dhblaz DHCP agent         | node-16.mumms.com | :-)   | True
22:50 dhblaz I don't know if it matters but neutron agent-list shows two L3 agent
22:50 dhblaz one of them xxx
22:51 dhblaz the dhcp server has also given out an IP address to two different vms
22:51 dhblaz one of them is shutdown
22:51 dhblaz so it isn't causing a conflict on the wire
22:52 dhblaz but it is causing a conflict in the config
22:54 xarses ya, you should be ok to just kill off the defunct nodes dnsmasq processes
22:55 xarses if you see that in the neutron data (duplicate ip) it's not likely related to dnsmasq not closing on the one node
22:55 dhblaz I confirmed that patch is applied
22:56 dhblaz It shows in the dnsmasq log:
22:56 dhblaz 2014-02-27T22:05:44.594360+00:00 err:  duplicate dhcp-host IP address 10.29.8.54 at line 12 of /var/lib/neutron/dhcp/50644057-b518-4e85-843a-3321c9a4073f/host
22:56 xarses ok, then it's likely the SIGTERM vs SIGKILL, I'll have that tested shortly
22:56 dhblaz It seems like node-16 and node-17 don't have their configs sync'd
22:57 xarses you can just kill it all and crm will restart it =)
22:57 xarses just wack dnsmasq and dhcp-agent
22:58 dhblaz but what is responsible for maintaining /var/lib/neutron/dhcp/50644057-b518-4e85-843a-3321c9a4073f/host?
22:58 dhblaz for instance node-16 has an entry for 10.29.8.55 but none for 10.29.8.54
22:58 dhblaz node-17 have no entry for .55 and 4 for .54
22:59 dhblaz correction I see an single entry for .54 in node-16 just not in the same order as it is for node-17
23:00 dhblaz I confirmed that there is only one copy of dhcp-agent running
23:00 dhblaz xarses: the main concern I have is that I have to run crm resource restart p_neutron-dhcp-agent often
23:00 dhblaz and I don't have a good way to determine that it is broken until images stop launching
23:04 dhblaz Based on the logs the problem is actually with neutron-dhcp-agent
23:04 dhblaz node-17 shows that neutron-dhcp-agent wasn't running when it was supposed to
23:05 dhblaz Not sure why the agent list showed it as happy
23:05 dhblaz I have to run, I'll check the IRC logs later if you have some more to add I'll read it.  Thanks for all your help xarses
23:11 xarses joined #fuel
23:14 richardkiene_ joined #fuel
23:36 zer0tweets joined #fuel
23:38 GeertJohan joined #fuel
23:45 xarses joined #fuel
23:45 xarses joined #fuel
23:45 mattymo joined #fuel
23:46 sanek joined #fuel

| Channels | #fuel index | Today | | Search | Google Search | Plain-Text | summary