Perl 6 - the future is here, just unevenly distributed

IRC log for #fuel, 2016-01-11

| Channels | #fuel index | Today | | Search | Google Search | Plain-Text | summary

All times shown according to UTC.

Time Nick Message
00:18 vsedelnik joined #fuel
00:23 tzn joined #fuel
00:47 zhangjn joined #fuel
00:53 zhangjn_ joined #fuel
00:59 dancn joined #fuel
01:24 tzn joined #fuel
02:05 dancn joined #fuel
02:22 vsedelnik joined #fuel
02:25 tzn joined #fuel
02:47 zhangjn joined #fuel
02:49 ilbot3 joined #fuel
02:49 Topic for #fuel is now Fuel 7.0 (Kilo) https://www.fuel-infra.org/ | Paste here http://paste.openstack.org/ | IRC logs http://irclog.perlgeek.de/fuel/
02:56 zhangjn joined #fuel
03:03 zhangjn joined #fuel
03:08 zhangjn joined #fuel
03:17 zhangjn joined #fuel
03:28 iamroddo joined #fuel
04:17 zerda joined #fuel
04:25 vsedelnik joined #fuel
04:27 tzn joined #fuel
04:34 vsedelnik joined #fuel
05:20 krypto joined #fuel
05:25 vsedelnik joined #fuel
05:28 zhangjn joined #fuel
05:47 javeriak joined #fuel
05:49 javeriak_ joined #fuel
06:00 vsedelnik joined #fuel
06:08 iamroddo joined #fuel
06:09 iamroddo joined #fuel
06:21 javeriak joined #fuel
06:28 iamroddo_ joined #fuel
06:29 tzn joined #fuel
06:41 zhangjn joined #fuel
06:52 vsedelnik joined #fuel
07:05 RageLtMan anyone around running Ceph?
07:06 RageLtMan Looks like the version of firefly deployed via Fuel7 is susceptible to http://tracker.ceph.com/issues/7915 - which causes ceph to go down completely, with all images, volumes, etc in it
07:07 RageLtMan kind of nuked all of openstack... so if anyone is around, would love a hand with figuring out how to fix this mess (i just took two volume snapshots within 10s of each other and that caused total ceph death)
07:22 krypto joined #fuel
07:29 magicboiz joined #fuel
07:30 tzn joined #fuel
07:33 iamroddo_ Anyone with experience using Fuel 5.1.1 to push a package installation onto a particular node type?
07:37 javeriak joined #fuel
07:44 krypto for adding new iscsi backend to existing cinder ceph backend do i need to make any changes in existing cinder.conf of 3 controller nodes.
07:51 iamroddo joined #fuel
07:51 fzhadaev1 joined #fuel
07:51 krypto- joined #fuel
07:56 magicboiz joined #fuel
08:12 magicboiz joined #fuel
08:18 vsedelnik joined #fuel
08:18 vsedelnik joined #fuel
08:21 vsedelnik joined #fuel
08:31 tzn joined #fuel
08:33 javeriak joined #fuel
08:49 pal_bth joined #fuel
08:51 javeriak joined #fuel
08:51 zinovik joined #fuel
08:59 javeriak joined #fuel
09:05 e0ne joined #fuel
09:09 zhangjn joined #fuel
09:22 samuelBartel joined #fuel
09:23 javeriak joined #fuel
09:32 holser_ RageLtMan: What has happened?
09:32 holser_ Could you be more specific?
09:32 RageLtMan I issued two snapshot requests within ~10s of each other, while the instance was down
09:33 RageLtMan next thing i know the compute node which usually holds it is down hard, required an ipmi reboot
09:33 RageLtMan then libvirt started to complain about rados not working, so i restarted the OSDs, and none came back up
09:33 holser_ hmmm, that’s something unexpected. Could you upload logs somewhere?
09:34 holser_ rados relies on apache
09:34 RageLtMan http://pastebin.com/iNdDsf1a
09:34 holser_ but again, without logs it’s very hard to say what has happened
09:34 RageLtMan its not rados
09:34 RageLtMan its ceph
09:34 RageLtMan the OSDs are failing
09:34 holser_ I see
09:34 RageLtMan the snapshots which were being created are apparently corrupt
09:35 RageLtMan there's an existing issue in the ceph tracker, marked as dup, but no link to the duplicate or any solution
09:35 RageLtMan http://tracker.ceph.com/issues/7915
09:35 RageLtMan the assert being tripped seems to think there's a zero length snap
09:35 RageLtMan or less than anyway
09:36 tzn joined #fuel
09:36 holser_ hmm
09:36 RageLtMan currently going through guide on recovering the most critical VMs, but this could be the end of our our openstack experiment - months of work down the drain from a normal operation
09:37 RageLtMan I can see why Ceph isnt recommended by the sales guys :)
09:37 holser_ :\
09:37 RageLtMan seems a good idea till something like this pops up and its BTRFS-like hell all over again
09:37 RageLtMan desperately need a ceph pro to help me determine if i can comment out the assert or if that'll do even more damage
09:38 RageLtMan if i can get the OSDs up ill be happy to migrate all volumes and glance images the hell out to something
09:38 RageLtMan also, why in the world does Fuel use such an old Ceph release?
09:39 RageLtMan all the guides on repairing damage rely on tools we dont have :(
09:41 holser_ There will be the latest release in 8.0
09:42 holser_ but I doubt the issue persists in latest ceph also
09:52 holser_ RageLtMan: I would recommend to ask someone on #ceph #ceph-devel
09:52 holser_ I asked our Ceph folks to have a look as the issue is far behind of my knowledge
09:53 RageLtMan i've asked there all day, not a whisper back - seems like a quiet chan
09:54 RageLtMan ironically enough our old cloudstack is currently holding up the few recovered remnants of the workload, but this is just catastrophic.
09:54 holser_ I agree
09:54 holser_ Give me a bit … I am chatting with our Ceph developers
09:55 RageLtMan thanks a ton
09:55 holser_ The issue is critical, so if it’s present in 8.0
09:55 holser_ we will need to create a bug as we cannot ship it with such issue
09:55 RageLtMan i'm trying to track down fragments of our ops system in RBD chunks. damned filenames dont lend themselves to scripted copying
09:55 RageLtMan agreed
09:55 RageLtMan well, latest is 10.0.1
09:55 RageLtMan i assume you're 9.2
09:56 RageLtMan which i believe does still have the issue
09:56 holser_ as a simple command will cause a catastrophic outage
09:56 holser_ 9.2 has that issue
09:56 holser_ I checked the ceph tracker
09:56 RageLtMan my own damned fault for using Ceph - Frank even told me not to
09:57 RageLtMan not only is my system hosed but my ego is bruised too now, sales people being right and all ;)
09:57 zhangjn joined #fuel
09:58 RageLtMan all this while wrapping up the cloudstack decom, actually snapped the last workload migrated from there which started this mess.
10:00 RageLtMan i also just pulled the last building fuel 8 image, is it anywhere near testable yet? i can probably cut cloudstack in half and see about standing up a liberty cluster if 8 is stable-ish
10:02 holser_ http://tracker.ceph.com/issues/11493
10:02 kgalanov Hello RageLtMan, I will try to reproduce the bug in 7.0 and 8.0 releases
10:03 holser_ I doubt we have the bug in 7.0, 6.1. We need to produce backports for these releases. 8.0 should be ok… though we’ll test if this bug hits it
10:04 RageLtMan kgalanov: i've created many snapshots before, only difference here was the short time between issued commands - happened in a bash loop
10:04 RageLtMan no idea why a host died, could be from the op, or what caused it
10:04 RageLtMan currently building firefly-backports branch with 0.80.11 merged in and remotes/origin/historic/snap-workaround
10:10 RageLtMan holser: seems that issue references a commit - bbec53edf9e585af4e20bbc9ba9057d6fdfda342 - which claims to fix this. looked over the diff, not sure how this is addressed, but added into the build
10:10 aarefiev joined #fuel
10:10 RageLtMan unfortunately ceph takes forever to build out
10:10 holser_ it depends on network bandwidth
10:18 magicboiz RageLtMan: did you try with ceph-users ML? there are a lot of experts there.......
10:19 RageLtMan I joined the ML yesterday, so no response yet
10:19 RageLtMan finally figured out how to copy the bloody rbd files for the image i'm seeking anyway
10:20 holser_ kgalanov: is reproducing the issue on vanilla 7.0 just to understand if we need to put backports to next maintenance updates
10:21 holser_ give him a bit of time
10:22 RageLtMan of course, thank you guys so much - you're the only ones who've rogered up to my pleas for assistance :)
10:22 holser_ yw
10:23 dhg joined #fuel
10:23 RageLtMan reading through some of the workarounds in the git history im starting to think that Ceph sort of needs a CoW with transactional semantics, a lot of this is edge-case handling for partial writes and other f'ups
10:23 zhangjn joined #fuel
10:23 RageLtMan oh christ, 0530 already
10:25 zhangjn joined #fuel
10:30 RageLtMan holser: i need to grab a couple hours of sleep if i'm to be of any use to clients and team members today. Will you guys be around in ~6 hours?
10:37 RageLtMan Please PM me if you guys find an approach to getting the OSDs back up, i'll check in here in a couple of hours before first meeting. If there's any log data you need, anything like that, please ping me as well.
10:38 asaprykin joined #fuel
10:39 asaprykin_ joined #fuel
10:39 trociny joined #fuel
10:39 hyperbaba joined #fuel
10:41 trociny RageLtMan: I think the duplicate for http://tracker.ceph.com/issues/7915 is http://tracker.ceph.com/issues/11493 althought it is marked as resolved and your version (0.80.11) contains the fix
10:41 trociny RageLtMan: could you please send output of your current `ceph -s` and `ceph osd tree`
10:41 hyperbaba hi there. Again i have a problem with token expiration in glance when saving large snapshot. The image is stuck in saving state and upload is aborted with notAuthorized error. Ive changed the expiration for the token to more than one hour and also few other settings in keystone.conf (found them on the net related to this problem) and the problem persist. The fuel is 5.1.1 deployed icehouse
10:41 RageLtMan sure
10:42 trociny RageLtMan: do all osd report the same error in logs on start?
10:42 RageLtMan http://pastebin.com/HaUPzC6w
10:42 RageLtMan yes
10:43 RageLtMan exactly the same
10:43 trociny RageLtMan: I suggest writing to ceph-dev if there is no response on ceph-user. This is the best place to ask for this kind of problem. Developers are busy to read user list
10:43 RageLtMan all OSDs are down and out
10:43 pal_bth joined #fuel
10:43 RageLtMan will ping ceph-dev as well
10:44 trociny RageLtMan: are you using cache tiering?
10:44 trociny RageLtMan: as for recovery tools, they should also be available in 0.80.11, they are just in a separate package ceph-test, with is not installed by fuel by default. Are you looking for some specific tool?
10:46 RageLtMan 80.11 is building now, Fuel installs 80.9 IIRC. dont remember name of tools off top of my head, but the ceph disaster recovery docs referene binaries that the debian packages i built dont contain according to dpkg -c
10:46 RageLtMan iirc, i did look in all of the built debs
10:47 RageLtMan currently going through this mess - https://ceph.com/planet/ceph-recover-a-rbd-image-from-a-dead-cluster/
10:47 RageLtMan educational...
10:48 RageLtMan trociny: have you ever seen this before? it strikes me as kind of crazy that a bad snapshot can bring a whole distributed FS to its knees like this. seems like there'd be a rational workaround or some sort of "safe mode" operation to remedy this.
10:49 RageLtMan and i never configured tiering past what Fuel would have done
10:49 RageLtMan journals are all SSD partitions in this setup, but dont think there's anything more advanced than that
10:50 RageLtMan oh and of course all disks are present and xfs_check doesnt find jack-all wrong with them
10:50 trociny RageLtMan: I have not seen this before. If you ask on ceph-dev, there might be an easier way to recover e.g. using a ceph-osdomap-tool and fixing some problem map...
10:52 trociny RageLtMan: according to your pastebin log you already run 0.80.11?
10:52 trociny ceph version 0.80.11-19-g130b0f7 (130b0f748332851eb2e3789e2b2fa4d3d08f3006)
10:53 RageLtMan thats the build from tonight
10:54 vsedelnik joined #fuel
10:54 RageLtMan so far i've been able to build firefly current-stable and now building the backports branch merged in
11:02 zhangjn joined #fuel
11:16 javeriak_ joined #fuel
11:22 vsedelnik joined #fuel
11:25 trociny RageLtMan: it looks you have 'debug_osd = 0/5', if you set it to 0/20  and try to start again you will have more debug info in your crush log, which might be useful
11:27 trociny RageLtMan, also output of `ceph osd dump` and `ceph pg dump` might be useful too.
11:30 dhg joined #fuel
11:53 bhaskarduvvuri joined #fuel
11:59 HeOS joined #fuel
12:01 vsedelni_ joined #fuel
12:15 javeriak joined #fuel
12:42 tzn joined #fuel
12:53 ilbot3 joined #fuel
12:53 Topic for #fuel is now Fuel 7.0 (Kilo) https://www.fuel-infra.org/ | Paste here http://paste.openstack.org/ | IRC logs http://irclog.perlgeek.de/fuel/
12:57 chaitu Hi all, my network verification fails but i could reach my repos from each and every node, can someone please help me with this issue
13:02 chaitu this is my nailgun.log http://paste.openstack.org/show/483419/
13:05 zhangjn joined #fuel
13:11 kaliya joined #fuel
13:14 ikalnitsky joined #fuel
13:17 ikalnitsky chaitu: this log snippet has nothing to do with network checker. could you please provide the whole diagnostic snapshot? or at least /var/log/docker-logs/nailgun & /var/log/docker-logs/astute ?
13:55 Guest83159 joined #fuel
13:55 Guest83159 hot to restart all openstack services?
13:55 Guest83159 at once
14:02 omolchanov joined #fuel
14:03 criss78 joined #fuel
14:06 cartik joined #fuel
14:20 ekhomyakova joined #fuel
14:22 richoid joined #fuel
14:24 gangadhar_ joined #fuel
14:26 ekhomyakova left #fuel
14:42 zhangjn joined #fuel
14:45 richoid joined #fuel
14:49 javeriak joined #fuel
14:51 javeriak_ joined #fuel
15:04 javeriak joined #fuel
15:21 vvalyavskiy joined #fuel
15:33 claflico joined #fuel
15:39 xarses joined #fuel
16:27 RageLtMan holser, kgalanov: any chance you guys were able to reproduce this little issue?
16:37 javeriak_ joined #fuel
16:42 mykola joined #fuel
16:46 javeriak joined #fuel
17:13 CheKoLyN joined #fuel
17:18 fuel-slackbot joined #fuel
17:24 krypto joined #fuel
17:30 kgalanov RageLtMan, I am initiating update for ceph with included patch for 7.0
17:33 javeriak_ joined #fuel
17:42 mykola RageLtMan, do you still have your cluster unrecovered?
17:44 claflico1 joined #fuel
17:54 elo joined #fuel
18:23 javeriak joined #fuel
18:24 pal_bth joined #fuel
18:30 elo joined #fuel
18:49 e0ne joined #fuel
19:34 dhblaz joined #fuel
19:47 vsedelnik joined #fuel
20:00 angdraug joined #fuel
20:28 javeriak_ joined #fuel
20:59 DevStok joined #fuel
21:13 DevStok edit_node_interfaces_screen.js:768 Uncaught TypeError: Cannot read property 'join' of undefined
21:13 DevStok i cant edit interfaces of a particular compute
21:13 DevStok using f12 i noticed :
21:13 DevStok edit_node_interfaces_screen.js:768 Uncaught TypeError: Cannot read property 'join' of undefined
21:15 mwhahaha is it not properly detecting all the interfaces?
21:16 DevStok sometimes i can see the interfaces
21:16 DevStok try to bond 2 interfaces and the fuel gui get blocked
21:18 DevStok reload the web page try to go on the interfaces but the page still blocked on loading
21:19 DevStok i've deployed successfully 3 ctrl and 2 cmpt
21:19 DevStok this compute gives this problem
21:20 DevStok i can go ahead but without bonding the interfaces
21:23 mwhahaha sounds like a bug in the ui
21:23 mwhahaha you could always do the setup via the cli
21:26 tzn joined #fuel
21:28 DevStok i'll try fabio
21:28 DevStok or maybe can i try to destroy some docker
21:28 DevStok which one?
21:28 DevStok ngnix
21:29 RageLtMan joined #fuel
21:30 RageLtMan holser, kgalanov: the wip-11493-b branch seems to have most of a fix for this problem, straight from Sage on ceph-devel ML. Cluster still hosed though
21:31 mwhahaha i doubt any docker stuff will fix it as it seems like it's a missing data/ui frontend error
21:37 DevStok how to bond the interfaces from cli?
22:13 Sesso joined #fuel
22:25 mwhahaha DevStok: i think you do it via the fuel node command, if you see the section about selecting  offloading modes it has the command and i think you'd just edit the yaml and upload it back, https://docs.mirantis.com/openstack/fuel/fuel-7.0/user-guide.html#nic-bonding-ui
22:29 DevStok thanks
22:40 claflico joined #fuel
23:57 xarses joined #fuel

| Channels | #fuel index | Today | | Search | Google Search | Plain-Text | summary