Perl 6 - the future is here, just unevenly distributed

IRC log for #gluster, 2016-08-19

| Channels | #gluster index | Today | | Search | Google Search | Plain-Text | summary

All times shown according to UTC.

Time Nick Message
00:02 om joined #gluster
00:04 om joined #gluster
00:09 om joined #gluster
00:40 shyam joined #gluster
01:20 shdeng joined #gluster
01:25 Gambit15 JoeJulian, btw on that topic, the man pages don't have any mention of arbiter yet (v3.7.13 at least)
01:37 wadeholler joined #gluster
01:47 wadeholler joined #gluster
01:49 ilbot3 joined #gluster
01:49 Topic for #gluster is now Gluster Community - http://gluster.org | Documentation - https://gluster.readthedocs.io/en/latest/ | Patches - http://review.gluster.org/ | Developers go to #gluster-dev | Channel Logs - https://botbot.me/freenode/gluster/ & http://irclog.perlgeek.de/gluster/
02:07 wadeholler joined #gluster
02:13 rafi joined #gluster
02:15 JoeJulian Gambit15: file a bug report.  :)
02:15 glusterbot https://bugzilla.redhat.com/enter_bug.cgi?product=GlusterFS
02:24 dlambrig joined #gluster
02:47 Javezim So running a 'ls -laR' on our cluster we are seeing a tonne of file that are showing - Input/output error. It appears these are split brained files, however when we do a 'gluster volume heal gv0 info split-brain' they do not show. Any Reason for this?
02:57 ramky joined #gluster
03:02 Gambit15 joined #gluster
03:06 Lee1092 joined #gluster
03:07 JoeJulian I'm still trying to figure out how many files are in a metric tonne.
03:11 JoeJulian Depends on the file size, the type of drive, and the drive density. 4TB drives would give you 5.5 petabytes +- 10%
03:11 JoeJulian SSD's of course would give you significantly more / tonne
03:12 JoeJulian Javezim: Assuming you're on 3.8 now, try leaving off the "split-brain" after info.
03:13 Javezim @JoeJulian Will do :) and yes we are
03:14 Javezim @JoeJulian We've set 'cluster.favorite-child-policy: mtime', it doesn't appear to be doing much though
03:22 ramky joined #gluster
03:26 magrawal joined #gluster
03:39 sanoj joined #gluster
03:42 atinm joined #gluster
03:45 RameshN joined #gluster
03:49 ashiq joined #gluster
03:51 kramdoss_ joined #gluster
03:53 hagarth joined #gluster
03:55 poornimag joined #gluster
03:55 shubhendu joined #gluster
03:56 skoduri joined #gluster
04:03 itisravi joined #gluster
04:05 Javezim With 3 Way Replication, if 2 Files are the same but the 3rd is split brained, does Gluster know that the 2 files are correct and read from that?
04:11 ZachLanich joined #gluster
04:12 itisravi Javezim: replication will always serve reads from the 'source' brick. ie. the brick which does not need heal for a given file.
04:13 itisravi so yes, if the 3rd brick has missed some writes, reads will be served from the 1st or 2nd one.
04:17 Javezim I am slightly worried with going into a 3 Way Replication in case of 3 Way Split brain, ie. File is different on all 3 bricks. We seem to have continous issues with 2 way Replication where split brains occur and files become "Input/Output Error", we're trying to find a way to reduce the amount of these (Especially on directories with large amounts of smaller files like Photos, Videos etc etc) We upgraded to 3.8.2 to try the cluster.favorite-child-policy bu
04:17 Javezim t we still appear to be having the issue, along with some performance issues with reading from Windows Clients. Would an Arbiter brick help reduce the Split brains on a 3 Way Replication since it would hold the Metadata and always have a brick to read/write from?
04:21 itisravi Yes you can use arbiter volumes or replica-3 volumes to prevent split-brains.
04:22 itisravi I  wonder why favorite-child-policy not working for you. Are you still getting EIO despite setting it?
04:22 aspandey joined #gluster
04:22 itisravi what policy did you set?
04:22 Javezim mtime
04:23 Javezim Yes still getting EIO on the Volumes files
04:23 itisravi hmm..was there a difference in mitme in the bricks for that file?
04:27 Javezim @itisravi How do you mean sorry?
04:29 itisravi I was asking if the mtimes of the files directly on the bricks were indeed different so that AFR could pick up the latest mtime. But I see in the code that even if the mtimes are the same, we pick up the first brick as the source and do the heal.
04:30 itisravi Javezim: Also, do these files show up in 'gluster volume heal VOLNAME info split-brain' ?
04:31 Javezim itisravi Nope, I ran this earlier and it came back with like 3 files in split brain for all the bricks. When in reality its in the Thousands. Running a 'ls -laR' on the directory spills back tonnes of Input/Output Error files
04:31 Javezim I also ran it without the Split Brain like Joe said
04:32 Javezim and it came back with Thousands of gfids
04:33 itisravi The output that you get without the split brain will also list the files that need heal (and will be healed) and not in split-brain.
04:33 anil joined #gluster
04:33 Javezim Fair enough
04:34 itisravi Can you share the getfattr output of a file that throws EIO but does not show in the info split brain command?
04:34 itisravi getfattr -d -m . -e hex /brick/path-to-file
04:34 itisravi From both bricks of the replica.
04:41 Javezim itisravi here is both outputs
04:41 Javezim http://paste.ubuntu.com/23069207/
04:41 glusterbot Title: Ubuntu Pastebin (at paste.ubuntu.com)
04:44 itisravi Javezim: Ah, you have ended up in gfid-split brain. These cannot be resolved using polices or the CLI.  You would need to delete the file and its hardlinks (the .glusterfs one and others if they exist) and then trigger the heal.
04:44 itisravi *delete from one of the bricks.
04:44 Javezim So if we have thousands of these?
04:45 itisravi I guess the only way is to script it out :(
04:46 Javezim You would need to delete the file and its hardlinks (the .glusterfs one and others if they exist) and then trigger the heal. - What do you mean by this sorry? I understand we need to delete the file from one of the bricks (Is there anyway to know which one is correct) but what are the hardlinks?
04:47 Javezim itisravi - Would this have a less chance of occuring if we had an Arbiter brick?
04:48 itisravi Javezim: See "Fixing Directory entry split-brain" in http://gluster.readthedocs.io/en/latest/Troubleshooting/split-brain/
04:48 glusterbot Title: Split Brain (Manual) - Gluster Docs (at gluster.readthedocs.io)
04:48 itisravi Javezim: yes it would.
04:49 Javezim If we added an Arbiter brick now, would it auto resolve them? Can you add an Arbiter brick to an already existing Volume?
04:49 itisravi Javezim: no you need to resolve these first.
04:50 itisravi and yes, you can use the add-brick command to convert replica-2 to arbiter volume.
04:51 itisravi Syntax: gluster volume add-brick <VOLNAME> replica 3 arbiter 1
04:51 itisravi <HOST:arbiter-brick-path>
04:51 Javezim Alright, Thanks for your help itisravi :) Looks like Ill have to try script something up to autoresolve these
04:51 itisravi Javezim: np :)
04:51 Javezim itisravi Reckon there will ever be something like the child-policy but for these split brains?
04:54 itisravi Javezim: no, the thing is gluster relies on gfids to do operations on files. Since the gfids are different, it cannot resolve the file to do further operations on it.
04:54 Javezim itisravi Any idea what can cause it on such a large scale?
04:54 Javezim I mean there are so many files with EIO, its crazy
04:56 itisravi Javezim: I'm guessing your workload is opening files with O_CREAT. So brick-1 down, your application creates dir/file1.  Now brick-1 comes up, brick-2 goes down, the app creates dir/file1.
04:56 itisravi Bam! you have file1 with 2 different gfids.
05:04 nbalacha joined #gluster
05:05 aravindavk joined #gluster
05:06 karthik_ joined #gluster
05:16 aspandey joined #gluster
05:18 ndarshan joined #gluster
05:21 Manikandan joined #gluster
05:35 hgowtham joined #gluster
05:38 jiffin joined #gluster
05:49 om joined #gluster
05:52 ramky joined #gluster
05:53 derjohn_mob joined #gluster
05:54 nishanth joined #gluster
05:56 Muthu joined #gluster
05:58 mhulsman joined #gluster
06:00 devyani7 joined #gluster
06:00 mhulsman joined #gluster
06:03 devyani7 joined #gluster
06:03 kshlm joined #gluster
06:09 Manikandan joined #gluster
06:14 Saravanakmr joined #gluster
06:19 msvbhat joined #gluster
06:20 satya4ever joined #gluster
06:22 kovshenin joined #gluster
06:23 [diablo] joined #gluster
06:25 klaas joined #gluster
06:27 Muthu joined #gluster
06:29 jtux joined #gluster
06:32 masber joined #gluster
06:37 kdhananjay joined #gluster
06:37 LinkRage joined #gluster
06:38 Alghost_ joined #gluster
06:53 rafi joined #gluster
06:57 Klas I'm currently failing to mount a volume with SSL on, but only with the client, here is my mnt.log: https://paste.fedoraproject.org/410561/47158968/
06:58 Klas I have made the /etc/glustefs/glusterfs.ca identical on all servers, and tried with and without the .pem from the client on the client
06:58 Klas anyone know what could be wrong?
06:58 Klas without SSL, it works fine to mount on the client
06:59 Klas secure-access is not enabled, there are no limits which hosts can mount the volume (at least none that I've configured)
07:01 atalur joined #gluster
07:03 jkroon joined #gluster
07:04 rastar joined #gluster
07:07 karnan joined #gluster
07:18 ppai joined #gluster
07:24 Sebbo1 joined #gluster
07:26 Alghost joined #gluster
07:38 arcolife joined #gluster
07:40 ankitraj joined #gluster
07:41 Chr1st1an joined #gluster
07:45 jri joined #gluster
07:51 loadtheacc joined #gluster
07:54 atalur joined #gluster
08:15 GoKuLe joined #gluster
08:16 GoKuLe Hello!!!
08:16 glusterbot GoKuLe: Despite the fact that friendly greetings are nice, please ask your question. Carefully identify your problem in such a way that when a volunteer has a few minutes, they can offer you a potential solution. These are volunteers, so be patient. Answers may come in a few minutes, or may take hours. If you're still in the channel, someone will eventually offer an answer.
08:18 GoKuLe Have problem with gluster, only first volume have active bricks, all other have bricks offline. When I stop first volume on the list, and restart server, next first volume have active bricks all other volumes have bricks offline. Only way to activate other gluster volume bricks are to stop/start them, or to use force command. Please help/
08:18 pur joined #gluster
08:21 derjohn_mob joined #gluster
08:21 magrawal klaas,from logs it seems certificate is expired,you need to regenerate the certificate
08:30 Klas magrawal: huh, it did indeed work, but the cert was about a month old, so very strange
08:30 Klas but thanks =)
08:31 Klas GoKuLe: what version are you running? I had similar issues with default ubuntu xenial version 3.7.6.
08:31 GoKuLe On CENTOS7.1, Gluster Version 3.8.2
08:31 magrawal Klas:welcome
08:33 GoKuLe Also firewalld is disabled, selinux=disabled
08:34 GoKuLe Error in usr-local-etc-glusterfs-glusterd.vol.log is
08:34 GoKuLe [2016-08-19 08:29:47.313995] E [MSGID: 106243] [glusterd.c:1652:init] 0-management: creation of 1 listeners failed, c ontinuing with succeeded transport
08:34 Slashman joined #gluster
08:38 Philambdo joined #gluster
08:39 derjohn_mob joined #gluster
08:39 kshlm GoKuLe, Can you give the volume status output for one such occurance?
08:41 GoKuLe Oj, just in sec
08:42 Manikandan joined #gluster
08:44 GoKuLe Status of volume: storage8483_17 Gluster process                             TCP Port  RDMA Port  Online  Pid ------------------------------------------------------------------------------ Brick Sb84:/brick1_8483/17                  N/A       N/A        N       N/A   Brick Sb83:/brick1_8483/17                  49153     0          Y       21611 Self-heal Daemon on localhost               N/A       N/A        Y       1561  Self-heal D
08:44 glusterbot GoKuLe: ----------------------------------------------------------------------------'s karma is now -17
08:44 kshlm GoKuLe, @paste
08:44 kshlm @paste
08:44 glusterbot kshlm: For a simple way to paste output, install netcat (if it's not already) and pipe your output like: | nc termbin.com 9999
08:44 kshlm GoKuLe, Don't paste output in the channel.
08:44 kshlm Use a paste service and paste the link here.
08:45 anil joined #gluster
08:46 rastar joined #gluster
08:51 GoKuLe http://termbin.com/1d67
08:51 devyani7 joined #gluster
08:52 kshlm GoKuLe, The output is from sb84?
08:52 GoKuLe Yes
08:53 Bhaskarakiran joined #gluster
08:53 GoKuLe When I restart both server only storage8483_144 have both bricks online, all other bricks in other volumes are offline
08:53 GoKuLe if I stop volume storage8483_144
08:54 GoKuLe And restart servers than only storage8483_17 will have active bricks
08:55 kshlm This not normal.
08:55 kshlm Do you restart both servers?
08:55 kshlm And when you say you restart, do you mean reboot, or restart glusterd?
08:56 GoKuLe Yes... I have done all combinations, with one restart with both, to stop/start some volumes, to disable random volume, to disable rpcbind
08:57 GoKuLe On the end, when you restart server, or even systemctl restart glusterd, all bricks on volumes except brick storage8483_144 are offline
08:58 kshlm Okay.
08:58 kshlm Could you share the glusterd log for the last restart.
08:58 kshlm Paste it to termbin as before.
08:58 GoKuLe Ok, I will clear all logs, and restart server
08:59 kshlm You could also share the log below the last 'Started running glusterd version v*' line.
08:59 GoKuLe I presume that you will need glustershd.log log
08:59 kshlm Nope.
08:59 kshlm etc-glusterfs-gluster.vol.log
08:59 GoKuLe ok
09:00 kshlm etc-glusterfs-glusterd.vol.log
09:02 GoKuLe http://termbin.com/qmq8
09:10 kshlm GoKuLe, Can you restart glusterd in debug mode and provide the log.
09:10 GoKuLe ok
09:10 kshlm You can enable debug log by editing /etc/sysconfig/glusterd and setting LOG_LEVEL to DEBUG
09:14 GoKuLe http://termbin.com/yng0
09:21 kshlm GoKuLe, Thanks.
09:22 kshlm The logs show that the bricks were started (or at least a start was attempted).
09:22 kshlm But only one brick did a portmap signin.
09:22 kshlm Brick need to do a portmap signin with glusterd before they can be shown online.
09:22 kshlm I'm not sure why the other bricks aren't performing a signin.
09:23 kshlm THe brick logs for one of those bricks should help understand why better.
09:23 GoKuLe It was working correct in 3.8.1, yesterday I upgrade to 3.8.2, and that stop working
09:23 kshlm Ah!
09:23 kshlm Let me check if something has changed regarding brick start recently.
09:26 kshlm Looks like http://review.gluster.org/14876 might have something to do with this.
09:26 glusterbot Title: Gerrit Code Review (at review.gluster.org)
09:27 kshlm GoKuLe, Found the bug.
09:28 kshlm We will fix it in the upcoming release. This is major breakage.
09:29 kshlm For now, though your only workaround is to do a force start.
09:30 atalur joined #gluster
09:30 GoKuLe Ok, thank you and thank you for your time
09:31 kshlm I'll file a bug, and you can track it for updates.
09:31 glusterbot https://bugzilla.redhat.com/enter_bug.cgi?product=GlusterFS
09:32 devyani7 joined #gluster
09:33 ju5t joined #gluster
09:35 derjohn_mob joined #gluster
09:38 ju5t left #gluster
09:41 hackman joined #gluster
09:43 kshlm GoKuLe, FYI this has already been fixed in the 3.8 branch.
09:43 kshlm The fix will be present in the next release.
09:43 hackman joined #gluster
09:44 kshlm https://bugzilla.redhat.com/show_bug.cgi?id=1366813
09:44 glusterbot Bug 1366813: unspecified, unspecified, ---, sbairagy, MODIFIED , Second gluster volume is offline after daemon restart or server reboot
09:44 kshlm https://review.gluster.org/15186
09:44 glusterbot Title: Gerrit Code Review (at review.gluster.org)
09:47 aspandey joined #gluster
09:48 GoKuLe Ok then, till version 3.8.3 in rc.local force command for starting volumes.
09:49 rastar joined #gluster
09:50 hackman joined #gluster
09:54 Gnomethrower joined #gluster
09:56 hackman joined #gluster
09:58 hackman joined #gluster
10:00 jkroon joined #gluster
10:02 atinm joined #gluster
10:03 hackman joined #gluster
10:03 msvbhat joined #gluster
10:07 anoopcs Javezim, I came across this bug report which may be related to performance issue you were facing while listing file/folders from Windows clients with GlusterFS and Samba. https://bugzilla.redhat.com/show_bug.cgi?id=1366284
10:07 glusterbot Bug 1366284: unspecified, unspecified, ---, atalur, POST , fix bug in protocol/client lookup callback
10:08 Arrfab hi guys, question about underlying devices and better IOs : if each storage node in the gluster setup has two disks, is it better to use the two disks in a VG, and use lvm stripping to distribute the IOs to the two underlying disks ?
10:08 Arrfab or one brick/disks for the same gluster vol ?
10:09 Arrfab s/disks/disk/
10:09 glusterbot What Arrfab meant to say was: or one brick/disk for the same gluster vol ?
10:57 kovshenin joined #gluster
11:01 mhulsman1 joined #gluster
11:09 lezo joined #gluster
11:14 B21956 joined #gluster
11:27 rwheeler joined #gluster
11:31 mhulsman joined #gluster
11:37 tjikkun I see multiple "Completed data selfheal on fe08ea0e-fe11-4202-be17-5ba602bbeb5a. source=1 sinks=0" for the same guid Anybody an idea what that might mean? Is this normal beahviour for a heal, or is there something wromg?
11:37 tjikkun This is a VM-image file by the way
11:43 itisravi tjikkun: It means one of the bricks did not witness a write on the image and now the self-heal has happened. Nothing to worry.
11:50 mhulsman joined #gluster
12:14 nishanth joined #gluster
12:17 poornimag joined #gluster
12:19 tom[] is there something about replication that makes tarring a dir in a replicated volume likely to warn "file changed as we read it"?
12:19 tom[] i wouldn't expect any of the files involved to change during the tar
12:22 ppai joined #gluster
12:28 unclemarc joined #gluster
12:29 plarsen joined #gluster
12:30 msvbhat joined #gluster
12:33 hagarth joined #gluster
12:41 ahino joined #gluster
12:45 BitByteNybble110 joined #gluster
12:51 ramky joined #gluster
12:52 itisravi tom[]: what version of gluster are you on?
12:53 tom[] 3.4
12:53 itisravi That is way too old. There are some fixes for this issue in 3.7.11.
12:53 * itisravi gets the bug ID.
12:53 chirino_m joined #gluster
12:54 itisravi tom[]: https://bugzilla.redhat.com/show_bug.cgi?id=1058526#c27
12:54 glusterbot Bug 1058526: medium, medium, ---, pkarampu, ASSIGNED , tar keeps reporting "file changed as we read it" on random files
12:57 tom[] thanks itisravi
12:57 tom[] i'm going to ignore it for now and hurry to 3.7
12:57 dlambrig joined #gluster
13:02 mhulsman joined #gluster
13:02 itisravi tom[]: oh as long as you're upgrading, you should do it to the latest 3.8.x
13:03 tom[] ok
13:04 tom[] does gluster offer a ppa for ubuntu?
13:05 ndevos ~ppa | tom[]
13:05 glusterbot tom [] : The official glusterfs packages for Ubuntu are available here: 3.6: http://goo.gl/XyYImN, 3.7: https://goo.gl/aAJEN5, 3.8: https://goo.gl/gURh2q
13:05 tom[] aah tremenjous
13:06 tom[] glusterbot got my name wrong
13:13 satya4ever joined #gluster
13:17 cloph does such a repo also exist for debian?
13:31 julim joined #gluster
13:50 johnmilton joined #gluster
13:52 johnmilton joined #gluster
14:01 rafi1 joined #gluster
14:03 mhulsman1 joined #gluster
14:03 nbalacha joined #gluster
14:08 BitByteNybble110 joined #gluster
14:12 hagarth joined #gluster
14:21 nbalacha joined #gluster
14:24 shyam joined #gluster
14:25 ndevos cloph: there should be a repo for debian on download.gluster.org
14:26 guhcampos joined #gluster
14:29 cloph ah, that was too easy/obvious :-)
14:32 Gambit15 Hey guys, anyone here using alb bonding or LACP? Curious to know if anyone's seen any particular benefits of one over the other...
14:32 kshlm joined #gluster
14:35 dlambrig joined #gluster
14:37 hagarth joined #gluster
14:38 unicornio8 joined #gluster
14:38 Gambit15 Also, I'm currently only seeing 1Mbps throughput & am looking for any tips on what can be done to improve it. The setup is "replica 3 arbiter 1" across 4 nodes, each with 2*1GbE interfaces bonded with ALB & MTU 1500. I've got 10GbE kit comming & will be able to test jumbo frames & LACP this afternoon. So besides MTU, LACP & 10GbE, what else can I tweak to improve performance?
14:39 neofob joined #gluster
14:39 dnunez joined #gluster
14:41 Gambit15 A quick test with dd gave me 144Mbps write for a 1GB file on the local disk & I got 100Mbps scping that to another node. Bit slower than I expected, but shouldn't be the 1Mbps bottleneck I'm seeing here...
14:43 Gambit15 performance.low-prio-threads: 32, features.shard-block-size: 512MB, performance.readdir-ahead: on
14:43 kramdoss_ joined #gluster
14:52 Arrfab Gambit15: it can be multiple things, including type of files : for example a lot of small files and rsync will lead to poor results
14:57 Gambit15 Arrfab, the volumes are not in use at the moment. These are just my preliminary tests before continuing. I'm aware large sequential r/w should show better performance, which is why I started by testing 200M, 500M & 1G files. I'm still hitting a roof of 1Mbps
14:58 Gambit15 So...what could these "multiple" things be beyond what I mentioned above?
14:59 Gambit15 (The storage network is on its own dedicated switch btw, so it's not contention on the switch...)
15:03 bowhunter joined #gluster
15:06 wushudoin joined #gluster
15:07 BitByteNybble110 joined #gluster
15:24 bb joined #gluster
15:26 Guest3567 Do any of the volumes outside of striped show linear improvements in write speed? Ive so far been unsuccessful with any of my attempts at improving write speeds when adding nodes outside of setting the volume up as striped.
15:31 kpease joined #gluster
15:32 msvbhat joined #gluster
15:36 ira joined #gluster
15:37 ira joined #gluster
15:39 kpease joined #gluster
15:40 Gambit15 Guest3567, tried sharding? Striping is apparently being deprecated...
15:41 squizzi joined #gluster
15:46 kovshenin joined #gluster
16:00 rafi joined #gluster
16:02 rafi1 joined #gluster
16:06 kpease joined #gluster
16:06 Guest3567 gambit15 Ive not seen any thing about sharding yet, ultimately im looking for a volume that is optimal for use as a backend for image files for virtual servers
16:07 Guest3567 Just did a quick google search on sharding and it sounds like that may be exactly the thing im looking for
16:07 Guest3567 Thanks
16:08 Gambit15 From what I've read, striping only gave a benefit if you deal with large sequential r/w
16:08 JoeJulian Whether or not it's "optimal" for virtual servers is arguable.
16:08 Guest3567 yeah I knew striping wasnt what I was looking for
16:09 Gambit15 Also, if VM images larger than your bricks aren't much a concern, you might get better performance without sharding, just use dist-rep
16:10 Gambit15 Replication can also have a big impact on performance, so consider using rep2 if viable. (To prevent split-brain, use replicate 3 arbiter 1. See the docs for why)
16:11 Guest3567 Ive got a large number of varying image sizes so I will never be certain but the majority of them would be under the size of a brick. Ultimately my main goal is providing better performance for these images
16:11 Gambit15 JoeJulian, any chance you could have a quick look at my comments/Qs above?
16:12 JoeJulian Was planning on it... need to do a little $dayjob managment first.
16:12 Gambit15 I'm getting 1Mbps throughput & trying to improve on it.
16:18 Lee1092 joined #gluster
16:26 Gambit15 dd if=/dev/zero of=test.dd bs=64k count=1600 ... On gluster fuse mount: 104857600 bytes (105 MB) copiados, 99,4895 s, 1,1 MB/s ... Local disk: 1048576000 bytes (1,0 GB) copiados, 3,3671 s, 311 MB/s ... scp to local disk on a peer: 100% 1000MB 100.0MB/s   00:10
16:26 Gambit15 (Just to give an idea)
16:28 Gambit15 Currently using an ALB bond which should give gluster an edge over scp. I'll be able to test with MTU 9000 & LACP in a bit, but even without those, I'd expect to see far better performance than this...
16:29 jiffin joined #gluster
16:36 Guest3567 Alright, maybe I just don't quite interpret gluster to perform as I expected it would. From what I read I always thought that by doubling what ive got for nodes/bricks that I would see  a double in performance...Taking 8 distributed bricks with a write speed of 234 MB/s and making it 16 bricks would show something more like 400MB/s. Am I just wrong about this or am I just severely missing something in my setup.
16:37 shyam joined #gluster
16:42 Gambit15 Guest3567, what level of replication are you on, and what speed interfaces? AFAIK, those are the two biggest factors
16:43 Guest3567 This is my current volume setup but ive been breaking it down and recreating it multiple ways
16:43 Guest3567 Type: Distribute - Number of Bricks: 12 - Transport-type: rdma
16:44 Guest3567 Its running over infiniband with each interface being  4X - 14.0625 Gbps
16:44 JoeJulian Gambit15: if iperf can run at wire speeds, then you *should* be able to fill the pipe up to the smallest bottleneck.
16:46 JoeJulian Guest3567: with sufficient clients using a broad set of files, yes, a distributed volume can handle more of that and increase the aggregate throughput of the entire cluster.
16:54 JoeJulian Ok, Gambit15, let's go ahead and prove iperf to get a baseline for your network.
16:54 Guest3567 I could be wrong but I believe iperf to only be good for TCP/UDP and there are no flags for testing RDMA, that being said my ib_send_bw tests top out at 5647.02 MB/sec which is more than sufficient. JoeJulian the way im reading what you are saying is that im not necessariyly going to see a DD write at 400MB/s but that overall with lots of clients writing to different files the overall max that the gluster configuration can support
16:54 Guest3567 is 400MB all at once but say a client would still be maxing that dd at 200MB/s?
16:55 JoeJulian Guest3567: Right, a single client operating on a single file will max out at the smallest bottleneck: cpu, ram, network, storage...
16:56 siavash joined #gluster
16:59 anil joined #gluster
16:59 Guest3567 Ah okay so ive been kind of interpreting gluster wrong all along :-) Thanks for the assistance
17:04 nbalacha joined #gluster
17:13 bowhunter joined #gluster
17:40 rafi joined #gluster
17:48 sc0 joined #gluster
17:49 julim joined #gluster
17:53 dgandhi I'm looking at setting up geo-rep on an existing 100TiB cluster , is there any mechanism for sneaker-netting the initial sync ?
17:54 post-factum sneaker-netting?
17:55 JoeJulian build the remote locally, geo-rep to it (use hostnames), then ship it?
17:56 JoeJulian Or at least do that with the drives?
17:56 dgandhi so build the whole system, and transport it config/OS and all?
17:56 JoeJulian It's an idea
17:58 * post-factum still wants "sneaker-netting" translation to simple english first
17:58 rafi joined #gluster
17:59 JoeJulian sneaker = shoe
17:59 JoeJulian netting as in network transmission
17:59 kkeithley Europeans like to call them trainers.  trainernet
17:59 JoeJulian So a sneaker-net is physically transporting the data.
17:59 kkeithley some europeans
18:00 post-factum bearing?
18:00 JoeJulian Gotta love colloquialisms.
18:00 post-factum JoeJulian++
18:00 glusterbot post-factum: JoeJulian's karma is now 30
18:02 dgandhi I have two clusters of 3 servers set up on the DCs, I have one spare machine at the source, I could rep to that machine, pull all the disks, and "recreate" it remote, but then I'll have to come up with some way to balance the blocks after the fact without breaking the rep state.
18:04 JoeJulian blocks?
18:06 dgandhi bricks/drives .. sry
18:07 kovsheni_ joined #gluster
18:09 JoeJulian You can run a rebalance on a geo-rep slave without breaking rep.
18:11 dgandhi but the initial bricks will all live on the first server, won't that keep gluster from spreading the redundancy bricks between servers?
18:12 dgandhi unless that has changed, I recall having to add bricks in distributed sets.
18:14 JoeJulian Let's start with what you have and what you want to end up with.
18:15 JoeJulian Use a pastebin for details, ie fpaste.org, gist, ,,(paste)
18:15 glusterbot For a simple way to paste output, install netcat (if it's not already) and pipe your output like: | nc termbin.com 9999
18:15 JoeJulian Like the volume info for the volume you want to geo-rep, for starters.
18:19 dgandhi http://pastebin.com/TZWPgBKv - basically we want to end with this on both ends with double the brick count when we are done.
18:19 glusterbot Please use http://fpaste.org or http://paste.ubuntu.com/ . pb has too many ads. Say @paste in channel for info about paste utils.
18:20 rafi joined #gluster
18:21 Jacob843 joined #gluster
18:23 rafi joined #gluster
18:25 JoeJulian So the geo-rep slave will have 60 bricks replica 2, right?
18:26 dgandhi yes
18:31 JoeJulian So here's what I would do... I would emulate a minimum subset of the new servers in containers with just enough storage to hold all the data. Use the hostnames that will match the new slaves. When you're satisfied with the amount of data that's synced, ship those drives and match the pseudo-layout you created. Copy /var/lib/glusterd from the fake machines to the real machines (you've used hostnames so all you have to do is change the ip addresses
18:31 JoeJulian associated with those hostnames)...
18:32 JoeJulian Then you can use add-brick to expand the volume to 30 bricks and rebalance.
18:33 JoeJulian After rebalance is complete (or really, you could even do it while it's running), add the additional 30 bricks to create the replicas (add-brick replica 2 $brick ...)
18:35 JoeJulian It *should* work smoothly.
18:35 tjikkun if volume heal .. info lists 8 entries, then not all entries will be healed simultanuous, right?
18:36 tjikkun any way to influance order?
18:36 JoeJulian no :(
18:37 tjikkun ok, too bad.. I am waiting for almost 3 days now
18:38 tjikkun most of them should complete fast, but 2 of them are written much to
18:38 tjikkun but of course now none of them get fixed
18:38 JoeJulian That shouldn't affect the ability to complete the heal.
18:38 tjikkun well it's a 100GB VM disk image
18:39 tjikkun without sharding
18:39 JoeJulian that's not as bad as the dozens of 20TB volumes that I get stuck healing. <sigh>
18:40 tjikkun heh, glad there is always somebody who has it worse
18:40 tjikkun although I am sure gluster is a "bit" more your expertise than mine
18:41 tjikkun so all in all I call it a draw
18:42 tjikkun but if I wait long enough, it will complete, right?
18:42 tjikkun Or is it possible it will never complete?
18:45 ZachLanich joined #gluster
18:46 dgandhi JoeJulian: thank you, presumably I would have to use clean gluster install on the remote servers to avoid breaking anything when overwriting /var/lib/glusterd. I suppose I could run KVM to do the same thing on a server with existing volumes.
18:46 JoeJulian You can get a state dump of your bricks ( gluster volume statedump dumps into /var/run/gluster ) and look through it for self-heal. You can see what offset the lock is at for the heal which tells you how far it is.
18:47 JoeJulian dgandhi: Right, my expectation was that the remote is all new servers. If not, you can *just* copy /var/lib/glusterd/vols (with glusterd stopped) to their matching hosts.
18:47 JoeJulian s/vols/$slave_volume
18:48 JoeJulian s/vols/$slave_volume/
18:48 glusterbot What JoeJulian meant to say was: dgandhi: Right, my expectation was that the remote is all new servers. If not, you can *just* copy /var/lib/glusterd/$slave_volume (with glusterd stopped) to their matching hosts.
18:50 JoeJulian And yes, kvm is just as good. I use nspawn containers for efficiency, but that's just a matter of style.
18:55 DV_ joined #gluster
18:56 dgandhi i'm still far too unreasonably salty about systemd sneaking up on me to be using nspawn yet, but the shutdown and update the vol dirs seems like a cleaner solution.
18:57 JoeJulian systemd is amazing. I wish we'd done it 20 years ago.
18:57 tjikkun JoeJulian: is that start= in the output? It does not seem to only increase between state dumps
18:58 dgandhi I know I'm being unreasonable, I will be assimilated, I just need a little more time. Thank you for your help.
18:59 JoeJulian You're welcome.
18:59 JoeJulian tjikkun: iirc, yes.
19:00 tjikkun JoeJulian: ok, well it seems to jump all over the place
19:01 JoeJulian It should just be going up.
19:01 JoeJulian What version are you running?
19:01 tjikkun 3.7.3
19:01 JoeJulian Ah, yeah...
19:02 tjikkun I know I should update, and I'm going to. After it is done healing :)
19:02 johnmilton joined #gluster
19:02 Gambit15 JoeJulian, sorry - got called out. iperf looks pretty good: 1.10 GBytes   942 Mbits/sec
19:03 JoeJulian tjikkun: Try disabling client-side heals: cluster.{data,metadata,entry}-self-heal
19:03 JoeJulian Leaving it to the self-heal daemon.
19:05 tjikkun JoeJulian: yeah I disabled those more then a day ago
19:05 tjikkun s/then/than/
19:05 glusterbot What tjikkun meant to say was: JoeJulian: yeah I disabled those more than a day ago
19:09 JoeJulian tjikkun: And the start isn't just moving forward? Seems odd.
19:09 JoeJulian I might suggest a "gluster volume start $volname force" to restart (yeah, I know it sucks) the self-heal daemons.
19:10 tjikkun Well I did 3 dumps, first it was 63278153728, then 49767645184, then 57723846656
19:11 JoeJulian Gambit15: if you dd (without O_DIRECT) from /dev/zero to a file on the fuse mount with bs=1M what throughput do you get?
19:12 tjikkun JoeJulian: will this impact the running vm's?
19:12 JoeJulian tjikkun: no
19:14 g-lim_ joined #gluster
19:15 g-lim_ hi i'm new to glusterfs and the last time i've tried to set it up i couldn't get it to replicate
19:15 g-lim_ maybe you guys can help me understand the basics
19:16 JoeJulian Did you mount the client and access your volume through that, or were you trying to use gluster as a replication service to copy the stuff you wrote to its brick?
19:16 g-lim_ JoeJulian: i'm trying to set up a geo-replication
19:17 g-lim_ i have 3 sites and I want to replicate VMWare templates to the remote sites
19:17 JoeJulian Ah, ok. I misunderstood "get it to replicate" to mean the local replication.
19:17 g-lim_ do you know if glusterfs can do differential copies?
19:17 JoeJulian It does, yes.
19:18 g-lim_ ok I read somewhere that it uses rsync as the tool for copy?
19:19 JoeJulian It does, but it does it intelligently using a change log.
19:20 om joined #gluster
19:28 jiffin joined #gluster
19:32 g-lim_ JoeJulian: thanks Joe!
19:32 tjikkun JoeJulian: Ok it looks like start is increasing only now, I'll just have to wait to see how it goes
19:33 Gambit15 JoeJulian, echo 3 > /proc/sys/vm/drop_caches; dd if=/dev/zero of=test.dd bs=1M count=100 ... 104857600 bytes (105 MB) copiados, 127,928 s, 820 kB/s
19:34 Gambit15 Mount options:    v1:/data on /mnt type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
19:47 mhulsman joined #gluster
19:54 tjikkun JoeJulian: Thanks for all the help, things look promising now
19:58 shyam joined #gluster
20:02 Wizek_ joined #gluster
20:05 sputnik13 joined #gluster
20:17 JoeJulian ~pasteinfo | Gambit15
20:17 glusterbot Gambit15: Please paste the output of "gluster volume info" to http://fpaste.org or http://dpaste.org then paste the link that's generated here.
20:18 JoeJulian Gambit15: Is this a stand-alone client, or is it one of the servers?
20:18 dlambrig joined #gluster
20:19 Gambit15 JoeJulian: https://paste.fedoraproject.org/410962/
20:19 glusterbot Title: #410962 Fedora Project Pastebin (at paste.fedoraproject.org)
20:20 Gambit15 The clients & servers are one & the same
20:20 Gambit15 "self hosted"
20:30 kovshenin joined #gluster
20:31 nishanth joined #gluster
20:32 nishanth joined #gluster
20:32 JoeJulian Gambit15: have you tried it without all the custom settings?
20:33 kovsheni_ joined #gluster
20:35 hagarth joined #gluster
20:38 Gambit15 JoeJulian, these? features.shard, performance.low-prio-threads, cluster.data-self-heal-algorithm
20:40 Gambit15 I presume self-heal won't be the issue at least...
20:42 Gambit15 Am going through things one by one. Just sorting out the switch now to configure jumbo frames & test replacing ALB with LACP, after that I'll try without rep, then arbiter, etc, etc. But nothing else big that I've missed then?
20:43 JoeJulian I have yet to try shard, so I have no personal experience. I wouldn't expect a performance hit though.
20:43 JoeJulian cluster.data-self-heal-algorithm: full is going to really suck when healing VM images, unless your network is faster than your cpus.
20:44 Gambit15 Distributed, might it not have the same penalty as striping?
20:44 JoeJulian It might, but it's written more intelligently so /shrug
20:44 JoeJulian performance.low-prio-threads: 32 - no clue what that's going to do.
20:45 JoeJulian network.remote-dio: enable - /should/ make o_direct faster but no personal experience.
20:46 JoeJulian All the performance.* : off /shouldn't/ affect that dd test, but I would reset them all just to see.
20:47 JoeJulian If any of those fix the slowness, then I would add other things back in one at a time and test.
20:47 Gambit15 I didn't set those. The only performance config I made was low-prio-threads
20:47 JoeJulian And you applied the vm group settings.
20:47 JoeJulian Which set most of those.
20:47 JoeJulian Again, should be fine. Shouldn't be slow.
20:47 JoeJulian But! since it is, let's eliminate possibilities.
20:47 Gambit15 gluster volume set data group virt, gluster volume set data storage.owner-uid 36, gluster volume set data storage.owner-gid 36, gluster volume set data features.shard on, gluster volume set data features.shard-block-size 512MB, gluster volume set data performance.low-prio-threads 32, gluster volume set data cluster.data-self-heal-algorithm full
20:48 JoeJulian Yeah, the "group virt" set most of those.
20:49 JoeJulian You can see them at /var/lib/glusterd/groups/virt
20:49 Gambit15 Right, I actually looked for a lot of these settings in the docs & found nothing. I thought the group was just for management reasons - being able to create a new group for different purpose volumes
20:50 Gambit15 Aha, good to know then
20:50 JoeJulian That set of config parameters is pretty well tested, so I'd be surprised if they fix your dd problem.
20:51 JoeJulian This is all low latency, too, right?
20:53 Gambit15 The network? Yeah
20:54 Gambit15 ICMP average of 0.12ms
20:56 g-lim_ what do I need to set up a geo-replication?
20:56 g-lim_ two independent glusterfs and peer them after?
20:56 JoeJulian http://gluster.readthedocs.io/en/latest/Administrator%20Guide/Geo%20Replication/
20:56 glusterbot Title: Geo Replication - Gluster Docs (at gluster.readthedocs.io)
20:57 g-lim_ need clarification on the master/slave terms
20:57 JoeJulian @glossary
20:57 glusterbot JoeJulian: A "server" hosts "bricks" (ie. server1:/foo) which belong to a "volume"  which is accessed from a "client"  . The "master" geosynchronizes a "volume" to a "slave" (ie. remote1:/data/foo).
20:57 JoeJulian And yes, a slave is a gluster volume.
20:58 g-lim_ ok thanks JoeJulian
20:58 JoeJulian But you don't have to "peer" them in the sense of "peer probe"
20:58 g-lim_ i think the best way to get my hands dirty and start the installation
20:59 JoeJulian Yeah, dig in, make mistakes, ask questions. That's how I feel I'm best at helping.
20:59 g-lim_ thanks alot! :)
21:20 bowhunter joined #gluster
21:34 shyam joined #gluster
21:44 Wizek_ joined #gluster
21:47 [diablo] joined #gluster
21:54 wadeholler joined #gluster
22:42 hagarth joined #gluster
22:47 RobertTuples joined #gluster
23:06 ZachLanich joined #gluster
23:09 siavash joined #gluster
23:12 ZachLanich Hey folks, I have a couple questions I still need help with regarding my Glsuter setup: https://notehub.org/q25ii - Please and ty!
23:12 glusterbot Title: NoteHub GlusterFS Use Case (at notehub.org)
23:13 ZachLanich The questions are in the "Maintainability" section of that doc I wrote up :)
23:15 ZachLanich I'm trying to wrap my head around the number of servers I need for certain setups. I realize Gluster doesn't care how many servers you spread your Bricks across, but I'm having a hard time wrapping my head around the fewest number of servers I need for H/A for my potential setups.
23:17 Gambit15 You can get away with 2 as a bit of a hack, but the ideal minimum is 3
23:18 Gambit15 However keep in mind that the more nodes you spread your bricks across, the better performance you can get - just make sure your network can keep up!
23:18 ZachLanich I'm fine with 3, btu is that possible with a Distributed, replicated setup? And if so, what Bricks do I put on what servers? lol The demos show 4-6 servers anytime I see one for a DR setup. It's confusing.
23:20 RustyB joined #gluster
23:20 billputer joined #gluster
23:20 Gambit15 With 3 servers, the normal setup would probably be rep 3 or dispersed
23:21 Gambit15 You could also do 2 servers & have a dummy node as an arbiter (the arbiter just stores the directory tree & file attributes to help keep quorum).
23:21 Gambit15 Although for just two servers, I'd use DRBD
23:24 Gambit15 If you want to balance multiple bricks across 3 nodes, check out the image on this page to visualise a good setup: https://joejulian.name/blog/how-to-expand-glusterfs-replicated-clusters-by-one-server/
23:24 glusterbot Title: How to expand GlusterFS replicated clusters by one server (at joejulian.name)
23:25 Gambit15 Right, that's me done for the week!
23:25 * Gambit15 gets coat
23:28 ZachLanich Gambit15 So, if I were to do exactly what's in that article, does that formally turn it into a Distributed Replicated setup?
23:40 ZachLanich Something really tells me I need 6 servers for a H/A Distributed Replicated setup. Someone tell me if I'm wrong. I'm lost lol.
23:41 RobertTuples you need a minimum of 3 servers just for HA (no sharding). if you want to enable sharing, yes, you'd need more servers on top of that. 5 or 6.
23:42 JoeJulian You *need* at a minimum 2 servers for HA.
23:42 JoeJulian If they're replicated and one goes down, you have another that keeps working.
23:42 ZachLanich I thought I needed at least 3 even in a Replicated only env??
23:42 JoeJulian You leave yourself at risk of split-brain so it's not recommended.
23:43 ZachLanich Doesn't that shut off Writes to avoid split brain without 3?
23:43 ZachLanich Right ^^ :P
23:43 JoeJulian Not unless you enable quorum.
23:43 RobertTuples guess that depends how much risk you want to take.
23:43 ZachLanich I'd prefer to not have to constantly manually intervene for split brain.
23:44 JoeJulian Personally, I choose replica 3. With most server hardware that gets you six nines.
23:44 ZachLanich What does 6 9's mean?
23:44 JoeJulian From there, you can add servers in groups of 3 to either expand capacity or to spread the load (in some use cases).
23:44 JoeJulian 99.9999% up time.
23:45 JoeJulian roughly 30 seconds per year.
23:46 JoeJulian Wait.. no...
23:46 ZachLanich Then I'm down for Replica 3. So explain to me in brief the effectiveness of quorum in a replica 2 env. I've read all the docs, but came to the conclusion that split brain was too common without replica 3.
23:46 JoeJulian That calculates out to 52 minutes? Why does that sound completely wrong?
23:46 JoeJulian Oh, hehe, I know why
23:46 JoeJulian I was right the first time. 31.6 seconds.
23:47 ZachLanich 365*24*60
23:47 ZachLanich Oops lol. That was meant for my calculator :P
23:47 JoeJulian or "1 year * .000001"
23:47 ZachLanich So like 5.5mins/yr downtime
23:48 JoeJulian If 31.6 seconds is like 5.5 minutes, yes.
23:48 RobertTuples Despite what the 3.8.0 release notes say [#1158654: [FEAT] Journal Based Replication (JBR - formerly NSR)], Journal-Based Replication is not supported in 3.8, right? https://github.com/gluster/glusterfs/blob/release-3.8/doc/release-notes/3.8.0.md
23:48 glusterbot Title: glusterfs/3.8.0.md at release-3.8 · gluster/glusterfs · GitHub (at github.com)
23:48 ZachLanich Hmm..
23:48 JoeJulian probably dropped a 0.
23:49 JoeJulian Well, bug 1158654 says it's in current release.
23:49 glusterbot Bug https://bugzilla.redhat.com:443/show_bug.cgi?id=1158654 medium, unspecified, ---, jdarcy, CLOSED CURRENTRELEASE, [FEAT] Journal Based Replication (JBR - formerly NSR)
23:49 JoeJulian I would test the hell out of it because it's new.
23:49 RobertTuples i can't find any documentation on it.
23:49 ZachLanich @JoeJulian Right 32secs haha. Can't math today
23:50 JoeJulian Blame jdarcy. :)
23:50 ZachLanich @JoeJulian So back to my Quorum question. I'd like to better understand how it's possible to maintain writes with 1 node down in a 2 node replica 2.
23:50 JoeJulian "gluster volume set <volname> cluster.nsr <on/off>" according to the commit message
23:51 RobertTuples and this slide from 2016-05-26 says "JBR planned for Gluster 4.0 (end of the year?)": http://people.redhat.com/ndevos/talks/2016-05-NLUUG/20160526-replication-in-gluster.pdf
23:51 RobertTuples page 17
23:52 JoeJulian Yeah, sometimes features move up.
23:52 JoeJulian The devs get pretty excited.
23:52 RobertTuples that's fine. i saw the bugzilla note and got pretty excited myself.
23:52 JoeJulian I'm surprised as well. I'll have to try it out on monday.
23:53 RobertTuples some of those "new features" on the release page are pretty funny
23:53 JoeJulian ZachLanich: quorum in a replica 2 can reduce the probability of split brain at the cost of being able to make changes if one brick is down.
23:53 RobertTuples "#1131275: I currently have no idea what rfc.sh is doing during at any specific moment"
23:53 RobertTuples new feature!
23:54 ZachLanich @JoeJulian That's what I thought. I just wanted to make sure I understood correctly.
23:54 JoeJulian In a use case that can postpone writes, or one that only reads, it's perfectly acceptable.
23:56 JoeJulian Well, it's 90 degrees in the outskirts of Seattle (32C for the rest of the world) and I'm hot and done working for the week. I'm going to have a tall cool beverage and sit out on the porch in the shade. Have a good weekend.
23:57 ZachLanich @JoeJulian So here's what I'm "trying" to do, and I just need help understanding the reality of it: I need Replica 3 at least right off, so a minimum of 3 nodes for H/A. As time goes on, I will need to expand for performance and capacity. Once I've maxed out my 3 nodes vertically, I'll need to add more nodes for space, and will supposedly gain performance from that as well. So my thoughts are that I would expand it into a
23:57 ZachLanich Distributed Replicated setup, but I still need to maintain H/A with an allowance of at least 1 node being down while maintaining writability. How do I do this?
23:57 JoeJulian ZachLanich: simplest solution, add 3 more nodes.
23:58 JoeJulian (and anyone who knows me knows how badly I cringe at the word "node" so feel privileged)
23:58 ZachLanich So if I add 3 more nodes and maintain replica 3, it splits the distribution between 3 replica pairs?
23:58 JoeJulian yep
23:58 ZachLanich @JoeJulian No like the word node? haha why?
23:58 JoeJulian It's an inaccurate vague term.
23:58 JoeJulian A node is also a printer.
23:58 ZachLanich @JoeJulian I have one more quick question that I'm dying to figure out, so bare with me and I'll send you free drinks some time haha
23:58 JoeJulian Or a finger.
23:59 ZachLanich @JoeJulian Ahh. Fair point. Comes from my Drupal background lol.
23:59 ZachLanich @JoeJulian So my last question:

| Channels | #gluster index | Today | | Search | Google Search | Plain-Text | summary