Perl 6 - the future is here, just unevenly distributed

IRC log for #rosettacode, 2012-03-03

| Channels | #rosettacode index | Today | | Search | Google Search | Plain-Text | summary

All times shown according to UTC.

Time Nick Message
02:02 mwn3d_phone joined #rosettacode
02:23 mwn3d_phone1 joined #rosettacode
02:29 AlecTaylor joined #rosettacode
02:29 AlecTaylor hi
02:36 sorear Hello AlecTaylor.
02:37 * AlecTaylor is having trouble finding C, C++ or Python implementations of various useful data-structures :\
02:38 sorear try glib, boost, or the standard library, respectively
02:39 AlecTaylor Looking for: Beap, soft heap, Brodal queue
02:40 sorear never heard of the last one and I wouldn't describe the first two as generally useful
02:48 AlecTaylor sorear: I'm looking for a term-frequency analysis structure which is more efficient than a hashmap.
02:49 sorear for counting the frequency of words in a document?
02:49 sorear you aren't going to get much better than O(1)...
03:01 AlecTaylor Hashmaps have a worst-case remember!
03:03 sorear if you want an efficient data structure for a hard real time mapping system, you want to use a self-balancing binary search tree
03:03 sorear also known as std::map
03:04 sorear I don't know offhand of easy-to-use C or Python implementations
03:04 AlecTaylor sorear: Is that AVL or R/B?
03:04 sorear Doesn't matter, they have exactly the same asymtotic performance and API
03:04 mikemol AlecTaylor: Implementation-defined. But in general, I believe C++'s std::map uses R/B.
03:04 AlecTaylor kk
03:05 AlecTaylor But still, considering I don't care about space, something like a VList might be more effective
03:05 sorear Personally I'm a fan of 2-3 trees :p
03:06 AlecTaylor :P
03:06 AlecTaylor sorear: But do you know if they are more efficient than hashmaps for tf?
03:07 mwn3d_phone joined #rosettacode
03:07 sorear AlecTaylor: Do you care about efficiency or worst-cases?  You need to pick one.
03:08 AlecTaylor Worst case time-complexity
03:08 mikemol You're not going to do any better than O(1), then.
03:09 AlecTaylor Exactly
03:09 sorear mikemol: Is there an O(1) worst-case map structure for write-dominated workflows/
03:09 AlecTaylor hashmaps have a worst-case of O(n)
03:10 sorear I think you care too much about worst-case complexity
03:10 mikemol sorear: I can easily conceive of one if you allow for a fixed upper bound on contained item count.
03:11 mikemol For example, if I were trying to be crazy fast with building things for a markov bot, I might allow it to only remember a certain number of elements.
03:12 mikemol How you choose which elements to remember or bump would be very use-case specific.
03:13 AlecTaylor Yeah, but since there exist tree-based O(1) worst-case/best-case structures, shouldn't I compare there performance?
03:13 * AlecTaylor also notes worst case O(log n) best case O(1) ones
03:13 sorear AlecTaylor: beaps are O(N) worst case for the "find" operation, which is the one you're using here
03:13 AlecTaylor http://cstheory.stackexchange.com/questions/10508/optimal-term-frequency-analysis
03:13 fedaykin "ds.data structures - Optimal term frequency analysis - Theoretical Computer Science - Stack Exchange" http://rldn.net/Bn4
03:14 sorear If you used a hashmap with a randomized seed it'd be done by now.
03:14 AlecTaylor sorear: Your point?
03:15 * AlecTaylor already has the results from tf freq analysis on a hashmap
03:18 sorear I lost control of myself, I'm stepping down now.
03:18 mikemol AlecTaylor: It's really, really unclear what you're looking for.
03:19 AlecTaylor mikemol: A structure which presents better results than a hashmap for this particular tf problem
03:19 mikemol AlecTaylor: Your particular data set, or any theoretical data set?
03:20 AlecTaylor Any theoretical data set
03:20 mikemol Then you're looking for the wrong thing, IMO.
03:21 mikemol If you have a small data set, you're probably looking for something not O(1). If you have a large data set, you're probably looking for something O(1).
03:22 AlecTaylor For arguments sake, lets say I was analysing every article on Wikipedia, and every book on Google Books. I want to create a tf index.
03:23 AlecTaylor I would need an O(1) algorithm with a worst-case no more than O(log n), preferably O(1)
03:25 sorear No, this is an _amortized_ problem
03:26 mwn3d_phone joined #rosettacode
04:21 mwn3d_phone1 joined #rosettacode
04:28 mikemol Have the entropykey running at home. Had a difficult time drawing off all the entropy it's generating. At this point, I've got 16 entropy buffers each drawing somewhere between 32 and 64 bytes per second.
04:28 mikemol It's actually *really* hard to do this smoothly.
04:33 mikemol I can push _almost_ 3kB/s of entropy from my box at home to the server.
04:45 Coderjoe by "entropy buffers" do you mean the capacitors, or what?
05:11 sorear any particualr reason it needs to be smooth?
05:11 sorear why not send it to the server in 1MB chunks?
07:03 Coderjoe or let the server pull as it needs it
07:22 GlitchMr joined #rosettacode
08:33 saschakb joined #rosettacode
09:30 kpreid joined #rosettacode
10:31 pr0ggie joined #rosettacode
11:15 mikemol Coderjoe: Having the server pull it as it need it would require me to have a way to perform actions based on entropy level events, rather than polling. I don't have a way to do that. If I did, I wouldn't have to use polling in the entropy buffer, either.
11:16 mikemol And, yes, by 'entropy buffer', I'm referring to the capacitors. The filename is actually entbuff.c; 'capacitor' was a convenient way to describe its charge/discharge behavior.
11:18 mikemol sorear: When I'm talking about it difficult to do things smoothly, I'm referring to the polling behavior of the entropy buffers. Lets say you've got ten consumers that activate every 1000ms, but you don't have a good way to ensure that their activations are spread evenly over that 1000ms; you might have a a few which all activate within 6ms of each other, and you might have a span of 800ms in which nothing happens.
11:21 mikemol The kernel's entropy pool caps out at 4096 bits, and the ioctl for reading the amount of entropy in that pool lies; it won't immediately reflect bits taken out from your (or anyone else's) actions.
11:22 mischi joined #rosettacode
11:24 mikemol The entropy buffer tries to keep the entropy pool between two threshold points. Below one threshold, it feeds into the pool. Above another threshold, it pulls from the pool. But since the value it reads from doesn't update quickly or synchronously, a race of several consuming buffers on the pool at the same time can cause the kernel pool to drain before the reported entropy level is adjusted.
11:24 mikemol So that's one problem.
11:26 mikemol Another problem occurs when the entropy count *is* updated quickly enough that one buffer draws the pool level down below $high_threshold, and then a burst of several other buffers see the pool level below $high_threshold and refuse to pull.
11:27 mikemol That wouldn't be so bad, except that, with the uneven spacing of the buffers, you'll have some large period where the buffers aren't pulling, and the entropy key firehoses bits until the pool is full, and then firehoses bits into an already-full pool.
11:28 mikemol All of this means that the entropy buffer(s) can't charge as quickly as I would like them to, which conversely means they can't discharge as quickly as I would like them to.
11:31 mikemol And while RC's server shouldn't generally be able to consume entropy fast enough to matter in that circumstance, my home network can.
11:32 mikemol (And with one entropykey to work with, I'm trying to spread things around.)
11:33 mikemol Measurement: I have 8MB of entropy buffered on RC's server; if the connection feeding from my home network dies, it should still be good for a while. At home, I've got about 10MB of entropy buffered between five buffers.
11:38 mikemol I take that back; apparently, I haven't been feeding entropy to the server since before I went to bed.
11:38 mikemol Weekend traffic is far easier to cope with.
11:55 pr0ggie joined #rosettacode
11:55 mikemol Cool. I managed to get entropy transfering from inara to rc's server at a rate of 20*(256B/s). The entropy key wasn't putting out that much, but my buffers were able to, by running five buffers with polling times of 125ms. Serverside, there were several buffers at 125ms as well.
12:04 mikemol Neat. The next thing I want to try is a configurable entropy distribution mesh.
12:05 mikemol So, let's say you have two machines you trust which generate entropy, but not always enough to satisfy their own needs.
12:07 mikemol Have a program running on each machine which behaves like the entropy capacitor, but which also uses a TCP connection to communicate with the other instance.
12:10 mikemol Each buffer would probably have four states: 'starving', 'critically low', 'low', 'ok'. When a buffer enters one of these states, it sends a message to its peer buffers.
12:15 mikemol When a peer is starving, a buffer in the 'ok' or 'low' states (but not itself in 'critically low' or 'starving') would send a packet of entropy, along with an offer to provide more. If the offer is accepted, more is sent until the peer's buffer state rises high enough that the offer wouldn't have been sent.
12:16 mikemol When a peer is low, a buffer in the 'ok' state (but not itself in 'starving', 'critically low' or 'low') would send an offer to provide entropy. If the offer is accepted, entropy is provided until the peer reaches the 'ok' state.
12:18 mikemol That's only a very rough description. Key points: If the state between two peers differs by two steps, an initial packet of entropy accompanies the offer for more, as a difference of two steps indicates an urgent condition on the lower of the two peers, and the capacity of the higher of the two to aid (possibly to the point of waste, if not all of the packet can be used)
12:20 mikemol That's to try to deal with avalanche conditions where one might have many peers, and hitting a critically low state could result in an overly large influx.
12:23 mikemol Now, there would also be peers which you *don't* trust to give you entropy, but you want to be able to handle their starvation cases. In those events, you wouldn't send them notifications of your level, and you wouldn't accept any entropy from them. You would, however, respond to their own level announcements.
12:34 Coderjoe holy crap that's a big wall of text
12:34 Coderjoe and it is way too early
12:37 Coderjoe one amusing way to test the speed of the entropy key: pipe /dev/random through the "buffer" program (with settings proper for how little data is normally there) and pipe the output of that into /dev/null. (this should be done for a very short period of time, though)
12:40 mikemol Coderjoe: Check out the 'pv' command.
12:46 Coderjoe I know about pv
12:49 Coderjoe heck, my desire to use PV to watch the copy progress of some large files was part of a screwup that cost me 28,000 friendster user profiles that are now lost forever.
12:49 mikemol Point being that pv would be a simple way to measure what you're describing measuring.
12:50 mikemol And, actually, my means of pushing entropy to RC's server is to run several instances of "pv -a -L 256 | nc -6 rosettacode.org $port"
12:50 fedaykin "Rosetta Code"
12:51 Coderjoe the buffer program does pretty much the same thing, though without any of the logic for progress bars or the like. I also don't know offhand if it can have the read buffer size configured. I know the buffer command can.
12:51 Coderjoe I used buffer well before I knew of pv for watching throughput rates
12:52 mikemol The purpose of the several instances is to have several copies of rngd playing receiver on the server; rngd only polls the entropy pool at a ragte of 1/s. The ratelimiting is because I know I can push data faster than an instance of rngd can usefully use it.
12:53 Coderjoe right, but it doesn't test the full POWAH of the key
12:53 mikemol I don't see the point of buffer over pv; if all I wanted was to measure throughput periodically, I'd normally do that with dd and SIGUSR1.
12:54 mikemol I don't need to test the full rate of the key right now; I have several entropy buffers filling on inara, and I don't want to kill them for a throughput test.
12:54 Coderjoe be careful. not all implementations of dd have that.
12:55 mikemol I check the manpage before I use it. The only implementation of dd I'm likely to touch that might not have it would be the one built-in to busybox.
12:55 Coderjoe busybox does not have it
12:56 Coderjoe at least not the last time I checked
12:56 Coderjoe and if you send it USR1 it dies
12:56 mikemol Figured as much.
12:56 Coderjoe (discovered empirically)
12:56 mikemol So, at the moment, I've got about 10.25M of entropy buffered on the server across five buffers.
12:57 Coderjoe well, I gotta go
12:57 * mikemol waves
12:57 mikemol Good luck at your tournament.
12:58 Coderjoe I'm not bowling, actually.
12:58 Coderjoe running side events.
12:58 mikemol ah
12:59 kpreid what's the origin of all this talk about entropy, anyway?
12:59 mikemol kpreid: http://rosettacode.org/cdc3/bin/index.cgi?hostname=prgmr2.rosettacode.org&plugin=entropy&timespan=2678400&action=show_selection&ok_button=OK
12:59 fedaykin "collection.cgi, Version 3" http://rldn.net/8Fp
12:59 Coderjoe "why's rosetta code responding so slowly?"
12:59 kpreid ah
13:00 Coderjoe with the answer being someone mirroring over ssl
13:00 mikemol Well, no.
13:00 mikemol RC doesn't implement SSL right now.
13:00 Coderjoe anyway, I'm gone. have fun with that entropy key.
13:00 Coderjoe oh
13:00 mikemol But, yeah, the lack of entropy is killing the performance of various things.
13:01 mikemol Or, at least, I believe that to be the case; that's one of the last things left on the server for performance tuning.
15:08 saschakb joined #rosettacode
16:26 maustin joined #rosettacode
16:58 mwn3d_phone1 joined #rosettacode
17:37 mwn3d_phone joined #rosettacode
17:44 saschakb joined #rosettacode
18:31 nimdAHK joined #rosettacode
18:38 mwn3d_phone joined #rosettacode
19:01 mwn3d_phone joined #rosettacode
19:10 saschakb joined #rosettacode
20:08 FireFly joined #rosettacode
20:10 mwn3d_phone joined #rosettacode
20:56 maustin joined #rosettacode
21:18 mwn3d_phone1 joined #rosettacode
21:21 mwn3d_phone joined #rosettacode
22:15 mwn3d_phone joined #rosettacode
22:29 mikemol https://github.com/mikemol/etools/blob/master/entmesh.txt
22:32 mikemol FWIW, the EntropyKey appears to generate around 2.75MB/s of data.
22:33 mischi joined #rosettacode
23:50 maustin joined #rosettacode

| Channels | #rosettacode index | Today | | Search | Google Search | Plain-Text | summary