Flutterby™! : On big buffers

On big buffers

2015-03-25 20:18:22.019561+00 by Dan Lyke 8 comments

Edit: Yeah, I'm skeptical.

~~When In-Memory Computing is Slower than Heavy Disk Usage, Kamran Karimi, Diwakar Krishnamurthy, Parissa Mirjafari.~~

A quick skim suggests that this is about letting your OS do the buffering rather than blowing out your CPU cache while attempting to buffering, and that incremental asynch writes while you're doing other stuff beats serializing everything.

Via /., but then reading the comments from people who've read the paper more thoroughly, suggests that these folks are exploiting bad behavior from Python and/or Java, and really don't understand what's happening under the hood.

comments in ascending chronological order (reverse):

#Comment Re: On big buffers made: 2015-03-25 21:17:27.525827+00 by: ebradway [edit history]

I haven't tested the Java version, but the Python code in the article prepends the one-byte string to the longer string. If you change the code to the appropriate append, the times are exactly as expected. And I sure don't want to wait around while I try to insert a byte at the start of a file a million times.

Here's my test code:

# Modified code from Kamran Karimi (kkarimi@ucalgary.ca)
# Modifications by Eric Wolf
# Short version. Runs each experiment once
import timeit

numAdd = 1 #Additional changes are needed when num Add = 1000000 totalMemory = 1000000 # bytes

addString = ""

for i in range(0, numAdd): addString = addString + "1"

# First part: in-memory
numIter = int(totalMemory / len(addString))

# Prepend the "1" to the string
concatString = ""
start = timeit.default_timer()

for i in range(0, numIter): concatString = addString + concatString

stop = timeit.default_timer() prependTime = stop - start

# Append the "1" to the string
concatString = ""
start = timeit.default_timer()

for i in range(0, numIter): concatString = concatString + addString

stop = timeit.default_timer() appendTime = stop - start

print "Prepend: " + str(prependTime) print "Append: " + str(appendTime)

Results: ~$ python karimi.py Prepend: 54.0762400627 Append: 0.134270906448

#Comment Re: On big buffers made: 2015-03-25 21:19:52.966956+00 by: ebradway

Not certain what's going on with the comment formatter...

#Comment Re: On big buffers made: 2015-03-25 21:32:43.850235+00 by: ebradway

As for the /. comments about Python and Java... That's why I don't read /. any more.

#Comment Re: On big buffers made: 2015-03-26 05:59:46.394013+00 by: Jack William Bell

I like big buffers, and I cannot lie!

#Comment Re: On big buffers made: 2015-03-26 06:07:43.536317+00 by: Jack William Bell [edit history]

On a more serious note, if someone gave me those test requirements, and a check, I would implement it as a stream with a smart buffer that provided a run length encoding scheme and wrote out the RLE data set instead of the data added to the buffer. This would handle their primary use case in less than a few hundred bytes of output and still provide reasonable results if they changed the character they were prepending once in a while.

Always good to try and anticipate future requirements if it doesn't add extra work to the current implementation.

OK. Not so serious a note...

#Comment Re: On big buffers made: 2015-03-26 14:20:28.467706+00 by: Dan Lyke

Flutterby formatter: I didn't have a good way to tell when code started and stopped, so if I see code (ie: /^#!/) I keep going as a <pre> block 'til the next double space.

/. comments: I was shocked by how not bad they were here.

The big issue is that their string manipulation is doing all sorts of wrong, the operating system isn't, and this is a good reminder in higher level languages to avoid string += in favor of string builders, like stringstream.

#Comment Re: On big buffers made: 2015-03-26 14:44:53.049269+00 by: ebradway

Formatter: I tried using tags and that was worse.

/.: Except for the gratuitous Python/Java bashing, they were insightful.

#Comment Re: On big buffers made: 2015-03-26 15:56:21.316207+00 by: Dan Lyke

Yeah, see, I'm no longer sure that Python/Java bashing is gratuitous...