Curious benchmarks
2007-11-28 23:08:46.67622+00 by
Dan Lyke
7 comments
Interesting. I'm playing with pthreads
a bit, so I've got a simple loop that increments a variable about a billion times, once just flat out, once locking around the increment. Linux laptop, Intel Core Duo T2450 at 2.00GHz, 3.6 seconds without locks, 58.8 seconds with locks. Mac Pro, 2 Dual-Core Intel Xeons at 2.66GHz, 2.7 seconds without locks, 62.1 seconds with locks.
Stupid benchmark, not significant in any way, except that it often feels that the Linux laptop is way snappier than the Mac desktop (and the Mac laptop that's in the shop right now), if the overhead of the OS and system library implementations is chewing up that additional CPU speed that may explain a lot...
[ related topics:
Open Source Macintosh
]
comments in ascending chronological order (reverse):
#Comment Re: made: 2007-11-28 23:49:53.445403+00 by:
Jim S
One significant hardware difference: I suspect the dual processor machine has to get the lock written out
to main memory while the T2450 only has to get it as far as the level 2 cache... or the linux pthread lock is
just faster.
#Comment Re: made: 2007-11-29 00:07:13.57475+00 by:
Dan Lyke
[edit history]
I'd expect that there'd be no particular need for the processor local cache to get flushed, so I chalked it up to the Linux pthread lock taking roughly 8 units to 11 units OS/X (where a unit is one iteration of the while (abc < 999999999) abc++;
loop).
Edit: Huh, duh, I was just being obtuse there, I need to go look at how multi-processor Xeon boxes communicate internal cache dirty status to each other.
#Comment Re: made: 2007-11-29 06:36:02.204736+00 by:
spc476
Hmmm ... the locking seems to really take a toll. I did the following:
loop: mov eax,[gv]
int eax
mov [gv],eax
cmp eax,1000000000
jl loop
And got 2.454s on a 2.6GHz dual Pentium system (running the code on a single core). I then did a spinlock version:
loop: mov al,1
spin: xchg al,[glock]
or al,al
jne spin
mov eax,[gv]
int eax
mov [gv],eax
move byte [glock],0
cmp eax,1000000000
jl loop
on the same system, and with one core running this segment, had a runtime of 39.752s. Even simple spin locks are expensive.
#Comment Re: made: 2007-11-29 07:53:50.425362+00 by:
spc476
I reran the test, this time dual-threaded (dual-core Pentium). I got some ... um ... curious results.
#Comment Re: made: 2007-11-29 16:10:21.597591+00 by:
Dan Lyke
If you feel like going further with that, how about getting rid of the 8 bit code in the locks and using full 16 or 32 bit words. The sequence:
t2: mov al,1
t2.wait: xchg al,[glock]
or al,al
jne t2.wait
Just screams stalls in the pipelining while the processor is splitting that poor register apart. I'll bet using ax
or eax
will give you at least 2x.
#Comment Re: made: 2007-11-29 19:29:57.831987+00 by:
spc476
16 would be worse (operand length overrides on the x86), and going full 32 bit didn't help.
#Comment Re: made: 2007-11-29 19:36:25.563757+00 by:
Dan Lyke
Huh. Weird. Thanks for the update.