AMD athlon slow?

BlitzMax Forums/BlitzMax Programming/AMD athlon slow?

dmaz	(Posted 2009) [#1]

this test seems to show different results on AMD vs intel and PPC.
can others verify and or explain these results? also can someone post numbers for a new AMD proc?

is there anything wrong with the test method?

SuperStrict
Framework BRL.GLMax2D
Import BRL.Random
Import BRL.StandardIO

Const SIZE:Int = 500000


Type tpart
	Global array:tpart[SIZE]
	Global list:TList = New TList
	Global first:tpart

	Field nextPart:tpart
	Field str:String = "name"
	Field a:Int = 89

	Function Allocate()
		For Local i:Int = 0 Until SIZE
			Local p:tpart = New tpart
			array[i] = p
			p.nextPart = first
			first = p

			list.AddLast p
		Next
	End Function

	Function LoopArray()
		For Local t:tpart = EachIn array
			t.a = 109
		Next
	End Function
	
	Function LoopTList()
		For Local t:tpart = EachIn list
			t.a = 109
		Next
	End Function

	Function LoopList()
		Local t:tpart = tpart.first
		While t
			t.a = 109
			t = t.nextPart
		Wend
	End Function

	
	Method m()
		Print a
	End Method

End Type

Delay 100
GCSetMode 2

tpart.Allocate

Delay 200

Local atimeI:Int = MilliSecs()
tpart.LoopArray
atimeI = MilliSecs()-atimeI

Local tltimeI:Int = MilliSecs()
tpart.LoopTList
tltimeI = MilliSecs()-tltimeI

Local ltimeI:Int = MilliSecs()
tpart.LoopList
ltimeI = MilliSecs()-ltimeI

Local atimeE:Int = MilliSecs()
For Local t:tpart= EachIn tpart.array
	t.a = 109
Next
atimeE = MilliSecs()-atimeE

Print "array internal: "+atimeI
Print "TList internal: "+tltimeI
Print " list internal: "+ltimeI
Print "array external: "+atimeE

my timings below seem to scale the same on each respective computer but that AMD is about 2x slower processing a simple list than doing an array.

		intel core2	intel core2 	ppc 800 MHz, 	AMD 2.13 GHz  	
		duo 4GHz,	duo 2.16 GHz,	Mac 10.3.9	atholon(old)
		Windows Vista	Mac 10.5.6			Windows XP

       objects	1,000,000	1,000,000	250,000		250,000	
array internal	41		41		46		47	
TList internal	42	1.02	43	1.05	65	1.41	66	1.40
 list internal	40	0.98	40	0.98	40	0.87	87	1.85
array external	41	1.00	40	0.98	47	1.02	46	0.98

I even coded a test adding an enumerator to the type so I could use eachin. that resulted in the same timings.

Perturbatio

(Posted 2009) [#2]

for my system:

array internal: 13
TList internal: 19
 list internal: 16
array external: 12

Nate the Great

(Posted 2009) [#3]

heres mine

at 500,000

array internal: 30
TList internal: 34
list internal: 31
array external: 29

at 1,000,000

array internal: 67
TList internal: 62
list internal: 37
array external: 35

thanks for this! It will definitely speed up my physics engine!

iprice

(Posted 2009) [#4]

On my work's new DELL machine - Intel Core2 Duo ~2.33Ghz
Intel Q35 Express

Array internal: 19
TList internal: 64
List internal: 20
Array External:20

My ancient (4years+) Athlon 64 3000+ laptop ~ 800Mhz
ATI Radeon 9600

Array internal: 33
TList internal: 52
List internal: 42
Array External: 33

Considering the age difference, that's not soo bad.

Both on Windows XP

dmaz	(Posted 2009) [#5]

thanks guys. so Nate you have an Intel then right?

take this test with a grain of salt though... it's not a good test at all to compare the actual number of ms. more over the percent gain/loss between the methods is what I was looking for.

maybe something is being optimized on AMDs for array loop or vice-versa

xlsior

(Posted 2009) [#6]

Intel core 2 Duo 2.4GHz

array internal: 19
TList internal: 21
list internal: 36
array external: 18

dmaz	(Posted 2009) [#7]

xlsior, did you run that just once? is multiple runs consistent with these numbers?

Nate the Great

(Posted 2009) [#8]

Yeah I have Intel core 2 Duo 2.6 GHz

xlsior

(Posted 2009) [#9]

xlsior, did you run that just once? is multiple runs consistent with these numbers?

I ran it a bunch of times -- the numbers do fluctuate quite a bit, but arrays are pretty much always faster than lists for me (which mirrors my own tests I've done in the past).

I just ran the test 10 times in a row (using 1,000,000)-- arrays won 9 times, lists 1.

Something else to consider other than just AMD vs. Intel: CPU cache.

Different chip models have different amounts of on-chip cache, and if your data happens to fit inside the CPU cache it's much faster than having to fetch parts of it from RAM. In general the Intel's have more cache than most AMD models.

The Core 2 Duo that I have has 4MB cache, but there are other models that have less.

kenshin

(Posted 2009) [#10]

500,000

array internal: 12
TList internal: 13
list internal: 12
array external: 12

1,000,000

array internal: 24
TList internal: 26
list internal: 24
array external: 25

I have the 4MB L2 cache as well.

degac

(Posted 2009) [#11]

500,000 tested on Athlon64 3500+ (2,2Ghz)

array internal: 18
TList internal: 22
list internal: 21
array external: 18

Panno

(Posted 2009) [#12]

Linking:untitled1
Executing:untitled1
array internal: 14
TList internal: 14
list internal: 13
array external: 14
intel core duo 5200 at 3 .05 GHZ

and on a P4 with 3gHZ I GET

Linking:untitled1.exe
Executing:untitled1.exe
array internal: 42
TList internal: 49
list internal: 53
array external: 46

HrdNutz

(Posted 2009) [#13]

I think xlsior got it - AMD chips have much less L1, L2, (and L3) cache than Intel. You have to effectively pre-fetch, which i'm not sure is possible with Bmax. AMD will generally have better memory bandwidth, and that will scale better with multi-core than Intel, but any cache friendly application Intel wins. Look into Spec-FP and Spec-Int benchmarks.

Try getting AMD CPU driver from their website and AMD Dual-Core optimizer, probably wont do much in your case, but worth a shot. I think they fixed some synchronization problems.

Please let know if you figure this out.

Retimer

(Posted 2009) [#14]

array internal: 38

TList internal: 120

 list internal: 50

array external: 40

Strange Tlist result. AMD athlon 64 X2 6100+

@Iprice

My ancient (4years+) Athlon 64 3000+ laptop ~ 800Mhz

Yet I bet the processor back then was expensive as hell =p 2800 4~5 years ago cost me 500$ at tigerdirect.

dmaz	(Posted 2009) [#15]

@HrdNutz, yeah that makes sense except that xlsior's *was* intel with a big cache... his results are out of the norm according to the rest of the thread... it looks more like the AMD results.

iprice

(Posted 2009) [#16]

It was indeed expensive, but it's not caused me any problems, unlike my desktop which I got at exactly the same time. I thought laptops were supposedly less reliable - not in my case. My desktop has gone through 2 motherboards, 3 PSUs and a couple of GFX cards.

My laptop is waaaaay underspecced for games nowadays, but perfect still for my programming needs. It plays HalfLife 2 and Doom3 lovely though :)

xlsior

(Posted 2009) [#17]

@HrdNutz, yeah that makes sense except that xlsior's *was* intel with a big cache... his results are out of the norm according to the rest of the thread... it looks more like the AMD results.

In case it makes a difference: I'm running the 64-bit version of Vista

dmaz	(Posted 2009) [#18]

I don't think so as so am i... hmmm.

HrdNutz

(Posted 2009) [#19]

try disabling Cool&Quiet on AMD (if enabled in BIOS) see if that makes any difference.

xlsior

(Posted 2009) [#20]

There's also variations among the Core Duo lines of course -- Mine's an E6600 Conroe @65nm

Dreamora

(Posted 2009) [#21]

500'000

array internal: 4
TList internal: 9
list internal: 6
array external: 4

1'000'000

array internal: 9
TList internal: 17
list internal: 12
array external: 9

Core i7 920 (2.83Ghz), 6GB TriChannel

As for your question: one of the major problems with old AMDs is their inexistant L2 cache. 256kb/512kb are nice for small things but at 500k entries, thats 2mb of pure pointer data already, so a lot of cache misses and data requests and transfers from the RAM which cost a lot of time
Arrays work because they are aligned so worlds less cache misses which is the real breaking point here. Every cache miss means transfer from RAM to CPU

Also the low l2 cache forces the CPU to swap in many blocks when requesting the actual data of the entries.
For the array again those entries are aligned in RAM to a much higher degree, so less cache misses on the actual data as well

The internal local just makes this problem worse by allocating new variables over and over again.

Pete Carter

(Posted 2009) [#22]

Core 2 duo 7200 (2.53ghz) XP pro

500'000

array internal: 25
TList internal: 26
list internal: 24
array external: 25

1'000'000

array internal: 53
TList internal: 53
list internal: 51
array external: 54