assembly: data bypass delays - optimising black magic

by **KG_is_back** » Wed Oct 29, 2014 8:50 pm

As preparation for optimization article for FS guru I have dug deeper into processor functionality. Here is one of the thing I've found out. When you use instructions that run different processor unit (a circuit that is responsible for that specific operation, for example integer ops vs. floating point ops.) there is additional latency (CPU cycles) used for switching data between the units.
Here is a simple example:
in FS there are two operations that asre functionwise exactly the same andps and pand. Although they both preform logical biwise and on 128bit register data, they use different execution units (andps is floating point unit pand is integer unit). They both take the same CPU but when you use pand inbetween float operations (for example addps as in the example) you have additional latency to switch the data. Therefore you use much more CPU.

On my machine:
code bypassed (the in2=0 bypasses the cod execution within component) ....1% CPU
code with andps ....3.6%CPU (2.6 if we subtract other code in the schematic)
code with pand ....4.3%CPU (3.3 if we subtract other code in the schematic)

So in this example using pand takes roughly 50%more CPU

Note that this problem is processor specific - it may or may not be present on your machine.

by **MyCo** » Wed Oct 29, 2014 9:57 pm

Makes absolute no difference on my system. Just for reference, it's an AMD FX8350

by **KG_is_back** » Wed Oct 29, 2014 10:09 pm

I'm using intel core i5-3210M.
I have tried the pand and andps operations inbetween paddd (integer) instructions and it also makes no difference - possibly because there is no penalty for using logic operations inbetween integer instructions. This is clearly highly machine - dependent topic.

by **Youlean** » Wed Oct 29, 2014 11:49 pm

On my i5 4760k I get 1.1% on first, and 1.3% CPU usage on second.

by **Walter Sommerfeld** » Thu Oct 30, 2014 7:46 pm

i7 3770:

0.40, 1.20 & 1.40 % CPU in FS
0.43, 0.55 & 0.55 % in Process Explorer

by **tulamide** » Thu Oct 30, 2014 9:15 pm

I'm not even sure if I understand what I'm expected to do :lol:

This is what I did:
1) Connected the module "output" with DS Out.
2) Set "in2" to false.
3) Looked at schematic CPU load: 0.7%
4) Switched selector from input 0 to input 1
5) Looked at schematic CPU load: 0.8%
6) Switched back to input 0 and set "in2" to true
7) Looked at schematic CPU load: 2.3%
8) Switched selector to input 1 again
9) Looked at schematic CPU load: 2.5%

If I've done it correctly I'm surprised, because you, KG, have higher values although my processor is a very old one: AMD Athlon X2 250. Seems that processor speed is more important than architecture (3 GHz, no turbo mode on mine, while yours 2.5 GHz with option to speed up to max. 3.1 GHz)?

But maybe I've done it all wrong

by **KG_is_back** » Thu Oct 30, 2014 9:47 pm

tulamide wrote:I'm not even sure if I understand what I'm expected to do

This is what I did:
1) Connected the module "output" with DS Out.
2) Set "in2" to false.
3) Looked at schematic CPU load: 0.7%
4) Switched selector from input 0 to input 1
5) Looked at schematic CPU load: 0.8%
6) Switched back to input 0 and set "in2" to true
7) Looked at schematic CPU load: 2.3%
8) Switched selector to input 1 again
9) Looked at schematic CPU load: 2.5%

If I've done it correctly I'm surprised, because you, KG, have higher values although my processor is a very old one: AMD Athlon X2 250. Seems that processor speed is more important than architecture (3 GHz, no turbo mode on mine, while yours 2.5 GHz with option to speed up to max. 3.1 GHz)?

But maybe I've done it all wrong

You've done it correctly. With the CPU load difference I'm not surprised. When you run a program you Operating system reserves some given maximum of CPU the program may use. The % readout in FS meter is actually the %of that maximum. The task manager shows the "true"value of how much CPU the program is actually using from the all processing power available. Your OS constantly checks the cpu load of different threads and may opt to give it more space if it reaches close to 100% of the max. From the schematic point of view the internal CPU meter is more relevant, because once it reaches 100% your OS will not allow it to use more, so other applications can still run in parallel.
Also multicore and multithreading takes place here too. For example when CPU meter in FS shows 100% the task manager shows 25% because I have dual-core 4threaded processor (so a single thread can take only 25% of the entire processing potential).

So the reason why you have so much smaller CPU reading with the same schematic might be that your Operating system gives FS more CPU headroom at the time, then it does on my machine.

by **tulamide** » Fri Oct 31, 2014 11:54 am

I was aware of the core issue, my cpu has 2 cores, but since yours has 2 cores also, I thought I could compare them directly. Since there's hyper threading, it doesn't work, of course.

The task manager is a different thing. In this case I didn't trust it that much, because in all of the four tests I did, it showed 0% with some rare peaks at 1%. But maybe it really is the real load. But then, to really see differences, the test should be heavier on processor load. I think.

by **KG_is_back** » Fri Oct 31, 2014 1:56 pm

You can actually observe and affect your windows process management. Open speed tester in two separate instances of flowstone (so you may see two FS windows at the same time). Setup the same code in both of them (preferably use pow() function which takes the most CPU) and set them up, so they both take about 30%CPU.
Now open task manager and click to processes tag. You will see there are two "flowstone" processes running at about 10% CPU. Now right click them and "set affinity..." on both of them so that they both are forced to run on the same processor core. Now you have them forced to run on same core, and your windows must prioritize between them. You may set priority of one of them (right click the process->priority->) and you may observe how internal CPU meter in the FS instances will suddenly jump up or down depending on the priority setting of both processes. Note that whatever priority you set, each FS takes the same CPU (you may see this in the task manager - always fluctuating at about the same value) but the FS meter will show completely different values because he calculates the CPU load from the CPU maximum provided by your OS (which depends on the "priority" and "core affinity" settings as well as CPU load of other processes).

assembly: data bypass delays - optimising black magic

assembly: data bypass delays - optimising black magic

Re: assembly: data bypass delays - optimising black magic

Re: assembly: data bypass delays - optimising black magic

Re: assembly: data bypass delays - optimising black magic

Re: assembly: data bypass delays - optimising black magic

Re: assembly: data bypass delays - optimising black magic

Re: assembly: data bypass delays - optimising black magic

Re: assembly: data bypass delays - optimising black magic

Re: assembly: data bypass delays - optimising black magic

Who is online