If you have a problem or need to report a bug please email : support@dsprobotics.com
There are 3 sections to this support area:
DOWNLOADS: access to product manuals, support files and drivers
HELP & INFORMATION: tutorials and example files for learning or finding pre-made modules for your projects
USER FORUMS: meet with other users and exchange ideas, you can also get help and assistance here
NEW REGISTRATIONS - please contact us if you wish to register on the forum
Users are reminded of the forum rules they sign up to which prohibits any activity that violates any laws including posting material covered by copyright
assembly: data bypass delays - optimising black magic
9 posts
• Page 1 of 1
assembly: data bypass delays - optimising black magic
As preparation for optimization article for FS guru I have dug deeper into processor functionality. Here is one of the thing I've found out. When you use instructions that run different processor unit (a circuit that is responsible for that specific operation, for example integer ops vs. floating point ops.) there is additional latency (CPU cycles) used for switching data between the units.
Here is a simple example:
in FS there are two operations that asre functionwise exactly the same andps and pand. Although they both preform logical biwise and on 128bit register data, they use different execution units (andps is floating point unit pand is integer unit). They both take the same CPU but when you use pand inbetween float operations (for example addps as in the example) you have additional latency to switch the data. Therefore you use much more CPU.
On my machine:
code bypassed (the in2=0 bypasses the cod execution within component) ....1% CPU
code with andps ....3.6%CPU (2.6 if we subtract other code in the schematic)
code with pand ....4.3%CPU (3.3 if we subtract other code in the schematic)
So in this example using pand takes roughly 50%more CPU
Note that this problem is processor specific - it may or may not be present on your machine.
Here is a simple example:
in FS there are two operations that asre functionwise exactly the same andps and pand. Although they both preform logical biwise and on 128bit register data, they use different execution units (andps is floating point unit pand is integer unit). They both take the same CPU but when you use pand inbetween float operations (for example addps as in the example) you have additional latency to switch the data. Therefore you use much more CPU.
On my machine:
code bypassed (the in2=0 bypasses the cod execution within component) ....1% CPU
code with andps ....3.6%CPU (2.6 if we subtract other code in the schematic)
code with pand ....4.3%CPU (3.3 if we subtract other code in the schematic)
So in this example using pand takes roughly 50%more CPU
Note that this problem is processor specific - it may or may not be present on your machine.
- Attachments
-
- data_bypass_delays.fsm
- (4.8 KiB) Downloaded 1406 times
- KG_is_back
- Posts: 1196
- Joined: Tue Oct 22, 2013 5:43 pm
- Location: Slovakia
Re: assembly: data bypass delays - optimising black magic
Makes absolute no difference on my system. Just for reference, it's an AMD FX8350
-
MyCo - Posts: 718
- Joined: Tue Jul 13, 2010 12:33 pm
- Location: Germany
Re: assembly: data bypass delays - optimising black magic
I'm using intel core i5-3210M.
I have tried the pand and andps operations inbetween paddd (integer) instructions and it also makes no difference - possibly because there is no penalty for using logic operations inbetween integer instructions. This is clearly highly machine - dependent topic.
I have tried the pand and andps operations inbetween paddd (integer) instructions and it also makes no difference - possibly because there is no penalty for using logic operations inbetween integer instructions. This is clearly highly machine - dependent topic.
- KG_is_back
- Posts: 1196
- Joined: Tue Oct 22, 2013 5:43 pm
- Location: Slovakia
Re: assembly: data bypass delays - optimising black magic
On my i5 4760k I get 1.1% on first, and 1.3% CPU usage on second.
- Youlean
- Posts: 176
- Joined: Mon Jun 09, 2014 2:49 pm
Re: assembly: data bypass delays - optimising black magic
i7 3770:
0.40, 1.20 & 1.40 % CPU in FS
0.43, 0.55 & 0.55 % in Process Explorer
0.40, 1.20 & 1.40 % CPU in FS
0.43, 0.55 & 0.55 % in Process Explorer
-
Walter Sommerfeld - Posts: 249
- Joined: Wed Jul 14, 2010 6:00 pm
- Location: HH - Made in Germany
Re: assembly: data bypass delays - optimising black magic
I'm not even sure if I understand what I'm expected to do
This is what I did:
1) Connected the module "output" with DS Out.
2) Set "in2" to false.
3) Looked at schematic CPU load: 0.7%
4) Switched selector from input 0 to input 1
5) Looked at schematic CPU load: 0.8%
6) Switched back to input 0 and set "in2" to true
7) Looked at schematic CPU load: 2.3%
8) Switched selector to input 1 again
9) Looked at schematic CPU load: 2.5%
If I've done it correctly I'm surprised, because you, KG, have higher values although my processor is a very old one: AMD Athlon X2 250. Seems that processor speed is more important than architecture (3 GHz, no turbo mode on mine, while yours 2.5 GHz with option to speed up to max. 3.1 GHz)?
But maybe I've done it all wrong
This is what I did:
1) Connected the module "output" with DS Out.
2) Set "in2" to false.
3) Looked at schematic CPU load: 0.7%
4) Switched selector from input 0 to input 1
5) Looked at schematic CPU load: 0.8%
6) Switched back to input 0 and set "in2" to true
7) Looked at schematic CPU load: 2.3%
8) Switched selector to input 1 again
9) Looked at schematic CPU load: 2.5%
If I've done it correctly I'm surprised, because you, KG, have higher values although my processor is a very old one: AMD Athlon X2 250. Seems that processor speed is more important than architecture (3 GHz, no turbo mode on mine, while yours 2.5 GHz with option to speed up to max. 3.1 GHz)?
But maybe I've done it all wrong
"There lies the dog buried" (German saying translated literally)
- tulamide
- Posts: 2714
- Joined: Sat Jun 21, 2014 2:48 pm
- Location: Germany
Re: assembly: data bypass delays - optimising black magic
tulamide wrote:I'm not even sure if I understand what I'm expected to do
This is what I did:
1) Connected the module "output" with DS Out.
2) Set "in2" to false.
3) Looked at schematic CPU load: 0.7%
4) Switched selector from input 0 to input 1
5) Looked at schematic CPU load: 0.8%
6) Switched back to input 0 and set "in2" to true
7) Looked at schematic CPU load: 2.3%
8) Switched selector to input 1 again
9) Looked at schematic CPU load: 2.5%
If I've done it correctly I'm surprised, because you, KG, have higher values although my processor is a very old one: AMD Athlon X2 250. Seems that processor speed is more important than architecture (3 GHz, no turbo mode on mine, while yours 2.5 GHz with option to speed up to max. 3.1 GHz)?
But maybe I've done it all wrong
You've done it correctly. With the CPU load difference I'm not surprised. When you run a program you Operating system reserves some given maximum of CPU the program may use. The % readout in FS meter is actually the %of that maximum. The task manager shows the "true"value of how much CPU the program is actually using from the all processing power available. Your OS constantly checks the cpu load of different threads and may opt to give it more space if it reaches close to 100% of the max. From the schematic point of view the internal CPU meter is more relevant, because once it reaches 100% your OS will not allow it to use more, so other applications can still run in parallel.
Also multicore and multithreading takes place here too. For example when CPU meter in FS shows 100% the task manager shows 25% because I have dual-core 4threaded processor (so a single thread can take only 25% of the entire processing potential).
So the reason why you have so much smaller CPU reading with the same schematic might be that your Operating system gives FS more CPU headroom at the time, then it does on my machine.
- KG_is_back
- Posts: 1196
- Joined: Tue Oct 22, 2013 5:43 pm
- Location: Slovakia
Re: assembly: data bypass delays - optimising black magic
I was aware of the core issue, my cpu has 2 cores, but since yours has 2 cores also, I thought I could compare them directly. Since there's hyper threading, it doesn't work, of course.
The task manager is a different thing. In this case I didn't trust it that much, because in all of the four tests I did, it showed 0% with some rare peaks at 1%. But maybe it really is the real load. But then, to really see differences, the test should be heavier on processor load. I think.
The task manager is a different thing. In this case I didn't trust it that much, because in all of the four tests I did, it showed 0% with some rare peaks at 1%. But maybe it really is the real load. But then, to really see differences, the test should be heavier on processor load. I think.
"There lies the dog buried" (German saying translated literally)
- tulamide
- Posts: 2714
- Joined: Sat Jun 21, 2014 2:48 pm
- Location: Germany
Re: assembly: data bypass delays - optimising black magic
You can actually observe and affect your windows process management. Open speed tester in two separate instances of flowstone (so you may see two FS windows at the same time). Setup the same code in both of them (preferably use pow() function which takes the most CPU) and set them up, so they both take about 30%CPU.
Now open task manager and click to processes tag. You will see there are two "flowstone" processes running at about 10% CPU. Now right click them and "set affinity..." on both of them so that they both are forced to run on the same processor core. Now you have them forced to run on same core, and your windows must prioritize between them. You may set priority of one of them (right click the process->priority->) and you may observe how internal CPU meter in the FS instances will suddenly jump up or down depending on the priority setting of both processes. Note that whatever priority you set, each FS takes the same CPU (you may see this in the task manager - always fluctuating at about the same value) but the FS meter will show completely different values because he calculates the CPU load from the CPU maximum provided by your OS (which depends on the "priority" and "core affinity" settings as well as CPU load of other processes).
Now open task manager and click to processes tag. You will see there are two "flowstone" processes running at about 10% CPU. Now right click them and "set affinity..." on both of them so that they both are forced to run on the same processor core. Now you have them forced to run on same core, and your windows must prioritize between them. You may set priority of one of them (right click the process->priority->) and you may observe how internal CPU meter in the FS instances will suddenly jump up or down depending on the priority setting of both processes. Note that whatever priority you set, each FS takes the same CPU (you may see this in the task manager - always fluctuating at about the same value) but the FS meter will show completely different values because he calculates the CPU load from the CPU maximum provided by your OS (which depends on the "priority" and "core affinity" settings as well as CPU load of other processes).
- KG_is_back
- Posts: 1196
- Joined: Tue Oct 22, 2013 5:43 pm
- Location: Slovakia
9 posts
• Page 1 of 1
Who is online
Users browsing this forum: No registered users and 13 guests