Support

If you have a problem or need to report a bug please email : support@dsprobotics.com

There are 3 sections to this support area:

DOWNLOADS: access to product manuals, support files and drivers

HELP & INFORMATION: tutorials and example files for learning or finding pre-made modules for your projects

USER FORUMS: meet with other users and exchange ideas, you can also get help and assistance here

NEW REGISTRATIONS - please contact us if you wish to register on the forum

assembly: data bypass delays - optimising black magic

DSP related issues, mathematics, processing and techniques

assembly: data bypass delays - optimising black magic

Postby KG_is_back » Wed Oct 29, 2014 8:50 pm

As preparation for optimization article for FS guru I have dug deeper into processor functionality. Here is one of the thing I've found out. When you use instructions that run different processor unit (a circuit that is responsible for that specific operation, for example integer ops vs. floating point ops.) there is additional latency (CPU cycles) used for switching data between the units.
Here is a simple example:
in FS there are two operations that asre functionwise exactly the same andps and pand. Although they both preform logical biwise and on 128bit register data, they use different execution units (andps is floating point unit pand is integer unit). They both take the same CPU but when you use pand inbetween float operations (for example addps as in the example) you have additional latency to switch the data. Therefore you use much more CPU.

On my machine:
code bypassed (the in2=0 bypasses the cod execution within component) ....1% CPU
code with andps ....3.6%CPU (2.6 if we subtract other code in the schematic)
code with pand ....4.3%CPU (3.3 if we subtract other code in the schematic)

So in this example using pand takes roughly 50%more CPU

Note that this problem is processor specific - it may or may not be present on your machine.
Attachments
data_bypass_delays.fsm
(4.8 KiB) Downloaded 1307 times
KG_is_back
 
Posts: 1196
Joined: Tue Oct 22, 2013 5:43 pm
Location: Slovakia

Re: assembly: data bypass delays - optimising black magic

Postby MyCo » Wed Oct 29, 2014 9:57 pm

Makes absolute no difference on my system. Just for reference, it's an AMD FX8350
User avatar
MyCo
 
Posts: 718
Joined: Tue Jul 13, 2010 12:33 pm
Location: Germany

Re: assembly: data bypass delays - optimising black magic

Postby KG_is_back » Wed Oct 29, 2014 10:09 pm

I'm using intel core i5-3210M.
I have tried the pand and andps operations inbetween paddd (integer) instructions and it also makes no difference - possibly because there is no penalty for using logic operations inbetween integer instructions. This is clearly highly machine - dependent topic.
KG_is_back
 
Posts: 1196
Joined: Tue Oct 22, 2013 5:43 pm
Location: Slovakia

Re: assembly: data bypass delays - optimising black magic

Postby Youlean » Wed Oct 29, 2014 11:49 pm

On my i5 4760k I get 1.1% on first, and 1.3% CPU usage on second.
Youlean
 
Posts: 176
Joined: Mon Jun 09, 2014 2:49 pm

Re: assembly: data bypass delays - optimising black magic

Postby Walter Sommerfeld » Thu Oct 30, 2014 7:46 pm

i7 3770:

0.40, 1.20 & 1.40 % CPU in FS
0.43, 0.55 & 0.55 % in Process Explorer
User avatar
Walter Sommerfeld
 
Posts: 249
Joined: Wed Jul 14, 2010 6:00 pm
Location: HH - Made in Germany

Re: assembly: data bypass delays - optimising black magic

Postby tulamide » Thu Oct 30, 2014 9:15 pm

I'm not even sure if I understand what I'm expected to do :lol:

This is what I did:
1) Connected the module "output" with DS Out.
2) Set "in2" to false.
3) Looked at schematic CPU load: 0.7%
4) Switched selector from input 0 to input 1
5) Looked at schematic CPU load: 0.8%
6) Switched back to input 0 and set "in2" to true
7) Looked at schematic CPU load: 2.3%
8) Switched selector to input 1 again
9) Looked at schematic CPU load: 2.5%

If I've done it correctly I'm surprised, because you, KG, have higher values although my processor is a very old one: AMD Athlon X2 250. Seems that processor speed is more important than architecture (3 GHz, no turbo mode on mine, while yours 2.5 GHz with option to speed up to max. 3.1 GHz)?

But maybe I've done it all wrong :P
"There lies the dog buried" (German saying translated literally)
tulamide
 
Posts: 2688
Joined: Sat Jun 21, 2014 2:48 pm
Location: Germany

Re: assembly: data bypass delays - optimising black magic

Postby KG_is_back » Thu Oct 30, 2014 9:47 pm

tulamide wrote:I'm not even sure if I understand what I'm expected to do :lol:

This is what I did:
1) Connected the module "output" with DS Out.
2) Set "in2" to false.
3) Looked at schematic CPU load: 0.7%
4) Switched selector from input 0 to input 1
5) Looked at schematic CPU load: 0.8%
6) Switched back to input 0 and set "in2" to true
7) Looked at schematic CPU load: 2.3%
8) Switched selector to input 1 again
9) Looked at schematic CPU load: 2.5%

If I've done it correctly I'm surprised, because you, KG, have higher values although my processor is a very old one: AMD Athlon X2 250. Seems that processor speed is more important than architecture (3 GHz, no turbo mode on mine, while yours 2.5 GHz with option to speed up to max. 3.1 GHz)?

But maybe I've done it all wrong :P


You've done it correctly. With the CPU load difference I'm not surprised. When you run a program you Operating system reserves some given maximum of CPU the program may use. The % readout in FS meter is actually the %of that maximum. The task manager shows the "true"value of how much CPU the program is actually using from the all processing power available. Your OS constantly checks the cpu load of different threads and may opt to give it more space if it reaches close to 100% of the max. From the schematic point of view the internal CPU meter is more relevant, because once it reaches 100% your OS will not allow it to use more, so other applications can still run in parallel.
Also multicore and multithreading takes place here too. For example when CPU meter in FS shows 100% the task manager shows 25% because I have dual-core 4threaded processor (so a single thread can take only 25% of the entire processing potential).

So the reason why you have so much smaller CPU reading with the same schematic might be that your Operating system gives FS more CPU headroom at the time, then it does on my machine.
KG_is_back
 
Posts: 1196
Joined: Tue Oct 22, 2013 5:43 pm
Location: Slovakia

Re: assembly: data bypass delays - optimising black magic

Postby tulamide » Fri Oct 31, 2014 11:54 am

I was aware of the core issue, my cpu has 2 cores, but since yours has 2 cores also, I thought I could compare them directly. Since there's hyper threading, it doesn't work, of course.

The task manager is a different thing. In this case I didn't trust it that much, because in all of the four tests I did, it showed 0% with some rare peaks at 1%. But maybe it really is the real load. But then, to really see differences, the test should be heavier on processor load. I think. :)
"There lies the dog buried" (German saying translated literally)
tulamide
 
Posts: 2688
Joined: Sat Jun 21, 2014 2:48 pm
Location: Germany

Re: assembly: data bypass delays - optimising black magic

Postby KG_is_back » Fri Oct 31, 2014 1:56 pm

You can actually observe and affect your windows process management. Open speed tester in two separate instances of flowstone (so you may see two FS windows at the same time). Setup the same code in both of them (preferably use pow() function which takes the most CPU) and set them up, so they both take about 30%CPU.
Now open task manager and click to processes tag. You will see there are two "flowstone" processes running at about 10% CPU. Now right click them and "set affinity..." on both of them so that they both are forced to run on the same processor core. Now you have them forced to run on same core, and your windows must prioritize between them. You may set priority of one of them (right click the process->priority->) and you may observe how internal CPU meter in the FS instances will suddenly jump up or down depending on the priority setting of both processes. Note that whatever priority you set, each FS takes the same CPU (you may see this in the task manager - always fluctuating at about the same value) but the FS meter will show completely different values because he calculates the CPU load from the CPU maximum provided by your OS (which depends on the "priority" and "core affinity" settings as well as CPU load of other processes).
KG_is_back
 
Posts: 1196
Joined: Tue Oct 22, 2013 5:43 pm
Location: Slovakia


Return to DSP

Who is online

Users browsing this forum: No registered users and 26 guests

cron