Fast Stream Array Access

by **KG_is_back** » Sun Oct 19, 2014 10:27 pm

Exo wrote:
KG_is_back wrote:I was going to ask you guys is there any opcodes you really want/need? If you can give clear examples of benefits of certain opcodes I could get on to Malc to add them (I'm usually quite good at getting him to add little things if I give him a clear example and make it simple for him).

Maybe topic for another thread?

NO.1 choice: subtraction for integers. Either sub reg,reg/var32; or/and psubd xmm0,xmm1/var; and to fix the nasty andnps coloring bug ...and logical not would be appreciated even in Code component.

by **MyCo** » Mon Oct 20, 2014 4:29 am

wow, Martin has a run

Haven't noticed that FS supports "movd r/m32, xmm" instruction, good to know... Unfortunately it doesn't support "movd xmm, r/m32", that would give another performance boost.

BTW: Don't trust the cycle counter method, it's pretty inaccurate. On my system for example the cycle counter outputs the same for the "Simple Delay" and the "Simple Delay (Stock)", although I know there should be a huge difference. When I need a meaningful comparison, I do hundreds of synchronized copys of a module and put them in parallel into a selector (as mono/packed mono stream). And then switch between optimized/normal while looking at the CPU usage either in FS or in the resource monitor of windows.

by **MyCo** » Mon Oct 20, 2014 5:07 am

Here is a test bench schematic that I use for optimizations. I've set it up with the delays.

by **Tronic** » Mon Oct 20, 2014 8:42 am

Exo wrote:I was going to ask you guys is there any opcodes you really want/need?

call [ reg ]
so we can call a function with address pointer from dll, directly in the Assembler, and use the dll as plugin.
Or any other way to call function from DLL in Code or Assembler.

by **KG_is_back** » Mon Oct 20, 2014 6:42 pm

MyCo wrote:Here is a test bench schematic that I use for optimizations. I've set it up with the delays.

very interesting! the stock delays show about 20% and the "optimized" show 30-40% on my machine.

by **martinvicanek** » Mon Oct 20, 2014 10:01 pm

MyCo wrote:When I need a meaningful comparison, I do hundreds of synchronized copys of a module and put them in parallel into a selector (as mono/packed mono stream). And then switch between optimized/normal while looking at the CPU usage either in FS or in the resource monitor of windows.

Hm, very confusing. The mass test does not show a big difference between stock and "optimized" - if any, then the other way round.

When you go to 10 instead of 100 copies then the proportions change towards the analyzer result. For me this shows that performance is a complex beast, it depends very much on context. Measuring the performance of one isolated unit seems to have little meaning. But then again, is the mass setup with 100 delays in parallel more representative of a real scenario?

I have implemented "fast" lookup table modules but now I hesitate to post them ...

by **tester** » Mon Oct 20, 2014 10:21 pm

When I play with oscillators, I usually have few hunderts of them on board. So - yes, it can be a real scenario, and it has practical uses. But on the other hand - even if your oscillators have better performance within smaller designs, these designs can be heavy on other parts, so these few percent can become helpful too. I think I may have a possibility to do a quick test of multi-osc setup, to see what is the real-life difference between stock and custom made part.

In fact - this is why I asked you the question on possibility to make "multisine" oscillators. I'm not sure if there is any way to make a single "shape" oscillator, that as an input takes a list of random sine frequencies (at c.a. 0.01Hz accuracy each).

by **KG_is_back** » Mon Oct 20, 2014 10:54 pm

Actually, now more relevant test occur to me - we can put the module into poly section and create module, that initiates given number of voices. Because poly section can work in parallel independently and run only when voice is on, we can avoid selectors.

by **martinvicanek** » Tue Oct 21, 2014 8:35 am

KG_is_back wrote:
Exo wrote:I was going to ask you guys is there any opcodes you really want/need?

NO.1 choice: subtraction for integers. Either sub reg,reg/var32; or/and psubd xmm0,xmm1/var; and to fix the nasty andnps coloring bug ...and logical not would be appreciated even in Code component.

+1, and the following:

PSRLD xmm1, xmm2/m128
Shift doublewords in xmm1 right by amount specified in xmm2/m128 while shifting in 0s.
(Would be handy for some IEE 754 trickey in log and exp approximations)

PMULUDQ xmm1, xmm2/m128
Multiply packed unsigned doubleword integers in xmm1 by packed unsigned doubleword integers in xmm2/m128, and store the quadword results in xmm1.
(Useful for linear congrugential random number generator)

PADDD xmm1, xmm2
Add packed doubleword integers from xmm2/m128 and xmm1.
(Current implementation only supports PADDD xmm1, m128)

Exo wrote:Maybe topic for another thread?

Yes, please

by **martinvicanek** » Tue Oct 21, 2014 8:42 am

martinvicanek wrote:Hm, very confusing. The mass test does not show a big difference between stock and "optimized" - if any, then the other way round.

Apparently this paradox has confused others before:
http://synthmaker.co.uk/forum/viewtopic ... =30#p77149

Fast Stream Array Access

Re: Fast Stream Array Access

Re: Fast Stream Array Access

Re: Fast Stream Array Access

Re: Fast Stream Array Access

Re: Fast Stream Array Access

Re: Fast Stream Array Access

Re: Fast Stream Array Access

Re: Fast Stream Array Access

Opcode Wishlist

Re: Fast Stream Array Access

Who is online