tester wrote:Today I did some test on translating them into mono4 (stereo2mono4 --> oscillator --> mono42stereo, to be precise), expecting that the general app pefrormance will increase. To my surprise - it seemed that it worked a bit worse (similar speed, but sluggish work
For a very short mono4 section, like a basic sine osc, that's quite normal - packing and unpacking the data has some CPU overhead of its own, and extra memory read/write etc. (and the toolbox prim's always pack/unpack all four channels even if you only use a couple of them).
Generally, the more modules or lines of code/assembly you can chain together without conversion, the greater the proportion of CPU saving. So its effectiveness will depend a lot on your design - if your synth needs lots of unpack->routing->pack in between very small "shared" sections, then it may not necessarily be worthwhile (and adds a lot of complication to the routing).
If you take a look at my old "Trogz Toolz" kit over at the SM site, you'll find hopped versions of the pack/unpack that can also help a little bit - e.g. when packing 'slow' green values from controls for input parameters. Swapping the stock ones for those can save a few % if you have lots of parameters to pack.
The nature of the code inside a module also has a part to play. For example, the stock oscillators all use look-up tables of some sort. Depending what the PC is doing at any given time, it can sometimes be accessing the data from memory that is more of a bottleneck than running the instructions themselves. This caught me by surprise when I was coding the delays from the same "toolz" - changing the maximum buffer size affected the CPU load more than many of the instruction optimisations I attempted.
This can lead to some weird "oscillating" CPU loads sometimes if many routines are trying to access the same lookup table, but out of phase with each other, and may also partly explain why the oscillators show little improvement...
If you use a single mono osc, it is really still running all four channels - but they are all sychronised, and therefore all reading from the same memory address in the lookup table. The moment you have four different frequencies running, the read pointer for the lookup will be bouncing around all over the place, making potentially four times as many 'page requests' from memory - and that would happen whether you used four mono osc's or a single mono4.