Stream FFT and iFFT

trogluddite · Post by **trogluddite** » Mon Jun 17, 2013 11:00 pm

It's just a delicate little baby at the moment, so treat it with care...

Stream FFT v001 Build 8.fsm: (158.47 KiB) Downloaded 2458 times

tester · Post by **tester** » Mon Jun 17, 2013 11:24 pm

First what I of course did - was to change that value in order to see the crash. But - it did not crashed. It looks that this value is changable when the sound is offline.

trogluddite · Post by **trogluddite** » Tue Jun 18, 2013 7:17 pm

Yes, that's right - it's only when the assembly blocks are running that there can be problems.
Because it is such an intensive algorithm, I left out the parts that check whether memory addresses are valid inside the code arrays - if they go out of range, there's an instant crash. Saves a fair chunk of CPU, but means that there are many hard-wired values that need to be right.

Ultimately, I'll probably just release a set of modules, each preset to a different buffer length - there's a big trade off with this job between versatility and CPU load due to the limitations of the FS assembly instruction set. There's still a lot more code to add yet too - FFT -> iFFT on its own is pretty dull; messing with the data in between is where things will get much more interesting.

tester · Post by **tester** » Tue Jun 18, 2013 9:52 pm

//edit:

redirecting this to new topic.
http://www.dsprobotics.com/support/view ... 1513#p6442

MyCo · Post by **MyCo** » Tue Jun 18, 2013 9:55 pm

The problem with realtime FFT and iFFT is, you have to overlap the buffers and apply the correct window to the FFT. And finally you have to find the right spot where you start the overlapping FFT. That's actually a very hard task.

Have you done the conversion to SSE from x86 code? It's very hard to read, but it looks like you don't really use the benefits of SSE.

When I understand it correctly, you buffer the 1024 samples linear into "inArray" and at the end of the buffer you calculate the FFT. Wouldn't it be faster to buffer the incoming samples at the bitshifted positions in an Array? Then you could leave this step (Loop1) out of the main code.

MyCo · Post by **MyCo** » Tue Jun 18, 2013 10:17 pm

I think I've found a bug in the FFT code. On line 125:

Code: Select all

movaps xmm1,xmm3;

xmm3 wasn't set before

MyCo · Post by **MyCo** » Tue Jun 18, 2013 10:29 pm

Another related bug:

Code: Select all

 mov eax,1;
cmp eax,0;
jnz JgteMtest;

the Jump is always executed because 1-0 is not zero

digitalwhitebyte · Post by **digitalwhitebyte** » Tue Jun 18, 2013 10:39 pm

@Trog:
your implementation seems to use this algorithm.

Code: Select all

class Array
	# DFT and inverse.
	# 
	# Algorithm from 
	# 'Meyberg, Vachenauer: Hoehere Mathematik II, Springer Berlin, 1991, page 332'
	#
	# See http://blog.mro.name/2011/04/simple-ruby-fast-fourier-transform/ 
	#
	def fft doinverse = true
		src = self
		# raise ArgumentError.new "Expected array input but was '#{src.class}'" unless src.kind_of? Array
		n = src.length
		nlog2 = Math.log( n ) / Math.log( 2 )
		raise ArgumentError.new "Input array size must be a power of two but was '#{n}'" unless nlog2.floor - nlog2 == 0.0
		n2 = n / 2
		phi = Math::PI / n2
		if doinverse
			phi = -phi
		else
			src.collect!{|c| c /= n.to_f}
		end
 
		# build a sine/cosine table
		wt = Array.new n2
		wt.each_index { |i| wt[i] = Complex(Math.cos(i * phi), Math.sin(i * phi)) }
 
		# bit reordering
		n1 = n - 1
		j = 0
		1.upto(n1) do |i|
			m = n2
			while j >= m
				j -= m
				m /= 2
			end
			j += m
			src[i],src[j] = src[j],src[i] if j > i
		end
 
		# 1d(N) Ueberlagerungen
		mm = 1
		begin
			m = mm
			mm *= 2
			0.upto(m - 1) do |k|
				w = wt[ k * n2 ]
				k.step(n1, mm) do |i|
					j = i + m
					src[j] = src[i] - (temp = w * src[j])
					src[i] += temp
				end
			end
			n2 /= 2
		end while mm != n
		src
	end
end

//usage
real = []
imag = []

@in.fft(false).each {|i| real.push i.real; imag.push i.imag}
output 0, real; output 1, imag

trogluddite · Post by **trogluddite** » Wed Jun 19, 2013 9:01 am

Hi Guys

MyCo wrote: you don't really use the benefits of SSE.

Saving that for later - I'll look into that once the other coding is done, but I thought I'd keep the other SSE channels free for now to give the option of parallel (e.g. stereo) operation.

Bugs! - nope!
There are a few unconditional jumps (why no 'jmp' opcode!) - they are just used to get the execution order right to switch between testing conditions at loop start/end; so the xmm3 value is actually set later in the code.

MyCo wrote:buffer the incoming samples at the bitshifted positions in an Array? Then you could leave this step (Loop1) out of the main code.

Yes, I have tried several methods of doing that - but oddly enough, they've all come out with higher CPU load so far, despite, as you say, less read/writes and way fewer opcodes. Seems that streamlining the memory pipeline is the crucial thing here - a handful less cache misses per iteration makes a huge difference.
I'll keep working on it though - that transfer is where scaling/windowing would be applied, so it needs tightening up!

digitalwhitebyte wrote:your implementation seems to use this algorithm.

Yup, that's the one!

Tronic · Post by **Tronic** » Wed Jun 19, 2013 11:28 am

I think it is a good read for understanding the world of the FFT
The FFT Demystified

Stream FFT and iFFT

Stream FFT and iFFT

Re: Stream FFT and iFFT

Re: Stream FFT and iFFT

Re: Stream FFT and iFFT

Re: Stream FFT and iFFT

Re: Stream FFT and iFFT

Re: Stream FFT and iFFT

Re: Stream FFT and iFFT

Re: Stream FFT and iFFT

Re: Stream FFT and iFFT