Basic UTF to ASCII decoder

by **trogluddite** » Mon Jan 21, 2013 6:27 pm

Hi All,
Following the text parsing issues experienced by 'mccy' with his SFZ translator, here is some rough and ready Ruby for reading text files that may be UTF encoded - as FS Ruby does not seem to have the libraries for doing this included in the install (and the toolbox 'Text Load' can't do UTF files either).

UTF Decoding.fsm: (1.58 KiB) Downloaded 1388 times

To open a file, use the syntax...
open_ascii ( <file_path> )
The method checks the file to see if it has UTF markers at the start of the file, and adapts the decoding to suit - you will be returned a string containing only the ASCII text, with unrecognised characters stripped out (by default).
If you are using non-ASCII characters in your files regularly - because you use an alternative language to English as your Windows default, say - it won't help much with that (unless you like your words without vowels!). It's designed more for when you need to parse data files that might just happen to be UTF encoded text.
There are also a couple of extra options...

- Add a symbol as a second argument to force a particular type of de-coding if the file is not tagged. The symbols are...
:ASCII, :UTF_8, :UTF_16LE, :UTF_16BE, :UTF_32LE or :UTF_32BE - :ASCII is assumed if you don't supply a value.
For example...
open_ascii ( <file_path>, :UTF_16LE )
Assumes that untagged files are utf-16 little-endian encoded.

- Pass a block to the method, and this will be used to decode any bytes outside the ASCII range. The block will be passed the data point as an integer, and the return value should be something that will insert into your text.
For example, to replace all non-ASCII characters with "<UTF:xxxxx >", where 'xxxxx' is the character code...
open_ascii ( <file_path> ) {|byte| "<UTF:#{byte}>"}

I've tested so far with Notepad text files and a few .xml's, and it seems pretty reliable - but that may depend on whether the file-type you choose included the markers in the header or not, and it is as yet untested with UTF-32 formats - so I'm sure there's room for improvement!

Note that the toolbox "Text Load" primitive also can't handle UTF encoded files - so you might find this useful as a module in its own right - the schematic has an input for a regular FS file browser, so that it can be used without any further coding.

by **mccy** » Wed Jan 23, 2013 11:16 pm

I put your whole code in new version 09e (not uploaded yet). Don't know it will work better... no matter ... Thanks for your great help, this is just unbelievable how much helpfull you are!

by **trogluddite** » Thu Jan 24, 2013 1:11 am

Thankyou, you're welcome.
I do really enjoy myself here, and I'm sure that all of my own projects benefit from it as much as anybody else's do; especially considering the years at the SM forums before this one.
It's surprising how often considering someone else's problem throws up a solution to something I'm stuck on - just by making me look at the problem from a totally different angle.

And it sure beats anything on TV - got rid of mine about a decade ago, and I've never missed it.
(the ups and downs of my posting habits over the years probably correlate pretty well with my relationship status too! :lol:

)

by **mccy** » Thu Jan 24, 2013 10:02 am

No TV here!!! Since many many years.

Your converter makes tx2sfz come close to version 1.0. I have converted GIGANTIC programs with many many samples!!!
I just ran in a last big problem: Tx16Wx does not write the wav rootkey in the program (couldn't find it and it reads consequently the wav rootkey and saves the wav with new root if neccessary) ! That can really screw up projects when also using sfz which does not read root key from wav!

Would it be difficult, to read the rootkey of a wav file so that the converter integrates it in the sfz file???

by **mccy** » Thu Jan 24, 2013 6:15 pm

With a Hex editor It seems to be easy to detect the location, where rootkey information is written in the wav file header. So maybe with a similar strategy as with your utf translator I should be able to read out that bytes.
Maybe this evening I'll find some time...

by **trogluddite** » Thu Jan 24, 2013 7:45 pm

Will be interesting to know what results you get with your wav reading.
Coincidentally I've been messing with wav files myself. I now have a routine that can read a standard wav file header to read the sample rate, nomber of samples, bit depth etc. Doesn't work with 24bit, float or surround channels yet, as that involves an extended header which I'm just working on decoding.
I've managed to successfully read and write standard 16bit files no problem. Unfortunately it's not a terribly fast process - I was hoping to some up with a workable audio recorder, but although the CPU is quite low, recording in real-time causes audio crackles despite trying several buffering methods. That's not entirely unexpected though, as I already knew that Ruby would not be a very efficient way to do it.
When I have some presentable code, I'll post it up - maybe it could still be useful for sample editing - especially if combined with what you learn about the embedded meta-data. Being able to read/write that meta-data was requested many times at the SM site by people interested in building their own sampling plugins.

by **mccy** » Thu Jan 24, 2013 9:57 pm

For now, I will experiment with my finding that at decimal Potition 82 (for 24bit) / 84 (for 16bit) you will find rootkey, which changes from hex oo = c-1 to g9 (127 dec)

So I will start building with manual selector 16/24 bit files, but as there may be mixed samples in a sampleprog I will have to read out 16/24 bit from the file when I have reading rootkey done.

by **trogluddite** » Fri Jan 25, 2013 12:31 am

mccy wrote:For now, I will experiment with my finding that at decimal Potition 82 (for 24bit) / 84 (for 16bit) you will find rootkey, which changes from hex oo = c-1 to g9 (127 dec)
So I will start building with manual selector 16/24 bit files, but as there may be mixed samples in a sampleprog I will have to read out 16/24 bit from the file when I have reading rootkey done.

A 'wav file is divided into chunks. The simplest form has two chunks - a 44byte header, and a 'data' chunk with the audio, 'Extended' formats (24bit/float/multi-channel) add another smaller chunk after the header with extra info - hence the pitch value is further into the file.
The meta-data that you want to read will be in other chunks still - and many are possible, including cue points, playlists, sampler/instruments data etc. And there is no set order that these chunks have to appear in.
This implies that relying on the index position of the data is unlikely to be reliable unless you are always reading files created by the same source application. The only way to be certain is to do some pretty serious parsing of the file to read the chunk "tags".
I've found lots of useful stuff in this document... Wave File Format. Most of the other references I found only really cover standard wav files, but this one has good detail about the sample/instrument chunks.

I've just about got a "chunk" parser made, but each chunk has a different format, so it slow going working out the unpacking of the data from each section. Here;s the current state of progress...

WAV file extended reader.fsm: (1.15 KiB) Downloaded 1466 times

Note the data in the text box when you load it - it's a 24bit file recorded in Reaper. Audio data doesn't start until byte 736, and there's even a chunk called 'junk'!

by **mccy** » Fri Jan 25, 2013 1:48 pm

Uff. This is far beyond my abilities... So without your help the tx2sfz converter would not become real. To use the time I worked on some workarounds like set high key as rootkey etc... And I like the results so far.
For drums with fixed tuning it works anyway.

In the end I bet I have to use your code for the rootkey reading. Great work so far, I'll try to understand some things...

by **trogluddite** » Fri Jan 25, 2013 8:52 pm

mccy wrote:In the end I bet I have to use your code for the rootkey reading. Great work so far, I'll try to understand some things...

That's really no problem, that's why I post them
And there's no need to be embarrassed or ashamed of 'borrowing' some code - 99% of the software you use every day consists largely of routines from pre-defined libraries, algorithms from university papers, chunks of code on forums etc. etc. Even when writing a whole routine, most code consists of archetypes and cliches, ingrained from seeing them so many times in examples of other code.
You can see this even in the FS application itself, with its borrowing of Ruby, and use of standard Windows routines for drawing to the screen.
In fact, that's part of the beauty of FS - the modular format makes sharing things so easy, which saves everyone from having to re-invent the wheel all the time, and means we can get on with implementing the "big picture" rather than getting bogged down in details.
And if you have even a sliver of curiousilty, you'll soak up programming know-how without even noticing - just poking around inside things and trying to understand how they work - even if if just inspires you to look up one command that you never saw before.

For what it's worth, the speed that you've got your project together from a "standing start" is really very impressive - especially as the task is in a totally different league to a typical "beginners guide" example - not to mention having to fit it around your job and family!

Basic UTF to ASCII decoder

Basic UTF to ASCII decoder

Re: Basic UTF to ASCII decoder

Re: Basic UTF to ASCII decoder

Re: Basic UTF to ASCII decoder

Re: Basic UTF to ASCII decoder

Re: Basic UTF to ASCII decoder

Re: Basic UTF to ASCII decoder

Re: Basic UTF to ASCII decoder

Re: Basic UTF to ASCII decoder

Re: Basic UTF to ASCII decoder

Who is online