Converting modules into bitstream format
A study on tracker music conversion
I sometimes listen to tracker music. Some of my favourite game soundtrack music ever happens to be from the golden era of tracker music. For example, games made with the original Unreal engine usually had a really good module soundtrack - Unreal Tournament and Deus Ex being good examples. Then there are artists such as Skaven (Peter Hajba) whose older production are mainly circulating in module format.
For those who don't know what this is all about, a module file is basically a bunch of audio samples and instructions how to play those samples, bundled together in one file that can be listened to using a module player. It's a bit like MIDI music, but with the samples included.
However, listening to tracker music in module format isn't very convenient. You need a special module player, and the end result depends a lot on the settings and capabilities of the player. Converting songs from module format into bistream format solves all the problems if you just want to listen to the music.
If I have a FastTracker 2 module I want to convert into WAV, to get as close to the artist's vision, the intuitive thing to do is to use the tools the artist used, and thus export the module from the tracker. I used to compose a little tracker music with FastTracker 2 back in the days, so I have some experience with the problems of the original tools. The question is, how should one convert tracker music into bitstream format if one wants to have the best possible quality and still be as close to the original vision as possible?
- Note: this article uses the HTML5 audio element, and all audio is in Ogg Vorbis format. Apparently, Safari doesn't support Vorbis, so for best experience, use e.g. Chrome.
For this analysis I used music from the original Unreal Tournament and One Must Fall 2097. They have music in a format native to Scream Tracker 3, FastTracker 2 and Impulse Tracker. To run these trackers I used DOSBox. I did a comparison between FastTracker 2 and MilkyTracker, which is a clone of FT2, and they seemed identical in their results, so I used MilkyTracker for convenience and processing comparison. The most promising modern tracker option seemed to be the Open ModPlug Tracker. The versions I used were:
- OpenMPT 1.27.03.00
- MilkyTracker 1.00
- Impulse Tracker 2.15 with WAV writer plugin
- Scream Tracker 3.21
- DOSBox 0.74.3
I extracted the Unreal Tournament music directly from the UMX files of the game using Unreal Musix Ripper 2.0. Similarly, I converted the One Must Fall 2097 PSM files to modules using Chronos Module Converter 1.01.
I basically selected the tracker which was native to the module format in question. I used my ears to assess if the result sounds anything like in-game. I then tested different settings of the trackers to find out how they affect the results in practice. I recorded audio samples where the differences are prominent and easy to hear. I used lossless formats, but for this article compressed the samples. It doesn't really matter for the demonstration purposes.
The main technical things to consider when converting to bitstream format are conversion artefacts and how to interpret the data. The most visible tracker settings usually involved with these are interpolation and volume ramping, but there are also other things that need to be taken into account.
The modules consist of audio samples that may begin or end with the value of the sample way off zero. This usually results in a pop or crack. Imagine a waveform with a step in it. How should a loudspeaker reproduce this sound? There will be a sudden pop as the speaker cone jumps to the desired position. This is almost always an unwanted effect. Volume ramping smoothens the beginnings and endings of the instrument samples so that there won't be such discontinuity between two samples.
The first example is the song Fire Breath from Unreal Tournament. It is exported using MilkyTracker with no volume ramping and no interpolation. The song sounds crisp, with some interesting high-pitched eerie noises. There are however also quite loud pops which sound as if they are not supposed to be there.
Turning on volume ramping in MilkyTracker the pops are smoothed out. However, in this case the volume ramping might be a bit too heavy. For example, in the beginning you have some thump-thump sounds. They pop without volume ramping badly, but the volume ramping of MilkyTracker makes them sound more like dhump-dhump and hump-hump. For some more aggressive synth sounds, heavy volume ramping might also smoothen them out too much.
Fire Breath, MilkyTracker, no volume ramping.
Fire Breath, MilkyTracker, volume ramping.
Compare this to Fire Breath exported from OpenMPT, with a default volume ramping of 363 µs (16 samples) up, 952 µs (42 samples) down. The pops are only very small, but the thump-thumps are still prominent. If we increase the volume up ramping from 16 samples to 42 samples, the pops start to disappear and only the thump-thump remains. In the sample, I've also turned on some interpolation - more about it in the next section.
Fire Breath, OpenMPT, volume up ramping 16 samples.
Fire Breath, OpenMPT, volume up ramping 42 samples.
And finally, if we listen to a recording made directly from the game, we immediately hear there's some interpolation but there's also the very prominent pops in the beginning, so there is no volume ramping.
Fire Breath, in-game recording.
If we open a module in OpenMPT and look at the samples it contains we might see a wild variety of formats. For example, let's examine the song Razorback, from Unreal Tournament. There are samples with common sampling rates of 32000 Hz and 44100 Hz, but also exotic ones such as 20000 Hz, 61000 Hz or even 9491 Hz. There are both 16-bit and 8-bit samples. In One Must Fall 2097 Menu, all the samples are 8448 Hz, 8-bit. For Deus Ex Intro, most samples are 44100 Hz or 22050 Hz, but not all.
Usually the older the module, the lower the sound quality of the instrument samples used in it. When mixing the final bitstream, which usually will be targeting the CD quality with a sample rate of 44100 Hz, the low-quality samples must be resampled to match. For example, if you have to double the sample rate, you could do it like in the picture.
However, imagine a smooth wave touching the samples on the left of the picture. Now imagine a wave touching the samples on the right. That wave wouldn't be that smooth anymore. But, if you imagine a high-frequency wave over the samples, you could have a continuous wave that touches the samples. In reality it's not exactly like this, but the point is, in the world of digital audio, this would mean that your new signal has high-frequency components, aliasing artefacts. Back in the days when computers weren't that powerful and sound-quality was technically very bad, the aliasing artefacts were part of the composing process and as they commonly introduce high frequencies to the final mix, they were part of the sound.
The resampling method in the picture just doubles the samples. Another way with similarly bad resulting artefacts is to fill in the new samples as zeroes. To prevent most resampling artefacts from showing up, interpolation is used. For most module music, using interpolation results in the music seemingly losing some of its crispness. Seemingly, because the samples actually never had any content above half their sampling frequency (see the Nyquist-Shannon sampling theorem). An interpolated song will almost always sound a bit smoother, sometimes even a bit muffled, compared to a non-interpolated.
Generally speaking, resampling artefacts are a bad thing, but in the case of old modules they may actually be part of the sound. Usually these modules involve low-fidelity chiptune samples. So in the end it's just trial and error, but usually a good starting point is to use high-quality interpolation. Common interpolation methods are linear, cubic spline and sinc.
Continuing with the Fire Breath as an example, here's the track from MilkyTracker with fast sinc interpolation. The volume ramping is also on, so you may compare to the sample in the previous section. The interpolation obviously changes the high-pitched noise sounds quite a lot, but also some of the background details which aren't that obvious. For example, listen carefully between 20...24 seconds. In the non-interpolated version there's a syncopated, ominous sound in the lower registry. It sounds as if someone was waitingly exhaling fire (as the track's name suggests). However, in the interpolated version it sounds as if someone opens and closes a gas burner. Small details add up and may change the overall listening experience.
Fire Breath, MilkyTracker, sinc interpolation.
Moving on to the song Razorback, let's compare a version from History of Unreal music compilation CD to a non-interpolated and interpolated export from OpenMPT. The CD version sounds pretty much as if it was just the module version linearly interpolated.
Comparing to the in-game version is interesting, as there's no sounds above 11 kHz. This means the music was probably played back at 22.05 kHz or heavily low-pass filtered. In any case, it sounds similar to the interpolated version which has only little data above 11 kHz.
Razorback, OpenMPT, no interpolation.
Razorback, OpenMPT, linear interpolation.
Razorback, CD version.
Razorback, in-game recording.
The best interpolation method
To find out what is the best method to do interpolation I did some listening tests. I compared the following methods found in OpenMPT:
- 2-tap linear
- 4-tap cubic spline
- 8-tap XMMS-ModPlug = sinc
- 8-tap PolyPhase = sinc + low-pass filter
For the song Fire Breath I noticed difference only in certain places. The syncopated ominous sound I mentioned earlier sounds exactly the same with everything 4-tap and above. I like the non-interpolated version most. In the beginning there's also an eerie high sound which is not there with the better interpolation methods (the 4-tap and above).
Fire Breath, linear interpolation.
Fire Breath, sinc interpolation.
In the Unreal Tournament song Run I didn't notice pretty much any difference at all between the songs. In Razorback I was quite sure the cubic spline interpolation sounded a bit brighter (and better) than the linear interpolation. Just for comparison, I also present a sample of the original Razorback (not the Unreal Tournament version). That version is found on Return to Stage 9 musicdisk. The sound is noticiably brighter and meaner.
Razorback, linear interpolation.
Razorback, no interpolation.
Razorback, spline interpolation.
Razorback, musicdisk version.
After listening, I plotted the spectrum and there indeed was a tiny boost in the audible upper frequency range with cubic spline interpolation compared to linear. For the XMMS interpolation, the frequencies just at the upper end of audible spectrum were attenuated a bit further.
From One Must Fall 2097 soundtrack, I listened to the Menu and End songs. Both had just a bit brighter sound with cubic spline interpolation. Again, I did some spectrum plotting for visualization.
All in all, it is hard to find differences between the interpolation algorithms from loud music: the very subtle nuances are so easily lost under the major sounds of the music. For the same reason using a theoretically inferior interpolation algorithm might make sense: you don't exactly hear all the minuscule details anyway, but it's nice to have some high frequencies left, even if they were mainly artefacts. It gives some "airiness" to the sound. Essentially the question is whether one should interpolate at all, not so what algorithm to use for doing it, but if I have to choose one, I'll take the cubic spline.
According to the OpenMPT Wiki, there's a special resampler that models the Amiga sound chip, so it might be worth checking out also for Amiga music. I was always a PC guy, so I never really listened to Amiga stuff.
Amplifying with an old tracker software is tricky. FastTracker 2 for example has a slider where you can set the amplification yourself. I don't know about the quality of the internal mixers of the old tracker software, either. Imagine you export a song and it ends up being really quiet. Then you wouldn't be using all available resolution, and might actually introduce errors because of quantization. But, amplify the song too much and you end up digitally clipping the samples, which always sounds horrible.
As for Impulse Tracker, the WAV writer plugin not only does some interpolation, it also amplifies the result too much. There's also no way of controlling the output volume. In the sample image are waveforms from the exported audio from Impulse Tracker and OpenMPT. The audio has been normalized using ReplayGain so that they sound approximately equally loud to a human. The OpenMPT export has more dynamics since the IT export is clipping all the time. See also the 2020-05-11 update at the end of article.
Now that I mentioned quantization, it's necessary to mention dithering also. Basically, dithering is a process where you intentionally introduce some noise to the end result so as to turn systematic, big artefacts into minuscule errors.
It's always a good idea to dither when rendering the final mix. Internally, a tracker software is hopefully using a high-enough resolution buffer so there won't be any quantization errors during processing. For example, OpenMPT uses a 32-bit buffer. But when you export the final result to an audio CD format with 16-bit resolution, it's good to use dithering.
OpenMPT also offers the option to normalize the WAV after format conversion. This makes it so that the peak amplitude is 1.0. However, this seems to be done after the dithering, so in theory, to achieve best possible quality, saving the files in more than 16-bits and converting the format with dithering after normalizing is the most high-quality option. For example, Audacity offers high-quality shaped dithering by default when exporting. Instead of normalizing, you might also want to do a ReplayGain analysis if you approach a bunch of songs with an album-like mentality, and lossily apply that information before format conversion.
For more about dithering, see the 2020-05-11 update at the end of article.
Another thing modules have special about them are the stereo effects. Panning channels left and right is usually done by giving a pan command with a hexadecimal number ranging from 00 to FF. Not all sound hardware are equal for some reason: for example, the Gravis UltraSound (GUS) does not support all panning positions, and the usable range is only from 10 to EF. The outermost 16 values are played as if they were in the middle, at least this is how it works in FastTracker 2.
While listening to One Must Fall 2097 music via DOSBox using GUS, I noticed the game had a narrower stereo image than what the OpenMPT output had. I tried playing a module in FastTracker 2, both with GUS and Sound Blaster. Both playing the module and exporting it resulted in a similar stereo effect, very much like in-game, so initially I thought the narrower stereo sound is actually the "correct" one. If using a stereo expansion value of 50%, the OpenMPT output sounds similar to the in-game GUS recording. This is basically same as mixing the channels with a ratio of 3:1.
Menu, in-game GUS.
Menu, OpenMPT, 50% stereo expansion.
In the End song there's the same stereo expansion weirdness, but also an interesting effect is heard in-game: the synth sound has more airiness to it, and it's also constantly playing, as if there was some reverb to it. The OpenMPT output sounds quite different.
With the End song another phenomenon is observed: in the in-game recording the song is approximately 0.8% faster than in any tracker I tried the song with. This actually happened with every OMF 2097 song except for the Menu and Stadium, so 5/7 of the in-game music is played in different speed if using Gravis UltraSound.
What is "original sound"?
The stereo expansion and playback speed differences are not the only things one might encounter when going down the rabbit hole. I did some testing with Scream Tracker using the the desert arena soundtrack from OMF 2097. The recordings are from DOSBox output. The GUS version sounds very similar to how it sounds in-game, except for the speed difference. The Sound Blaster Pro version not only sounds worse, but has the stereo channels mixed.
The Desert, ST3, GUS.
The Desert, ST3, Sound Blaster Pro.
I then tried how different soundcards behaved in-game. I used the Ultra High Quality setting in OMF 2097 setup and tested the same desert arena music with GUS, Sound Blaster 16 and Sound Blaster Pro. The Sound Blasters didn't have much difference even if the other one is a 16-bit card with 44.1 kHz sampling rate and the other one has specs half of those. The reason is probably because the samples used in the modules are of such low quality. In fact, for their technical audio quality many games even in the latter half of the 90s weren't even close to CD quality. For example, in Duke Nukem 3D, released in 1996, the samples are 8-bit, 11025 Hz. A lot of times the high-frequency, aliasing artefacts were part of the "better sound". But, even if the samples weren't of that good quality, the end result was usually better when the final mix was closer to CD quality.
The Desert, in-game, GUS.
The Desert, in-game, SB 16.
The Desert, in-game, SB Pro.
Emulation versus hardware
I've been aware that emulation not always provides the real deal, and now testing out things makes it seem even more so. I was intrigued by the stereo panning and speed differences in OMF 2097 and did some more testing. I created a stereo test sample WAV and downloaded Cubic Player for DOS. The surprise was quite huge when I realized that DOSBox has a buggy GUS emulation: the sample was played in mono. Now I don't what exactly determines how a sound is played out, since the module files clearly had 50% of their stereo effect left, but in any case something was wrong.
Doing a quick search I found this Reddit thread where someone was wondering why GUS was always mono in DOSBox. A reply suggested it's a known bug and there's a fork of DOSBox addressing the issue. So I went to GitHub and cloned DOSBox-X. Surprisingly, it compiled without any errors and I got to test things out. Lo and behold, stereo in Gravis UltraSound worked, and the stereo expansion in OMF 2097 was wide again. It was all just because of buggy DOSBox GUS emulation.
However, the speed difference still remained. I've once had a real Gravis UltraSound MAX in my old AMD 486 DX2 66MHz computer I inherited from my big brother, but unfortunately I was silly enough to get rid of them at some point. However, I found this very prolific, old YouTube channel Teppica, which has a lot of original game music recorded apparently using the real hardware. Checking against those, the speed difference is present there also, so it seems OMF 2097 indeed just plays some songs faster when using GUS.
This brings more flavours to the question: what is "original" or "the way it is meant to sound"? Also, going back to the stereo expansion thing, a very radical stereo expansion isn't actually that pleasant to listen to for a long - especially when using headphones - so funnily enough, that DOSBox bug actually made the music sound better in a way.
Towards the CD era
Sometimes the best listening experience is achieved with a separate CD soundtrack. Unlike the History of Unreal music CD I mentioned in the Interpolation section, sometimes the CD soundtracks have their own unique song versions. Let's take a look at Deus Ex: Game of the Year soundtrack. In the comparison are the first 25 seconds of the Unatco theme and how it sounds in-game (or playing the module) versus the CD soundtrack.
Unatco, CD soundtrack.
The CD version is not only cleaner and of higher quality overall, it also has some added things like the cool deep bass hits, making the CD version the go-to version. However, sometimes a technically better quality CD version changes things too much, and not always for the better. I remember the first time I listened to the Deus Ex CD soundtrack and was eagerly waiting for a certain cool, trancy section in the title score, only to be a bit disappointed it was remixed and removed. The section I'm talking about is in the samples.
Deus Ex Title, in-game/module.
Deus Ex Title, CD soundtrack.
On a side note, now is a good time to reinstall Deus Ex if you haven't done so already for some reason.
I started to study this whole module conversion thing with the thought that I would find a single way that would work with all the situations and could be considered "the best". Along the way, however, I stumbled upon a lot of exceptions to the rules and started to truly think about the question how to define "the best", or "original" even. I realized that using the original tools - tracker software in this case - you are probably far from the best technical result. Even using real hardware might introduce phenomena one might consider errors, even if Gravis UltraSound would probably be your go-to option. The only definitive thing to say is that there is no clear "best way" when converting a module to bitstream. What is original and sounds best is a question that should be approached case-by-case, subjectively.
However, this doesn't prevent us from making civilized guesses. As it turns out, the modern OpenMPT is technically much superior to old software. It offers so much leeway for fine-tuning even per channel or sample that one could easily spend weeks just fiddling with a single track. From a technical point of view, however, we can make a good initial guess for reasonable module conversion settings. So, no matter what type of module we are talking about, as a starting point I recommend exporting the music into a WAV using OpenMPT with the following settings:
- Default dithering, if exporting in 16-bit. If amplifying with external applications, use higher bit formats.
- Cubic spline interpolation. I simply think this sounded the best even though the technically superior 8-tap filters exist.
- Volume ramping 363 µs (16 samples) up, 952 µs (42 samples) down. These are the OpenMPT defaults. Sometimes more up ramping is necessary if a song has really bad samples, in which case 42 samples is a good value.
- If stereo effects hurt your head, they might actually be off, or need some subjective tweaking. But start with the default 100%.
With the above setup, the result will sound detailed and punchy, without most problems of tracker music. I'm actually quite happy to say the values are almost the default ones, so I trust the OpenMPT developers really know what they are doing. I wasn't a fan of interpolation back in the days, but good quality interpolation will usually make music easier on the ears in the long run, and also seems to be something the original composers quite often would have liked to have. Of course, the optimal solution is to just use high-quality samples in the first place, skipping the need for interpolation altogether.
Since Diversions Entertainment released One Must Fall 2097 as freeware already in 1999, I felt it's fine to upload the music to YouTube, using the wisdom gathered during the writing of this article. There are already plenty of uploads for them, but most seem remixes or low-quality. I recommend checking out the Teppica YouTube channel for that real hardware sound - my versions aren't trying to sound 100% like real hardware, but simply good for everyday listening. I tested a LADSPA Narrower plugin, but eventually just used 50% stereo expansion and the settings I otherwise recommend. I did the final format conversion in Audacity, using shaped dither. The links are below, enjoy.
One Must Fall 2097 soundtrack:
2020-05-11 update: case Unreal Tournament soundtrack
I decided to create myself the best possible version of Unreal Tournament soundtrack using the wisdom gained writing the article. The OpenMPT settings I used were:
- Volume ramping 363 µs (16 samples) up, 952 µs (42 samples) down. For the Fire Breath track, I used 42 samples up as in the article.
- Export to floating-point WAV for possible gain tweaking and external dithering.
I decided to use SoX for converting the audio files, as it is a nice command-line tool and as such is easy to use for batch processing.
More about dithering
SoX documentation has a nice page about noise shaping and how they affect the spectrograms. From the spectrograms it's easy to see, without even hearing, how noise shaping would affect the audible end results.
The documentation also provides in a single image a comparison of different noise shaping algorithms. Based on that I chose to use the Gesemann algorithm for the Unreal Tournament soundtrack.
More about amplification
With everything set, I ran the following command to convert the OpenMPT WAV exports into CD audio format:
ls *.wav | xargs -I % sox -D % -b 16 final/% dither -f gesemann
-D parameter disables automatic dithering; instead I manually set the Gesemann noise-shaped dithering. However, this command resulted in the following warnings:
sox WARN sox: `Credits.wav' input clipped 1 samples sox WARN sox: `Ending.wav' input clipped 2763 samples sox WARN sox: `firebr.wav' input clipped 62 samples sox WARN sox: `Godown.wav' input clipped 203 samples sox WARN sox: `Lock.wav' input clipped 2 samples sox WARN sox: `Organic.wav' input clipped 382 samples sox WARN sox: `Run.wav' input clipped 2 samples sox WARN sox: `Savemeg.wav' input clipped 115 samples sox WARN sox: `SaveMe.wav' input clipped 183 samples
So we see that, even with floating-point WAV files, the output of OpenMPT may in fact clip. To fix this I re-exported the nine files but manually lowered the volume per file until the results did not clip anymore. Of course a few clipped samples is an extremely theoretical problem, but I still wanted to see how everything would work if done perfectly.
Eventually I ended up with the final WAV files, which I converted to FLAC and tagged. As usual, I put the files through a ReplayGain scanner. The end result is seen in the picture. I'd say a peak of 0.9992 with less than a decibel of correction is very good and dynamic a result.
Today I listened the soundtrack through at least three times. It just never gets old, and I'm happy to have it in such a high quality now.
What next - Deus Ex soundtrack maybe?