I'm using this library I found to read a midi file
As there is very little documentation, I have no idea how to read the output object?
question: What does: Channel, data, deltaTime and type keys mean?
In the end I would love to map this js object to some kind of visualization.
Channel: The MIDI format uses the concept of channels to allow different MIDI devices to only listen to specific MIDI events by listening to such a channel. This makes it possible to use a single MIDI file for multiple instruments that should play different notes, etc. So when you have a note on event you should check the channel of the event and only play the instruments that are interested in events that happen in this channel.
Data: Data is a bit arbitrary, but in your example we have an event of type 255 (0xFF) which is a meta event. It has a meta type of 3 (0x03) which means it's a Sequence/Track-name. This was probably assigned by the program that created the MIDI file you use. There's a pretty nifty and concise list of events here: http://www.ccarh.org/courses/253/handout/smf/
deltaTime: Since events in the MIDI file is tempo agnostic it uses the concept of ticks. It's basically a resolution expressed as ticks per quarter note. I think 480 ticks per quarter note is pretty standard, though that is purely based on my own experience, so YMMV. Events can then either be expressed in absolute time (ie. this note on events happens 4800 ticks from the start of the track) or delta time. Delta time is the number of ticks since the last MIDI event happened.
Type: Each MIDI event in a MIDI file has a type to identify what kind of an event it is. This matters since different types of events has different formats (and thus changes the way we decode it, since MIDI is a binary format), where some have a fixed length and others include information on how long the event is (the number of bytes that make up the event).
It's been a couple of years since I last worked with the MIDI format, but I think the above is accurate.
Related
I'd like to build a very low-level audio output node using the web audio API's AudioWorkletProcessor. It seems like all of the examples out there implement processors that transform input samples to output samples, but what I'd like to do is produce output only, all based on the timestamp of the samples.
According to MDN, BaseAudioContext.currentTime is not a precise source for this timestamp:
To offer protection against timing attacks and fingerprinting, the precision of audioCtx.currentTime might get rounded depending on browser settings.
In Firefox, the privacy.reduceTimerPrecision preference is enabled by default and defaults to 20us in Firefox 59; in 60 it will be 2ms.
One hacky solution may be to use BaseAudioContext.sampleRate, a running counter, and the size of the output arrays, but that only works if we can assume that every sample is computed without drops and that everything's computed in order, and I'm not sure if those are valid assumptions.
Within a processing frame, is there a reliable way to know the timestamp which correlates to given sample index?
Inside of the AudioWorkletGlobalScope you have access to the currentTime as well as to the currentFrame. They are both available as globals.
https://webaudio.github.io/web-audio-api/#AudioWorkletGlobalScope-attributes
As far as I know they are accurate in any browser which supports the AudioWorklet.
I'm working about audio but I'm a newbie in this area. I would like to matching sound from microphone to my source audio(just only 1 sound) like Coke Ads from Shazam. Example Video (0.45 minute) However, I want to make it on website by JavaScript. Thank you.
Building something similar to the backend of Shazam is not an easy task. We need to:
Acquire audio from the user's microphone (easy)
Compare it to the source and identify a match (hmm... how do... )
How can we perform each step?
Aquire Audio
This one is a definite no biggy. We can use the Web Audio API for this. You can google around for good tutorials on how to use it. This link provides some good fundametal knowledge that you may want to understand when using it.
Compare Samples to Audio Source File
Clearly this piece is going to be an algorithmic challenge in a project like this. There are probably various ways to approach this part, and not enough time to describe them all here, but one feasible technique (which happens to be what Shazam actually uses), and which is also described in greater detail here, is to create and compare against a sort of fingerprint for smaller pieces of your source material, which you can generate using FFT analysis.
This works as follows:
Look at small sections of a sample no more than a few seconds long (note that this is done using a sliding window, not discrete partitioning) at a time
Calculate the Fourier Transform of the audio selection. This decomposes our selection into many signals of different frequencies. We can analyze the frequency domain of our sample to draw useful conclusions about what we are hearing.
Create a fingerprint for the selection by identifying critical values in the FFT, such as peak frequencies or magnitudes
If you want to be able to match multiple samples like Shazam does, you should maintain a dictionary of fingerprints, but since you only need to match one source material, you can just maintain them in a list. Since your keys are going to be an array of numerical values, I propose that another possible data structure to quickly query your dataset would be a k-d tree. I don't think Shazam uses one, but the more I think about it, the closer their system seems to an n-dimensional nearest neighbor search, if you can keep the amount of critical points consistent. For now though, just keep it simple, use a list.
Now we have a database of fingerprints primed and ready for use. We need to compare them against our microphone input now.
Sample our microphone input in small segments with a sliding window, the same way we did our sources.
For each segment, calculate the fingerprint, and see if it matches close to any from storage. You can look for a partial match here and there are lots of tweaks and optimizations you could try.
This is going to be a noisy and inaccurate signal so don't expect every segment to get a match. If lots of them are getting a match (you will have to figure out what lots means experimentally), then assume you have one. If there are relatively few matches, then figure you don't.
Conclusions
This is not going to be an super easy project to do well. The amount of tuning and optimization required will prove to be a challenge. Some microphones are inaccurate, and most environments have other sounds, and all of that will mess with your results, but it's also probably not as bad as it sounds. I mean, this is a system that from the outside seems unapproachably complex, and we just broke it down into some relatively simple steps.
Also as a final note, you mention Javascript several times in your post, and you may notice that I mentioned it zero times up until now in my answer, and that's because language of implementation is not an important factor. This system is complex enough that the hardest pieces to the puzzle are going to be the ones you solve on paper, so you don't need to think in terms of "how can I do X in Y", just figure out an algorithm for X, and the Y should come naturally.
Say there are 20 segments in an MPEG-DASH stream, and the stream typically starts at index 0. Is it possible to start at index 13, assuming an init file/byte sequence has already been queued into the Media Source buffer? An example, in which this use-case would be practical, is for something like Netflix' resumption feature – where someone could continue streaming on another device/browser. (Presumably with the same init data as when started from the beginning.)
My only thought is that my assumption is wrong, and there would be a different initialization chunk for each various point in which the media could be paused… but that would just be silly… right?
The simple answer is that yes it is possible and as you suggest this can be used for resume playback features. It can also be used for 'start-over' on live streams and to jump forwards or backwards to a particular point in a video.
MPEG DASH supports two main file formats (or video container formats) - ISO Base Media File Format (ISOBMFF - which is often referred to as MP4 although it is strictly speaking a generalisation of MPEG-2) and MPEG-TS.
The MPEG DASH standard uses the concept of 'Periods' as one of its fundamental building blocks - the periods represents a part of the content stream and include a start time and a duration. To be able to playback the content in a given period you still need some initialisation data.
Looking at ISOBMFF, there is an init segment as you suggest which contains this required data and which is defined by W3C as:
Initialization Segment
A sequence of bytes that contain all of the initialization information required to decode a sequence of media segments. This includes codec initialization data, Track ID mappings for multiplexed segments, and timestamp offsets (e.g., edit lists).
I'm trying to determine the fundamental frequency of an input signal (from a tone generator, or possibly a musical instrument) using JavaScript's WebAudio API, along with some other SO articles (How to get frequency from fft result?, How do I obtain the frequencies of each value in an FFT?), but I can only seem to determine a frequency within about 5-10Hz.
I'm testing using a signal generator app for iPad positioned next to a high-quality microphone in a quiet room.
For 500Hz, it consistently returns 506; for 600Hz, I get 592. 1kHz is 1001Hz, but 2kHz is 1991. I've also noticed that I can modulate the frequency by 5Hz (at 1Hz increments) before seeing any change in data from the FFT. And I've only been able to get this accurate by averaging together the two highest bins.
Does this mean that there's not enough resolution in the FFT data to accurately determine the fundamental frequency within 1Hz, or have I gone about it the wrong way?
I've tried using both the native FFT libs (like this, for example):
var fFrequencyData = new Float32Array(analyser.frequencyBinCount);
analyser.getFloatFrequencyData(fFrequencyData);
(you can assume I've properly initialized and connected an Analyser node), which only shows a sensitivity/resolution of about
and also using DSP.js' FFT lib, like this:
var fft = new FFT();
fft.forward(e.inputBuffer.getChannelData(0));
fFrequencyData = fft.spectrum;
where e is the event object passed to onaudioprocess.
I seem to have a problem with FFT data (like FFT::spectrum) being null - is that normal? The data from fft.spectrum is naught unless I run it through analyser.getFloatFrequencyData, which I'm thinking overwrites the data with stuff from the native FFT, defeating the purpose entirely?
Hopefully someone out there can help steer me in the right direction - thanks! :)
You would need a very large FFT to get high-quality pitch detection this way. To get a 1Hz resolution, for example, with a 44,100kHz sample rate, you would need a 64k(-ish) FFT size. You're far better off using autocorrelation to do monophonic pitch detection.
I've been working on a tool to transcribe recordings of speech with Javascript. Basically I'm hooking up key events to play, pause, and loop a file read in with the audio tag.
There are a number of advanced existing desktop apps for doing this sort of thing (such as Transcriber -- here's a screenshot). Most transcription tools have a built-in waveform that can be used to jump around the audio file, which is very helpful because the transcriber can learn to visually find and repeat or loop phrases.
I'm wondering if it's possible to emulate a subset of this functionality in the browser, with Javascript. I don't know much about signal processing, perhaps it's not even feasible.
But what I envision is Javascript reading the sound stream from the file, and periodically sampling the amplitude. If the amplitude is very low for longer than a certain threshhold of time, then that would be labled as a phrase break.
Such labeling, I think, would be very useful for transcription. I could then set up key commands to jump to the previous period of silence. So hypothetically (imagining a jQuery-based API):
var audio = $('audio#someid');
var silences = silenceFindingVoodoo(audio);
silences will then contain a list of times, so I can hook up some way to let the user jump around through the various silences, and then set the currentTime to a chosen value, and play it.
Is it even conceivable to do this sort of thing with Javascript?
Yes it's possible with Web Audio API, to be more precise you will need AnalyserNode. To give you a short proof of concept you can get this example, and add following code to drawTimeDomain():
var threshold = 1000;
var sum = 0;
for (var i in amplitudeArray) {
sum += Math.abs(128 - amplitudeArray[i]);
}
var test = (sum < threshold) ? 'silent' : 'sound';
console.log('silent info', test);
You will just need a additional logic to filter silent by milliseconds (e.g. any silent taking more than 500 ms should be seen as real silent )
I think this is possible using javascript (although maybe not advisable, of course). This article:
https://developer.mozilla.org/En/Using_XMLHttpRequest#Handling_binary_data
... discusses how to access files as binary data, and once you have the audio file as binary data you could do whatever you like with it (I guess, anyway - I'm not real strong with javascript). With audio files in WAV format, this would be a trivial exercise, since the data is already organized by samples in the time domain. With audio files in a compressed format (like MP3), transforming the compressed data back into time-domain samples would be so insanely difficult to do in javascript that I would found a religion around you if you managed to do it successfully.
Update: after reading your question again, I realized that it might actually be possible to do what you're discussing in javascript, even if the files are in MP3 format and not WAV format. As I understand your question, you're actually just looking to locate points of silence within the audio stream, as opposed to actually stripping out the silent stretches.
To locate the silent stretches, you wouldn't necessarily need to convert the frequency-domain data of an MP3 file back into the time-domain of a WAV file. In fact, identifying quiet stretches in audio can actually be done more reliably in the frequency domain than in the time domain. Quiet stretches tend to have a distinctively flat frequency response graph, whereas in the time domain the peak amplitudes of audible speech are sometimes not much higher than the peaks of background noise, especially if auto-leveling is occurring.
Analyzing an MP3 file in javascript would be significantly easier if the file were CBR (constant bit rate) instead of VBR (variable bit rate).
As far as I know, JavaScript is not powerful enough to do this.
You'll have to resort to flash, or some sort of server side processing to do this.
With the HTML5 audio/video tags, you might be able to trick the page into doing something like this. You could (hypothetically) identify silences server-side and send the timestamps of those silences to the client as meta data in the page (hidden fields or something) and then use that to allow JavaScript to identify those spots in the audio file.
If you use WebWorker threads you may be able to do this in Javascript, but that would require using more threads in the browser to do this. You could break up the problem into multiple threads and process it, but, it would be all but impossible to synchronize this with the playback. So, Javascript can determine the silent periods, by doing some audio processing, but since you can't link that to the playback well it would not be the best choice.
But, if you wanted to show the waveforms to the user then javascript and canvas can be used for this, but then see the next paragraph for the streaming.
Your best bet would be to have the server stream the audio and it can do the processing and find all the silences. Each of these should then be saved in a separate file, so that you can easily jump between the silences, and by streaming, your server app can determine when to load up the new file, so there isn't a break.
I don't think JavaScript is the tool you want to use to use for processing those audio files - that's asking for trouble. However, javascript could easily read a corresponding XML file which describes where those silences occur in the audio file, adjusting the user interface appropriately. Then, the question is what do you use to generate those XML files:
You can do it manually if you need to demo the capability right away. (Use audacity to see where those audio envelopes occur)
Check out this CodeProject article, which creates a wav processing library in C#. The author has created a function to extract silence from the input file. Probably a good place to start hacking.
Just two of my initial thoughts ... There are ALOT of audio processing APIs out there, but they are written for particular frameworks and application programming languages. Definitely make use of them before trying to write something from scratch ... unless you happen to really love fourier transforms.