Compare sound between source and microphone in JavaScript

Compare sound between source and microphone in JavaScript - javascript

I'm working about audio but I'm a newbie in this area. I would like to matching sound from microphone to my source audio(just only 1 sound) like Coke Ads from Shazam. Example Video (0.45 minute) However, I want to make it on website by JavaScript. Thank you.

Building something similar to the backend of Shazam is not an easy task. We need to:
Acquire audio from the user's microphone (easy)
Compare it to the source and identify a match (hmm... how do... )
How can we perform each step?
Aquire Audio
This one is a definite no biggy. We can use the Web Audio API for this. You can google around for good tutorials on how to use it. This link provides some good fundametal knowledge that you may want to understand when using it.
Compare Samples to Audio Source File
Clearly this piece is going to be an algorithmic challenge in a project like this. There are probably various ways to approach this part, and not enough time to describe them all here, but one feasible technique (which happens to be what Shazam actually uses), and which is also described in greater detail here, is to create and compare against a sort of fingerprint for smaller pieces of your source material, which you can generate using FFT analysis.
This works as follows:
Look at small sections of a sample no more than a few seconds long (note that this is done using a sliding window, not discrete partitioning) at a time
Calculate the Fourier Transform of the audio selection. This decomposes our selection into many signals of different frequencies. We can analyze the frequency domain of our sample to draw useful conclusions about what we are hearing.
Create a fingerprint for the selection by identifying critical values in the FFT, such as peak frequencies or magnitudes
If you want to be able to match multiple samples like Shazam does, you should maintain a dictionary of fingerprints, but since you only need to match one source material, you can just maintain them in a list. Since your keys are going to be an array of numerical values, I propose that another possible data structure to quickly query your dataset would be a k-d tree. I don't think Shazam uses one, but the more I think about it, the closer their system seems to an n-dimensional nearest neighbor search, if you can keep the amount of critical points consistent. For now though, just keep it simple, use a list.
Now we have a database of fingerprints primed and ready for use. We need to compare them against our microphone input now.
Sample our microphone input in small segments with a sliding window, the same way we did our sources.
For each segment, calculate the fingerprint, and see if it matches close to any from storage. You can look for a partial match here and there are lots of tweaks and optimizations you could try.
This is going to be a noisy and inaccurate signal so don't expect every segment to get a match. If lots of them are getting a match (you will have to figure out what lots means experimentally), then assume you have one. If there are relatively few matches, then figure you don't.
Conclusions
This is not going to be an super easy project to do well. The amount of tuning and optimization required will prove to be a challenge. Some microphones are inaccurate, and most environments have other sounds, and all of that will mess with your results, but it's also probably not as bad as it sounds. I mean, this is a system that from the outside seems unapproachably complex, and we just broke it down into some relatively simple steps.
Also as a final note, you mention Javascript several times in your post, and you may notice that I mentioned it zero times up until now in my answer, and that's because language of implementation is not an important factor. This system is complex enough that the hardest pieces to the puzzle are going to be the ones you solve on paper, so you don't need to think in terms of "how can I do X in Y", just figure out an algorithm for X, and the Y should come naturally.

Related

How can you avoid false positives with music identifying algorithms?

I’m a music producer/composer who will be submitting works to new music libraries. In some cases, I’d like to use previous projects as a starting point. So while the result will be new, unique compositions, I want to avoid a scenario where an algorithm might mistake a new song (or song segment) for a previous work.
I’d like to develop some rules of thumb to keep in mind to ensure this doesn’t happen. Specifically, to understand more about how music identifying algorithms work and what combination of parameters need to be different - and to what degrees they need to be different - so as to avoid creating false positive identifications against my other works.
For example:
Imagine “song a” is part of “library a”. Then I create “song b” for “library b”. The arrangement is similar, same instruments are used, same tempo, same key, and mix is essentially the same. But the chord progression and melody are different, though a similar vibe. Could that trigger a false positive?
Or a scenario like above where maybe the instrumentation is similar, but also using some alternate voices (Like an alternate synth patch for the baseline, and similar but different percusssion samples). New key, and a speed increase of 5 bpm. Is that enough to differentiate?
Or imagine a scenario where the bulk of the track is significantly different for all parameters, including a new tempo and key, except there is a 20 second break in the middle that resembles a previous work: an ambient tonal bed with light percussion. The same tonal bed is used, but in the new key and tempo, and the percussion is close to the same. Then a user uses only those 20 seconds in a video. How different would those 20 seconds need to be from the original, and across what parameters, to avoid a false positive?
These examples are just thought experiments to try and understand how it all works. I imagine any new compositions I make should easily be adequately different from previous compositions, and the cumulative differences would easily extend beyond tenants listed in above scenarios.
But given the fact that there are some parameters that could be very similar…(even just from a mix perspective and instruments used), I would like to develop a deeper understanding of what gets analyzed. And consequently, what sort of differences I should ensure remain constant - because it seems to me even 20 seconds of enough similarity could trigger a potential issue.
Thanks!
Ps:
Note I welcome any insight offered, and am certainly receptive to the answer being couched in coding language…this is stack exchange after all, and it could be pretty interesting. But at the end of the day, I’m not a coder (though i am coding curious), and need to translate any clarity offered into practical considerations that could be employed from a music production POV. Which is to say, if it’s easy enough to include some language/concepts with that in mind, I’d be very grateful. Parameters like: tempo, key, chord progressions, rhythm elements, frequency considerations, sounds used, overall mix, etc etc. Thanks again!

attempting to actually answer the question, despite the discussion in the comments, I happen to know of the existence of this video by computerphile. at least some of the music matching algorithms out in the wild must be based on that.
P.S. linked is How Shazam Works (Probably!) featuring David Domminney Fowler. I barely remember the details of the video, except its existence, which is why the answer is so bad. edits are welcome.

Web Audio API - Get correct frequency

I am using the Web Audio API to get the frequency of the sound, which is coming from the microphone. For this I found some useful code on this github repo: https://gist.github.com/giraj/250decbbc50ce091f79e .
Now my problem is, that I am getting a lot of different frequencies, for only one little sound. This sound might be from my voice, or from an instrument.
These frequencies are like between 90 and 4000Hz. But as I know, one note of a human voice or from an instrument, only can have one single frequency amount in Hz. And I am pretty sure, that I am only playing one single tone.
So how can I know, which frequency of the 3 or 4 frequencies per tone is the one I am searching. I need this value, to recognize musical notes like C, D, E from their frequencies. I hope this question isn't off-topic, because I really tried hard to find a solution and I don't know if this is a solveable issue from the API itself, or if I have to eliminate some frequencies somehow. I would appreciate any kind of help.
Edit: And I want to add, that I never reach the same values of the notes as listed in this frequency list: http://www.phy.mtu.edu/~suits/notefreqs.html . I am using a piano app, which is giving always the correct frequencies on frequency apps on the play store. So I even doubt the results I am receiving.

I've been messing with the same question and have some interesting partial answers. This website http://www.phy.mtu.edu/~suits/Physicsofmusic.html has a huge amount of information explain music in math terms and is super helpful.
I wrote something that uses a web audio analyser and simply buckets the fft results into bins by musical pitch - it gives you a graphic of what the fft results are and is kind of indicative of what frequencies are actually in the sound. It's at https://aerik.github.io/NoteDetector.htm.
After I started with that I found another guy's code that uses "auto-correlation" to detect the fundamental. This might be closer to what you're looking for: https://github.com/cwilso/PitchDetect The problem I'm having with that is it that, while it works well for fairly pure tones, it still has a lot of noise.
I'm thinking of combining his approach with mine by comparing the autocorrelation result with the signal strength from the fft.
It's a fun project, but I don't think there are any simple answers.

I'm a professional singer, pianist, and voice teacher transitioning into code, so I think I can speak to some of the confusing results you're getting here.
Bottom line: you are actually producing many different frequencies at the same time when you sing or play a note on an instrument, so chances are the results you're seeing are accurate. What you're aiming for, however, is almost certainly the fundamental pitch, which is the lowest one.
Longer, more complex answer with physics: Unless you're looking at a sine wave (sounds like a mechanical beep, and won't come out of a decent musical instrument), the sound you're hearing probably contains many different frequencies. The sound is made up of a fundamental pitch (the lowest frequency, and usually the one we're talking about when we name a pitch in music), and a whole lot of overtones (other, higher frequencies that make up the characteristic sound of an instrument, or for singers, even a vowel).
Let's pick a number that's easy to work with: imagine your fundamental pitch is 100hz. We'll call that C1 for convenience in discussing musical implications (though it's not actually a C), and the numbers represent octave leaps with octaves ranging from C up to B. You could potentially have overtones at any of the following pitches: 200hz (C2), 300hz (G2), 400hz (C3), 500hz (E3), 600hz (G3), 700hz (Bb3), 800hz (C4), 900 hz (D4), 1000hz (E4), etc. Different instruments might make some overtones pop out more than others, or skip some of these entirely (many will skip every other overtone), but all the overtones will be within this pattern.
Notice that all the overtones are multiples of the fundamental. That means you can use the pattern in all those other pitches you see to figure out the fundamental pitch underneath. From a musical standpoint, you might also notice that the pitches you see first in this overtone series are the ones we consider most consonant — octaves, perfect 5ths, major 3rds, major triads. This is not a coincidence, and the way the overtones line up with these other pitches is almost certainly why we find them pretty to listen to.
Boiling all this down to how you'd determine the fundamental pitch given a series of overtones presumably resulting from the same fundamental: you're essentially looking for the greatest common factor of the various frequencies you'll see. It is probably also the lowest frequency you detected, but be careful with this heuristic, because you may have unrelated noise in your signal. Anything that doesn't fall into your nice list of multiples is probably noise.
All this gets much more complicated, of course, when you play more than one (fundamental) pitch at once. I'm pondering chord detection myself, and found your question while looking for what people have already done in this area and how I can build on it.

Client side search engine optimization

Due to the reasons outlined in this question I am building my own client side search engine rather than using the ydn-full-text library which is based on fullproof. What it boils down to is that fullproof spawns "too freaking many records" in the order of 300.000 records whilst (after stemming) there are only about 7700 unique words. So my 'theory' is that fullproof is based on traditional assumptions which only apply to the server side:
Huge indices are fine
Processor power is expensive
(and the assumption of dealing with longer records which is just applicable to my case as my records are on average 24 words only1)
Whereas on the client side:
Huge indices take ages to populate
Processing power is still limited, but relatively cheaper than on the server side
Based on these assumptions I started of with an elementary inverted index (giving just 7700 records as IndexedDB is a document/nosql database). This inverted index has been stemmed using the Lancaster stemmer (most aggressive one of the two or three popular ones) and during a search I would retrieve the index for each of the words, assign a score based on overlap of the different indices and on similarity of typed word vs original (Jaro-Winkler distance).
Problem of this approach:
Combination of "popular_word + popular_word" is extremely expensive
So, finally getting to my question: How can I alleviate the above problem with a minimal growth of the index? I do understand that my approach will be CPU intensive, but as a traditional full text search index seems unusably big this seems to be the only reasonable road to go down on. (Pointing me to good resources or works is also appreciated)
1 This is a more or less artificial splitting of unstructured texts into small segments, however this artificial splitting is standardized in the relevant field so has been used here as well. I have not studied the effect on the index size of keeping these 'snippets' together and throwing huge chunks of texts at fullproof. I assume that this would not make a huge difference, but if I am mistaken then please do point this out.

This is a great question, thanks for bringing some quality to the IndexedDB tag.
While this answer isn't quite production ready, I wanted to let you know that if you launch Chrome with --enable-experimental-web-platform-features then there should be a couple features available that might help you achieve what you're looking to do.
IDBObjectStore.openKeyCursor() - value-free cursors, in case you can get away with the stem only
IDBCursor.continuePrimaryKey(key, primaryKey) - allows you to skip over items with the same key
I was informed of these via an IDB developer on the Chrome team and while I've yet to experiment with them myself this seems like the perfect use case.
My thought is that if you approach this problem with two different indexes on the same column, you might be able to get that join-like behavior you're looking for without bloating your stores with gratuitous indexes.
While consecutive writes are pretty terrible in IDB, reads are great. Good performance across 7700 entries should be quite tenable.

Mass astar pathfinding

I'm trying to create a tower defence game in Javascript.
It's all going well apart from the pathfinding..
I'm using the astar code from this website: http://www.briangrinstead.com/blog/astar-search-algorithm-in-javascript which uses a binary heap (which I believe is fairly optimal)
The problem i'm having is I want to allow people to block the path of the "attackers". This means that each "attacker" needs to be able to find its way to the exit on its own (as someone could just cut off a single "attacker" and it would need to find its own way to the exit). Now 5/6 attackers can pathfind at any one time with no issue. But say the path is blocked for 10+ attackers, all 10 of them will need to fire its pathfinding script at the same time which just drops the FPS to about 1/2 per sec.
This must be a common problem for anyone who has a lot of entities pathfinding at anyone time, so I imagine there must be a better way than my approach.
So my question is: What is the best way to implement mass pathfinding algorithm to multiple "bots" in the most efficient way.
Thanks,
James

Use Anti-objects, this is the only way to get cheap pathfinding, afaik :
http://www.cs.colorado.edu/~ralex/papers/PDF/OOPSLA06antiobjects.pdf
Anti-object basically mean that instead of bots having individual ai, you will have one "swarm ai", which is bound to your game map.
p.s.: Here is another link about pathfinding in general (possibly the best online reference available):
http://theory.stanford.edu/~amitp/GameProgramming/index.html

Just cache the result.
Store the path as the value in a hash table (object), give each node a UUID, concatenate the UUIDs to form a unique hash table key and insert the path into it.
When you retrieve the path back out of the hash table, walk the path, and see if it's still valid, if not, recalculate and insert the new one back in.
There are many optimization that you can do :)
Like c69 said swarm AI or hive mind come to mind :P

Handling combat effects in game development

I'm trying to nut out a highlevel tech spec for a game I'm tinkering with as a personal project. It's a turn based adventure game that's probably closest to Archon in terms of what I'm trying to do.
What I'm having trouble with is conceptualising the best way to develop a combat system that I can implement simply at first, but that will allow expansion and complexity to be added in the future.
Specifically I'm having trouble trying to figure out how to handle combat special effects, that is, bonuses or negatives that may be applied or removed by an actor, an item or an environment.
Do I have the actor handle all effects that are in play for/against them should the game itself check each weapon, armour, actor and location each time it tries to make a decisive roll.
Are effects handled in individual objects or is there an 'effect' object or a bit of both?
I may well have not explained myself at all well here, and I'm more than happy to try and expand the question if my request is simply too broad and airy. But my intial thinking is that smarter people than me have spent the time and effort in figuring things like this out and frankly I don't want to taint the conversation with the cul-de-sac of my own stupidity too early.
The language in question is javascript, although at this point I don't imagine it makes a great difference.

What you're calling 'special effects' used to be called 'modifiers' but nowadays go by the term popular in MMOs as 'buffs'. Handling these is as easy or as difficult as you want it to be, given that you get to choose how much versatility you want to be able to bestow at each stage.
Fundamentally though, each aspect of the system typically stores a list of the modifiers that apply to it, and you can query them on demand. Typically there are only a handful of modifiers that apply to any one player at any given time so it's not a problem - take the player's statistics and any modifiers imparted by skills/spells/whatever, add on any modifiers imparted by worn equipment, then add anything imparted by the weapon in question. If you come up with a standard interface here (eg. sumModifiersTo(attributeID)) that is used by actors, items, locations, etc., then implementing this can be quick and easy.
Typically the 'effect' objects would be contained within the entity they pertain to: actors have a list of effects, and the items they wear or use have their own list of effects. Where effects are explicitly activated and/or time-limited, it's up to you where you want to store them - eg. if you have magical potions or other consumables, their effects will need to be appended to the Actor rather than the (presumably destroyed) item.
Don't be tempted to try and have the effects modify actor attributes in-place, as you quickly find that it's easy for the attributes to 'drift' if you don't ensure all additions and removals are done following the correct protocol. It also makes it much harder to bypass certain modifiers later. eg. Imagine a magical shield that only protects against other magic - you can pass some sort of predicate to your modifier totalling function that disregards certain types of effect to do this.

Take a look at the book, Head First Design Patterns, by Elisabeth Freeman. Specifically, read up on the Decorator and Factory patterns and the method of programming to interfaces, not implementations. I found that book to be hugely effective in illustrating some of the complex concepts that may get you going on this.
Hope this helps to point you in the right direction.

At first blush I would say that the individual combatants (player and NPC) have a role in determining what their combat characteristics are (i.e. armor value, to-hit number, damage range, etc.) given all the modifiers that apply to that combatant. So then the combat system is not trying to figure out whether or not the character's class gives him/her an armor bonus, whether a magic weapon weighs in on the to hit, etc.
But I would expect the combat system itself to be outside of the individual combatants. That it would take information about an attacker and a desired type of attack and a target or set of targets and resolve that.
To me, that kind of model reflects how we actually ran combat in pencil and paper RPGs. The DM asked each player for the details of his or her character and then ran the combat using that information as the inputs. That it works in the real world suggests its a pretty flexible system.

Develop Reference

JavaScript is the programming language of the Web.