Michael Chaney

Posted on Jan 27 • Edited on Jan 30

Matching Fuzzy Data

#ruby #devlopment #data

I do a lot of work in the music industry. One thing about the music industry - and really every industry - is that there's a lot of junk data out there.

Now, there are various ways that this manifests in the music industry. The obvious one is what I'll cover today - composition titles and associated filenames. Then we have composer names. And publisher names. This isn't even getting into the realm of metadata.

There are lots of places where we need to match up pieces of data that might not match with a simple string compare. There are also issues of missing data - audio files might not be present or lines may be missing from the sheet.

Managing and ingesting data of this kind requires fuzzy matching as part of a process that alerts the user to all of the ways that the data is problematic.

In Ruby, I use a gem called "fuzzy-string-match" which implements what's known as the Jaro-Winkler algorithm. (An alternative is the Levenshtein distance algorithm.)

As I said before, I'll cover the concept of composition titles and filenames. It's very common that I'll receive a group of files along with a spreadsheet of metadata. The trick is to match the lines in the spreadsheet with the associated file path.

Right now, I'm dealing with a dataset where the file path looks like this:

country/{sometimes genre/}album/filename.[mp3|wav]

It turns out that the "genre" - if present - may or may not match what's in the spreadsheet, so it'll have to be ignored.

The album column contains an "album number" consisting of a few letters to identify the publisher along with a serial number plus the title, so it's guaranteed to have some uniqueness.

The filename, though. May God help me. The filename might be any number of things:

The composition title
The album number/name plus the composition title
The album number/name, the track number, and the composition title
The track number and composition title
(I'm not making this up) The album number/name, the track number, the track number again, and the composition title

For #5, it looks like the composition title came from somewhere that had the track number already tacked on. This belief is supported by some of those also have an extension of ".mp3.mp3".

I would also point out that the various components of this filename have delimiters between them, typically some combination of dashes, spaces, and underscores.

I'm not going to embarrass my client if for no other reason than their data is normal. It's no worse than any number of other datasets that I've encountered. So consider this filepath which has some stuff changed:

Canada/ZZZ 001_Very Nice Music Vol. 1/ZZZ 001_Very Nice Music Vol. 1 - 10 - 10_Fighter.mp3.mp3

So,
country: Canada
album: ZZZ 001_Very Nice Music Vol. 1
track number: 10
composition title: Fighter

Here's another one:
Mexico/Rock/YYY 001_Rock Vol. 1/YYY 001_Rock_9_9_ Love You Much.mp3

Note that in this one the album designation is missing "Vol. 1" in the filename. Plus we have an extraneous space. What can you trust here?

If you're thinking I should just dig out the album name and track number - good idea. Except, records still might not match up. It's important to me to have a bit of a double-check to make sure the titles are correct(ish) as well.

The general way to accomplish this:

Create likely possible file paths for each row in the spreadsheet.
Use the fuzzy matcher to determine if it's really close to one of the file paths
If so, match it up and take that file path out of the pool
Output a new sheet with a new "full file path" column
Output separately a list of unmatched file paths

In this list, the second item is the meat of the algorithm. It's also where we can get the most gains from a correct implementation.

The Jaro-Winkler Algorithm

Let's talk about the Jaro-Winkler algorithm. To make it easy, let's look at examples using irb.

The implementation of the Jaro-Winkler algorithm in Ruby is a simple function call.

$ irb
3.3.6 :001 > require 'fuzzystringmatch'
 => true
3.3.6 :002 > fuzzy = FuzzyStringMatch::JaroWinkler.create(:pure)
 => #<FuzzyStringMatch::JaroWinklerPure:0x000000011d0c0120>
3.3.6 :003 > fuzzy.getDistance('string 1', 'string 2')
 => 0.975
3.3.6 :004 > fuzzy.getDistance('the same thing', 'the same thing')
 => 1.0
3.3.6 :005 > fuzzy.getDistance('a string like this', 'something different')
 => 0.5467836257309941

You can see here that the value "1.0" means equality, and that number diminishes to "0.0" as the strings diverge. I would note that the linked article shows the opposite, which makes sense given that it's a "distance", but as long as it's consistent we can work with whatever we get.

3.3.6 :006 > fuzzy.getDistance('monday', 'MONDAY')
 => 0.0

Ack. It's case-sensitive.

3.3.6 :007 > fuzzy.getDistance('a new day', 'a_new_day')
 => 0.8666666666666666
3.3.6 :008 > fuzzy.getDistance('a new day', 'a  new  day')
 => 0.8898071625344353
3.3.6 :009 > fuzzy.getDistance('a new day', 'a   new   day')
 => 0.8505369274600044

It cares about whitespace and non-alphanumeric characters as well. But there's something to be noticed here - adding more spaces doesn't make the score drop by much.

Just from what I've shown here, we can come up with a few ideas to make the filename matches more likely:

Remove all file extensions from the filenames, including duplicate extensions
Force the file path to lowercase
Remove all non-alphanumeric characters
Compress multiple spaces down to a single space

Let's then consider this file path:

Mexico/Rock/YYY 001_Rock Vol. 1/YYY 001_Rock_9_9_ Love You Much.mp3

First, I can cut it up into four pieces:

Mexico
Rock
YYY 001_Rock Vol. 1
YYY 001_Rock_9_9_ Love You Much.mp3

I'll ignore the genre. Let's munge the album name:

yyy 001 rock vol 1

And let's do the same for the filename, removing the extension as well:

yyy 001 rock 9 9 love you much

Now, the row in the spreadsheet will show that the album name is:

YYY 001_Rock Vol. 1

That'll then turn into the same thing we see above:

yyy 001 rock vol 1

There are two issues in the filename. First, the "Vol. 1" is missing from the album designation. Second, the track number is duplicated.

Let's see how those things affect the scoring. The ideal filename from the data would be YYY 001_Rock Vol. 1_9_Love You Much.mp3 given the established pattern. Let's munge that (yyy 001 rock vol 1 9 love you much) and compare:

3.3.6 :010 > fuzzy.getDistance('yyy 001 rock vol 1 9 love you much', 'yyy 001 rock 9 9 love you much')
 => 0.9202640894085828

So, it comes in at 0.92, which is close. Is it close enough?

A naive algorithm might have a cutoff value that is considered "close enough". But if we do a complete cartesian product and sort by the fuzzy distance (descending) we'll get all of the best matches at the top. This can be an expensive operation, and it might make sense to look for exact matches before running the fuzzy matcher. I won't optimize it here, but you get the idea.

Munging Strings For Better Matching

Taking our example, we can run a set of mungers against the two strings, compare all possible pairs, and extract the highest score as the official score for those two strings. Let's look at how to do this in Ruby.

First, what are our "two strings". On one side, we have the file path. On the other side, we have the expected file path that will be generated from the spreadsheet data. In reality, there will be a few different "expected" file paths since there are a few ways those are generated.

Let's create a simple set of mungers. We will munge the country, album name, and filename items separately. All will go through the basic munging:

Force to lowercase
Change all non-alphanumeric characters to spaces
Collapse multiple contiguous spaces to one single space

Filenames will have to be treated differently. Just looking at the examples shown:

Remove "vol x", where "x" is a number
Change " x x " to " x " - in other words, if a number is duplicated remove one of them.

There's a potential pitfall here. If we have "vol. 1" and track "1" there'll likely be a duplicated number. For those mungers, though, we'll try all combinations - including none, so the original filename will still be in the list.

Let's start with our simple mungers (I'm going to skip the new Ruby syntaxes to make this work back into the Ruby 2 series):

downcase_munger = lambda { |x| x.downcase }
alphanumeric_filter_munger = lambda { |x| x.gsub(/\P{Alnum}/, ' ') }
spaces_munger = lambda { |x| x.gsub(/ +/, ' ').strip }

And our specific filename mungers:

remove_vol_munger = lambda { |x| x.sub(/ vol \d+ /, ' ') }
remove_dup_track_number_munger = lambda { |x| x.sub(/ (\d+) (\1) /, ' \\1 ') }

Finally, we also want to remove the extensions from the filenames, knowing that they might be duplicated (or replicated):

remove_ext_munger = lambda { |x| x.sub(/(\.(mp3|wav))+$/i, '') }

We're going to use a fun trick to apply every combination of mungers to our strings. Note that downcase_munger, alphanumeric_filter_munger, and spaces_munger are always applied to our strings at the start, with spaces_munger applied again at the end. But our other two mungers for filenames need to be applied in all combinations. Since there are only two, this is very easy: none, both, and each individually. I have code that uses up to seven mungers, so generalizing this makes sense.

Ruby has an awesomely simple way to do this with the combination method for arrays:

3.3.6 :001 > [1,2,3,4].combination(3).to_a
 => [[1, 2, 3], [1, 2, 4], [1, 3, 4], [2, 3, 4]]
3.3.6 :002 > [1,2,3,4].combination(2).to_a
 => [[1, 2], [1, 3], [1, 4], [2, 3], [2, 4], [3, 4]]
3.3.6 :003 > [1,2].combination(2).to_a
 => [[1, 2]]
3.3.6 :004 > [1,2].combination(1).to_a
 => [[1], [2]]

As you can see, given an array it returns all possible combinations of "n" elements.

What if those elements are mungers?

In Ruby, starting in version 2.5 of the language, it's possible to easily compose functions using the ">>" or "<<" operator.

As an aside, I'm going to assume that my mungers can be called in any order, meaning we only need to worry about "combinations" and not "permutations". In reality order does matter, but it doesn't matter enough for me to bother with it.

Here are our steps so far:

For the spreadsheet: come up with a list of all expected file paths given the country, album name, track number, and composition title.
For the file paths: remove the genre if applicable, apply the general mungers to the country and album name, then run all mungers against the filename. Reassemble the paths.
Run the cartesian product of expected file paths and file paths. Since there will be multiple items for each spreadsheet line and each file path, we'll again run a cartesian product and take only the closest match.

We will create an array that has these elements:

spreadsheet line
full file path
Jaro-Winkler score

Scores below a certain threshold will have to be ignored. We'll figure that out later.

The General Algorithm

So, let's talk about the spreadsheet lines. Here is the data for the file paths that we've referenced above:

country	album	track_number	title
Canada	ZZZ 001_Very Nice Music Vol. 1	10	Fighter
Mexico	YYY 001_Rock Vol. 1	9	Love You Much

Here are all of our combinations of these data items:

title - "%s"
The album plus the title - "%s - %s"
The album, track_number, and title - "%s - %d - %s"
The track_number and title - "%d - %s"

I can ignore the duplicated track number as the mungers will handle that. Let's see how this works.

downcase_munger = lambda { |x| x.downcase }
alphanumeric_filter_munger = lambda { |x| x.gsub(/\P{Alnum}/, ' ') }
spaces_munger = lambda { |x| x.gsub(/ +/, ' ').strip }
general_munger = downcase_munger >> alphanumeric_filter_munger >> spaces_munger

filename_patterns = [
  "%<title>s",
  "%<album>s - %<title>s",
  "%<album>s - %<track_number>d - %<title>s",
  "%<track_number>d - %<title>s"
]
# Note that the keys need to be symbols
csv_rows = [
  { country: "Canada", album: "ZZZ 001_Very Nice Music Vol. 1",
    track_number: 10, title: "Fighter"
  },
  { country: "Mexico", album: "YYY 001_Rock Vol. 1",
    track_number: 9, title: "Love You Much"
  }
]

With all of that in place, we can get a group of possible (expected) filenames for each csv row:

add_possible_filenames = lambda do |csv_rows|
  csv_rows.each do |csv_row|
    country = general_munger.call csv_row[:country]
    album = general_munger.call csv_row[:album]
    path_prefix = "#{country}/#{album}/"
    csv_row[:possible_filenames] = filename_patterns.map { |pattern|
      general_munger.call sprintf(pattern, csv_row)
    }.map { |filename| path_prefix + filename }
  end
end

3.3.6 :067 > add_possible_filenames.call csv_row
 =>
[{:country=>"Canada",
  :album=>"001_Very Nice Music Vol. 1",
  :track_number=>10,
  :title=>"Fighter",
  :possible_filenames=>
   ["canada/zzz 001 very nice music vol 1/fighter",
    "canada/zzz 001 very nice music vol 1/zzz 001 very nice music vol 1 fighter",
    "canada/zzz 001 very nice music vol 1/zzz 001 very nice music vol 1 10 fighter",
    "canada/zzz 001 very nice music vol 1/10 fighter"]},
 {:country=>"Mexico",
  :album=>"YYY 001_Rock Vol. 1",
  :track_number=>9,
  :title=>"Love You Much",
  :possible_filenames=>
   ["mexico/yyy 001 rock vol 1/love you much",
    "mexico/yyy 001 rock vol 1/yyy 001 rock vol 1 love you much",
    "mexico/yyy 001 rock vol 1/yyy 001 rock vol 1 9 love you much",
    "mexico/yyy 001 rock vol 1/9 love you much"]}]

I placed the possible filenames in the csv_rows under "possible_filenames", and they're already generally munged.

Now, let's look at the filepaths:

filename_list = [
  "Mexico/Rock/YYY 001_Rock Vol. 1/YYY 001_Rock_9_9_ Love You Much.mp3",
  "Canada/ZZZ 001_Very Nice Music Vol. 1/ZZZ 001_Very Nice Music Vol. 1 - 10 - 10_Fighter.mp3.mp3"
]
filename_mungers = [ remove_vol_munger, remove_dup_track_number_munger ]

apply_filename_mungers = lambda do |filename|
  munged_filenames = [ filename ] +
  1.upto(filename_mungers.length).map do |i|
    filename_mungers.combination(i).to_a.map do |munger_combo|
      munger_combo.inject(:>>).call(filename)
    end
  end.flatten
  munged_filenames.map { |x| spaces_munger.call x }.sort.uniq
end

create_filename_lists = lambda do |filename_list|
  filename_list.map do |filepath|
    path_pieces = filepath.split("/")
    path_pieces.delete_at(1) if path_pieces.size == 4
    country = general_munger.call path_pieces[0]
    album = general_munger.call path_pieces[1]
    filename = (remove_ext_munger >> general_munger).call path_pieces[2]
    filenames = apply_filename_mungers.call filename
    [ filepath, filenames.map { |fn| "#{country}/#{album}/#{fn}" } ]
  end
end

Now, to apply this:

3.3.6 :075 > file_path_variations = create_filename_lists.call filename_list
 =>
[["Mexico/Rock/YYY 001_Rock Vol. 1/YYY 001_Rock_9_9_ Love You Much.mp3",
  ["mexico/yyy 001 rock vol 1/yyy 001 rock 9 9 love you much",
   "mexico/yyy 001 rock vol 1/yyy 001 rock 9 love you much"]],
 ["Canada/ZZZ 001_Very Nice Music Vol. 1/ZZZ 001_Very Nice Music Vol. 1 - 10 - 10_Fighter.mp3.mp3",
  ["canada/zzz 001 very nice music vol 1/zzz 001 very nice music 10 10 fighter",
   "canada/zzz 001 very nice music vol 1/zzz 001 very nice music 10 fighter",
   "canada/zzz 001 very nice music vol 1/zzz 001 very nice music vol 1 10 10 fighter",
   "canada/zzz 001 very nice music vol 1/zzz 001 very nice music vol 1 10 fighter"]]]

Here we have an array where the first element is the file path on disk and the second element is a subarray with a few different variations, based on the mungers. Again, there may be any number of mungers depending on the data.

At this point, we want to combine everything. For each spreadsheet row and each file path we want to get the Jaro-Winkler distance between all variations and capture the closest (highest, in our case).

If you're using a language other than Ruby or JavaScript you might want to have a slightly different data structure than what I'm doing here. These languages use references, meaning 1) data isn't copied and 2) equality can be guaranteed by checking the actual object reference.

However, note that this will grow exponentially as the data set grows, so it works great for hundreds and maybe thousands of records. Beyond that you'll likely need to start working on optimization.

compute_cartesian_product_array = lambda do |csv_rows, file_path_variations|
  csv_rows.map do |csv_row|
    file_path_variations.map do |filepath, variations|
      closest = csv_row[:possible_filenames].map do |possible_filename|
        variations.map do |variation|
          fuzzy.getDistance(possible_filename, variation)
        end
      end.flatten.max
      [csv_row, filepath, closest]
    end
  end.flatten(1)
end

Just with two items we have a four element array:

[[{:country=>"Canada",
   :album=>"ZZZ 001_Very Nice Music Vol. 1",
   :track_number=>10,
   :title=>"Fighter",
   :possible_filenames=>
    ["canada/zzz 001 very nice music vol 1/fighter",
     "canada/zzz 001 very nice music vol 1/zzz 001 very nice music vol 1 fighter",
     "canada/zzz 001 very nice music vol 1/zzz 001 very nice music vol 1 10 fighter",
     "canada/zzz 001 very nice music vol 1/10 fighter"]},
  "Mexico/Rock/YYY 001_Rock Vol. 1/YYY 001_Rock_9_9_ Love You Much.mp3",
  0.6296653796653796],
 [{:country=>"Canada",
   :album=>"ZZZ 001_Very Nice Music Vol. 1",
   :track_number=>10,
   :title=>"Fighter",
   :possible_filenames=>
    ["canada/zzz 001 very nice music vol 1/fighter",
     "canada/zzz 001 very nice music vol 1/zzz 001 very nice music vol 1 fighter",
     "canada/zzz 001 very nice music vol 1/zzz 001 very nice music vol 1 10 fighter",
     "canada/zzz 001 very nice music vol 1/10 fighter"]},
  "Canada/ZZZ 001_Very Nice Music Vol. 1/ZZZ 001_Very Nice Music Vol. 1 - 10 - 10_Fighter.mp3.mp3",
  1.0],
 [{:country=>"Mexico",
   :album=>"YYY 001_Rock Vol. 1",
   :track_number=>9,
   :title=>"Love You Much",
   :possible_filenames=>
    ["mexico/yyy 001 rock vol 1/love you much",
     "mexico/yyy 001 rock vol 1/yyy 001 rock vol 1 love you much",
     "mexico/yyy 001 rock vol 1/yyy 001 rock vol 1 9 love you much",
     "mexico/yyy 001 rock vol 1/9 love you much"]},
  "Mexico/Rock/YYY 001_Rock Vol. 1/YYY 001_Rock_9_9_ Love You Much.mp3",
  0.9804808984852584],
 [{:country=>"Mexico",
   :album=>"YYY 001_Rock Vol. 1",
   :track_number=>9,
   :title=>"Love You Much",
   :possible_filenames=>
    ["mexico/yyy 001 rock vol 1/love you much",
     "mexico/yyy 001 rock vol 1/yyy 001 rock vol 1 love you much",
     "mexico/yyy 001 rock vol 1/yyy 001 rock vol 1 9 love you much",
     "mexico/yyy 001 rock vol 1/9 love you much"]},
  "Canada/ZZZ 001_Very Nice Music Vol. 1/ZZZ 001_Very Nice Music Vol. 1 - 10 - 10_Fighter.mp3.mp3",
  0.6403194506642782]]

Now, we can process this array to get the likely matches, starting with the most likely and descending. At each step, we will remove "used" items from the cartesian_product_array - based on both the "csv_row" and the "filepath". The array will quickly shrink as this action is destructive. Again, this is unoptimized and there are many obvious optimizations to speed this up.

As I've written this, everything will be "consumed". What that means is that when we finish whichever set is smaller (csv rows or file paths) will have a match for every element. Some of those matches may be incorrect simply because there was no match and it grabbed the closest. You can see above that absolutely wrong items still scored above 0.6. We probably need to program in a cutoff of .8 or .9, which we can work on next.

The Naive Algorithm

Let's do the naive algorithm first.

cartesian_product_array.sort_by!(&:last)

This will put the most likely matches at the end. We can grab the last element in this array and assume it's a match.

3.3.6 :313 > cartesian_product_array.last
 =>
[{:country=>"Canada",
  :album=>"ZZZ 001_Very Nice Music Vol. 1",
  :track_number=>10,
  :title=>"Fighter",
  :possible_filenames=>
   ["canada/zzz 001 very nice music vol 1/fighter",
    "canada/zzz 001 very nice music vol 1/zzz 001 very nice music vol 1 fighter",
    "canada/zzz 001 very nice music vol 1/zzz 001 very nice music vol 1 10 fighter",
    "canada/zzz 001 very nice music vol 1/10 fighter"]},
 "Canada/ZZZ 001_Very Nice Music Vol. 1/ZZZ 001_Very Nice Music Vol. 1 - 10 - 10_Fighter.mp3.mp3",
 1.0]

That is a nice match - the score is "1.0" meaning it was an exact match. We need to remove every instance of this csv_row and every instance of this file path from our array.

3.3.6 :317 > current_match = cartesian_product_array.last
3.3.6 :318 > cartesian_product_array.reject! { |row|
3.3.6 :319 >   row[0] == current_match[0] || row[1] == current_match[1]
3.3.6 :320 > }
 =>
[[{:country=>"Mexico",
   :album=>"YYY 001_Rock Vol. 1",
   :track_number=>9,
   :title=>"Love You Much",
   :possible_filenames=>
    ["mexico/yyy 001 rock vol 1/love you much",
     "mexico/yyy 001 rock vol 1/yyy 001 rock vol 1 love you much",
     "mexico/yyy 001 rock vol 1/yyy 001 rock vol 1 9 love you much",
     "mexico/yyy 001 rock vol 1/9 love you much"]},
  "Mexico/Rock/YYY 001_Rock Vol. 1/YYY 001_Rock_9_9_ Love You Much.mp3",
  0.9804808984852584]]

That knocked out three lines! This algorithm continues to speed up as it goes along. At this point, we save the current_match and do this again, and we'll have both matches.

3.3.6 :321 > current_match = cartesian_product_array.last
 =>
[{:country=>"Mexico",
...
3.3.6 :322 > cartesian_product_array.reject! { |row|
3.3.6 :323 >   row[0] == current_match[0] || row[1] == current_match[1]
3.3.6 :324 > }
 => []

Let's turn that into a function. I'll just add the matched file path to the csv_row.

add_file_path_to_csv_rows = lambda do |cartesian_product_array|
  cartesian_product_array.sort_by!(&:last)
  while cartesian_product_array.any?
    current_match = cartesian_product_array.last
    cartesian_product_array.reject! { |row|
      row[0] == current_match[0] || row[1] == current_match[1]
    }
    current_match[0][:matched_file_path] = current_match[1]
  end
end

Run that, and it'll add the "matched_file_path" to the csv_rows:

[{:country=>"Canada",
  :album=>"ZZZ 001_Very Nice Music Vol. 1",
  :track_number=>10,
  :title=>"Fighter",
  :possible_filenames=>
   ["canada/zzz 001 very nice music vol 1/fighter",
    "canada/zzz 001 very nice music vol 1/zzz 001 very nice music vol 1 fighter",
    "canada/zzz 001 very nice music vol 1/zzz 001 very nice music vol 1 10 fighter",
    "canada/zzz 001 very nice music vol 1/10 fighter"],
  :matched_file_path=>
   "Canada/ZZZ 001_Very Nice Music Vol. 1/ZZZ 001_Very Nice Music Vol. 1 - 10 - 10_Fighter.mp3.mp3"},
 {:country=>"Mexico",
  :album=>"YYY 001_Rock Vol. 1",
  :track_number=>9,
  :title=>"Love You Much",
  :possible_filenames=>
   ["mexico/yyy 001 rock vol 1/love you much",
    "mexico/yyy 001 rock vol 1/yyy 001 rock vol 1 love you much",
    "mexico/yyy 001 rock vol 1/yyy 001 rock vol 1 9 love you much",
    "mexico/yyy 001 rock vol 1/9 love you much"],
  :matched_file_path=>
   "Mexico/Rock/YYY 001_Rock Vol. 1/YYY 001_Rock_9_9_ Love You Much.mp3"}]

More Complicated Data

Okay, let's be frank. That example is a little sterile given that we knew up front that our two files would match the two lines in the sheet. Real life is never so nice. I mean, it just isn't. So given that, how can we make this work when stuff doesn't match? Let's add some crap data and see what happens.

Let's add an extraneous file path first:

filename_list = [
  "Mexico/Rock/YYY 001_Rock Vol. 1/YYY 001_Rock_9_9_ Love You Much.mp3",
  "Canada/ZZZ 001_Very Nice Music Vol. 1/ZZZ 001_Very Nice Music Vol. 1 - 10 - 10_Fighter.mp3.mp3",
  "USA/AAA 001_I Rock, You Don't/AAA 001_I Rock, You Don't - 1 - I Rock, You Don't.mp3"
]

With that, we can compute a new set of file path variations, and I'll also reset my csv_rows.

Let's get a new cartesian product array:

[[{:country=>"Canada",
   :album=>"ZZZ 001_Very Nice Music Vol. 1",
   :track_number=>10,
   :title=>"Fighter",
   :possible_filenames=>
    ["canada/zzz 001 very nice music vol 1/fighter",
     "canada/zzz 001 very nice music vol 1/zzz 001 very nice music vol 1 fighter",
     "canada/zzz 001 very nice music vol 1/zzz 001 very nice music vol 1 10 fighter",
     "canada/zzz 001 very nice music vol 1/10 fighter"]},
  "Mexico/Rock/YYY 001_Rock Vol. 1/YYY 001_Rock_9_9_ Love You Much.mp3",
  0.6296653796653796],
 [{:country=>"Canada",
   :album=>"ZZZ 001_Very Nice Music Vol. 1",
   :track_number=>10,
   :title=>"Fighter",
   :possible_filenames=>
    ["canada/zzz 001 very nice music vol 1/fighter",
     "canada/zzz 001 very nice music vol 1/zzz 001 very nice music vol 1 fighter",
     "canada/zzz 001 very nice music vol 1/zzz 001 very nice music vol 1 10 fighter",
     "canada/zzz 001 very nice music vol 1/10 fighter"]},
  "Canada/ZZZ 001_Very Nice Music Vol. 1/ZZZ 001_Very Nice Music Vol. 1 - 10 - 10_Fighter.mp3.mp3",
  1.0],
 [{:country=>"Canada",
   :album=>"ZZZ 001_Very Nice Music Vol. 1",
   :track_number=>10,
   :title=>"Fighter",
   :possible_filenames=>
    ["canada/zzz 001 very nice music vol 1/fighter",
     "canada/zzz 001 very nice music vol 1/zzz 001 very nice music vol 1 fighter",
     "canada/zzz 001 very nice music vol 1/zzz 001 very nice music vol 1 10 fighter",
     "canada/zzz 001 very nice music vol 1/10 fighter"]},
  "USA/AAA 001_I Rock You Don't/AAA 001_I Rock You Don't - 1 - I Rock You Don't.mp3",
  0.6125451577579237],
 [{:country=>"Mexico",
   :album=>"YYY 001_Rock Vol. 1",
   :track_number=>9,
   :title=>"Love You Much",
   :possible_filenames=>
    ["mexico/yyy 001 rock vol 1/love you much",
     "mexico/yyy 001 rock vol 1/yyy 001 rock vol 1 love you much",
     "mexico/yyy 001 rock vol 1/yyy 001 rock vol 1 9 love you much",
     "mexico/yyy 001 rock vol 1/9 love you much"]},
  "Mexico/Rock/YYY 001_Rock Vol. 1/YYY 001_Rock_9_9_ Love You Much.mp3",
  0.9804808984852584],
 [{:country=>"Mexico",
   :album=>"YYY 001_Rock Vol. 1",
   :track_number=>9,
   :title=>"Love You Much",
   :possible_filenames=>
    ["mexico/yyy 001 rock vol 1/love you much",
     "mexico/yyy 001 rock vol 1/yyy 001 rock vol 1 love you much",
     "mexico/yyy 001 rock vol 1/yyy 001 rock vol 1 9 love you much",
     "mexico/yyy 001 rock vol 1/9 love you much"]},
  "Canada/ZZZ 001_Very Nice Music Vol. 1/ZZZ 001_Very Nice Music Vol. 1 - 10 - 10_Fighter.mp3.mp3",
  0.6403194506642782],
 [{:country=>"Mexico",
   :album=>"YYY 001_Rock Vol. 1",
   :track_number=>9,
   :title=>"Love You Much",
   :possible_filenames=>
    ["mexico/yyy 001 rock vol 1/love you much",
     "mexico/yyy 001 rock vol 1/yyy 001 rock vol 1 love you much",
     "mexico/yyy 001 rock vol 1/yyy 001 rock vol 1 9 love you much",
     "mexico/yyy 001 rock vol 1/9 love you much"]},
  "USA/AAA 001_I Rock You Don't/AAA 001_I Rock You Don't - 1 - I Rock You Don't.mp3",
  0.6289747064137309]]

The "I Rock, You Don't" file isn't matching anything. If I now run add_file_path_to_csv_rows, I get the same result. Now, we can find missing file paths:

find_missing_file_paths = lambda do |csv_rows, filename_list|
  filename_list.sort - csv_rows.map { |x| x[:matched_file_path] }.compact.sort
end

3.3.6 :428 > find_missing_file_paths.call(csv_rows, filename_list)
 => ["USA/AAA 001_I Rock You Don't/AAA 001_I Rock You Don't - 1 - I Rock You Don't.mp3"]

Now, what happens if we have an extra row in the CSV file? Let's start out with the same number of rows, but items that clearly don't match. I'll leave "I Rock, You Don't" in the list of files, and we'll add another line to the CSV file:

country	album	track_number	title
Canada	ZZZ 001_Very Nice Music Vol. 1	10	Fighter
Mexico	YYY 001_Rock Vol. 1	9	Love You Much
UK	UUU 003_Folk Vol. 3	1	Serious Sunday

The first two will match up, and we'll see "Serious Sunday" get matched with "I Rock, You Don't". Let's modify the matcher to store the JW distance in csv_rows as well:

add_file_path_to_csv_rows = lambda do |cartesian_product_array|
  cartesian_product_array.sort_by!(&:last)
  while cartesian_product_array.any?
    current_match = cartesian_product_array.last
    cartesian_product_array.reject! { |row|
      row[0] == current_match[0] || row[1] == current_match[1]
    }
    current_match[0][:matched_file_path] = current_match[1]
    current_match[0][:matched_file_score] = current_match[2]
  end
end

With that, I'll run everything:

3.3.6 :451 > add_possible_filenames.call csv_rows;
3.3.6 :452 > cartesian_product_array = compute_cartesian_product_array.call(csv_rows, file_path_variations);
3.3.6 :453 > add_file_path_to_csv_rows.call cartesian_product_array;

And the output (note that "possible_filenames" have been removed):

[{:country=>"Canada",
  :album=>"ZZZ 001_Very Nice Music Vol. 1",
  :track_number=>10,
  :title=>"Fighter",
  :matched_file_path=>
   "Canada/ZZZ 001_Very Nice Music Vol. 1/ZZZ 001_Very Nice Music Vol. 1 - 10 - 10_Fighter.mp3.mp3",
  :matched_file_score=>1.0},
 {:country=>"Mexico",
  :album=>"YYY 001_Rock Vol. 1",
  :track_number=>9,
  :title=>"Love You Much",
  :matched_file_path=>
   "Mexico/Rock/YYY 001_Rock Vol. 1/YYY 001_Rock_9_9_ Love You Much.mp3",
  :matched_file_score=>0.9804808984852584},
 {:country=>"UK",
  :album=>"UUU 003_Folk Vol. 3",
  :track_number=>1,
  :title=>"Serious Sunday",
  :matched_file_path=>
   "USA/AAA 001_I Rock You Don't/AAA 001_I Rock You Don't - 1 - I Rock You Don't.mp3",
  :matched_file_score=>0.5716374269005847}]

So, you can see that "I Rock, You Don't" and "Serious Sunday" matched, but the score is 0.57. We can get around this by ignoring low scores, we just have to define what "low" is. Ultimately, that's something to decide with a trial and error approach. For now, I think 0.9 is reasonable.

Let's take this one step further and see what happens if we add another CSV row:

country	album	track_number	title
Canada	ZZZ 001_Very Nice Music Vol. 1	10	Fighter
Mexico	YYY 001_Rock Vol. 1	9	Love You Much
UK	UUU 003_Folk Vol. 3	1	Serious Sunday
USA	AAA 002_Country	2	Let's Go Out

At this point, we now have two items in the CSV file and one file which are orphans.

[{:country=>"Canada",
  :album=>"ZZZ 001_Very Nice Music Vol. 1",
  :track_number=>10,
  :title=>"Fighter",
  :matched_file_path=>
   "Canada/ZZZ 001_Very Nice Music Vol. 1/ZZZ 001_Very Nice Music Vol. 1 - 10 - 10_Fighter.mp3.mp3",
  :matched_file_score=>1.0},
 {:country=>"Mexico",
  :album=>"YYY 001_Rock Vol. 1",
  :track_number=>9,
  :title=>"Love You Much",
  :matched_file_path=>
   "Mexico/Rock/YYY 001_Rock Vol. 1/YYY 001_Rock_9_9_ Love You Much.mp3",
  :matched_file_score=>0.9804808984852584},
 {:country=>"UK",
  :album=>"UUU 003_Folk Vol. 3",
  :track_number=>1,
  :title=>"Serious Sunday"},
 {:country=>"USA",
  :album=>"AAA 002_Country",
  :track_number=>2,
  :title=>"Let's Go Out",
  :matched_file_path=>
   "USA/AAA 001_I Rock You Don't/AAA 001_I Rock You Don't - 1 - I Rock You Don't.mp3",
  :matched_file_score=>0.6940492321589883}]

You can see that "Serious Sunday" didn't get matched, and "Let's Go Out" got matched to "I Rock, You Don't" with a score of 0.69.

A More Sophisticated Algorithm

We have a couple of places to cut out the low scores, but earlier in the process is better since there's no reason to even drag those around. I will note here that if you have every reason to believe that there will be 1:1 matching you might want to leave the low scores in. But for what we're doing here - we won't miss them.

compute_cartesian_product_array = lambda do |csv_rows, file_path_variations|
  csv_rows.map do |csv_row|
    file_path_variations.map do |filepath, variations|
      closest = csv_row[:possible_filenames].map do |possible_filename|
        variations.map do |variation|
          fuzzy.getDistance(possible_filename, variation)
        end
      end.flatten.max
      closest >= 0.9 ? [csv_row, filepath, closest] : nil
    end.compact
  end.flatten(1)
end

And the output:

[{:country=>"Canada",
  :album=>"ZZZ 001_Very Nice Music Vol. 1",
  :track_number=>10,
  :title=>"Fighter",
  :matched_file_path=>
   "Canada/ZZZ 001_Very Nice Music Vol. 1/ZZZ 001_Very Nice Music Vol. 1 - 10 - 10_Fighter.mp3.mp3",
  :matched_file_score=>1.0},
 {:country=>"Mexico",
  :album=>"YYY 001_Rock Vol. 1",
  :track_number=>9,
  :title=>"Love You Much",
  :matched_file_path=>
   "Mexico/Rock/YYY 001_Rock Vol. 1/YYY 001_Rock_9_9_ Love You Much.mp3",
  :matched_file_score=>0.9804808984852584},
 {:country=>"UK",
  :album=>"UUU 003_Folk Vol. 3",
  :track_number=>1,
  :title=>"Serious Sunday"},
 {:country=>"USA",
  :album=>"AAA 002_Country",
  :track_number=>2,
  :title=>"Let's Go Out"}]

And, of course, we now have a missing file path (really, a missing CSV entry):

["USA/AAA 001_I Rock You Don't/AAA 001_I Rock You Don't - 1 - I Rock You Don't.mp3"]

Wrap Up

There are a lot of options of how to handle this in your UI/UX. I like to show the matches and make anything that's of lower quality red to draw attention to it. It's often the case that I'll do this with a DropBox or Box directory, so the user has to find the directory and then be shown a matching list of files. If things don't match properly, they can rename them, add the missing files, whatever is appropriate, and then try matching again.

While this algorithm time increases exponentially as data is added, cutting off low-quality matches up front eases a lot of that pain. It still should be fine for even thousands of rows as long as you have enough memory.

Feel free to ask questions below.

Addendum

I wanted to make an addendum on this since I didn't really go into detail about why I do the matching in this manner. I'm doing a cartesian product of two sets, getting the distance, and then matching individually starting at the best matches and working my way down.

Why not just determine what's "close enough" and assume those are matches?

A lot of the data that I work with has really close names. Even the data set referenced above has this issue, but some other data sets even more so.

It's common for me to get a set of files with names like these (note that this is real data except for the title):

I Rock, You Don't - full.wav
I Rock, You Don't - full no drums.wav
I Rock, You Don't - under.wav
I Rock, You Don't - under no drums.wav
I Rock, You Don't - bass stem.wav
I Rock, You Don't - brass stem.wav
I Rock, You Don't - drums stem.wav
I Rock, You Don't - fx stem.wav
I Rock, You Don't - keys stem.wav
I Rock, You Don't - strings stem.wav
I Rock, You Don't - synth drones stem.wav
I Rock, You Don't - synth pulse stem.wav
I Rock, You Don't - sting.wav
I Rock, You Don't - 30.wav
I Rock, You Don't - 60.wav

I'll direct your attention to the "bass" and "brass" stems. Check this out:

3.3.6 :003 > fuzzy.getDistance("I Rock, You Don't - bass stem.wav", "I Rock, You Don't - brass stem.wav")
 => 0.9962514417531718

Those are really close, and if you're doing a really simple "take anything over .95" type algorithm you'll have problems.

I also can't just take exact matches, because it's so common with these filenames to have "drum stem" instead of "drums stem", etc. Plus typos.

The idea is to find the closest match first, and then mark those items on both sides as "matched" and remove them from the pool. By doing that, it's extremely unlikely to have issues such as "bass stem" and "brass stem" getting mixed up.

Top comments (1)

David Lazar • Jan 28

Great post. Loved it. As a user of online music since FTP exchanges, Napster, Torrents, Icecast, Rdio and now Tidal, it is very easy to see how such a simple domain is actually so complex! My friends in the music publishing business constantly remind me of how tortuous the whole industry is at times. As for me, I remain amused ever so slightly about how:

a single band can have 2, 3 or more artist links (instead of just one)
many bands can share a single name, and you have no idea which is which
one band name can host the albums of a different artist
artists can exist with no bio whatsoever
albums can exist with no ties to reality, as per when was this actually released
no services currently offer some kind of "this is the collection we have, here is what is missing"

So yes! You used a Ruby gem to expose how tricky all these aspects of what should be better for all and is not, because is it hard to catch all the things that are similar, and sometimes should be different. Thanks!

DEV Community

Matching Fuzzy Data

The Jaro-Winkler Algorithm

Munging Strings For Better Matching

The General Algorithm

The Naive Algorithm

More Complicated Data

A More Sophisticated Algorithm

Wrap Up

Addendum

Top comments (1)

Read next

Hands-On with Amazon EC2: Learning by Doing!

The Dead Letter Channel Enterprise Integration Pattern (EIP): A Deep Dive

React Scan - Best choise for FE dev

My 5 Go-To AWS Services To Build Web Applications