DEV Community

Bob Lied
Bob Lied

Posted on

PWC 233 Data on the side

PWC is back, with forever an asterisk where #232 should be. Best wishes for full recovery to our intrepid host and benevolent overlord.

This week's challenges have something in common: they require us to organize data by a trait that is not directly in the data. All software problems, of course, can be solved by adding a level of indirection, and that applies here. Supplementing data with an invented data structure on the side is a common programming strategy. It also reminds me of the Schwartzian transform, and maybe that will be useful this week.

Task 1: Similar Words

You are given an array of words made up of alphabets only.

Write a script to find the number of pairs of similar words.
Two words are similar if they consist of the same characters.
Enter fullscreen mode Exit fullscreen mode

Example 1

Input: @words = ("aba", "aabb", "abcd", "bac", "aabc")
Output: 2
Pair 1: similar words ("aba", "aabb")
Pair 2: similar words ("bac", "aabc")

Example 2

Input: @words = ("aabb", "ab", "ba")
Output: 3
Pair 1: similar words ("aabb", "ab")
Pair 2: similar words ("aabb", "ba")
Pair 3: similar words ("ab", "ba")

Example 3

Input: @words = ("nba", "cba", "dba")
Output: 0

Thoughts

First, I notice that the task requires us to print a count of pairs, not the pairs themselves. Counting pairs is a specific case of a general combinatorial problem; how many ways can we choose k things at a time from a set of n? In general, choosing k from n is n!/(n-k)!β€’k!; when k is 2, it reduces to nβ€’(n-1)/2. Let's tuck that away for a moment.

To see if words are similar by the task definition, we will need to find the signature of a word that identifies which letters it contains. We'll need to decide what "made up of alphabets only" really means. A diligent developer would consider at least Unicode, case folding and things like whether n and Γ± are "similar". But we are Perl programmers, and our defining virtue is (say it with me) laziness, so we choose "alphabets only" to mean lower-case English alphabet a to z.

I can think of a couple of ways to make a word signature.

  1. We can split the word into its characters; then sort that list of characters, remove duplicates, and then join the now-sorted characters into a string. For instance "joinuniqsortsplit" becomes "ijlnopqrstu".
  2. Given that we have only the 26 lowercase letters of the English alphabet, we have a set that can be efficiently represented by a bit vector. If we let bit 0 correspond to a and bit 25 to z, we can set the bits to 1 for letters that are present, and each such set creates an integer between 1 ("a", "aa", "aaa" and so on) and 2^26-1 ("abcdefghijklmnopqrstuvwxyz" and any other string that contains all the letters). For instance, "joinuniqsortsplit" becomes 0x01feb00, or 2091776.
    zy xwvu tsrq ponm lkji hgfe dcba
    00 0001 1111 1110 1011 0000 0000
Enter fullscreen mode Exit fullscreen mode

The bit vector solution is kind of fun; it's probably what I would do in a C/C++ solution. But Perl, so the first one.

So what data do we need? How about this: for every unique word signature, a list of words that have that signature. For example 1, that would look like:

my %similar = (
  ab => [ "aba", "aabb" ],
 abc => [ "bac", "aabc" ],
abcd => [ "abcd" ],
);
Enter fullscreen mode Exit fullscreen mode

Because we only need a count of pairs, we don't really need to preserve the list of words -- we could map the signatures to a count. But let's roll with it for now and optimize later if we must (say it with me -- "laziness").

So, on to our solution. There are two parts. One is a function that reduces a word to its signature, using uniq from List::Util to reduce a sorted list to single elements:

sub wordHash($word)
{
    join "", uniq sort { $a cmp $b } split(//, $word)
}
Enter fullscreen mode Exit fullscreen mode

Why did I name this "wordHash"? Reducing a word to its signature reminded me of reducing a complex data structure to a single value which could be used as a key in a hash table. The same sort of problem arises in serializing and de-serializing data, for instance in Java remote procedure calls. I should probably change the name from "hash" to "signature" or something more directly applicable to this problem. Naming things is one of the two hardest problems in computer science; the others are cache invalidation and off-by-one errors.

The other part is to create a hash of similar words, and then iterate over it to find the pair counts:

sub similarWords(@wordList)
{
    # Create a hash mapping signatures to words that have that signature
    my %similar;
    for my $word ( @wordList )
    {
        # Build an array of words via reference
        push @{$similar{wordHash($word)}}, $word;
    }

    my $pairCount = 0;
    for my $signature ( keys %similar)
    {
        my $n = scalar $similar{$signature}->@*;
        next unless $n > 1;
        $pairCount += $n*($n-1) / 2;
    }
    return $pairCount;
}
Enter fullscreen mode Exit fullscreen mode

Here we retrieve the combinatorial pair-counting from where we tucked it away earlier to add up the sum count. Getting $n is perhaps a little weird, so let's break it down:

  • $similar{$signature} retrieves the value from the hash. The values stored are array references.
  • $similar{$signature}->@* is post-fix dereferencing, which is a scary way to say "the actual array"
  • scalar ... when applied to an array yields the size of the array.

One more thing that might be considered is what should happen if there are duplicate words with the same signature. Suppose our list is ("ab", "ab", "ab"). Is that three unique words and therefore three pairs, or is it just one and therefore no pairs? What to do, what to do? Hint: laziness.

Task 2: Frequency Sort

You are given an array of integers.

Write a script to sort the given array in increasing order
based on the frequency of the values. If multiple values
have the same frequency then sort them in decreasing order.
Enter fullscreen mode Exit fullscreen mode

Example 1

Input: @ints = (1,1,2,2,2,3)
Ouput: (3,1,1,2,2,2)
'3' has a frequency of 1
'1' has a frequency of 2
'2' has a frequency of 3

Example 2

Input: @ints = (2,3,1,3,2)
Ouput: (1,3,3,2,2)
'2' and '3' both have a frequency of 2, so they are sorted in decreasing order.

Example 3

Input: @ints = (-1,1,-6,4,5,-6,1,4,1)
Ouput: (5,-1,4,4,-6,-6,1,1,1)

Thoughts

Hmm, a problem of multiple sort criteria. Let's restate it.
Put things in order. The things that only occur once should come first, then the things that occur twice, and so on. Within the groups, the numbers should be in descending order.

More than thirty years ago, Stephen Covey told us in The 7 Habits of Highly Effective People that we should begin with the end in mind. And since many people have called me a highly effective person ... hold on, I'm being told the word they used was "defective." Anyway.

In the end, we're going to need to determine the frequency of each element in the list. And if we had a table of frequencies, it would be very easy to sort by frequency. To make such a table, it's not hard to iterate over the list and increment a count in a hash, but it's even easier to use a module: the frequency function from List::MoreUtils.

Given the frequency table, then implementing the two-part sort is easy, if somewhat idiomatic.

sub frequencySort(@list)
{
    use List::MoreUtils qw/frequency/;
    my %freq = frequency(@list);

    [ sort { $freq{$a} <=> $freq{$b} || $b <=> $a} @list ]
}
Enter fullscreen mode Exit fullscreen mode

The sort criteria are two parts separated by ||. If the first condition determines the order, that's that, and the second condition isn't even considered (shortcut evaluation). But if they compare equal on frequency, then the <=> operator returns 0, which is false in Perl. Therefore, the second condition does get evaluated. The second condition is a simple numeric comparison, but $b and $a have been switched so that it becomes a descending sort.

Finally, putting the whole thing in [] brackets creates a reference to an anonymous array. Because that's the last statement in the subroutine, it becomes the return value. Shame on me for not being explicit with a return keyword. Or ending the statement with a semicolon. Did I mention "laziness"?

Top comments (0)