TLDR
One pass over sorted data is a better strategy than searching through random data.
The Problem
Perl Weekly Challenge 244, Task 1, Count Smaller asks us to categorize data for its relative values:
You are given an array of integers. Write a
script to calculate the number of integers
smaller than the integer at each index.
Example 1
Input:
@int = (8, 1, 2, 2, 3)
Output:(4, 0, 1, 1, 3)
For index = 0, count of elements less than 8 is 4.
For index = 1, count of elements less than 1 is 0.
For index = 2, count of elements less than 2 is 1.
For index = 3, count of elements less than 2 is 1.
For index = 4, count of elements less than 3 is 3.
Example 2
Input:
@int = (6, 5, 4, 8)
Output:(2, 1, 0, 3)
Example 3
Input:
@int = (2, 2, 2)
Output:(0, 0, 0)
Idea 1: Brute force search
The first idea, as usual, is brute force. For every element in the array, scan the array for values that are less than that. An optimization that immediately occurs to me is that the scan for smaller values would be easier if the numbers are sorted; we can quit early and don't have to scan the entire list every time.
sub countSmaller_A($nums)
{
my @sorted = sort { $a <=> $b } $nums->@*;
my @smaller = ();
for my $i ( $nums->@* )
{
my $count = 0;
for my $j ( @sorted )
{
last if $j >= $i;
$count++
}
push @smaller, $count;
}
return \@smaller;
}
Searching in a sorted list has an optimization. We could find the index of the first element in @sorted
that is greater than $i
using binary search (or, more realistically, using List::Utils::bsearch_index
because my odds of getting a binary search algorithm right are slim). Maybe I'll come back to this, but this is basically an O(n^2) algorithm, so there are probably better ideas to explore.
Idea 2: Frequency counts
The second approach is to count how many times each value occurs, saving a frequency table. Then select from that table all the lesser numbers and add them up. This is a bit of a Perl tour-de-force, using array operations up the wazoo. The code is compact, but the complexity might not be worth it.
sub countSmaller_B($nums)
{
use List::Util qw/sum0/;
my %freq;
$freq{$_}++ for $nums->@*;
return [ map { my $i = $_; sum0 @freq{ grep { $_ < $i } keys %freq } } $nums->@* ];
}
Building the frequency table is relatively obvious, but what's going on in that last line?
-
return [ ... ]
-- the answer is going to be an array reference to the list of numbers -
return [ map { my $i = $_; ... } $nums->@*
] -- the list of numbers is going to be generated by mapping each of the original numbers to its count of smaller numbers. Each of the original numbers is going to be assigned to$i
while we're doing the transformation -
grep { $_ < $i } keys %freq
-- we're going to select all keys from the frequency table%freq
where the key is less than the number we're looking at -
@freq{ grep...}
-- Using the list of keys from thegrep
, take a hash slice from the%freq
table. This will give us the counts of each value less than$i
. -
sum0 @freq...
-- Thesum0
function adds up the values in an array, defaulting to 0 if the array is empty.
Idea 3: Make one pass over sorted data
Let's return for a moment to that sorted array we built back in Idea 1. Once the array is sorted, it's pretty easy to find all the smaller numbers -- it's a monotonically increasing sequence as we scan from left to right.
There are two complications: repeated values, and the fact that the answer has to be in the same order as the original array. Hashes for caches and hash slices are going to come to our rescue.
sub countSmaller_C($nums)
{
my @sorted = sort { $a <=> $b } $nums->@*;
my %smaller;
my $lessCount = 0;
while ( defined(my $i = shift @sorted) )
{
if ( ! exists $smaller{$i} )
{
$smaller{$i} = $lessCount;
}
$lessCount++
}
return [ @smaller{$nums->@*} ];
}
The first thing we do is take the hit of doing the sort, on the premise that we're going to get that performance back with a simpler search loop.
The %smaller
variable is a hash that is going to contain our answer, but in an arbitrary order. We will un-arbitrary it at the end with a hash slice that uses the original array for keys -- that will solve our ordering problem.
The $lessCount
variable is an accumulator that will count up the number of smaller values as we pass over the sorted array.
The while
loop will take elements off the sorted array from left to right. We saw last week that shifting elements off the array can be more efficient than using array indexing. A quick note: the test has to include defined
because there could be a zero in the data, and that would test as false.
Within the loop, we use the %smaller
array as a cache for values we've already seen. That solves our repeated-values problem.
Time for benchmarking
This is only an exhibition; no wagering, please. As I've been doing for the past week, I brought out the Benchmark
module for some simple comparison.
sub runBenchmark($repeat)
{
use Benchmark qw/cmpthese/;
my @data = map { int(rand(100)) } 1..100;
cmpthese($repeat, {
"simple " => sub { countSmaller_A(\@data) },
"frequency" => sub { countSmaller_B(\@data) },
"one pass " => sub { countSmaller_C(\@data) },
});
}
The benchmark generates some bigger data than task examples, and then runs the functions thousands of times to get something like a statistically valid result.
And here's the big reveal, sorted from slowest to fastest:
Rate frequency simple one pass
frequency 1339/s -- -63% -95%
simple 3647/s 172% -- -88%
one pass 29268/s 2085% 702% --
First of all, those compact array operations in Idea 2 look cute, but there's a lot of data access going on underneath. We were better off with using more readable loops from Idea 1.
But, wow! Look at the one pass solution! Literally ten times faster. Better algorithms beat micro-optimizations.
Top comments (0)