Finding The Odd One Out In C#

Finding The Odd One Out In C#

Solutions to the finding the odd one out problem are of interest when one must find missing matches in some dataset, for example when performing data quality analysis for missing event pairs.

This article describes and benchmarks three approaches for a simple variation of this problem and then a hybrid approach for optimal performance.

The Problem

We have a variable-sized array of integer values, in the example below of nine items.

Values: 2 5 3 5 5 3 5 2 3
-------------------------
 Index: 0 1 2 3 4 5 6 7 8

We want to find the odd one out in the array.

These are the rules:

  • We can pair each item in the array against another item of equal value.
  • We can only pair each item once.

These are the constraints:

  • There can be any number of odd ones out in the array - items that don’t match with any other item.
  • There can also be no odd ones out, meaning all values have a corresponding pair.
  • The array can be of arbitrary length or even empty.
  • Each item can be of arbitrary value from minimum integer to maximum integer.

Given this setup, we want an algorithm that returns the value of the odd one out in the array, with the following behaviour:

  • If a single odd one out exists, return that value.
  • If multiple odd ones out exist, return any single one of those values.
  • If no odd one out exists, return null.

In the example above:

  • 2 occurs twice, forming one pair, with no odd one out.
  • 5 occurs four times, forming two pairs, with no odd one out.
  • 3 occurs three times, forming one pair, with one odd one out.
  • There is one odd one out with value 3, so the algorithm returns 3.

A Simple Solution

One simple algorithm can be to iterate through all individual items and then count how many items of that value exist in the array.

public static int? FindOddOneOutByIterating(int[] values)
{
    // keep track of items already tested
    var tested = new HashSet<int>();

    // check each item in the array
    for (int i = 0; i < values.Length; ++i)
    {
        // check if we have tested this value already
        if (tested.Contains(values[i]))
        {
            continue;
        }

        // count the number of times this value shows in the array
        int count = 0;
        for (int j = 0; j < values.Length; ++j)
        {
            if (values[i] == values[j])
            {
                ++count;
            }
        }

        // check if it appears an odd number of times
        if (count % 2 != 0)
        {
            return values[i];
        }

        // mark this value as tested
        tested.Add(values[i]);
    }

    // we did not find an unpaired item
    return null;
}

The above algorithm is self-explanatory. It brute-forces the problem by testing every single value in the array for an odd number of occurences until it finds one. To avoid testing the same value twice, it makes use of a HashSet<T> to keep track of values already tested. Time complexity varies from a best-case scenario O(n) when the odd one out is on the first position, to a worst-case scenario O((n*n)/2) when the odd one out is on the last position.

This works, but can we do better?

A Better Solution

We can attempt to improve upon the brute-force aspect of the previous solution by sorting the data first, and then counting the occurrences of each item in sequential fashion.

public static int? FindOddOneOutByOrdering(int[] values)
{
    // we need at least one item for this algorithm to make sense
    int count = 0;
    int? last_value = null;

    // copy and sort the array
    var sorted = new int[values.Length];
    Array.Copy(values, sorted, values.Length);
    Array.Sort(sorted);

    // look for odd occurrences of a value until we find one
    for (int i = 0; i < sorted.Length; ++i)
    {
        if (sorted[i] == last_value)
        {
            ++count;
        }
        else
        {
            if (count % 2 != 0)
            {
                return sorted[i];
            }
            else
            {
                count = 1;
                last_value = sorted[i];
            }
        }
    }

    // extra check for the last value
    return (count % 2 != 0 ? last_value : null);
}

The above algorithm makes use of the Array.Sort() method to sort a copy of the input array first. Then it tests each value in the sorted sequence until it finds a value with an odd number of occurrences.

Array.Sort() selects an appropriate algorithm based on length between Insertion Sort, Heapsort and Quicksort, with an overall worst-case time complexity of O(n*log(n)). We then pay a further O(n) in a worst-case to find an odd one left out.

This looks a bit better than the worst-case scenario of the first approach, but can we do even better?

An Efficient Solution

We can achieve a more efficient algorithm by keeping track of value occurrences while making a single pass over the array. In fact, we do not even need to know how many occurrences of each value there are, unlike what the brute-force algorithm would make us believe. We only need to know whether there is an current odd one left out for each value that we track.

public static int? FindOddOneOutByHashing(int[] values)
{
    // prepare a lookup for on-the-go checks
    var candidates = new HashSet<int>();

    // check each item in the enumeration
    for (int i = 0; i < values.Length; ++i)
    {
        // check if we have found this value once before
        if (candidates.Contains(values[i]))
        {
            // if yes then it is no longer a candidate for odd occurrences
            candidates.Remove(values[i]);
        }
        else
        {
            // if not then it is either the first time it appears in the enumeration
            // or all previous appearances have found a pair
            // in that case we add it to the lookup as a new candidate for odd occurences
            candidates.Add(values[i]);
        }
    }

    // check if any candidate survived
    if (candidates.Count > 0)
    {
        return candidates.First();
    }
    else
    {
        return null;
    }
}

The above algorithm makes use of a HashSet<T> collection as a cheap state-machine for keeping track of whether a given value has been matched or not. Contains() and Remove() are O(1) operations, while Add() approaches a O(1) operation, given a decent GetHashCode() implementation to minimize hash bucket sizes.

The HashSet<T>’s approaching O(1) time complexity combined with the single-pass iteration over the array makes this algorithm approach O(n), at the expense of allocating the HashSet<T> at startup.

At the end there is an extra O(1) operation to fetch any identified odd one out.

Performance

Performance comparison for the three approaches isn’t as straightforward as one would expect.

  • The simple algorithm varies in performance depending on what position the odd item out is in the array. Its best-case scenario may be a nice O(n) but its worst-case scenario is a not so nice O((n*n)/2).
  • The ordering algorithm brings complexity down to a more manageable O(n*log(n) + n) worst-case scenario, with some up-front cost to sort the array.
  • The hashing algorithm brings complexity down to almost O(n), albeit at a price of creating and maintaining a HashSet<T> over the single pass. But, then again, so does the brute force algorithm pay for it.

Let’s see how performance compares on a best-case scenario and a worst-case scenario.

You can also run these benchmarks yourself from the Quicker repository.

Best Case Scenario

In a best-case scenario, the odd element is at the start of the input array, which is ideal for the brute scenario.

Each test array contains a pattern similiar to:

[0] [1] [2] [3] [4] [0] [1] [2] [3] [4] [0]

Here are some numbers, lower is better.

Method N Mean Ratio Rank
FindOddOneOutByIterating 1 19.73 ns 1.00 *
FindOddOneOutByOrdering 1 35.33 ns 1.79 **
FindOddOneOutByHashing 1 102.13 ns 5.15 ***
         
FindOddOneOutByIterating 11 26.97 ns 1.00 *
FindOddOneOutByOrdering 11 118.27 ns 4.32 **
FindOddOneOutByHashing 11 449.65 ns 16.47 ***
         
FindOddOneOutByIterating 101 100.84 ns 1.00 *
FindOddOneOutByOrdering 101 1,066.39 ns 10.63 **
FindOddOneOutByHashing 101 3,402.77 ns 33.71 ***
         
FindOddOneOutByIterating 1001 656.41 ns 1.00 *
FindOddOneOutByOrdering 1001 22,699.50 ns 34.62 **
FindOddOneOutByHashing 1001 31,599.56 ns 48.05 ***
         
FindOddOneOutByIterating 10001 6,183.70 ns 1.00 *
FindOddOneOutByOrdering 10001 546,227.42 ns 88.33 ***
FindOddOneOutByHashing 10001 349,001.85 ns 56.44 **
         
FindOddOneOutByIterating 100001 61,403.81 ns 1.00 *
FindOddOneOutByOrdering 100001 7,183,987.40 ns 117.00 ***
FindOddOneOutByHashing 100001 3,155,303.61 ns 51.40 **



For a best-case scenario:

  • The brute-force algorithm shows a clear advantage in all cases where the first item is the odd one out. As it iterates only once through the array in that case, it outpaces the setup time required for sorting and the maintenance time required for hash pairing.
  • The ordering algorithm does start out better than hashing, but then it loses pace as the array size becomes significant.
  • The hashing algorithm shows the greatest initial cost, but its performance benchmark remains stable as we add more items.

Worst Case Scenario

However, the tables turn in a dramatic fashion for a worst-case scenario, where the odd one out is at the end of the array.

Each test array contains a pattern similiar to:

[0] [1] [2] [3] [4] [0] [1] [2] [3] [4] [5]

Here are some numbers, lower is better.

Method N Mean Ratio Rank
FindOddOneOutByIterating 1 19.86 ns 1.00 *
FindOddOneOutByOrdering 1 34.18 ns 1.72 **
FindOddOneOutByHashing 1 91.43 ns 4.62 ***
         
FindOddOneOutByIterating 11 328.65 ns 1.00 **
FindOddOneOutByOrdering 11 136.84 ns 0.42 *
FindOddOneOutByHashing 11 429.78 ns 1.31 ***
         
FindOddOneOutByIterating 101 7,284.18 ns 1.00 ***
FindOddOneOutByOrdering 101 1,323.07 ns 0.18 *
FindOddOneOutByHashing 101 3,368.68 ns 0.46 **
         
FindOddOneOutByIterating 1001 348,152.80 ns 1.00 ***
FindOddOneOutByOrdering 1001 22,426.86 ns 0.06 *
FindOddOneOutByHashing 1001 31,421.15 ns 0.09 **
         
FindOddOneOutByIterating 10001 31,160,440.83 ns 1.00 ***
FindOddOneOutByOrdering 10001 476,150.85 ns 0.02 **
FindOddOneOutByHashing 10001 347,095.77 ns 0.01 *
         
FindOddOneOutByIterating 100001 3,072,503,593.03 ns 1.000 ***
FindOddOneOutByOrdering 100001 7,446,879.34 ns 0.002 **
FindOddOneOutByHashing 100001 3,091,703.46 ns 0.001 *



The charts now make clear the difference in scalability between the three algorithms.

  • While brute-forcing works for just above 10 items, it becomes unusable after that.
  • The ordering algorithm holds its ground well, event if struggling a bit as the array grows. It also shows lower initial cost for small arrays than the hashing alternative.
  • However, when array length becomes significant, the hashing algorithm leaves all contenders behind with no apologies. It runs in the same time regardless of where the odd one out is and it scales in a linear fashion. For a 100K-long array, the hashing algorithm takes less than 0.03% of the time of brute-forcing and less than half the time of ordering.

The Winner

… is not what you expect.

The most interesting bit in these tests is not the scalability of the hashing algorithm. It’s the performance cross-overs between 10 and 100 items and then between 1K and 10K items. Not one of these algorithms is better than all the others ones all the time - each one has their particular performance domain.

This realization opens the door for a hybrid algorithm, a pseudo-algorithm that elects the better suited approach depending on the incoming number of items. Similar to the sort technique selection in Array.Sort(), this can provide the best performance for any count of input elements.

Here is one way to implement this:

public static int? FindOddOneOutHybrid(int[] values, int orderingThreshold = 10, int hashingThreshold = 10000)
{
    if (values.Length > hashingThreshold) return FindOddOneOutByHashing(values);
    if (values.Length > orderingThreshold) return FindOddOneOutByOrdering(values);
    return FindOddOneOutByIterating(values);
}

The default thresholds are taken from earlier testing and we can always optimize them with finer-grained benchmarking.

Here are the results of a new benchmark run with default thresholds. There is a slight variance in values as I have run this on a laptop, but the overall performance ranking holds.

Method N Mean Ratio Rank
FindOddOneOutByIterating 1 20.65 ns 1.00 **
FindOddOneOutByOrdering 1 35.98 ns 1.81 ***
FindOddOneOutByHashing 1 90.05 ns 4.38 **
FindOddOneOutHybrid 1 19.92 ns 0.97 *
         
FindOddOneOutByIterating 11 345.30 ns 1.00 **
FindOddOneOutByOrdering 11 137.49 ns 0.39 *
FindOddOneOutByHashing 11 461.86 ns 1.37 ***
FindOddOneOutHybrid 11 136.58 ns 0.39 *
         
FindOddOneOutByIterating 101 7,509.89 ns 1.00 **
FindOddOneOutByOrdering 101 1,350.84 ns 0.18 *
FindOddOneOutByHashing 101 3,491.06 ns 0.46 ***
FindOddOneOutHybrid 101 1,378.60 ns 0.18 **
         
FindOddOneOutByIterating 1001 351,577.94 ns 1.00 ***
FindOddOneOutByOrdering 1001 23,636.44 ns 0.07 *
FindOddOneOutByHashing 1001 32,036.17 ns 0.09 **
FindOddOneOutHybrid 1001 23,039.26 ns 0.07 *
         
FindOddOneOutByIterating 10001 31,723,898.86 ns 1.00 ***
FindOddOneOutByOrdering 10001 481,376.16 ns 0.02 **
FindOddOneOutByHashing 10001 348,620.23 ns 0.01 *
FindOddOneOutHybrid 10001 353,378.17 ns 0.01 *
         
FindOddOneOutByIterating 100001 3,106,550,714.05 ns 1.000 **
FindOddOneOutByOrdering 100001 7,459,762.64 ns 0.002 ***
FindOddOneOutByHashing 100001 3,290,452.46 ns 0.001 **
FindOddOneOutHybrid 100001 3,138,249.50 ns 0.001 *

Hint: You can click on the series names in the chart to hide and show individual lines.



The charts speak for themselves. The true winner of this contest is not an individual algorithm - but the use of the most appropriate algorithm for the given situation.

Takeaway

This article described a simple variation of the finding the odd one out problem and compared three possible solutions for it.

As the benchmarks show, it can pay off by orders of magnitude to invest in some setup time at the beginning of an algorithm.

It also pays off to create a hybrid algorithm - one that elects the most appropriate approach for the situation at hand.

Resources

You can find all the code for this post in the Quicker repository, including all the benchmarks.

Jorge Candeias's Picture

About Jorge Candeias

Jorge helps organizations succeed by building high-performing solutions on the Microsoft tech stack.

London, United Kingdom https://jorgecandeias.github.io