What is the quickest (and least resource intensive) to compare two massive (>50.000 items) and as a result have two lists like the ones below:
items that show up in the first list but not in the second items that show up in the second list but not in the first
Currently I'm working with the List or IReadOnlyCollection and solve this issue in a linq query:
var list1 = list.Where(i => !list2.Contains(i)).ToList();
var list2 = list2.Where(i => !list.Contains(i)).ToList();
But this doesn't perform as good as i would like. Any idea of making this quicker and less resource intensive as i need to process a lot of lists?
Use Except
:
var firstNotSecond = list1.Except(list2).ToList();
var secondNotFirst = list2.Except(list1).ToList();
I suspect there are approaches which would actually be marginally faster than this, but even this will be vastly faster than your O(N * M) approach.
If you want to combine these, you could create a method with the above and then a return statement:
return !firstNotSecond.Any() && !secondNotFirst.Any();
One point to note is that there is a difference in results between the original code in the question and the solution here: any duplicate elements which are only in one list will only be reported once with my code, whereas they'd be reported as many times as they occur in the original code.
For example, with lists of [1, 2, 2, 2, 3]
and [1]
, the "elements in list1 but not list2" result in the original code would be [2, 2, 2, 3]
. With my code it would just be [2, 3]
. In many cases that won't be an issue, but it's worth being aware of.
Enumerable.SequenceEqual Method Determines whether two sequences are equal according to an equality comparer. MS.Docs
Enumerable.SequenceEqual(list1, list2);
This works for all primitive data types. If you need to use it on custom objects you need to implement IEqualityComparer
Defines methods to support the comparison of objects for equality.
IEqualityComparer Interface Defines methods to support the comparison of objects for equality. MS.Docs for IEqualityComparer
SequenceEqual
is a simple bool
. The OP wants two lists of results - and describes what they want in terms of set operations: "items that show up in the first list but not in the second". There's no indication that ordering is relevant, whereas SequenceEqual does consider it to be relevant. This appears to be answering an entirely different question.
More efficient would be using Enumerable.Except
:
var inListButNotInList2 = list.Except(list2);
var inList2ButNotInList = list2.Except(list);
This method is implemented by using deferred execution. That means you could write for example:
var first10 = inListButNotInList2.Take(10);
It is also efficient since it internally uses a Set<T>
to compare the objects. It works by first collecting all distinct values from the second sequence, and then streaming the results of the first, checking that they haven't been seen before.
Set<T>
is built from the second sequence (i.e. it's fully iterated and stored), then items that can be added from the first sequence are yielded.
Where
is partially deferred because in list.Where(x => x.Id == 5)
the value of the number 5
is stored at the start, rather than executed lazily.
If you want the results to be case insensitive, the following will work:
List<string> list1 = new List<string> { "a.dll", "b1.dll" };
List<string> list2 = new List<string> { "A.dll", "b2.dll" };
var firstNotSecond = list1.Except(list2, StringComparer.OrdinalIgnoreCase).ToList();
var secondNotFirst = list2.Except(list1, StringComparer.OrdinalIgnoreCase).ToList();
firstNotSecond
would contain b1.dll
secondNotFirst
would contain b2.dll
using System.Collections.Generic;
using System.Linq;
namespace YourProject.Extensions
{
public static class ListExtensions
{
public static bool SetwiseEquivalentTo<T>(this List<T> list, List<T> other)
where T: IEquatable<T>
{
if (list.Except(other).Any())
return false;
if (other.Except(list).Any())
return false;
return true;
}
}
}
Sometimes you only need to know if two lists are different, and not what those differences are. In that case, consider adding this extension method to your project. Note that your listed objects should implement IEquatable!
Usage:
public sealed class Car : IEquatable<Car>
{
public Price Price { get; }
public List<Component> Components { get; }
...
public override bool Equals(object obj)
=> obj is Car other && Equals(other);
public bool Equals(Car other)
=> Price == other.Price
&& Components.SetwiseEquivalentTo(other.Components);
public override int GetHashCode()
=> Components.Aggregate(
Price.GetHashCode(),
(code, next) => code ^ next.GetHashCode()); // Bitwise XOR
}
Whatever the Component
class is, the methods shown here for Car
should be implemented almost identically.
It's very important to note how we've written GetHashCode. In order to properly implement IEquatable
, Equals
and GetHashCode
must operate on the instance's properties in a logically compatible way.
Two lists with the same contents are still different objects, and will produce different hash codes. Since we want these two lists to be treated as equal, we must let GetHashCode
produce the same value for each of them. We can accomplish this by delegating the hashcode to every element in the list, and using the standard bitwise XOR to combine them all. XOR is order-agnostic, so it doesn't matter if the lists are sorted differently. It only matters that they contain nothing but equivalent members.
Note: the strange name is to imply the fact that the method does not consider the order of the elements in the list. If you do care about the order of the elements in the list, this method is not for you!
Not for this Problem, but here's some code to compare lists for equal and not! identical objects:
public class EquatableList<T> : List<T>, IEquatable<EquatableList<T>> where T : IEquatable<T>
/// <summary>
/// True, if this contains element with equal property-values
/// </summary>
/// <param name="element">element of Type T</param>
/// <returns>True, if this contains element</returns>
public new Boolean Contains(T element)
{
return this.Any(t => t.Equals(element));
}
/// <summary>
/// True, if list is equal to this
/// </summary>
/// <param name="list">list</param>
/// <returns>True, if instance equals list</returns>
public Boolean Equals(EquatableList<T> list)
{
if (list == null) return false;
return this.All(list.Contains) && list.All(this.Contains);
}
Except
If only combined result needed, this will work too:
var set1 = new HashSet<T>(list1);
var set2 = new HashSet<T>(list2);
var areEqual = set1.SetEquals(set2);
where T is type of lists element.
try this way:
var difList = list1.Where(a => !list2.Any(a1 => a1.id == a.id))
.Union(list2.Where(a => !list1.Any(a1 => a1.id == a.id)));
While Jon Skeet's answer is an excellent advice for everyday's practice with small to moderate number of elements (up to a few millions) it is nevertheless not the fastest approach and not very resource efficient. An obvious drawback is the fact that getting the full difference requires two passes over the data (even three if the elements that are equal are of interest as well). Clearly, this can be avoided by a customized reimplementation of the Except
method, but it remains that the creation of a hash set requires a lot of memory and the computation of hashes requires time.
For very large data sets (in the billions of elements) it usually pays off to consider the particular circumstances. Here are a few ideas that might provide some inspiration: If the elements can be compared (which is almost always the case in practice), then sorting the lists and applying the following zip approach is worth consideration:
/// <returns>The elements of the specified (ascendingly) sorted enumerations that are
/// contained only in one of them, together with an indicator,
/// whether the element is contained in the reference enumeration (-1)
/// or in the difference enumeration (+1).</returns>
public static IEnumerable<Tuple<T, int>> FindDifferences<T>(IEnumerable<T> sortedReferenceObjects,
IEnumerable<T> sortedDifferenceObjects, IComparer<T> comparer)
{
var refs = sortedReferenceObjects.GetEnumerator();
var diffs = sortedDifferenceObjects.GetEnumerator();
bool hasNext = refs.MoveNext() && diffs.MoveNext();
while (hasNext)
{
int comparison = comparer.Compare(refs.Current, diffs.Current);
if (comparison == 0)
{
// insert code that emits the current element if equal elements should be kept
hasNext = refs.MoveNext() && diffs.MoveNext();
}
else if (comparison < 0)
{
yield return Tuple.Create(refs.Current, -1);
hasNext = refs.MoveNext();
}
else
{
yield return Tuple.Create(diffs.Current, 1);
hasNext = diffs.MoveNext();
}
}
}
This can e.g. be used in the following way:
const int N = <Large number>;
const int omit1 = 231567;
const int omit2 = 589932;
IEnumerable<int> numberSequence1 = Enumerable.Range(0, N).Select(i => i < omit1 ? i : i + 1);
IEnumerable<int> numberSequence2 = Enumerable.Range(0, N).Select(i => i < omit2 ? i : i + 1);
var numberDiffs = FindDifferences(numberSequence1, numberSequence2, Comparer<int>.Default);
Benchmarking on my computer gave the following result for N = 1M:
Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated DiffLinq 115.19 ms 0.656 ms 0.582 ms 1.00 2800.0000 2800.0000 2800.0000 67110744 B DiffZip 23.48 ms 0.018 ms 0.015 ms 0.20 - - - 720 B
And for N = 100M:
Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated DiffLinq 12.146 s 0.0427 s 0.0379 s 1.00 13000.0000 13000.0000 13000.0000 8589937032 B DiffZip 2.324 s 0.0019 s 0.0018 s 0.19 - - - 720 B
Note that this example of course benefits from the fact that the lists are already sorted and integers can be very efficiently compared. But this is exactly the point: If you do have favourable circumstances, make sure that you exploit them.
A few further comments: The speed of the comparison function is clearly relevant for the overall performance, so it may be beneficial to optimize it. The flexibility to do so is a benefit of the zipping approach. Furthermore, parallelization seems more feasible to me, although by no means easy and maybe not worth the effort and the overhead. Nevertheless, a simple way to speed up the process by roughly a factor of 2, is to split the lists respectively in two halfs (if it can be efficiently done) and compare the parts in parallel, one processing from front to back and the other in reverse order.
I have used this code to compare two list which has million of records.
This method will not take much time
//Method to compare two list of string
private List<string> Contains(List<string> list1, List<string> list2)
{
List<string> result = new List<string>();
result.AddRange(list1.Except(list2, StringComparer.OrdinalIgnoreCase));
result.AddRange(list2.Except(list1, StringComparer.OrdinalIgnoreCase));
return result;
}
I compared 3 different methods for comparing different data sets. Tests below create a string collection of all the numbers from 0
to length - 1
, then another collection with the same range, but with even numbers. I then pick out the odd numbers from the first collection.
Using Linq Except
public void TestExcept()
{
WriteLine($"Except {DateTime.Now}");
int length = 20000000;
var dateTime = DateTime.Now;
var array = new string[length];
for (int i = 0; i < length; i++)
{
array[i] = i.ToString();
}
Write("Populate set processing time: ");
WriteLine(DateTime.Now - dateTime);
var newArray = new string[length/2];
int j = 0;
for (int i = 0; i < length; i+=2)
{
newArray[j++] = i.ToString();
}
dateTime = DateTime.Now;
Write("Count of items: ");
WriteLine(array.Except(newArray).Count());
Write("Count processing time: ");
WriteLine(DateTime.Now - dateTime);
}
Output
Except 2021-08-14 11:43:03 AM
Populate set processing time: 00:00:03.7230479
2021-08-14 11:43:09 AM
Count of items: 10000000
Count processing time: 00:00:02.9720879
Using HashSet.Add
public void TestHashSet()
{
WriteLine($"HashSet {DateTime.Now}");
int length = 20000000;
var dateTime = DateTime.Now;
var hashSet = new HashSet<string>();
for (int i = 0; i < length; i++)
{
hashSet.Add(i.ToString());
}
Write("Populate set processing time: ");
WriteLine(DateTime.Now - dateTime);
var newHashSet = new HashSet<string>();
for (int i = 0; i < length; i+=2)
{
newHashSet.Add(i.ToString());
}
dateTime = DateTime.Now;
Write("Count of items: ");
// HashSet Add returns true if item is added successfully (not previously existing)
WriteLine(hashSet.Where(s => newHashSet.Add(s)).Count());
Write("Count processing time: ");
WriteLine(DateTime.Now - dateTime);
}
Output
HashSet 2021-08-14 11:42:43 AM
Populate set processing time: 00:00:05.6000625
Count of items: 10000000
Count processing time: 00:00:01.7703057
Special HashSet test:
public void TestLoadingHashSet()
{
int length = 20000000;
var array = new string[length];
for (int i = 0; i < length; i++)
{
array[i] = i.ToString();
}
var dateTime = DateTime.Now;
var hashSet = new HashSet<string>(array);
Write("Time to load hashset: ");
WriteLine(DateTime.Now - dateTime);
}
> TestLoadingHashSet()
Time to load hashset: 00:00:01.1918160
Using .Contains
public void TestContains()
{
WriteLine($"Contains {DateTime.Now}");
int length = 20000000;
var dateTime = DateTime.Now;
var array = new string[length];
for (int i = 0; i < length; i++)
{
array[i] = i.ToString();
}
Write("Populate set processing time: ");
WriteLine(DateTime.Now - dateTime);
var newArray = new string[length/2];
int j = 0;
for (int i = 0; i < length; i+=2)
{
newArray[j++] = i.ToString();
}
dateTime = DateTime.Now;
WriteLine(dateTime);
Write("Count of items: ");
WriteLine(array.Where(a => !newArray.Contains(a)).Count());
Write("Count processing time: ");
WriteLine(DateTime.Now - dateTime);
}
Output
Contains 2021-08-14 11:19:44 AM
Populate set processing time: 00:00:03.1046998
2021-08-14 11:19:49 AM
Count of items: Hosting process exited with exit code 1.
(Didnt complete. Killed it after 14 minutes)
Conclusion:
Linq Except ran approximately 1 second slower on my device than using HashSets (n=20,000,000).
Using Where and Contains ran for a very long time
Closing remarks on HashSets:
Unique data
Make sure to override GetHashCode (correctly) for class types
May need up to 2x the memory if you make a copy of the data set, depending on implementation
HashSet is optimized for cloning other HashSets using the IEnumerable constructor, but it is slower to convert other collections to HashSets (see special test above)
First approach:
if (list1 != null && list2 != null && list1.Select(x => list2.SingleOrDefault(y => y.propertyToCompare == x.propertyToCompare && y.anotherPropertyToCompare == x.anotherPropertyToCompare) != null).All(x => true))
return true;
Second approach if you are ok with duplicate values:
if (list1 != null && list2 != null && list1.Select(x => list2.Any(y => y.propertyToCompare == x.propertyToCompare && y.anotherPropertyToCompare == x.anotherPropertyToCompare)).All(x => true))
return true;
One line:
var list1 = new List<int> { 1, 2, 3 };
var list2 = new List<int> { 1, 2, 3, 4 };
if (list1.Except(list2).Count() + list2.Except(list1).Count() == 0)
Console.WriteLine("same sets");
I think this is a simple and easy way to compare two lists element by element
x=[1,2,3,5,4,8,7,11,12,45,96,25]
y=[2,4,5,6,8,7,88,9,6,55,44,23]
tmp = []
for i in range(len(x)) and range(len(y)):
if x[i]>y[i]:
tmp.append(1)
else:
tmp.append(0)
print(tmp)
Maybe it's funny, but this works for me:
string.Join("",List1) != string.Join("", List2)
ToString
results of two objects don't per se make them equal and that both lists are supposed to be sorted.
This is the best solution you'll found
var list3 = list1.Where(l => list2.ToList().Contains(l));
List<T>
for each element in list1
. Also the result is called list3
when it's not a List<T>
.
Success story sharing
Equals(object)
and/or implementsIEquatable<T>
it should be fine.IEquatable<T>
implementation or theobject.Equals(object)
method. It sounds like you should create a new question with a minimal reproducible example - we can't really diagnose things in comments.