ChatGPT解决这个技术问题 Extra ChatGPT

Quickest way to compare two generic lists for differences

What is the quickest (and least resource intensive) to compare two massive (>50.000 items) and as a result have two lists like the ones below:

items that show up in the first list but not in the second items that show up in the second list but not in the first

Currently I'm working with the List or IReadOnlyCollection and solve this issue in a linq query:

var list1 = list.Where(i => !list2.Contains(i)).ToList();
var list2 = list2.Where(i => !list.Contains(i)).ToList();

But this doesn't perform as good as i would like. Any idea of making this quicker and less resource intensive as i need to process a lot of lists?

If you come across this question and consider adding a new answer, please note that they're not asking for a way, but the quickest way.

J
Jon Skeet

Use Except:

var firstNotSecond = list1.Except(list2).ToList();
var secondNotFirst = list2.Except(list1).ToList();

I suspect there are approaches which would actually be marginally faster than this, but even this will be vastly faster than your O(N * M) approach.

If you want to combine these, you could create a method with the above and then a return statement:

return !firstNotSecond.Any() && !secondNotFirst.Any();

One point to note is that there is a difference in results between the original code in the question and the solution here: any duplicate elements which are only in one list will only be reported once with my code, whereas they'd be reported as many times as they occur in the original code.

For example, with lists of [1, 2, 2, 2, 3] and [1], the "elements in list1 but not list2" result in the original code would be [2, 2, 2, 3]. With my code it would just be [2, 3]. In many cases that won't be an issue, but it's worth being aware of.


This is really a huge performance gain! Thanks for this answer.
I'm wondering for two huge lists, is it useful to sort before compare? or inside Except extension method, the list passed in is sorted already.
@Larry: It's not sorted; it builds a hash set.
@PranavSingh: It will work for anything that has appropriate equality - so if your custom type overrides Equals(object) and/or implements IEquatable<T> it should be fine.
@k2ibegin: It uses the default equality comparer, which will use an IEquatable<T> implementation or the object.Equals(object) method. It sounds like you should create a new question with a minimal reproducible example - we can't really diagnose things in comments.
C
Community

Enumerable.SequenceEqual Method Determines whether two sequences are equal according to an equality comparer. MS.Docs

Enumerable.SequenceEqual(list1, list2);

This works for all primitive data types. If you need to use it on custom objects you need to implement IEqualityComparer

Defines methods to support the comparison of objects for equality.

IEqualityComparer Interface Defines methods to support the comparison of objects for equality. MS.Docs for IEqualityComparer


this should be the accepted answer. The question is not about SETS but about LISTS, which can contain duplication of elements.
I don't see how this could be the answer, given that the result of SequenceEqual is a simple bool. The OP wants two lists of results - and describes what they want in terms of set operations: "items that show up in the first list but not in the second". There's no indication that ordering is relevant, whereas SequenceEqual does consider it to be relevant. This appears to be answering an entirely different question.
yes, correct, seems that I answered this one too fast and didn't look at the second part of the request...same as the first two comments...
This is a cute one (in my opinion the best answer) and works like a charm. Thanks!
T
Tim Schmelter

More efficient would be using Enumerable.Except:

var inListButNotInList2 = list.Except(list2);
var inList2ButNotInList = list2.Except(list);

This method is implemented by using deferred execution. That means you could write for example:

var first10 = inListButNotInList2.Take(10);

It is also efficient since it internally uses a Set<T> to compare the objects. It works by first collecting all distinct values from the second sequence, and then streaming the results of the first, checking that they haven't been seen before.


Hmm. Not quite deferred. I'd say partially deferred. A complete Set<T> is built from the second sequence (i.e. it's fully iterated and stored), then items that can be added from the first sequence are yielded.
@spender, that's like saying that execution of Where is partially deferred because in list.Where(x => x.Id == 5) the value of the number 5 is stored at the start, rather than executed lazily.
A
Arulkumar

If you want the results to be case insensitive, the following will work:

List<string> list1 = new List<string> { "a.dll", "b1.dll" };
List<string> list2 = new List<string> { "A.dll", "b2.dll" };

var firstNotSecond = list1.Except(list2, StringComparer.OrdinalIgnoreCase).ToList();
var secondNotFirst = list2.Except(list1, StringComparer.OrdinalIgnoreCase).ToList();

firstNotSecond would contain b1.dll

secondNotFirst would contain b2.dll


D
Devon Parsons
using System.Collections.Generic;
using System.Linq;

namespace YourProject.Extensions
{
    public static class ListExtensions
    {
        public static bool SetwiseEquivalentTo<T>(this List<T> list, List<T> other)
            where T: IEquatable<T>
        {
            if (list.Except(other).Any())
                return false;
            if (other.Except(list).Any())
                return false;
            return true;
        }
    }
}

Sometimes you only need to know if two lists are different, and not what those differences are. In that case, consider adding this extension method to your project. Note that your listed objects should implement IEquatable!

Usage:

public sealed class Car : IEquatable<Car>
{
    public Price Price { get; }
    public List<Component> Components { get; }

    ...
    public override bool Equals(object obj)
        => obj is Car other && Equals(other);

    public bool Equals(Car other)
        => Price == other.Price
            && Components.SetwiseEquivalentTo(other.Components);

    public override int GetHashCode()
        => Components.Aggregate(
            Price.GetHashCode(),
            (code, next) => code ^ next.GetHashCode()); // Bitwise XOR
}

Whatever the Component class is, the methods shown here for Car should be implemented almost identically.

It's very important to note how we've written GetHashCode. In order to properly implement IEquatable, Equals and GetHashCode must operate on the instance's properties in a logically compatible way.

Two lists with the same contents are still different objects, and will produce different hash codes. Since we want these two lists to be treated as equal, we must let GetHashCode produce the same value for each of them. We can accomplish this by delegating the hashcode to every element in the list, and using the standard bitwise XOR to combine them all. XOR is order-agnostic, so it doesn't matter if the lists are sorted differently. It only matters that they contain nothing but equivalent members.

Note: the strange name is to imply the fact that the method does not consider the order of the elements in the list. If you do care about the order of the elements in the list, this method is not for you!


A
Armstrongest

Not for this Problem, but here's some code to compare lists for equal and not! identical objects:

public class EquatableList<T> : List<T>, IEquatable<EquatableList<T>> where    T : IEquatable<T>

/// <summary>
/// True, if this contains element with equal property-values
/// </summary>
/// <param name="element">element of Type T</param>
/// <returns>True, if this contains element</returns>
public new Boolean Contains(T element)
{
    return this.Any(t => t.Equals(element));
}

/// <summary>
/// True, if list is equal to this
/// </summary>
/// <param name="list">list</param>
/// <returns>True, if instance equals list</returns>
public Boolean Equals(EquatableList<T> list)
{
    if (list == null) return false;
    return this.All(list.Contains) && list.All(this.Contains);
}

This is what you need to be able to compare custom data types. Then use Except
You can probably do better with sortable types. This runs in O(n^2), while you can do O(nlogn).
A
Ali Khaleghi Karsalari

If only combined result needed, this will work too:

var set1 = new HashSet<T>(list1);
var set2 = new HashSet<T>(list2);
var areEqual = set1.SetEquals(set2);

where T is type of lists element.


A
Alex

try this way:

var difList = list1.Where(a => !list2.Any(a1 => a1.id == a.id))
            .Union(list2.Where(a => !list1.Any(a1 => a1.id == a.id)));

This suffers from horrible performance, requiring a scan of the second list for every item in the first. Not downvoting because it works, but it's as bad as the original code.
I think the OP was addressing the issue of not being implemented IEquitable for the objects.
K
KloppyToppy

While Jon Skeet's answer is an excellent advice for everyday's practice with small to moderate number of elements (up to a few millions) it is nevertheless not the fastest approach and not very resource efficient. An obvious drawback is the fact that getting the full difference requires two passes over the data (even three if the elements that are equal are of interest as well). Clearly, this can be avoided by a customized reimplementation of the Except method, but it remains that the creation of a hash set requires a lot of memory and the computation of hashes requires time.

For very large data sets (in the billions of elements) it usually pays off to consider the particular circumstances. Here are a few ideas that might provide some inspiration: If the elements can be compared (which is almost always the case in practice), then sorting the lists and applying the following zip approach is worth consideration:

/// <returns>The elements of the specified (ascendingly) sorted enumerations that are
/// contained only in one of them, together with an indicator,
/// whether the element is contained in the reference enumeration (-1)
/// or in the difference enumeration (+1).</returns>
public static IEnumerable<Tuple<T, int>> FindDifferences<T>(IEnumerable<T> sortedReferenceObjects,
    IEnumerable<T> sortedDifferenceObjects, IComparer<T> comparer)
{
    var refs  = sortedReferenceObjects.GetEnumerator();
    var diffs = sortedDifferenceObjects.GetEnumerator();
    bool hasNext = refs.MoveNext() && diffs.MoveNext();
    while (hasNext)
    {
        int comparison = comparer.Compare(refs.Current, diffs.Current);
        if (comparison == 0)
        {
            // insert code that emits the current element if equal elements should be kept
            hasNext = refs.MoveNext() && diffs.MoveNext();

        }
        else if (comparison < 0)
        {
            yield return Tuple.Create(refs.Current, -1);
            hasNext = refs.MoveNext();
        }
        else
        {
            yield return Tuple.Create(diffs.Current, 1);
            hasNext = diffs.MoveNext();
        }
    }
}

This can e.g. be used in the following way:

const int N = <Large number>;
const int omit1 = 231567;
const int omit2 = 589932;
IEnumerable<int> numberSequence1 = Enumerable.Range(0, N).Select(i => i < omit1 ? i : i + 1);
IEnumerable<int> numberSequence2 = Enumerable.Range(0, N).Select(i => i < omit2 ? i : i + 1);
var numberDiffs = FindDifferences(numberSequence1, numberSequence2, Comparer<int>.Default);

Benchmarking on my computer gave the following result for N = 1M:

Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated DiffLinq 115.19 ms 0.656 ms 0.582 ms 1.00 2800.0000 2800.0000 2800.0000 67110744 B DiffZip 23.48 ms 0.018 ms 0.015 ms 0.20 - - - 720 B

And for N = 100M:

Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated DiffLinq 12.146 s 0.0427 s 0.0379 s 1.00 13000.0000 13000.0000 13000.0000 8589937032 B DiffZip 2.324 s 0.0019 s 0.0018 s 0.19 - - - 720 B

Note that this example of course benefits from the fact that the lists are already sorted and integers can be very efficiently compared. But this is exactly the point: If you do have favourable circumstances, make sure that you exploit them.

A few further comments: The speed of the comparison function is clearly relevant for the overall performance, so it may be beneficial to optimize it. The flexibility to do so is a benefit of the zipping approach. Furthermore, parallelization seems more feasible to me, although by no means easy and maybe not worth the effort and the overhead. Nevertheless, a simple way to speed up the process by roughly a factor of 2, is to split the lists respectively in two halfs (if it can be efficiently done) and compare the parts in parallel, one processing from front to back and the other in reverse order.


S
Sathish

I have used this code to compare two list which has million of records.

This method will not take much time

    //Method to compare two list of string
    private List<string> Contains(List<string> list1, List<string> list2)
    {
        List<string> result = new List<string>();

        result.AddRange(list1.Except(list2, StringComparer.OrdinalIgnoreCase));
        result.AddRange(list2.Except(list1, StringComparer.OrdinalIgnoreCase));

        return result;
    }

A
Ahmad

I compared 3 different methods for comparing different data sets. Tests below create a string collection of all the numbers from 0 to length - 1, then another collection with the same range, but with even numbers. I then pick out the odd numbers from the first collection.

Using Linq Except

public void TestExcept()
{
    WriteLine($"Except {DateTime.Now}");
    int length = 20000000;
    var dateTime = DateTime.Now;
    var array = new string[length];
    for (int i = 0; i < length; i++)
    {
        array[i] = i.ToString();
    }
    Write("Populate set processing time: ");
    WriteLine(DateTime.Now - dateTime);
    var newArray = new string[length/2];
    int j = 0;
    for (int i = 0; i < length; i+=2)
    {
        newArray[j++] = i.ToString();
    }
    dateTime = DateTime.Now;
    Write("Count of items: ");
    WriteLine(array.Except(newArray).Count());
    Write("Count processing time: ");
    WriteLine(DateTime.Now - dateTime);
}

Output

Except 2021-08-14 11:43:03 AM
Populate set processing time: 00:00:03.7230479
2021-08-14 11:43:09 AM
Count of items: 10000000
Count processing time: 00:00:02.9720879

Using HashSet.Add

public void TestHashSet()
{
    WriteLine($"HashSet {DateTime.Now}");
    int length = 20000000;
    var dateTime = DateTime.Now;
    var hashSet = new HashSet<string>();
    for (int i = 0; i < length; i++)
    {
        hashSet.Add(i.ToString());
    }
    Write("Populate set processing time: ");
    WriteLine(DateTime.Now - dateTime);
    var newHashSet = new HashSet<string>();
    for (int i = 0; i < length; i+=2)
    {
        newHashSet.Add(i.ToString());
    }
    dateTime = DateTime.Now;
    Write("Count of items: ");
    // HashSet Add returns true if item is added successfully (not previously existing)
    WriteLine(hashSet.Where(s => newHashSet.Add(s)).Count());
    Write("Count processing time: ");
    WriteLine(DateTime.Now - dateTime);
}

Output

HashSet 2021-08-14 11:42:43 AM
Populate set processing time: 00:00:05.6000625
Count of items: 10000000
Count processing time: 00:00:01.7703057

Special HashSet test:

public void TestLoadingHashSet()
{
    int length = 20000000;
    var array = new string[length];
    for (int i = 0; i < length; i++)
    {
       array[i] = i.ToString();
    }
    var dateTime = DateTime.Now;
    var hashSet = new HashSet<string>(array);
    Write("Time to load hashset: ");
    WriteLine(DateTime.Now - dateTime);
}
> TestLoadingHashSet()
Time to load hashset: 00:00:01.1918160

Using .Contains

public void TestContains()
{
    WriteLine($"Contains {DateTime.Now}");
    int length = 20000000;
    var dateTime = DateTime.Now;
    var array = new string[length];
    for (int i = 0; i < length; i++)
    {
        array[i] = i.ToString();
    }
    Write("Populate set processing time: ");
    WriteLine(DateTime.Now - dateTime);
    var newArray = new string[length/2];
    int j = 0;
    for (int i = 0; i < length; i+=2)
    {
        newArray[j++] = i.ToString();
    }
    dateTime = DateTime.Now;
    WriteLine(dateTime);
    Write("Count of items: ");
    WriteLine(array.Where(a => !newArray.Contains(a)).Count());
    Write("Count processing time: ");
    WriteLine(DateTime.Now - dateTime);
}

Output

Contains 2021-08-14 11:19:44 AM
Populate set processing time: 00:00:03.1046998
2021-08-14 11:19:49 AM
Count of items: Hosting process exited with exit code 1.
(Didnt complete. Killed it after 14 minutes)

Conclusion:

Linq Except ran approximately 1 second slower on my device than using HashSets (n=20,000,000).

Using Where and Contains ran for a very long time

Closing remarks on HashSets:

Unique data

Make sure to override GetHashCode (correctly) for class types

May need up to 2x the memory if you make a copy of the data set, depending on implementation

HashSet is optimized for cloning other HashSets using the IEnumerable constructor, but it is slower to convert other collections to HashSets (see special test above)


It's not convincing to show data of only one run. The order of runs may matter and there's always noise involved. Things like this should be benchmarked reliably, for example by a benchmarking framework like benchmarkdotnet.org.
I didnt include all the data, but I did run it several times before/after/repeated etc, results were consistent.
u
user3415602

First approach:

if (list1 != null && list2 != null && list1.Select(x => list2.SingleOrDefault(y => y.propertyToCompare == x.propertyToCompare && y.anotherPropertyToCompare == x.anotherPropertyToCompare) != null).All(x => true))
   return true;

Second approach if you are ok with duplicate values:

if (list1 != null && list2 != null && list1.Select(x => list2.Any(y => y.propertyToCompare == x.propertyToCompare && y.anotherPropertyToCompare == x.anotherPropertyToCompare)).All(x => true))
   return true;

They're not asking for a way, but the quickest way. Prove that if you answer the question. Besides, your methods always return true.
k
kame

One line:

var list1 = new List<int> { 1, 2, 3 };
var list2 = new List<int> { 1, 2, 3, 4 };
if (list1.Except(list2).Count() + list2.Except(list1).Count() == 0)
    Console.WriteLine("same sets");

They're not asking for a way, but the quickest way. Prove that if you answer the question.
u
user10915707

I think this is a simple and easy way to compare two lists element by element

x=[1,2,3,5,4,8,7,11,12,45,96,25]
y=[2,4,5,6,8,7,88,9,6,55,44,23]

tmp = []


for i in range(len(x)) and range(len(y)):
    if x[i]>y[i]:
        tmp.append(1)
    else:
        tmp.append(0)
print(tmp)

This is a C# question, and you have not provided C# code.
Perhaps you could delete this answer and move it to (for example) How can I compare two lists in python and return matches?
A
Adrian Mole

Maybe it's funny, but this works for me:

string.Join("",List1) != string.Join("", List2)

as it's written here it would not even work for List or List, as for example the two lists 11;2;3 and 1;12;3 would be identical since you don't join the strings with some unique separator that isn't a possible item in the list. Apart from that, concatenating strings for a list with a lot of items is probably a performance killer.
@SwissCoder: You are wrong, this is not a performacne killer for string. If you have two list with 50.000 strings (each of length 3) this algorithm needs 3 ms on my machine. The accepted answere needs 7. I think the trick is Jibz only needs one string comparision. Of course he has to add a unique separator.
@user1027167: Im not talking about comparing strings directly (as this is also not the question). Calling the .ToString() method of all the objects in a List with 50.000 objects can create a huge string, depending how it's implemented. I don't think that's the way to go. Then it's also risky to rely on a character or string being "unique", the code would not really be reusable like that.
Ok that's true. The questioner asked for the quickest way without giving the datatype of his lists. Probably this answere is the quickest way for the use case of the questioner.
Not to mention that equal ToString results of two objects don't per se make them equal and that both lists are supposed to be sorted.
F
Fajoui El Mahdi

This is the best solution you'll found

var list3 = list1.Where(l => list2.ToList().Contains(l));

This is actually very bad because it creates a new List<T> for each element in list1. Also the result is called list3 when it's not a List<T>.