How do I efficiently obtain the frequency count for each unique value in a NumPy array?
>>> x = np.array([1,1,1,2,2,2,5,25,1,1])
>>> freq_count(x)
[(1, 5), (2, 3), (5, 1), (25, 1)]
collections.Counter(x)
sufficient?
Use numpy.unique
with return_counts=True
(for NumPy 1.9+):
import numpy as np
x = np.array([1,1,1,2,2,2,5,25,1,1])
unique, counts = np.unique(x, return_counts=True)
>>> print(np.asarray((unique, counts)).T)
[[ 1 5]
[ 2 3]
[ 5 1]
[25 1]]
In comparison with scipy.stats.itemfreq
:
In [4]: x = np.random.random_integers(0,100,1e6)
In [5]: %timeit unique, counts = np.unique(x, return_counts=True)
10 loops, best of 3: 31.5 ms per loop
In [6]: %timeit scipy.stats.itemfreq(x)
10 loops, best of 3: 170 ms per loop
Take a look at np.bincount
:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html
import numpy as np
x = np.array([1,1,1,2,2,2,5,25,1,1])
y = np.bincount(x)
ii = np.nonzero(y)[0]
And then:
zip(ii,y[ii])
# [(1, 5), (2, 3), (5, 1), (25, 1)]
or:
np.vstack((ii,y[ii])).T
# array([[ 1, 5],
[ 2, 3],
[ 5, 1],
[25, 1]])
or however you want to combine the counts and the unique values.
return_inverse
option here?
Use this:
>>> import numpy as np
>>> x = [1,1,1,2,2,2,5,25,1,1]
>>> np.array(np.unique(x, return_counts=True)).T
array([[ 1, 5],
[ 2, 3],
[ 5, 1],
[25, 1]])
Original answer:
Use scipy.stats.itemfreq
(warning: deprecated):
>>> from scipy.stats import itemfreq
>>> x = [1,1,1,2,2,2,5,25,1,1]
>>> itemfreq(x)
/usr/local/bin/python:1: DeprecationWarning: `itemfreq` is deprecated! `itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
array([[ 1., 5.],
[ 2., 3.],
[ 5., 1.],
[ 25., 1.]])
I was also interested in this, so I did a little performance comparison (using perfplot, a pet project of mine). Result:
y = np.bincount(a)
ii = np.nonzero(y)[0]
out = np.vstack((ii, y[ii])).T
is by far the fastest. (Note the log-scaling.)
https://i.stack.imgur.com/mjDiR.png
Code to generate the plot:
import numpy as np
import pandas as pd
import perfplot
from scipy.stats import itemfreq
def bincount(a):
y = np.bincount(a)
ii = np.nonzero(y)[0]
return np.vstack((ii, y[ii])).T
def unique(a):
unique, counts = np.unique(a, return_counts=True)
return np.asarray((unique, counts)).T
def unique_count(a):
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), dtype=int)
np.add.at(count, inverse, 1)
return np.vstack((unique, count)).T
def pandas_value_counts(a):
out = pd.value_counts(pd.Series(a))
out.sort_index(inplace=True)
out = np.stack([out.keys().values, out.values]).T
return out
b = perfplot.bench(
setup=lambda n: np.random.randint(0, 1000, n),
kernels=[bincount, unique, itemfreq, unique_count, pandas_value_counts],
n_range=[2 ** k for k in range(26)],
xlabel="len(a)",
)
b.save("out.png")
b.show()
equality_check=array_sorteq
in perfplot.show()
. What was causing an error ( in Python 2) was pd.value_counts
(even with sort=False).
Using pandas module:
>>> import pandas as pd
>>> import numpy as np
>>> x = np.array([1,1,1,2,2,2,5,25,1,1])
>>> pd.value_counts(x)
1 5
2 3
25 1
5 1
dtype: int64
df = pd.DataFrame(x) df = df.astype('category') print(df.describe())
will give info like count 10 unique 4 top 1 freq 5
, which can be useful
This is by far the most general and performant solution; surprised it hasn't been posted yet.
import numpy as np
def unique_count(a):
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1)
return np.vstack(( unique, count)).T
print unique_count(np.random.randint(-10,10,100))
Unlike the currently accepted answer, it works on any datatype that is sortable (not just positive ints), and it has optimal performance; the only significant expense is in the sorting done by np.unique.
AttributeError: 'numpy.ufunc' object has no attribute 'at'
np.bincount(inverse)
numpy.bincount
is the probably the best choice. If your array contains anything besides small dense integers it might be useful to wrap it something like this:
def count_unique(keys):
uniq_keys = np.unique(keys)
bins = uniq_keys.searchsorted(keys)
return uniq_keys, np.bincount(bins)
For example:
>>> x = array([1,1,1,2,2,2,5,25,1,1])
>>> count_unique(x)
(array([ 1, 2, 5, 25]), array([5, 3, 1, 1]))
Even though it has already been answered, I suggest a different approach that makes use of numpy.histogram
. Such function given a sequence it returns the frequency of its elements grouped in bins.
Beware though: it works in this example because numbers are integers. If they where real numbers, then this solution would not apply as nicely.
>>> from numpy import histogram
>>> y = histogram (x, bins=x.max()-1)
>>> y
(array([5, 3, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1]),
array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11.,
12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22.,
23., 24., 25.]))
Old question, but I'd like to provide my own solution which turn out to be the fastest, use normal list
instead of np.array
as input (or transfer to list firstly), based on my bench test.
Check it out if you encounter it as well.
def count(a):
results = {}
for x in a:
if x not in results:
results[x] = 1
else:
results[x] += 1
return results
For example,
>>>timeit count([1,1,1,2,2,2,5,25,1,1]) would return:
100000 loops, best of 3: 2.26 µs per loop
>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]))
100000 loops, best of 3: 8.8 µs per loop
>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]).tolist())
100000 loops, best of 3: 5.85 µs per loop
While the accepted answer would be slower, and the scipy.stats.itemfreq
solution is even worse.
A more indepth testing did not confirm the formulated expectation.
from zmq import Stopwatch
aZmqSTOPWATCH = Stopwatch()
aDataSETasARRAY = ( 100 * abs( np.random.randn( 150000 ) ) ).astype( np.int )
aDataSETasLIST = aDataSETasARRAY.tolist()
import numba
@numba.jit
def numba_bincount( anObject ):
np.bincount( anObject )
return
aZmqSTOPWATCH.start();np.bincount( aDataSETasARRAY );aZmqSTOPWATCH.stop()
14328L
aZmqSTOPWATCH.start();numba_bincount( aDataSETasARRAY );aZmqSTOPWATCH.stop()
592L
aZmqSTOPWATCH.start();count( aDataSETasLIST );aZmqSTOPWATCH.stop()
148609L
Ref. comments below on cache and other in-RAM side-effects that influence a small dataset massively repetitive testing results.
numpy
is not necessarily the way to go.
import pandas as pd
import numpy as np
x = np.array( [1,1,1,2,2,2,5,25,1,1] )
print(dict(pd.Series(x).value_counts()))
This gives you: {1: 5, 2: 3, 5: 1, 25: 1}
collections.Counter(x)
also give the same result. I believe the OP wants a output that resembles R table
function. Keeping the Series
may be more useful.
pd.Series(x).reshape(-1)
if it is a multidimensional array.
To count unique non-integers - similar to Eelco Hoogendoorn's answer but considerably faster (factor of 5 on my machine), I used weave.inline
to combine numpy.unique
with a bit of c-code;
import numpy as np
from scipy import weave
def count_unique(datain):
"""
Similar to numpy.unique function for returning unique members of
data, but also returns their counts
"""
data = np.sort(datain)
uniq = np.unique(data)
nums = np.zeros(uniq.shape, dtype='int')
code="""
int i,count,j;
j=0;
count=0;
for(i=1; i<Ndata[0]; i++){
count++;
if(data(i) > data(i-1)){
nums(j) = count;
count = 0;
j++;
}
}
// Handle last value
nums(j) = count+1;
"""
weave.inline(code,
['data', 'nums'],
extra_compile_args=['-O2'],
type_converters=weave.converters.blitz)
return uniq, nums
Profile info
> %timeit count_unique(data)
> 10000 loops, best of 3: 55.1 µs per loop
Eelco's pure numpy
version:
> %timeit unique_count(data)
> 1000 loops, best of 3: 284 µs per loop
Note
There's redundancy here (unique
performs a sort also), meaning that the code could probably be further optimized by putting the unique
functionality inside the c-code loop.
multi-dimentional frequency count, i.e. counting arrays.
>>> print(color_array )
array([[255, 128, 128],
[255, 128, 128],
[255, 128, 128],
...,
[255, 128, 128],
[255, 128, 128],
[255, 128, 128]], dtype=uint8)
>>> np.unique(color_array,return_counts=True,axis=0)
(array([[ 60, 151, 161],
[ 60, 155, 162],
[ 60, 159, 163],
[ 61, 143, 162],
[ 61, 147, 162],
[ 61, 162, 163],
[ 62, 166, 164],
[ 63, 137, 162],
[ 63, 169, 164],
array([ 1, 2, 2, 1, 4, 1, 1, 2,
3, 1, 1, 1, 2, 5, 2, 2,
898, 1, 1,
import pandas as pd
import numpy as np
print(pd.Series(name_of_array).value_counts())
from collections import Counter
x = array( [1,1,1,2,2,2,5,25,1,1] )
mode = counter.most_common(1)[0][0]
Most of simple problems get complicated because simple functionality like order() in R that gives a statistical result in both and descending order is missing in various python libraries. But if we devise our thinking that all such statistical ordering and parameters in python are easily found in pandas, we can can result sooner than looking in 100 different places. Also, development of R and pandas go hand-in-hand because they were created for same purpose. To solve this problem I use following code that gets me by anywhere:
unique, counts = np.unique(x, return_counts=True)
d = {'unique':unique, 'counts':count} # pass the list to a dictionary
df = pd.DataFrame(d) #dictionary object can be easily passed to make a dataframe
df.sort_values(by = 'count', ascending=False, inplace = True)
df = df.reset_index(drop=True) #optional only if you want to use it further
some thing like this should do it:
#create 100 random numbers
arr = numpy.random.random_integers(0,50,100)
#create a dictionary of the unique values
d = dict([(i,0) for i in numpy.unique(arr)])
for number in arr:
d[j]+=1 #increment when that value is found
Also, this previous post on Efficiently counting unique elements seems pretty similar to your question, unless I'm missing something.
Success story sharing
return_counts
keyword argument didn't exist, which might explain the exception. In that case, the docs suggest thatnp.unique(x, True)
is equivalent tonp.unique(x, return_index=True)
, which doesn't return counts.unique, idx = np.unique(x, return_inverse=True); counts = np.bincount(idx)
. When this feature was added (see here) some informal testing had the use ofreturn_counts
clocking over 5x faster.