How do I convert a pandas dataframe into a NumPy array?
DataFrame:
import numpy as np
import pandas as pd
index = [1, 2, 3, 4, 5, 6, 7]
a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1]
b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan]
c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan]
df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index)
df = df.rename_axis('ID')
gives
label A B C
ID
1 NaN 0.2 NaN
2 NaN NaN 0.5
3 NaN 0.2 0.5
4 0.1 0.2 NaN
5 0.1 0.2 0.5
6 0.1 NaN 0.5
7 0.1 NaN NaN
I would like to convert this to a NumPy array, like so:
array([[ nan, 0.2, nan],
[ nan, nan, 0.5],
[ nan, 0.2, 0.5],
[ 0.1, 0.2, nan],
[ 0.1, 0.2, 0.5],
[ 0.1, nan, 0.5],
[ 0.1, nan, nan]])
Also, is it possible to preserve the dtypes, like this?
array([[ 1, nan, 0.2, nan],
[ 2, nan, nan, 0.5],
[ 3, nan, 0.2, 0.5],
[ 4, 0.1, 0.2, nan],
[ 5, 0.1, 0.2, 0.5],
[ 6, 0.1, nan, 0.5],
[ 7, 0.1, nan, nan]],
dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')])
df.to_numpy() is better than df.values, here's why.*
It's time to deprecate your usage of values
and as_matrix()
.
pandas v0.24.0
introduced two new methods for obtaining NumPy arrays from pandas objects:
to_numpy(), which is defined on Index, Series, and DataFrame objects, and array, which is defined on Index and Series objects only.
If you visit the v0.24 docs for .values
, you will see a big red warning that says:
Warning: We recommend using DataFrame.to_numpy() instead.
See this section of the v0.24.0 release notes, and this answer for more information.
* - to_numpy()
is my recommended method for any production code that needs to run reliably for many versions into the future. However if you're just making a scratchpad in jupyter or the terminal, using .values
to save a few milliseconds of typing is a permissable exception. You can always add the fit n finish later.
Towards Better Consistency: to_numpy()
In the spirit of better consistency throughout the API, a new method to_numpy
has been introduced to extract the underlying NumPy array from DataFrames.
# Setup
df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]},
index=['a', 'b', 'c'])
# Convert the entire DataFrame
df.to_numpy()
# array([[1, 4, 7],
# [2, 5, 8],
# [3, 6, 9]])
# Convert specific columns
df[['A', 'C']].to_numpy()
# array([[1, 7],
# [2, 8],
# [3, 9]])
As mentioned above, this method is also defined on Index
and Series
objects (see here).
df.index.to_numpy()
# array(['a', 'b', 'c'], dtype=object)
df['A'].to_numpy()
# array([1, 2, 3])
By default, a view is returned, so any modifications made will affect the original.
v = df.to_numpy()
v[0, 0] = -1
df
A B C
a -1 4 7
b 2 5 8
c 3 6 9
If you need a copy instead, use to_numpy(copy=True)
.
pandas >= 1.0 update for ExtensionTypes
If you're using pandas 1.x, chances are you'll be dealing with extension types a lot more. You'll have to be a little more careful that these extension types are correctly converted.
a = pd.array([1, 2, None], dtype="Int64")
a
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64
# Wrong
a.to_numpy()
# array([1, 2, <NA>], dtype=object) # yuck, objects
# Correct
a.to_numpy(dtype='float', na_value=np.nan)
# array([ 1., 2., nan])
# Also correct
a.to_numpy(dtype='int', na_value=-1)
# array([ 1, 2, -1])
This is called out in the docs.
If you need the dtypes in the result...
As shown in another answer, DataFrame.to_records
is a good way to do this.
df.to_records()
# rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],
# dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
This cannot be done with to_numpy
, unfortunately. However, as an alternative, you can use np.rec.fromrecords
:
v = df.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
# rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],
# dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
Performance wise, it's nearly the same (actually, using rec.fromrecords
is a bit faster).
df2 = pd.concat([df] * 10000)
%timeit df2.to_records()
%%timeit
v = df2.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
12.9 ms ± 511 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.56 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Rationale for Adding a New Method
to_numpy()
(in addition to array
) was added as a result of discussions under two GitHub issues GH19954 and GH23623.
Specifically, the docs mention the rationale:
[...] with .values it was unclear whether the returned value would be the actual array, some transformation of it, or one of pandas custom arrays (like Categorical). For example, with PeriodIndex, .values generates a new ndarray of period objects each time. [...]
to_numpy
aims to improve the consistency of the API, which is a major step in the right direction. .values
will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.
Critique of Other Solutions
DataFrame.values
has inconsistent behaviour, as already noted.
DataFrame.get_values()
is simply a wrapper around DataFrame.values
, so everything said above applies.
DataFrame.as_matrix()
is deprecated now, do NOT use!
To convert a pandas dataframe (df) to a numpy ndarray, use this code:
df.values
array([[nan, 0.2, nan],
[nan, nan, 0.5],
[nan, 0.2, 0.5],
[0.1, 0.2, nan],
[0.1, 0.2, 0.5],
[0.1, nan, 0.5],
[0.1, nan, nan]])
Note: The .as_matrix()
method used in this answer is deprecated. Pandas 0.23.4 warns:
Method .as_matrix will be removed in a future version. Use .values instead.
Pandas has something built in...
numpy_matrix = df.as_matrix()
gives
array([[nan, 0.2, nan],
[nan, nan, 0.5],
[nan, 0.2, 0.5],
[0.1, 0.2, nan],
[0.1, 0.2, 0.5],
[0.1, nan, 0.5],
[0.1, nan, nan]])
object
.
to_numpy
instead (not .values
either). More here.
I would just chain the DataFrame.reset_index() and DataFrame.values functions to get the Numpy representation of the dataframe, including the index:
In [8]: df
Out[8]:
A B C
0 -0.982726 0.150726 0.691625
1 0.617297 -0.471879 0.505547
2 0.417123 -1.356803 -1.013499
3 -0.166363 -0.957758 1.178659
4 -0.164103 0.074516 -0.674325
5 -0.340169 -0.293698 1.231791
6 -1.062825 0.556273 1.508058
7 0.959610 0.247539 0.091333
[8 rows x 3 columns]
In [9]: df.reset_index().values
Out[9]:
array([[ 0. , -0.98272574, 0.150726 , 0.69162512],
[ 1. , 0.61729734, -0.47187926, 0.50554728],
[ 2. , 0.4171228 , -1.35680324, -1.01349922],
[ 3. , -0.16636303, -0.95775849, 1.17865945],
[ 4. , -0.16410334, 0.0745164 , -0.67432474],
[ 5. , -0.34016865, -0.29369841, 1.23179064],
[ 6. , -1.06282542, 0.55627285, 1.50805754],
[ 7. , 0.95961001, 0.24753911, 0.09133339]])
To get the dtypes we'd need to transform this ndarray into a structured array using view:
In [10]: df.reset_index().values.ravel().view(dtype=[('index', int), ('A', float), ('B', float), ('C', float)])
Out[10]:
array([( 0, -0.98272574, 0.150726 , 0.69162512),
( 1, 0.61729734, -0.47187926, 0.50554728),
( 2, 0.4171228 , -1.35680324, -1.01349922),
( 3, -0.16636303, -0.95775849, 1.17865945),
( 4, -0.16410334, 0.0745164 , -0.67432474),
( 5, -0.34016865, -0.29369841, 1.23179064),
( 6, -1.06282542, 0.55627285, 1.50805754),
( 7, 0.95961001, 0.24753911, 0.09133339),
dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
You can use the to_records
method, but have to play around a bit with the dtypes if they are not what you want from the get go. In my case, having copied your DF from a string, the index type is string (represented by an object
dtype in pandas):
In [102]: df
Out[102]:
label A B C
ID
1 NaN 0.2 NaN
2 NaN NaN 0.5
3 NaN 0.2 0.5
4 0.1 0.2 NaN
5 0.1 0.2 0.5
6 0.1 NaN 0.5
7 0.1 NaN NaN
In [103]: df.index.dtype
Out[103]: dtype('object')
In [104]: df.to_records()
Out[104]:
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
(4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
(7, 0.1, nan, nan)],
dtype=[('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
In [106]: df.to_records().dtype
Out[106]: dtype([('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
Converting the recarray dtype does not work for me, but one can do this in Pandas already:
In [109]: df.index = df.index.astype('i8')
In [111]: df.to_records().view([('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
Out[111]:
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
(4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
(7, 0.1, nan, nan)],
dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
Note that Pandas does not set the name of the index properly (to ID
) in the exported record array (a bug?), so we profit from the type conversion to also correct for that.
At the moment Pandas has only 8-byte integers, i8
, and floats, f8
(see this issue).
np.array
constructor.
It seems like df.to_records()
will work for you. The exact feature you're looking for was requested and to_records
pointed to as an alternative.
I tried this out locally using your example, and that call yields something very similar to the output you were looking for:
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
(4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
(7, 0.1, nan, nan)],
dtype=[(u'ID', '<i8'), (u'A', '<f8'), (u'B', '<f8'), (u'C', '<f8')])
Note that this is a recarray
rather than an array
. You could move the result in to regular numpy array by calling its constructor as np.array(df.to_records())
.
to_records()
over 5 years earlier?
Try this:
a = numpy.asarray(df)
Here is my approach to making a structure array from a pandas DataFrame.
Create the data frame
import pandas as pd
import numpy as np
import six
NaN = float('nan')
ID = [1, 2, 3, 4, 5, 6, 7]
A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1]
B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN]
C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN]
columns = {'A':A, 'B':B, 'C':C}
df = pd.DataFrame(columns, index=ID)
df.index.name = 'ID'
print(df)
A B C
ID
1 NaN 0.2 NaN
2 NaN NaN 0.5
3 NaN 0.2 0.5
4 0.1 0.2 NaN
5 0.1 0.2 0.5
6 0.1 NaN 0.5
7 0.1 NaN NaN
Define function to make a numpy structure array (not a record array) from a pandas DataFrame.
def df_to_sarray(df):
"""
Convert a pandas DataFrame object to a numpy structured array.
This is functionally equivalent to but more efficient than
np.array(df.to_array())
:param df: the data frame to convert
:return: a numpy structured array representation of df
"""
v = df.values
cols = df.columns
if six.PY2: # python 2 needs .encode() but 3 does not
types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)]
else:
types = [(cols[i], df[k].dtype.type) for (i, k) in enumerate(cols)]
dtype = np.dtype(types)
z = np.zeros(v.shape[0], dtype)
for (i, k) in enumerate(z.dtype.names):
z[k] = v[:, i]
return z
Use reset_index
to make a new data frame that includes the index as part of its data. Convert that data frame to a structure array.
sa = df_to_sarray(df.reset_index())
sa
array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5),
(4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5),
(7L, 0.1, nan, nan)],
dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
EDIT: Updated df_to_sarray to avoid error calling .encode() with python 3. Thanks to Joseph Garvin and halcyon for their comment and solution.
A Simpler Way for Example DataFrame:
df
gbm nnet reg
0 12.097439 12.047437 12.100953
1 12.109811 12.070209 12.095288
2 11.720734 11.622139 11.740523
3 11.824557 11.926414 11.926527
4 11.800868 11.727730 11.729737
5 12.490984 12.502440 12.530894
USE:
np.array(df.to_records().view(type=np.matrix))
GET:
array([[(0, 12.097439 , 12.047437, 12.10095324),
(1, 12.10981081, 12.070209, 12.09528824),
(2, 11.72073428, 11.622139, 11.74052253),
(3, 11.82455653, 11.926414, 11.92652727),
(4, 11.80086775, 11.72773 , 11.72973699),
(5, 12.49098389, 12.50244 , 12.53089367)]],
dtype=(numpy.record, [('index', '<i8'), ('gbm', '<f8'), ('nnet', '<f4'),
('reg', '<f8')]))
Two ways to convert the data-frame to its Numpy-array representation.
mah_np_array = df.as_matrix(columns=None)
mah_np_array = df.values
Doc: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.as_matrix.html
I went through the answers above. The "as_matrix()" method works but its obsolete now. For me, What worked was ".to_numpy()".
This returns a multidimensional array. I'll prefer using this method if you're reading data from excel sheet and you need to access data from any index. Hope this helps :)
Just had a similar problem when exporting from dataframe to arcgis table and stumbled on a solution from usgs (https://my.usgs.gov/confluence/display/cdi/pandas.DataFrame+to+ArcGIS+Table). In short your problem has a similar solution:
df
A B C
ID
1 NaN 0.2 NaN
2 NaN NaN 0.5
3 NaN 0.2 0.5
4 0.1 0.2 NaN
5 0.1 0.2 0.5
6 0.1 NaN 0.5
7 0.1 NaN NaN
np_data = np.array(np.rec.fromrecords(df.values))
np_names = df.dtypes.index.tolist()
np_data.dtype.names = tuple([name.encode('UTF8') for name in np_names])
np_data
array([( nan, 0.2, nan), ( nan, nan, 0.5), ( nan, 0.2, 0.5),
( 0.1, 0.2, nan), ( 0.1, 0.2, 0.5), ( 0.1, nan, 0.5),
( 0.1, nan, nan)],
dtype=(numpy.record, [('A', '<f8'), ('B', '<f8'), ('C', '<f8')]))
A simple way to convert dataframe to numpy array:
import pandas as pd
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df_to_array = df.to_numpy()
array([[1, 3],
[2, 4]])
Use of to_numpy is encouraged to preserve consistency.
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html
Try this:
np.array(df)
array([['ID', nan, nan, nan],
['1', nan, 0.2, nan],
['2', nan, nan, 0.5],
['3', nan, 0.2, 0.5],
['4', 0.1, 0.2, nan],
['5', 0.1, 0.2, 0.5],
['6', 0.1, nan, 0.5],
['7', 0.1, nan, nan]], dtype=object)
Some more information at: [https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html] Valid for numpy 1.16.5 and pandas 0.25.2.
Further to meteore's answer, I found the code
df.index = df.index.astype('i8')
doesn't work for me. So I put my code here for the convenience of others stuck with this issue.
city_cluster_df = pd.read_csv(text_filepath, encoding='utf-8')
# the field 'city_en' is a string, when converted to Numpy array, it will be an object
city_cluster_arr = city_cluster_df[['city_en','lat','lon','cluster','cluster_filtered']].to_records()
descr=city_cluster_arr.dtype.descr
# change the field 'city_en' to string type (the index for 'city_en' here is 1 because before the field is the row index of dataframe)
descr[1]=(descr[1][0], "S20")
newArr=city_cluster_arr.astype(np.dtype(descr))
Success story sharing
as_matrix
to another solution, in this case,to_numpy
without explaining how to recover the column selecting functionality ofas_matrix
! I am sure there are other ways to select columns, butas_matrix
was at least one of them!df[[col1, col2']].to_numpy()
? Not sure why you think wanting to advertise an updated alternative to a deprecated function warrants a downvote on the answer.