I have a pandas DataFrame, df_test
. It contains a column 'size' which represents size in bytes. I've calculated KB, MB, and GB using the following code:
df_test = pd.DataFrame([
{'dir': '/Users/uname1', 'size': 994933},
{'dir': '/Users/uname2', 'size': 109338711},
])
df_test['size_kb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0, grouping=True) + ' KB')
df_test['size_mb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 2, grouping=True) + ' MB')
df_test['size_gb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 3, grouping=True) + ' GB')
df_test
dir size size_kb size_mb size_gb
0 /Users/uname1 994933 971.6 KB 0.9 MB 0.0 GB
1 /Users/uname2 109338711 106,776.1 KB 104.3 MB 0.1 GB
[2 rows x 5 columns]
I've run this over 120,000 rows and time it takes about 2.97 seconds per column * 3 = ~9 seconds according to %timeit.
Is there anyway I can make this faster? For example, can I instead of returning one column at a time from apply and running it 3 times, can I return all three columns in one pass to insert back into the original dataframe?
The other questions I've found all want to take multiple values and return a single value. I want to take a single value and return multiple columns.
You can return a Series from the applied function that contains the new data, preventing the need to iterate three times. Passing axis=1
to the apply function applies the function sizes
to each row of the dataframe, returning a series to add to a new dataframe. This series, s, contains the new values, as well as the original data.
def sizes(s):
s['size_kb'] = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
s['size_mb'] = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
s['size_gb'] = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return s
df_test = df_test.append(rows_list)
df_test = df_test.apply(sizes, axis=1)
Use apply and zip will 3 times fast than Series way.
def sizes(s):
return locale.format("%.1f", s / 1024.0, grouping=True) + ' KB', \
locale.format("%.1f", s / 1024.0 ** 2, grouping=True) + ' MB', \
locale.format("%.1f", s / 1024.0 ** 3, grouping=True) + ' GB'
df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes))
Test result are:
Separate df.apply():
100 loops, best of 3: 1.43 ms per loop
Return Series:
100 loops, best of 3: 2.61 ms per loop
Return tuple:
1000 loops, best of 3: 819 µs per loop
apply
on the entire frame instead of specific columns
zip
approach does not retain the correct index. result_type=expand
however will.
ValueError: Columns must be same length as key
Some of the current replies work fine, but I want to offer another, maybe more "pandifyed" option. This works for me with the current pandas 0.23 (not sure if it will work in previous versions):
import pandas as pd
df_test = pd.DataFrame([
{'dir': '/Users/uname1', 'size': 994933},
{'dir': '/Users/uname2', 'size': 109338711},
])
def sizes(s):
a = locale.format_string("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
b = locale.format_string("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
c = locale.format_string("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return a, b, c
df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes, axis=1, result_type="expand")
Notice that the trick is on the result_type
parameter of apply
, that will expand its result into a DataFrame
that can be directly assign to new/old columns.
.apply()
on DataFrames, not on Series. Also, with pandas 1.1.5 this doesn't work at all.
apply()
doesn’t have an axis
parameter), but it works for me with 1.1.5.
Just another readable way. This code will add three new columns and its values, returning series without use parameters in the apply function.
def sizes(s):
val_kb = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
val_mb = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
val_gb = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return pd.Series([val_kb,val_mb,val_gb],index=['size_kb','size_mb','size_gb'])
df[['size_kb','size_mb','size_gb']] = df.apply(lambda x: sizes(x) , axis=1)
A general example from: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
#foo bar
#0 1 2
#1 1 2
#2 1 2
df.apply(x, axis=1)
. Also, it's basically the same solution as that of Jesse.
The performance between the top answers is significantly varied, and Jesse & famaral42 have already discussed this, but it is worth sharing a fair comparison between the top answers, and elaborating on a subtle but important detail of Jesse's answer: the argument passed in to the function, also affects performance.
(Python 3.7.4, Pandas 1.0.3)
import pandas as pd
import locale
import timeit
def create_new_df_test():
df_test = pd.DataFrame([
{'dir': '/Users/uname1', 'size': 994933},
{'dir': '/Users/uname2', 'size': 109338711},
])
return df_test
def sizes_pass_series_return_series(series):
series['size_kb'] = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
series['size_mb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
series['size_gb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return series
def sizes_pass_series_return_tuple(series):
a = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
b = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
c = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return a, b, c
def sizes_pass_value_return_tuple(value):
a = locale.format_string("%.1f", value / 1024.0, grouping=True) + ' KB'
b = locale.format_string("%.1f", value / 1024.0 ** 2, grouping=True) + ' MB'
c = locale.format_string("%.1f", value / 1024.0 ** 3, grouping=True) + ' GB'
return a, b, c
Here are the results:
# 1 - Accepted (Nels11 Answer) - (pass series, return series):
9.82 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2 - Pandafied (jaumebonet Answer) - (pass series, return tuple):
2.34 ms ± 48.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 3 - Tuples (pass series, return tuple then zip):
1.36 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 4 - Tuples (Jesse Answer) - (pass value, return tuple then zip):
752 µs ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Notice how returning tuples is the fastest method, but what is passed in as an argument, also affects the performance. The difference in the code is subtle but the performance improvement is significant.
Test #4 (passing in a single value) is twice as fast as test #3 (passing in a series), even though the operation performed is ostensibly identical.
But there's more...
# 1a - Accepted (Nels11 Answer) - (pass series, return series, new columns exist):
3.23 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2a - Pandafied (jaumebonet Answer) - (pass series, return tuple, new columns exist):
2.31 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 3a - Tuples (pass series, return tuple then zip, new columns exist):
1.36 ms ± 58.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 4a - Tuples (Jesse Answer) - (pass value, return tuple then zip, new columns exist):
694 µs ± 3.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In some cases (#1a and #4a), applying the function to a DataFrame in which the output columns already exist is faster than creating them from the function.
Here is the code for running the tests:
# Paste and run the following in ipython console. It will not work if you run it from a .py file.
print('\nAccepted Answer (pass series, return series, new columns dont exist):')
df_test = create_new_df_test()
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)
print('Accepted Answer (pass series, return series, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)
print('\nPandafied (pass series, return tuple, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")
print('Pandafied (pass series, return tuple, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")
print('\nTuples (pass series, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))
print('Tuples (pass series, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))
print('\nTuples (pass value, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))
print('Tuples (pass value, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))
Really cool answers! Thanks Jesse and jaumebonet! Just some observation in regards to:
zip(* ...
... result_type="expand")
Although expand is kind of more elegant (pandifyed), **zip is at least 2x faster. On this simple example below, I got 4x faster.
import pandas as pd
dat = [ [i, 10*i] for i in range(1000)]
df = pd.DataFrame(dat, columns = ["a","b"])
def add_and_sub(row):
add = row["a"] + row["b"]
sub = row["a"] - row["b"]
return add, sub
df[["add", "sub"]] = df.apply(add_and_sub, axis=1, result_type="expand")
# versus
df["add"], df["sub"] = zip(*df.apply(add_and_sub, axis=1))
A fairly fast way to do this with apply and lambda. Just return the multiple values as a list and then use to_list()
import pandas as pd
dat = [ [i, 10*i] for i in range(100000)]
df = pd.DataFrame(dat, columns = ["a","b"])
def add_and_div(x):
add = x + 3
div = x / 3
return [add, div]
start = time.time()
df[['c','d']] = df['a'].apply(lambda x: add_and_div(x)).to_list()
end = time.time()
print(end-start) # output: 0.27606
Simply and easy:
def func(item_df):
return [1,'Label 1'] if item_df['col_0'] > 0 else [0,'Label 0']
my_df[['col_1','col2']] = my_df.apply(func, axis=1,result_type='expand')
I believe the 1.1 version breaks the behavior suggested in the top answer here.
import pandas as pd
def test_func(row):
row['c'] = str(row['a']) + str(row['b'])
row['d'] = row['a'] + 1
return row
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['i', 'j', 'k']})
df.apply(test_func, axis=1)
The above code ran on pandas 1.1.0 returns:
a b c d
0 1 i 1i 2
1 1 i 1i 2
2 1 i 1i 2
While in pandas 1.0.5 it returned:
a b c d
0 1 i 1i 2
1 2 j 2j 3
2 3 k 3k 4
Which I think is what you'd expect.
Not sure how the release notes explain this behavior, however as explained here avoiding mutation of the original rows by copying them resurrects the old behavior. i.e.:
def test_func(row):
row = row.copy() # <---- Avoid mutating the original reference
row['c'] = str(row['a']) + str(row['b'])
row['d'] = row['a'] + 1
return row
Generally, to return multiple values, this is what I do
def gimmeMultiple(group):
x1 = 1
x2 = 2
return array([[1, 2]])
def gimmeMultipleDf(group):
x1 = 1
x2 = 2
return pd.DataFrame(array([[1,2]]), columns=['x1', 'x2'])
df['size'].astype(int).apply(gimmeMultiple)
df['size'].astype(int).apply(gimmeMultipleDf)
Returning a dataframe definitively has its perks, but sometimes not required. You can look at what the apply()
returns and play a bit with the functions ;)
You can go 40+ times faster than the top answers here if you do your math in numpy instead. Adapting @Rocky K's top two answers. The main difference is running on an actual df of 120k rows. Numpy is way faster at math when you apply your functions array-wise (instead of applying a function value-wise). The best answer is by far the third one because it uses numpy for the math. Also notice that it only calculates 1024**2 and 1024**3 once each instead of once for each row, saving 240k calculations. Here are the timings on my machine:
Tuples (pass value, return tuple then zip, new columns dont exist):
Runtime: 10.935037851333618
Tuples (pass value, return tuple then zip, new columns exist):
Runtime: 11.120025157928467
Use numpy for math portions:
Runtime: 0.24799370765686035
Here is the script I used (adapted from Rocky K) to calculate these times:
import numpy as np
import pandas as pd
import locale
import time
size = np.random.random(120000) * 1000000000
data = pd.DataFrame({'Size': size})
def sizes_pass_value_return_tuple(value):
a = locale.format_string("%.1f", value / 1024.0, grouping=True) + ' KB'
b = locale.format_string("%.1f", value / 1024.0 ** 2, grouping=True) + ' MB'
c = locale.format_string("%.1f", value / 1024.0 ** 3, grouping=True) + ' GB'
return a, b, c
print('\nTuples (pass value, return tuple then zip, new columns dont exist):')
df1 = data.copy()
start = time.time()
df1['size_kb'], df1['size_mb'], df1['size_gb'] = zip(*df1['Size'].apply(sizes_pass_value_return_tuple))
end = time.time()
print('Runtime:', end - start, '\n')
print('Tuples (pass value, return tuple then zip, new columns exist):')
df2 = data.copy()
start = time.time()
df2 = pd.concat([df2, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
df2['size_kb'], df2['size_mb'], df2['size_gb'] = zip(*df2['Size'].apply(sizes_pass_value_return_tuple))
end = time.time()
print('Runtime:', end - start, '\n')
print('Use numpy for math portions:')
df3 = data.copy()
start = time.time()
df3['size_kb'] = (df3.Size.values / 1024).round(1)
df3['size_kb'] = df3.size_kb.astype(str) + ' KB'
df3['size_mb'] = (df3.Size.values / 1024 ** 2).round(1)
df3['size_mb'] = df3.size_mb.astype(str) + ' MB'
df3['size_gb'] = (df3.Size.values / 1024 ** 3).round(1)
df3['size_gb'] = df3.size_gb.astype(str) + ' GB'
end = time.time()
print('Runtime:', end - start, '\n')
It gives a new dataframe with two columns from the original one.
import pandas as pd
df = ...
df_with_two_columns = df.apply(lambda row:pd.Series([row['column_1'], row['column_2']], index=['column_1', 'column_2']),axis = 1)
Success story sharing
rows_list
in this answer?pd.Series(data, index=...)
. Otherwise you get cryptic errors when you try to assign the result back into the parent dataframe.rows_list
formulation so that your answer will compile without any problems (see also @David Stansby comment). I proposed this as an edit to avoid you the hassle, but evidently moderators prefer comments over edits.