I want to create a new column in a pandas
data frame by applying a function to two existing columns. Following this answer I've been able to create a new column when I only need one column as an argument:
import pandas as pd
df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})
def fx(x):
return x * x
print(df)
df['newcolumn'] = df.A.apply(fx)
print(df)
However, I cannot figure out how to do the same thing when the function requires multiple arguments. For example, how do I create a new column by passing column A and column B to the function below?
def fxy(x, y):
return x * y
You can go with @greenAfrican example, if it's possible for you to rewrite your function. But if you don't want to rewrite your function, you can wrap it into anonymous function inside apply, like this:
>>> def fxy(x, y):
... return x * y
>>> df['newcolumn'] = df.apply(lambda x: fxy(x['A'], x['B']), axis=1)
>>> df
A B newcolumn
0 10 20 200
1 20 30 600
2 30 10 300
Alternatively, you can use numpy underlying function:
>>> import numpy as np
>>> df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})
>>> df['new_column'] = np.multiply(df['A'], df['B'])
>>> df
A B new_column
0 10 20 200
1 20 30 600
2 30 10 300
or vectorize arbitrary function in general case:
>>> def fx(x, y):
... return x*y
...
>>> df['new_column'] = np.vectorize(fx)(df['A'], df['B'])
>>> df
A B new_column
0 10 20 200
1 20 30 600
2 30 10 300
np.vectorize()
is amazingly fast. Thank you.
np.vectorize
does not work. The reason is, that one of the columns is of the type pandas._libs.tslibs.timestamps.Timestamp
, which gets turned into the type numpy.datetime64
by the vectorization. The two types are not interchangeable, causing the function to behave badly. Any suggestions on this? (Other than .apply
as this is apparently to be avoided)
This solves the problem:
df['newcolumn'] = df.A * df.B
You could also do:
def fab(row):
return row['A'] * row['B']
df['newcolumn'] = df.apply(fab, axis=1)
apply
.
If you need to create multiple columns at once:
Create the dataframe: import pandas as pd df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]}) Create the function: def fab(row): return row['A'] * row['B'], row['A'] + row['B'] Assign the new columns: df['newcolumn'], df['newcolumn2'] = zip(*df.apply(fab, axis=1))
zip
do here? Thanks!
zip
iterates simultaneously several iterables (e.g. lists, iterators). *df.apply
will yield N (N=len(df)
) iterables, each iterable with 2 elements; zip
will iterate over the N rows simultaneously, so that it instead yields 2 iterables of N elements. You can test this, e.g. zip(['a','b'],['c','d'],['e','f'])
will yield [('a', 'c', 'e'), ('b', 'd', 'f')]
(basically, the transpose). Note that I am intentionally using the word yield
, as opposed to return
, because we are talking about iterators (so, transform the zip result into a list: list(zip(['a','b'],['c','d'],['e','f']))
)
One more dict style clean syntax:
df["new_column"] = df.apply(lambda x: x["A"] * x["B"], axis = 1)
or,
df["new_column"] = df["A"] * df["B"]
This will dynamically give you desired result. It works even if you have more than two arguments
df['anothercolumn'] = df[['A', 'B']].apply(lambda x: fxy(*x), axis=1)
print(df)
A B newcolumn anothercolumn
0 10 20 100 200
1 20 30 400 600
2 30 10 900 300
Success story sharing