How can I use the apply() function for a single column?

python pandas dataframe

I have a pandas data frame with two columns. I need to change the values of the first column without affecting the second one and get back the whole data frame with just first column values changed. How can I do that using apply in pandas?

Please post some input sample data and desired output.

You should almost never use apply in a situation like this. Operate on the column directly instead.

As Ted Petrou said, avoid using apply as much as possible. If you're not sure you need to use it, you probably don't. I recommend taking a look at When should I ever want to use pandas apply() in my code?.

The question is not completely clear: is it apply a function to every element of a column or apply a function to the column as a whole (for example: reverse the column) ?

Mateen Ulhaq

Given a sample dataframe df as:

what you want is:

df['a'] = df['a'].apply(lambda x: x + 1)

that returns:

apply should never be used in a situation like this

@TedPetrou you're perfectly right, it was just an example on how to apply a general function on one single column, as the OP asked.

When I try doing this I get the following warning: "A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead"

As a matter of curiosity: why should apply not be used in that situation? What is the situation exactly?

@UncleBenBen in general apply uses an internal loop over rows that is far slower than vectorized functions, like e.g. df.a = df.a / 2 (see Mike Muller answer).

Fabio Lamanna

For a single column better to use map(), like this:

df = pd.DataFrame([{'a': 15, 'b': 15, 'c': 5}, {'a': 20, 'b': 10, 'c': 7}, {'a': 25, 'b': 30, 'c': 9}])

    a   b  c
0  15  15  5
1  20  10  7
2  25  30  9



df['a'] = df['a'].map(lambda a: a / 2.)

      a   b  c
0   7.5  15  5
1  10.0  10  7
2  12.5  30  9

Why is map() better than apply() for a single column?

This was very useful. I used it to extract file names from paths stored in a column df['file_name'] = df['Path'].map(lambda a: os.path.basename(a))

map() is for Series (i.e. single columns) and operates on one cell at a time, while apply() is for DataFrame, and operates on a whole row at a time.

@jpcgt Does that mean that map is faster than apply in this case?

I'm receiving an error "SettingWithCopyWarning" when use this code

above_c_level

Given the following dataframe df and the function complex_function,

  import pandas as pd

  def complex_function(x, y=0):
      if x > 5 and x > y:
          return 1
      else:
          return 2

  df = pd.DataFrame(data={'col1': [1, 4, 6, 2, 7], 'col2': [6, 7, 1, 2, 8]})

     col1  col2
  0     1     6
  1     4     7
  2     6     1
  3     2     2
  4     7     8

there are several solutions to use apply() on only one column. In the following I will explain them in detail.

I. Simple solution

The straightforward solution is the one from @Fabio Lamanna:

  df['col1'] = df['col1'].apply(complex_function)

Output:

     col1  col2
  0     2     6
  1     2     7
  2     1     1
  3     2     2
  4     1     8

Only the first column is modified, the second column is unchanged. The solution is beautiful. It is just one line of code and it reads almost like english: "Take 'col1' and apply the function complex_function to it."

However, if you need data from another column, e.g. 'col2', it's not working. If you want to pass the values of 'col2' to variable y of the complex_function, you need something else.

II. Solution using the whole dataframe

Alternatively, you could use the whole dataframe as described in this or this SO post:

  df['col1'] = df.apply(lambda x: complex_function(x['col1']), axis=1)

or if you prefer (like me) a solution without a lambda function:

  def apply_complex_function(x): return complex_function(x['col1'])
  df['col1'] = df.apply(apply_complex_function, axis=1)

There is a lot going on in this solution that needs to be explained. The apply() function works on pd.Series and pd.DataFrame. But you cannot use df['col1'] = df.apply(complex_function).loc[:, 'col1'], because it would throw a ValueError.

Hence, you need to give the information which column to use. To complicate things, the apply() function does only accept callables. To solve this, you need to define a (lambda) function with the column x['col1'] as argument; i.e. we wrap the column information in another function.

Unfortunately, the default value of the axis parameter is zero (axis=0), which means it will try executing column-wise and not row-wise. This wasn't a problem in the first solution, because we gave apply() a pd.Series. But now the input is a dataframe and we must be explicit (axis=1). (I marvel how often I forget this.)

Whether you prefer the version with the lambda function or without is subjective. In my opinion the line of code is complicated enough to read even without a lambda function thrown in. You only need the (lambda) function as a wrapper. It is just boiler code. A reader should not be bothered with it.

Now, you can modify this solution easily to take the second column into account:

    def apply_complex_function(x): return complex_function(x['col1'], x['col2'])
    df['col1'] = df.apply(apply_complex_function, axis=1)

Output:

     col1  col2
  0     2     6
  1     2     7
  2     1     1
  3     2     2
  4     2     8

At index 4 the value has changed from 1 to 2, because the first condition 7 > 5 is true but the second condition 7 > 8 is false.

Note that you only needed to change the first line of code (i.e. the function) and not the second line.

Side note

Never put the column information into your function.

  def bad_idea(x):
      return x['col1'] ** 2

By doing this, you make a general function dependent on a column name! This is a bad idea, because the next time you want to use this function, you cannot. Worse: Maybe you rename a column in a different dataframe just to make it work with your existing function. (Been there, done that. It is a slippery slope!)

III. Alternative solutions without using apply()

Although the OP specifically asked for a solution with apply(), alternative solutions were suggested. For example, the answer of @George Petrov suggested to use map(), the answer of @Thibaut Dubernet proposed assign().

I fully agree that apply() is seldom the best solution, because apply() is not vectorized. It is an element-wise operation with expensive function calling and overhead from pd.Series.

One reason to use apply() is that you want to use an existing function and performance is not an issue. Or your function is so complex that no vectorized version exists.

Another reason to use apply() is in combination with groupby(). Please note that DataFrame.apply() and GroupBy.apply() are different functions.

So it does make sense to consider some alternatives:

map() only works on pd.Series, but accepts dict and pd.Series as input. Using map() with a function is almost interchangeable with using apply(). It can be faster than apply(). See this SO post for more details.

  df['col1'] = df['col1'].map(complex_function)

applymap() is almost identical for dataframes. It does not support pd.Series and it will always return a dataframe. However, it can be faster. The documentation states: "In the current implementation applymap calls func twice on the first column/row to decide whether it can take a fast or slow code path.". But if performance really counts you should seek an alternative route.

  df['col1'] = df.applymap(complex_function).loc[:, 'col1']

assign() is not a feasible replacement for apply(). It has a similar behaviour in only the most basic use cases. It does not work with the complex_function. You still need apply() as you can see in the example below. The main use case for assign() is method chaining, because it gives back the dataframe without changing the original dataframe.

  df['col1'] = df.assign(col1=df.col1.apply(complex_function))

Annex: How to speed up apply?

I only mention it here because it was suggested by other answers, e.g. @durjoy. The list is not exhaustive:

Do not use apply(). This is no joke. For most numeric operations, a vectorized method exists in pandas. If/else blocks can often be refactored with a combination of boolean indexing and .loc. My example complex_function could be refactored in this way. Refactor to Cython. If you have a complex equation and the parameters of the equation are in your dataframe, this might be a good idea. Check out the official pandas user guide for more information. Use raw=True parameter. Theoretically, this should improve the performance of apply() if you are just applying a NumPy reduction function, because the overhead of pd.Series is removed. Of course, your function has to accept an ndarray. You have to refactor your function to NumPy. By doing this, you will have a huge performance boost. Use 3rd party packages. The first thing you should try is Numba. I do not know swifter mentioned by @durjoy; and probably many other packages are worth mentioning here. Try/Fail/Repeat. As mentioned above, map() and applymap() can be faster - depending on the use case. Just time the different versions and choose the fastest. This approach is the most tedious one with the least performance increase.

What if I have more complex slices than just col1? How do I avoid duplicating the slice expression? Say, for instance: df[:, ~df.columns.isin(skip_cols)]. Writing this twice on both sides of the equation seems uncanon.

Chrisji

You don't need a function at all. You can work on a whole column directly.

Example data:

>>> df = pd.DataFrame({'a': [100, 1000], 'b': [200, 2000], 'c': [300, 3000]})
>>> df

      a     b     c
0   100   200   300
1  1000  2000  3000

Half all the values in column a:

>>> df.a = df.a / 2
>>> df

     a     b     c
0   50   200   300
1  500  2000  3000

What if I want to split every element in a column by "/" and take the first part?

@KamranHosseini use df['newcolumn'] = df['a'].str.split('/')[0]

@Arun df['a'].str.split('/') produces a Series object, right? So wouldn't df['a'].str.split('/')[0] produce a single element from that Series? I don't think you can assign that to a entire column like that.

@TheUnknownDev its specific to Kamran's comment above. Not for OP's case. When the series consisting of str and the values are delimited by '/'. We can use it to get first part. eg. '100/101' in a series will be split as 100. Tested and Verified!

Thibaut Dubernet

Although the given responses are correct, they modify the initial data frame, which is not always desirable (and, given the OP asked for examples "using apply", it might be they wanted a version that returns a new data frame, as apply does).

This is possible using assign: it is valid to assign to existing columns, as the documentation states (emphasis is mine):

Assign new columns to a DataFrame. Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.

In short:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([{'a': 15, 'b': 15, 'c': 5}, {'a': 20, 'b': 10, 'c': 7}, {'a': 25, 'b': 30, 'c': 9}])

In [3]: df.assign(a=lambda df: df.a / 2)
Out[3]: 
      a   b  c
0   7.5  15  5
1  10.0  10  7
2  12.5  30  9

In [4]: df
Out[4]: 
    a   b  c
0  15  15  5
1  20  10  7
2  25  30  9

Note that the function will be passed the whole dataframe, not only the column you want to modify, so you will need to make sure you select the right column in your lambda.

I'm trying to keep things immutable, thinking in Functional Programming. I'm very, very, glad of your answer! :-)

durjoy

If you are really concerned about the execution speed of your apply function and you have a huge dataset to work on, you could use swifter to make faster execution, here is an example for swifter on pandas dataframe:

import pandas as pd
import swifter

def fnc(m):
    return m*3+4

df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})

# apply a self created function to a single column in pandas
df["y"] = df.m.swifter.apply(fnc)

This will enable your all CPU cores to compute the result hence it will be much faster than normal apply functions. Try and let me know if it become useful for you.

Great library and great example!

Hari_pb

Let me try a complex computation using datetime and considering nulls or empty spaces. I am reducing 30 years on a datetime column and using apply method as well as lambda and converting datetime format. Line if x != '' else x will take care of all empty spaces or nulls accordingly.

df['Date'] = df['Date'].fillna('')
df['Date'] = df['Date'].apply(lambda x : ((datetime.datetime.strptime(str(x), '%m/%d/%Y') - datetime.timedelta(days=30*365)).strftime('%Y%m%d')) if x != '' else x)

How can I use the apply() function for a single column?

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Links

Contact US