I have the following DataFrame:
daysago line_race rating rw wrating
line_date
2007-03-31 62 11 56 1.000000 56.000000
2007-03-10 83 11 67 1.000000 67.000000
2007-02-10 111 9 66 1.000000 66.000000
2007-01-13 139 10 83 0.880678 73.096278
2006-12-23 160 10 88 0.793033 69.786942
2006-11-09 204 9 52 0.636655 33.106077
2006-10-22 222 8 66 0.581946 38.408408
2006-09-29 245 9 70 0.518825 36.317752
2006-09-16 258 11 68 0.486226 33.063381
2006-08-30 275 8 72 0.446667 32.160051
2006-02-11 475 5 65 0.164591 10.698423
2006-01-13 504 0 70 0.142409 9.968634
2006-01-02 515 0 64 0.134800 8.627219
2005-12-06 542 0 70 0.117803 8.246238
2005-11-29 549 0 70 0.113758 7.963072
2005-11-22 556 0 -1 0.109852 -0.109852
2005-11-01 577 0 -1 0.098919 -0.098919
2005-10-20 589 0 -1 0.093168 -0.093168
2005-09-27 612 0 -1 0.083063 -0.083063
2005-09-07 632 0 -1 0.075171 -0.075171
2005-06-12 719 0 69 0.048690 3.359623
2005-05-29 733 0 -1 0.045404 -0.045404
2005-05-02 760 0 -1 0.039679 -0.039679
2005-04-02 790 0 -1 0.034160 -0.034160
2005-03-13 810 0 -1 0.030915 -0.030915
2004-11-09 934 0 -1 0.016647 -0.016647
I need to remove the rows where line_race
is equal to 0
. What's the most efficient way to do this?
If I'm understanding correctly, it should be as simple as:
df = df[df.line_race != 0]
But for any future bypassers you could mention that df = df[df.line_race != 0]
doesn't do anything when trying to filter for None
/missing values.
Does work:
df = df[df.line_race != 0]
Doesn't do anything:
df = df[df.line_race != None]
Does work:
df = df[df.line_race.notnull()]
df = df[df.columns[2].notnull()]
, but one way or another you need to be able to index the column somehow.
df = df[df.line_race != 0]
drops the rows but also does not reset the index. So when you add another row in the df it may not add at the end. I'd recommend resetting the index after that operation (df = df.reset_index(drop=True)
)
==
operator to start. stackoverflow.com/questions/3257919/…
None
values you can use is
instead of ==
and is not
instead of !=
, like in this example df = df[df.line_race is not None]
will work
just to add another solution, particularly useful if you are using the new pandas assessors, other solutions will replace the original pandas and lose the assessors
df.drop(df.loc[df['line_race']==0].index, inplace=True)
.reset_index()
as well if someone ends up using index accessors
If you want to delete rows based on multiple values of the column, you could use:
df[(df.line_race != 0) & (df.line_race != 10)]
To drop all rows with values 0 and 10 for line_race
.
drop = [0, 10]
and then something like df[(df.line_race != drop)]
df[(df.line_race != drop)]
does not work, but I guess there is a possibility to do it more efficient. I do not have a solution right now, but if someone has, please let us now.
Though the previous answer are almost similar to what I am going to do, but using the index method does not require using another indexing method .loc(). It can be done in a similar but precise manner as
df.drop(df.index[df['line_race'] == 0], inplace = True)
The best way to do this is with boolean masking:
In [56]: df
Out[56]:
line_date daysago line_race rating raw wrating
0 2007-03-31 62 11 56 1.000 56.000
1 2007-03-10 83 11 67 1.000 67.000
2 2007-02-10 111 9 66 1.000 66.000
3 2007-01-13 139 10 83 0.881 73.096
4 2006-12-23 160 10 88 0.793 69.787
5 2006-11-09 204 9 52 0.637 33.106
6 2006-10-22 222 8 66 0.582 38.408
7 2006-09-29 245 9 70 0.519 36.318
8 2006-09-16 258 11 68 0.486 33.063
9 2006-08-30 275 8 72 0.447 32.160
10 2006-02-11 475 5 65 0.165 10.698
11 2006-01-13 504 0 70 0.142 9.969
12 2006-01-02 515 0 64 0.135 8.627
13 2005-12-06 542 0 70 0.118 8.246
14 2005-11-29 549 0 70 0.114 7.963
15 2005-11-22 556 0 -1 0.110 -0.110
16 2005-11-01 577 0 -1 0.099 -0.099
17 2005-10-20 589 0 -1 0.093 -0.093
18 2005-09-27 612 0 -1 0.083 -0.083
19 2005-09-07 632 0 -1 0.075 -0.075
20 2005-06-12 719 0 69 0.049 3.360
21 2005-05-29 733 0 -1 0.045 -0.045
22 2005-05-02 760 0 -1 0.040 -0.040
23 2005-04-02 790 0 -1 0.034 -0.034
24 2005-03-13 810 0 -1 0.031 -0.031
25 2004-11-09 934 0 -1 0.017 -0.017
In [57]: df[df.line_race != 0]
Out[57]:
line_date daysago line_race rating raw wrating
0 2007-03-31 62 11 56 1.000 56.000
1 2007-03-10 83 11 67 1.000 67.000
2 2007-02-10 111 9 66 1.000 66.000
3 2007-01-13 139 10 83 0.881 73.096
4 2006-12-23 160 10 88 0.793 69.787
5 2006-11-09 204 9 52 0.637 33.106
6 2006-10-22 222 8 66 0.582 38.408
7 2006-09-29 245 9 70 0.519 36.318
8 2006-09-16 258 11 68 0.486 33.063
9 2006-08-30 275 8 72 0.447 32.160
10 2006-02-11 475 5 65 0.165 10.698
UPDATE: Now that pandas 0.13 is out, another way to do this is df.query('line_race != 0')
.
query
. It allows for more rich selection criteria (eg. set-like operations like df.query('variable in var_list')
where 'var_list' is a list of desired values)
query
is not very useful if the column name has a space in it.
df = df.rename(columns=lambda x: x.strip().replace(' ','_'))
df.columns = df.columns.str.replace(' ', '_')
.
In case of multiple values and str dtype
I used the following to filter out given values in a col:
def filter_rows_by_values(df, col, values):
return df[~df[col].isin(values)]
Example:
In a DataFrame I want to remove rows which have values "b" and "c" in column "str"
df = pd.DataFrame({"str": ["a","a","a","a","b","b","c"], "other": [1,2,3,4,5,6,7]})
df
str other
0 a 1
1 a 2
2 a 3
3 a 4
4 b 5
5 b 6
6 c 7
filter_rows_by_values(df, "str", ["b","c"])
str other
0 a 1
1 a 2
2 a 3
3 a 4
def filter_rows_by_values(df, col, values, true_or_false = False): return df[df[col].isin(values) == true_or_false]
df[df[col].isin(values) == False]
by another negating condition using the tilde ~
invert operator df[~df[col].isin(values)]
. See How can I obtain the element-wise logical NOT of a pandas Series?
The given answer is correct nontheless as someone above said you can use df.query('line_race != 0')
which depending on your problem is much faster. Highly recommend.
DataFrame
variable names like me (and, I'd venture to guess, everyone as compared to the df
used for examples), because you only have to write it once.
One of the efficient and pandaic way is using eq()
method:
df[~df.line_race.eq(0)]
df[df.line_race.ne(0)]
?
Another way of doing it. May not be the most efficient way as the code looks a bit more complex than the code mentioned in other answers, but still alternate way of doing the same thing.
df = df.drop(df[df['line_race']==0].index)
Adding one more way to do this.
df = df.query("line_race!=0")
I compiled and run my code. This is accurate code. You can try it your own.
data = pd.read_excel('file.xlsx')
If you have any special character or space in column name you can write it in ''
like in the given code:
data = data[data['expire/t'].notnull()]
print (date)
If there is just a single string column name without any space or special character you can directly access it.
data = data[data.expire ! = 0]
print (date)
Just adding another way for DataFrame expanded over all columns:
for column in df.columns:
df = df[df[column]!=0]
Example:
def z_score(data,count):
threshold=3
for column in data.columns:
mean = np.mean(data[column])
std = np.std(data[column])
for i in data[column]:
zscore = (i-mean)/std
if(np.abs(zscore)>threshold):
count=count+1
data = data[data[column]!=i]
return data,count
Just in case you need to delete the row, but the value can be in different columns. In my case I was using percentages so I wanted to delete the rows which has a value 1 in any column, since that means that it's the 100%
for x in df:
df.drop(df.loc[df[x]==1].index, inplace=True)
Is not optimal if your df have too many columns.
It doesn't make much difference for simple example like this, but for complicated logic, I prefer to use drop()
when deleting rows because it is more straightforward than using inverse logic. For example, delete rows where A=1 AND (B=2 OR C=3)
.
Here's a scalable syntax that is easy to understand and can handle complicated logic:
df.drop( df.query(" `line_race` == 0 ").index)
Success story sharing
df
is large? Or, can I do it inplace?df
with 2M rows and it went pretty fast.df = df[df['line race'] != 0]
df=df[~df['DATE'].isin(['2015-10-30.1', '2015-11-30.1', '2015-12-31.1'])]