Thursday, June 15, 2017

python pandas practice

iternate dataframe

for index, row in prediction_df.iterrows():

rename dataframe column name

df=df.rename(columns = {'two':'new_name'})

convert a list of string to dataframe

the first item in the list is the column names
pd.DataFrame(data[1:], columns=data[0])

dataFrame get all columns

list(df) returns all columns except the 'index'
to include the index, one way is
columns = list(df)

get values from next row

df['nextDayClose'] = df['Close'].shift(-1)

rolling window

Here is the example to calculate the 7 days average.
df['7_mean'] = df['Close'].rolling(window=7).mean()

dataframe remove rows containing NAN

dat.dropna(how='any') #to drop if any value in the row has a nan
dat.dropna(how='all') #to drop if all values in the row are nan

get last value of one column


dataframe change type to numeric

df = df.apply(pd.to_numeric, errors='ignore')

dataframe get one column

* putting text in the square brackets means manipulate columns
* putting number in the square brackets means manipulate rows

dataframe get multiple columns

df[['colName1', 'colName2', 'colName3']]

dataframe get first row

the api is like this df[startIndex:endIndex:sort], df[:1] equals df[0:1] which returns the first row of the dataframe

dataframe get last row


dataframe get all rows except first/last


dataframe reverse the order

to get rows except last and reverse the order, this one does not work, df[:-1:-1]
it has to be df[:-1][::-1]

dataframe sort by columns


dataframe keep rows have null/nan values

df = df[df.isnull().any(axis=1)]

readcsv and explicitly specify column type

quotes = pd.read_csv(csv, dtype={'Symbol': 'str'})
DataFrame.from_csv does not have this feature. It will convert "INF" to number infinite.

dataframe drop duplicates

quotes.drop_duplicates(subset=['Date'], keep='last', inplace=True)

two dataframes merge like inner join

merged = pd.merge(DataFrameA,DataFrameB, on=['Code','Date'])

dataframe only contain rows have null values

try to understand this

dataframe locate a cell

df.loc[0,'Close'] return the value of first row and column 'Close'
to set the value, df.loc[0,'Close']=123

dataframe locate by value

df.loc[df['column_name'] == some_value]

dataframe locate by lambda

quotes_df.loc[lambda df: pd.isnull(df.Open) | pd.isnull(df.Volume) | df.Open == 0,:]

dataframe locate with or condition

new_df = prediction_df.loc[(prediction_df['result'] != 0) | (prediction_df['prediction'] != 0)]
df.loc[(df['column_name'] == some_value) & df['other_column'].isin(some_values)]
isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~

replace null with Nan

df = df.replace('null', np.nan)
then we can do dropna to delete the rows
df.dropna(how='any', subset=['Open', 'High', 'Low', 'Close', 'Volume'], inplace=True)

dateframe group by

group by one column value and returns multiple dataframes
dateSymbol_df is
          Date Symbol
    0   2017-08-10   SFBC
    1   2017-08-10   SVBI
    2   2017-08-09  SENEB
for date, symbol_df in dateSymbol_df.groupby('Date'):
    dateSymbols[date] = list(symbol_df['Symbol'])
the dateSymbols is
    '2017-08-10': ['SFBC','SVBI'],
    '2017-08-09': ['SENEB'],

iterate between two dates

for i in pd.date_range(start, end):

print more columns

pd.set_option('display.expand_frame_repr', False)

No comments:

Post a Comment