The king of data manipulation in Python,  pandas is your best friend for your data needs. Any data scientist intending to use Python as their tool of choice must master pandas; it is compulsory, like learning to walk before you can run.

Here is my quick reference list of functions. Note that since reading written material is no substitute for repeated practice, you should not expect to remember the functions below. Better to treat this list as a cheatsheet to refer to when working through practice problems, such as the ones here.

Creating a dataframe

pd.DataFrame(x, index, colnames) creates a pandas dataframe from some data x, a list of indices index and column labels colnames:

For example:

You can also create a dataframe without following this syntax. Here’s a multi-column version from a dictionary:

To create a Series use  pd.Series(x, index)  – it’ll let you create a series from an array/dict/scalar x with index index.

Dataframe functions

The following are some useful dataframe functions:

  1. pd.DataFrame.head() – returns the first five rows of a dataframe.
  2. pd.DataFrame.tail()  – returns the last five rows of a dataframe.
  3. pd.DataFrame.index  – display the index of a dataframe.
  4. pd.DataFrame.columns  – list the columns of a dataframe.
  5. pd.DataFrame.dtypes  – print the data types of each column of a dataframe.
  6. pd.DataFrame.values  – print the values of a dataframe.
  7. pd.DataFrame.describe()  – summarise a dataframe: return summary statistics including the number of observations per column, the mean of each column and the standard deviation of each column.
  8. pd.DataFrame.info() –  brief summary of a dataframe.
  9. pd.DataFrame.T – transpose a dataframe.
  10. pd.DataFrame.sort_index() – sort a dataframe by its index values. Can specify the axis (colnames, rownames) and the order of sorting.
  11. pd.DataFrame.sort_values('col')  – sort a dataframe by the column name col.
  12. pd.DataFrame.iloc[i]  – slice and subset your data by a numerical index.
  13. pd.DataFrame.loc[] – slice and subset your data by using string(s).
  14. pd.DataFrame.isin(l)  – return True or False depending if the item value is in the list l.
  15. pd.DataFrame.set_index(s) – set the index of a data frame to column name(s) s, where  s can be an array of columnnames to create a MultiIndex.
  16. pd.DataFrame.swaplevel(i,j) – swap the levels i and j in a MultiIndex.
  17. pd.DataFrame.drop('c1', axis=1, inplace=True)  – drop a column c1 from a dataframe.
  18. pd.DataFrame.iterrows()  – a generator for iterating over the rows of a dataframe.
  19. pd.DataFrame.apply(f, axis) – apply a function f vectorwise to a dataframe over a given axis.
  20. pd.DataFrame.applymap(f)  – apply a function f elementwise to a dataframe.
  21. pd.DataFrame.drop(s, axis=1)  – delete column s from a dataframe.
  22. pd.DataFrame.resample('offsetString')  – convenient way to group timeseries into bins. See here for details on the offset string and here for some examples.
  23. pd.DataFrame.merge(df2)  – join a dataframe df2 to another dataframe. Can specify the type of join.
  24. pd.DataFrame.append(df2)  – append the dataframe df2 to a dataframe (similar to  rbind() in R).
  25. pd.DataFrame.reset_index()  – reset the index back to the default numeric row counter.
  26. pd.DataFrame.idxmax()  – dataframe equivalent of the numpy argmax  method.
  27. pd.DataFrame.isnull()  – indicates if values are null or not.
  28. pd.DataFrame.from_dict(d)  – create a dataframe from a dictionary d.
  29. pd.DataFrame.stack()  – turn column names into index labels.
  30. pd.DataFrame.unstack()  – turn index values into column names.

Groupby methods

To group a dataframe by  a column (or columns), use  pd.DataFrame.groupby('colname') . This returns a DataFrameGroupBy object, on which you can call a certain set of methods.

Assume gb is a DataFrameGroupBy object returned from calling pd.DataFrame.groupby(). There are a basic family of functions that you can commonly call on these objects; sum, min, max, mean, median and std will all be very useful to you. Some other useful methods are:

  1. gb.agg(arr)  – returns whatever functions you specify in array arr.
  2. gb.size()  – return the number of elements in each group.
  3. gb.describe()  – returns summary statistics.

String methods

The pandas library contains a module dedicated to string manipulation and string handling. This module, called str, operates on Series objects and is located at pd.Series.str in the pandas hierarchy of functions.

Let s be a Series made up of strings. Then the following are some useful methods:

  1. s.str[0] – return the first letter of each element in s.
  2. s.str.lower()  – change each element of s to lowercase.
  3. s.str.upper()  – change each element of s to uppercase.
  4. s.str.len()  – return the number of letters of each element of s.
  5. s.str.strip()  – remove whitespace around the elements of s.
  6. s.str.replace('s1', 's2')  – replace a substring s1 with a substring s2 for each element of s.
  7. s.str.split('s1')  – split up the elements of s using s1 as a separator.
  8. s.str.get(i)  – extract the ith element of each array of s.

Miscellaneous functions

  1. pd.__version__  – return the version of pandas.
  2. pd.date_range() – create a series of dates in a DateTimeIndex. Some options include:
    • a start date and an end date (e.g. pd.date_range('2015-01-05', '2015-01-10') )
    • a start date, end date and a frequency (e.g. pd.date_range('2016-01', '2016-10',freq='M') )
    • a start date and the number of periods (e.g. pd.date_range('2016-01', periods=10) )
  3. pd.read_csv(filepath, sep, index_col)  – read in a csv file, often from a web address or file. Specify the separator with the sep parameter, and the column to use as the rownames of the table with the index_col parameter.
  4. pd.value_counts() – count how many times a value appears in a column.
  5. pd.crosstab()  – create frequency table of two or more factors.
  6. pd.Series.map(f)  – the Series version of applymap.
  7. pd.to_datetime()  – convert something to a numpy datetime64 format.
  8. pd.to_numeric() – convert something to a float format.
  9. pd.concat(objs)  – put together data frames in the array objs along a given axis, similar to rbind() or cbind()  in R.

Final words

This list is by no means complete; nor does it pretend to be complete. This list is simply of functions I have encountered in my journey learning pandas.

If you create your own list, and post your list on your blog, and send me a link to your list, then we both may learn something new today.

Leave a Reply

Your email address will not be published. Required fields are marked *