Note: This is Part 2 of a three-part article.
In Part 1 of this article, we downloaded multiple CSV files containing LendingClub's historical loan data. Here we'll determine if this data has any issues and, if so, how to deal with them.
Looking at our data
Our fpath_list variable contains the paths of all the CSVs we downloaded. Let's take a quick look at one of them by calling the pandas.read_csv() method and displaying the first and last few lines of the file:
Hopefully, you noticed that something doesn't look right about the way this DataFrame is displayed. Let’s look at the list of columns:
Not what you expected, right?
We used read_csv() with default values for all of its parameters. While these default values work well for CSV files that are properly structured, the one we have is not.
Problems with our files
Our CSV files have two issues:
- They don't have column names on the first line. The first line is just this string: 'Notes offered by Prospectus (<https://www.lendingclub.com/info/prospectus.action>)'
- They contain summary totals in the last two lines.
The solution: skiprows= and skipfooters=
Below you see that we dealt with these issues by using the skiprows= and skipfooters= parameters. We encourage you to refer to the documentation to see how to use these and other parameters of the read_csv() method.
We have a list of file paths for our CSV files, which, hopefully, all have the structure. (It would be time-consuming to tweak the data-loading code for each one of them.) If this is the case, the next step is to load all of them into one big DataFrame.
A quick and easy way to achieve this is to first create an empty list (df_list) that will store individual DataFrames (one per CSV file). To populate this empty list, we'll iterate through file paths. On every iteration, we’ll:
- Load a CSV file with read_csv() and the parameters we identified earlier
- Make an assertion that all CSV files have the same columns as the very first file we looked at (for a brief intro to assertions, see here)
- Append the newly created DataFrame to df_list
After the loop is done, we'll combine all DataFrames in the df_list into one (df_all) using pandas' concat (concatenate) method.
Missing values and outliers
Our dataset contains some missing values. Ideally, for every column that's missing values, you'd like to know why values are missing and the preferred way of dealing with them (filling with some constant value, removing from the dataset, etc).
If you are lucky, there will either be an accompanying metadata file describing the reasons why some data is missing, or one of your colleagues might possess this information. In reality, this is rare, and most of the time, you'll have to make some educated guesses and decide how to deal with these issues. Here, for simplicity, we'll do the following:
- Calculate the percentage of values missing in each column
- Remove columns that are missing more than a predefined (by us) PERCENT_MISSING_THRES threshold value
Feel free to tweak this threshold in either direction and see how this changes the number of columns.
Now we are moving on to feature engineering, the process of modifying or creating new features from existing ones. Feature engineering can be domain-specific and can get fairly complex.
Here, we'll leave all numerical features (integers and floats) as is and focus our attention on object-type (which in pandas world often means string-type) columns. Let's see how many of them we have:
Which columns are object type?
And what data do they contain?
First, we notice a few datetime columns that were loaded as strings. Let's convert them to datetime objects with pandas' to_datetime() method.
Next, we see a few columns ('term', 'int_rate', 'revol_util', and 'emp_length') that contain numerical information but are stored as strings because of the unit names stored along with numbers (%, months, years, etc). For each of these columns, we'll need to come up with a function that takes a raw string value and outputs a cleaned-up numerical value (either integer or float, depending on the original data).
Below you see how we deal with the 'term' column by defining the term_to_int function to map a value like ' 36 months' to the number 36. Note that we use the strip() method inside this function. It turns out that all the values in the 'term' column start with a leading space! Can you guess how we found that out? Hint: check out pandas.Series.values.
Once we have the function that transforms our values, we use the pandas.Series.apply() method to apply this function to our column.
We deal with the columns 'int_rate', 'revol_util' and 'emp_length' the same way, adjusting our transformation function as needed by the raw values contained in each column.
What about all the other columns that contain string values? We can see their cardinality (that's the number of unique values in a categorical column) by chaining together the following methods:
It looks like all columns except for emp_title have a very low number of unique values compared to the total number of observations in our dataset. There's not much we can do here — in today's world, the number of different occupations is very large. We did, however, notice that differences in capitalization (e.g. teacher vs Teacher) contributed to the high cardinality of the emp_title column. One thing we can do is normalize the values by making them all lowercase.
As for the remaining variables containing strings, we'll convert them to pandas' categorical data type.
The categorical data type is useful when:
- We have a string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory
- The lexical order of a variable is not the same as the logical order ('one', 'two', 'three'). By converting to a categorical and specifying the order of the categories, sorting and min/max will use the logical order instead of the lexical order
- As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types)
Let's see what datatypes we have now. The only object datatype we expect to see is for the 'emp_title' column.
This concludes our feature engineering efforts.
Checkpointing the results
It’s a good idea to occasionally create checkpoints of your results by saving intermediate DataFrames. The main goal is to allow yourself to pause and return to your work later without having to redo earlier, potentially time-consuming steps. The most common way of doing this is to use pandas.DataFrame.to_csv() which, as the name suggests, saves your DataFrame as a CSV file. This approach has its pros and cons.
- Can be opened by many software tools like Excel, as well as other programming languages
- Files are typically much larger than the information they store (because everything is stored in plain text)
- The files don't store information about data types. That means all your dates/times, categorical variables, etc. will be loaded as simple strings
If you are planning to share your results with someone else, CSV may be the preferred format.
Pros of Feather:
- It has high read and write performance
- It can be read and written in Python, Rm and a few other programming languages
- It stores information about DataFrame data types
- Unfortunately, unlike CSV files, Feather files are binary files that cannot be viewed in text editors
Let’s load the Feather file back in and check the data types.
As expected, the Feather format has preserved the data types, and they are indeed the same as what we had earlier.
Now our data is ready for further exploration.
In Part 3, we'll take a look at a couple of Python libraries that can help us expedite the exploration stage.
Learn more about the data science bootcamp and other courses by visiting Practicum and signing up for your free introductory class.