Using machine learning to predict LendingClub loan defaults

This spring I took some time to test out the free open source machine learning kit Scikit-learn, or SKlearn, recycling a project from a Data Analytics class from my MBA program.

The basic challenge: Given a bunch of loan application data from thousands of loans, can you predict which loans will default and which will not?

My code is in GitHub located here:


Using Excel, figuring out what columns were important and which were not was a long and painful process. Once I had a basic script in Python, I was able to use something called a decision tree classifier to determine that for me, with greater accuracy. You can learn more about that here:

In 39 lines of (amateur) code you can go from some basic cleaned .csv data to visual tree graphs. It should take less than 1 minute to run. Powerful stuff, and what I love about coding data analytics solutions like this is you can reuse it for many types of classification and regression problems.

You can get the data here:

Feel free to install Python along with Pandas and SKLearn and give it a shot yourself! I am sure there are things I could improve, but it picked up on the proper drivers without any prompt from me. SKLearn has a lot of interesting features that I am starting to wrap my head around.