This spring I took some time to test out the free open source machine learning kit Scikit-learn, or SKlearn, recycling a project from a Data Analytics class from my MBA program.
The basic challenge: Given a bunch of loan application data from thousands of loans, can you predict which loans will default and which will not?
My code is in GitHub located here: https://github.com/dforrestwilson/albums/blob/master/lendingclub.py
Using Excel, figuring out what columns were important and which were not was a long and painful process. Once I had a basic script in Python, I was able to use something called a decision tree classifier to determine that for me, with greater accuracy. You can learn more about that here: http://scikit-learn.org/stable/modules/tree.html
In 39 lines of (amateur) code you can go from some basic cleaned .csv data to visual tree graphs. It should take less than 1 minute to run. Powerful stuff, and what I love about coding data analytics solutions like this is you can reuse it for many types of classification and regression problems.
You can get the data here: https://www.lendingclub.com/info/download-data.action
Feel free to install Python along with Pandas and SKLearn and give it a shot yourself! I am sure there are things I could improve, but it picked up on the proper drivers without any prompt from me. SKLearn has a lot of interesting features that I am starting to wrap my head around.