Examining Employee Data in mySQL (with a dash of Tableau)

In this post, I discuss a handful of SQL queries on employee data that I’ve implemented in mySQL Workbench. I wrote these queries from scratch, specifically choosing research questions that would require me to write more complex code that I had relatively less experience working with.

Read More

Predicting Customer Churn with Python

In this post, I examine and discuss the 4 classifiers I fit to predict customer churn: K Nearest Neighbors, Logistic Regression, Random Forest, and Gradient Boosting. I first outline the data cleaning and preprocessing procedures I implemented to prepare the data for modeling. I then proceed to a discusison of each model in turn, highlighting what the model actually does, how I tuned the model’s hyperparameters, and the results of each model. The best performing model has an out-of-sample test performance of about 80%, and more importantly, an AUC of 0.84. I conclude by comparing the efficacy of the models and discussing avenues by which the models could be improved even further.

Read More

An Exploration of Northwind Using MySQL

I’ve been focusing recently on trying to take my SQL skills to the next level – defining more complicated subqueries, working with triggers and views, Online Analytical Processing tools, and the like. After taking six online courses and working through numerous excercises, I wanted to get my hands dirty with some data in a more applied way. In this post, I walk through some queries and analyses I performed on the oft-used Northwind data. All of the analyses herein were done in MySQL using the MySQL Workbench GUI.

Read More

Retail Sales Forecasts

I’ve spent quite a bit of time over the last year or so digging into the nitty gritty of time series models. My first thorough exposure to the topic came by way of a month-long Advanced Time Series Analysis course at the Inter-University Consortium for Political and Social Research. I then spent some time learning various tools related to the manipulation and modeling of time series using R. One particular flavor of TS models particularly piqued my interest – forecasting. After taking Rob Hyndman’s Forecasting course on DataCamp and reading some of his book with George Athanasopoulos, I wanted to get my hands dirty trying out some of the different forecasting models.

Read More

Loan Default and Returns on Investment Analyses

In this post, I discuss the logistic regression and random forest classification models I built to predict loan default. I also examine returns on investment as a function of a few key predictors. I find that the logistic regression model slightly outperforms the random forest model with respect to predictive ability. I conclude by discussing the strengths and weaknesses of the discussed models and highlight other pertinent variables that might be used to build a better predictive model.

Read More

Boston Geographical Crime Analyses

These analyses served as the basis for the final project I submitted for a graduate-level Geographic Information Systems (GIS) course at Indiana University. I was primarily interested in examining the spatial distribution of crime in Boston as well as the correlates of crime rates. Drawing on prior research, I began the project with the following hypotheses:

  • The proportion of the population that is white is negatively associated with crime rates (Liska & Bellair 1995; Blau & Blau 1982; Braithwaite 1979)
  • Median income is negatively associated with crime rates (Gould et al. 2002; Levitt 1999)
Read More