wuju Predictions made simple. Sorta.


New Version, Hearst Finalist

It's been an incredibly busy month for me.  I got married, went on a honeymoon, presented at the Convio Summit earlier today in Baltimore, and then flew to Boston and am preparing to present my solution for the Hearst Challenge tomorrow afternoon.  I am a finalist and am really hoping that tomorrow is a great day for me, but even if I don't end up winning it is an honor to be a finalist and I'm looking forward to meeting everyone.

So that was one big long paragraph that serves as an excuse for why some of the features I've mentioned are not yet complete (and thus the new version is not 1.0).  However, I have made an interim release with many new features, enhancements, and bug fixes.  You can download it here.  The changes:

  • Added automatic checks for updates
  • Added ability for a user to manually check for updates
  • Added ability for users to download new versions directly from the application
  • Added functionality to return error messages from R directly to the user
  • Added pop-up changelog on first use
  • Minor tweaks to the UI to make it less garish
  • Fixed a problem that prevented saving Logistic Regression models
  • Added additional trouble-shooting text to the error messages that are displayed when the DataHelper class instantiation throws an exception

I will have more features soon, and after I present tomorrow, I will post information about my solution for the Hearst Challenge.  Sorry for all the delay!

Filed under: wuju No Comments

New Version Coming, Hearst Challenge Update

Wow, I've really neglected this the last few weeks.  Luckily, it looks like hardly anybody is visiting the website these days so nobody is missing out too much.  I do have an upcoming release of wuju which I daresay I will call version 1.0.  Some of the new features include:

  • Changed way I'm calling fit model commands in R so as to lower the memory overhead of the call (for R dorks I'll be calling the S3 version that specifies the x and y vectors rather than the formula version where possible).
  • The application will automatically check for updates and provide a direct download link if an update exists.  No more filling out forms.
  • Finally, support for loading/saving files in various Excel file formats.
  • Errors raised by R will be returned directly to the user instead of merely "Error in the application" messages.
  • Bug fix for a problem you may have encountered when trying to save a Logistic Regression model.
  • Support for unsupervised learning via K-Means clustering.
  • Summary statistics for error rates/confusion matrices on completed models.
  • Other assorted minor bug fixes and aesthetic changes.

I can tell all 3 of you reading this are super excited.

As I've mentioned a couple times here, I've been hard at work competing in the Hearst Analytics Challenge.  The 3 month wait is nearly over, as the live leaderboard has closed with me (team NP) at the top.  Sadly, this does not mean I am about to be $25,000 richer just yet.  There is a hold-out evaluation set that everyone will have to score and submit predictions for, and performance on that set is what will determine the winner.

I think I have a fairly good shot here, however.  The evaluation set is substantially similar to the validation set we've been using to receive scores from the live scoring feature, and my model's performance has stood the test of the last week or so with a flurry of entries from Global Decision, Euro RSCG Discovery, R&R, Rexer-Russo, and AmAnalytics.  I was able to improve my score by about 1/10,000th of a point late in the game, which I hope will help.  All that said, we are all incredibly close in the standings.  I don't have time (and perhaps not even the talent!) to do the math right now but I would be shocked if there were a statistically significant difference between my score and anyone else with a sub-.228 score.

I'm also really curious to see who all these people are that I've been e-competing against for a while.  I've googled pretty much every team name on that leaderboard (except a few that I recognize) and would love to hear from other participants about their experiences in the challenge.  Sadly, the discussion around this challenge is nowhere near as robust as the discussions that happen over at Kaggle, and I've got several theories for why that may be.  At any rate, check back soon - I'll be posting information about my ultimate placement in the contest, my thoughts on the contest, and a description of my methodology.

Filed under: wuju 1 Comment

More Features In the Works

Hello again!  I'm back from vacation and working on some bug fixes and new features.  We've got much better error messages coming, plus a fix for a nasty bug that might prevent you from saving logistic regression models.  I am also building in a feature that will automatically check for a new version and provide you with a download link if one exists so you don't have to repeatedly fill out the download form.  I know you're all just so excited you can't stand it.

As far as data mining features, I don't have a lot in the works.  I'll probably provide support for a couple additional algorithms (K-Means clustering and Gradient Boosting Machines, I think).  I'm also working on a feature that will show you the R code that's being generated by wuju so that you can replicate it yourself or use wuju to help learn R, though this feature may not be ready in time for the next release.

I've been competing in the Hearst Analytics Challenge, using it both as a testing ground and a proof-of-efficacy for wuju.  I'm on the leaderbord as "NP".  I was in first place until mid-July when Global Decision took the lead.  I haven't made any official entries in a while, just coasting comfortably in second place.  But with Sigma Square pushing ahead of me the last couple of days I am motivated to attack the problem again.  I'll be posting a series of blog entries on my work in this challenge, how I used wuju, and what I think of the actual competition once the contest is over in September.

In the meantime, let me know if you have any bugs to report or new features you'd like to see.  Thanks!

Filed under: wuju No Comments

Version Released!

Hello! I've made a second beta release of wuju. We've got some new features here:

  • You can now read in training data in various formats: CSV, Tab-Delimited, or arbitrary-delimited (space, pipe, dot, etc).
  • You can now read in prediction data in various formats: CSV, Tab-Delimited, or arbitrary-delimited (space, pipe, dot, etc).
  • Wuju helps you deal with blank values by prompting you to replace them in both your training and your prediction data. See my previous post for more information on this.

I've also fixed one bug:

  • Information on the 'Use Models' tab about which variables were specified as numeric or categorical contained pesky 'as.numeric()' or 'as.factor()' text around the variable names.  I've removed that.

I've added a short form to the download page so that I can get a better handle on who is using wuju for what, so hopefully that's not too annoying for you all. Download the new version here. You can simply delete the first version from your computer. Models created using the first version will be fully compatible with this newer version.

I'm still working on fixed-width text, Excel file compatibility, and a few other enhancements, but I'll be on vacation most of the coming week. Look for more updates early next weekend. Thanks!

Filed under: wuju No Comments

Working with Missing Data

One frustrating but unavoidable truth about working with data in the real world is that frequently there will be missing values - blanks - in your data.  As suggested in the comments here, I'm working on a few ideas I have that will allow you to more easily cope with this truth using wuju.

Currently, the plan is to have wuju check your file for missing values when you load it.  If it finds any columns that contain missing values, it will prompt you to replace them with new values.  My thinking is to wholesale replace all missing values with one of: the mean of the column, the median of the column, the mode of the column, zero, one, negative one, or nothing (leave NA).  For most imputation purposes, these values should work fine.  Here's an example of what I've got:

Any thoughts?  Should I add an "other" option where the user can specify a constant number?  I'm doing a little work to polish this up before the next release, thus the passage of the weekend with no release.  I realize that this only works for numeric columns, but in my testing I believe that is sufficient as R interprets blank values as their own distinct value in factor columns.

Let me know what you think.  I hope to get this pushed out by Tuesday, along with the aforementioned new file format support.

Filed under: wuju No Comments

Upcoming Features

We're approaching the 50th download of wuju, which is great!  Sadly, I haven't gotten feedback from anyone who's used it yet, so if you've given it a whirl and have some thoughts, please share.

In the meantime, I've been working out a development plan to make some improvements and add some functionality.  First up on the list will be adding the ability to load/save files in formats other than CSV.  My current thinking is to add:

  • Tab-delimited text
  • Arbitrarily-delimited text (space, dot, pipe, etc.)
  • Excel 2003 format (xls)
  • Excel 2007+ format (xlsx)
  • Fixed-width

I'm not sure I'll get around to fixed-width by the next release, but the rest are high priorities.  I'm also planning on removing the "as.numeric()" and "as.factor()" text from the list of variables used in a particular model so that only the variable name is displayed.  Finally, I'm working on improving some of the error messages and adding a way to output summary information for models (things like the coefficients, r², etc).  This works in a 32 bit environment already, but I'm having trouble with 64 bit R.  I'm hunting around in the R.NET source code to try and figure that out.

I anticipate releasing version, containing most of these changes, by the end of next week.  Until then, I'll try and post more often here, turning this page into a bit of a development blog.

Filed under: wuju 5 Comments

Introducing wuju

In the non-profit direct marketing industry, organizations have myriad options when it comes to finding vendors to help with predictive modeling projects.  In my own work, I've had the good fortune to work at an organization that has the budget available to support these projects, and I have colleagues who understand the value that these investments can bring to the table.

Unfortunately, many organizations either do not have the money to spend on these projects or do not have the organizational buy-in needed to pursue them.  Building capacity in-house is the next best option, but hiring new staff can be even more challenging, and existing staff taking time out of their day to learn to use open source tools like R or Weka is often impractical.

Most modeling projects in our field employ only the basic methods: linear regression, logistic regression, decision trees, or random forests.  Frequently, only simple data is needed to produce a valuable model: RFM variables, acquisition channel, presence or absence of phone numbers and email addresses, etc.  My point here is that you don't need a lot of money, a vendor, a degree in statistics, or the ability to write code in order to create a basic, useful model.  If you have some of those things you will likely build a better model, but most of the time you're not looking for perfection.  You're looking for something to quickly and easily prioritize telemarketing contacts.  You're looking for the best 15,000 lapsed names to add to your renewal mailing.

You can do this yourself.  Wuju is designed to help you.  It uses a powerful (free!) statistical computing platform (The R Project) to do the heavy lifting, the .Net platform to provide a simple user interface, and the R.Net package to connect the two.  Check out the rest of the site for more details, or download it and give it a shot.  Like it?  Hate it?  Confused?  Leave a comment or shoot me an email: wuju@adamscruggs.com.  Thanks!

Filed under: wuju No Comments