Big Data/Analytics Zone is brought to you in partnership with:

Justin Bozonier is the Product Optimization Specialist at GrubHub formerly Sr. Developer/Analyst at Cheezburger. He's engineered a large, scalable analytics system, worked on actuarial modeling software. As Product Optimization Specialist he is currently leading split test design, implementation, and analysis. The opinions expressed here represent my own and not those of my employer. Justin is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

Algorithm of the Week: Practical Parallelizing in R

07.02.2013
| 10061 views |
  • submit to reddit

I wrote an algorithm in R to run a Monte Carlo simulation of how many test subjects I need for split tests to detect X% shift in the mean. It essentially required hundreds of thousands of calculations in order to come up with the final table. As a result this meant that my algorithm ran for a few minutes.

I’ll talk about my specific problem in a future post but for now I’ll quickly introduce you to parallel operations in R.

First you should install the “multicore” package. I can’t say that this is the “best” package but it works: install.packages(“multicore”)

Now you can use a function called “mclapply” that you can use in place of “mapply”. Let’s create a slightly contrived example:

mapply(function(i) t.test(rnorm(10000, mean=25, sd=1), rnorm(10000, mean=25, sd=1)), 1:10000)

This will create two normal distributions with 10,000 elements in each and will compare them to each other with a t-test 10,000 times. It takes several seconds until it starts printing its results. Now because I chose to implement this using an apply function rather than using a for loop, I can easily convert this to be multi-core friendly. Check out this example and try running it.

mclapply(1:10000, function(i) t.test(rnorm(10000, mean=25, sd=1), rnorm(10000, mean=25, sd=1)))

You might also try watching your CPU with both versions. The “mclapply” version will automatically max out all of your cores and finish MUCH faster. The standard R version will hardly peg a single CPU.

Ok. I know this seems like a contrived situation, but it sets us up perfectly for my next post where I talk about extending this technique to find the number of samples you need in an experiment to measure a statistically significant shift in the results. Because we’re generating all of the data and we aren’t I/O bound, a simple multicore technique like this will save us big.

Stay tuned!

(Note: This article and the opinions expressed are solely my own and do not represent those of my employer.)

Published at DZone with permission of Justin Bozonier, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)