News

Tips on Computing with Big Data in R


Big data is ubiquitous. The good news is that it provides great opportunities for the data analyst. With more data comes more information and more insight. You can relax assumptions required with smaller data sets and let the data speak for itself. But big data also presents problems. The size of data sets is now increasing much more rapidly than the speed of single cores, of RAM, and of hard drives. Many tools can't handle this well; when data is too large, the software stops working.


When it comes to big data, R users are faced with two issues related to scale: capacity and speed. In R, data sets typically must fit into memory, but even so, there are limits to what types of analyses can be done. In some cases, computation may be too slow to be useful.

If you are used to working with smaller data sets in R, you want to think differently about how you perform your analyses when using big data. If you are used to working in a High Performance Computing (HPC) environment, you also want to think differently when you add analysis of big data to the picture. High-Performance Computing is CPU-centric, typically focusing on using many cores to perform lots of processing on small amounts of data. High Performance Analytics (HPA) is data-centric. The focus is on feeding data to the cores—on disk I/O, data locality, efficient threading, and data management in RAM.


The RevoScaleR package that is included with Machine Learning Server provides tools and examples for addressing the speed and capacity issues involved in High-Performance Analytics. It provides data management and analysis functionality that scales from small, in-memory data sets to huge data sets stored on disk. The analysis functions are threaded to use multiple cores, and computations can be distributed across multiple computers (nodes) on a cluster or in the cloud.


In this article, we review some tips for handling big data with R.

Upgrade hardware
It is always best to start with the easiest things first, and in some cases getting a better computer, or improving the one you have, can help a great deal. Usually the most important consideration is memory. If you are analyzing data that just about fits in R on your current system, getting more memory will not only let you finish your analysis, it is also likely to speed up things by a lot. This is because your operating system starts to “thrash” when it gets low on memory, removing some things from memory to let others continue to run. This can slow your system to a crawl. Getting more cores can also help, but only up to a point. R itself can generally only use one core at a time internally. In addition, for many data analysis problems the bottlenecks are disk I/O and the speed of RAM, so efficiently using more than 4 or 8 cores on commodity hardware can be difficult.

Minimize copies of data
When working with small data sets, an extra copy is not a problem. With big data it can slow the analysis, or even bring it to a screeching halt. Be aware of the ‘automatic’ copying that occurs in R. For example, if a data frame is passed into a function, a copy is only made if the data frame is modified. But if a data frame is put into a list, a copy is automatically made. In many of the basic analysis algorithms, such as lm and glm, multiple copies of the data set are made as the computations progress, resulting in serious limitations in processing big data sets. The RevoScaleR analysis functions (for instance, rxSummary, rxCube, rxLinMod, rxLogit, rxGlm, rxKmeans) are all implemented with a focus on efficient use of memory; data is not copied unless absolutely necessary. The plot below shows an example of how reducing copies of data and tuning algorithms can dramatically increase speed and capacity.
Тег А Origin article publication is here.

Comments

Post a Comment

Popular posts from this blog

Power of Mathematics

Ten Machine Learning Algorithms You Should Know to Become a Data Scientist

Суб’єктивний погляд на Data Science в Україні