Introduction to Data Science - Unit : 2 - Topic 6 : PROBLEMS AND GENERAL TECHNIQUES FOR HANDLING LARGE DATA, PYTHON TOOLS

 PROGRAMMING TIPS FOR DEALING LARGE DATA

The tricks that work in a general programming context still apply for data science. Several might be worded slightly differently, but the principles are essentially the same for all programmers. This section recapitulates those tricks that are important in a data science context.

You can divide the general tricks into three parts.

mind map:

Ø  Don’t reinvent the wheel. Use tools and libraries developed by others.

Ø  Get the most out of your hardware. Your machine is never used to its full potential; with simple adaptions you can make it work harder.

Ø  Reduce the computing need. Slim down your memory and processing needs as much as possible.




1.     Don’t reinvent the wheel:

“Don’t repeat anyone” is probably even better than “don’t repeat yourself.” Add value with your actions: make sure that they matter. Solving a problem that has already been solved is a waste of time.

a.     Exploit the power of databases: The first reaction most data scientists have when working with large data sets is to prepare their analytical base tables inside a database. This method works well when the features you want to prepare are fairly simple.

a.     Use optimized libraries: Creating libraries like Mahout, Weka, and other machine learning algorithms requires time and knowledge. They are highly optimized and incorporate best practices and state-of-the art technologies.

2.     Get the most out of your hardware: Resources on a computer can be idle, whereas other resources are over-utilized. This slows down programs and can even make them fail. Sometimes it’s possible (and necessary) to shift the workload from an overtaxed resource to an underutilized resource using the following techniques:

a.      Feed the CPU compressed data. A simple trick to avoid CPU starvation is to feed the CPU compressed data instead of the inflated (raw) data.

b.     Make use of the GPU. Sometimes your CPU and not your memory is the bottleneck. If your computations are parallelizable, you can benefit from switching to the GPU.

c.      Use multiple threads. It’s still possible to parallelize computations on your CPU. You can achieve this with normal Python threads.

3.     Reduce your computing needs

“Working smart + hard = achievement.” This also applies to the programs you write. The best way to avoid having large data problems is by removing as much of the work as possible up front and letting the computer work only on the part that can’t be skipped. The following list contains methods to help you achieve this:

a.      Profile your code and remediate slow pieces of code.

b.     Use compiled code whenever possible, certainly when loops are involved.

c.      Otherwise, compile the code yourself.

d.     Avoid pulling data into memory.

e.      Use generators to avoid intermediate data storage.

f.      Use as little data as possible.

Use your math skills to simplify calculations as much as possible

Comments

Popular posts from this blog

How to Get a Job in Top IT MNCs (TCS, Infosys, Wipro, Google, etc.) – Step-by-Step Guide for B.Tech Final Year Students

Common HR Interview Questions

How to Get an Internship in a MNC