Introduction to Data Science - Unit : 2 - Topic 6 : PROBLEMS AND GENERAL TECHNIQUES FOR HANDLING LARGE DATA, PYTHON TOOLS
PROGRAMMING TIPS FOR DEALING LARGE DATA
The tricks that work
in a general programming context still apply for data science. Several might be
worded slightly differently, but the principles are essentially the same for
all programmers. This section recapitulates those tricks that are important in
a data science context.
You can divide the
general tricks into three parts.
mind map:
Ø Don’t reinvent the wheel. Use tools and libraries developed by others.
Ø Get the most out of your hardware. Your machine is never used to its full
potential; with simple adaptions you can make it work harder.
Ø Reduce the computing need. Slim down your memory and processing needs as
much as possible.
1. Don’t reinvent the wheel:
“Don’t repeat
anyone” is probably even better than “don’t repeat yourself.” Add value with
your actions: make sure that they matter. Solving a problem that has already
been solved is a waste of time.
a.
Exploit
the power of databases: The first reaction most data scientists have when working
with large data sets is to prepare their analytical base tables inside a
database. This method works well when the features you want to prepare are
fairly simple.
a.
Use
optimized libraries: Creating libraries like Mahout, Weka, and other machine
learning algorithms requires time and knowledge. They are highly optimized and
incorporate best practices and state-of-the art technologies.
2.
Get the most out of
your hardware: Resources
on a computer can be idle, whereas other resources are over-utilized. This
slows down programs and can even make them fail. Sometimes it’s possible (and
necessary) to shift the workload from an overtaxed resource to an underutilized
resource using the following techniques:
a.
Feed
the CPU compressed data. A
simple trick to avoid CPU starvation is to feed the CPU compressed data instead
of the inflated (raw) data.
b.
Make use of the GPU. Sometimes
your CPU and not your memory is the bottleneck. If your computations are
parallelizable, you can benefit from switching to the GPU.
c.
Use multiple threads. It’s
still possible to parallelize computations on your CPU. You can achieve this
with normal Python threads.
3.
Reduce
your computing needs
“Working smart + hard = achievement.” This
also applies to the programs you write. The best way to avoid having large data
problems is by removing as much of the work as possible up front and letting
the computer work only on the part that can’t be skipped. The following list
contains methods to help you achieve this:
a.
Profile
your code and remediate slow pieces of code.
b.
Use
compiled code whenever possible, certainly when loops are involved.
c.
Otherwise,
compile the code yourself.
d.
Avoid
pulling data into memory.
e.
Use
generators to avoid intermediate data storage.
f.
Use
as little data as possible.
Comments
Post a Comment