Introduction to Data Science - Unit : 2 - Topic 5 : PROBLEMS AND GENERAL TECHNIQUES FOR HANDLING LARGE DATA, PYTHON TOOLS
HANDLING
LARGE DATA: PROBLEMS AND GENERAL TECHNIQUES FOR HANDLING LARGE DATA
The
problems you face when handling large data:
A large volume of data poses new challenges, such as overloaded memory
and algorithms that never stop running. It forces you to adapt and expand your
repertoire of techniques. But even when you can perform your analysis, you
should take care of issues such as I/O (input/output) and CPU starvation,
because these can cause speed issues. The below Figure shows a mind map that
will gradually unfold as we go through the steps: problems, solutions, and
tips.
A computer only has a limited amount of RAM. When you try to squeeze
more data into this memory than actually fits, the OS will start swapping out
memory blocks to disks, which is far less efficient than having it all in
memory. But only a few algorithms are designed to handle large data sets; most
of them load the whole data set into memory at once, which causes the
out-of-memory error. Other algorithms need to hold multiple copies of the data
in memory or store intermediate results. All of these aggravate the problem.
General techniques for handling large volumes of data
Never-ending algorithms, out-of-memory errors, and speed issues are the
most common challenges you face when working with large data. In this section,
we’ll investigate solutions to overcome or alleviate these problems. The
solutions can be divided into three categories: using the correct algorithms,
choosing the right data structure, and using the right tools
1.
Choosing
the right algorithm
Choosing the right
algorithm can solve more problems than adding more or better hardware. An
algorithm that’s well suited for handling large data doesn’t need to load the
entire data set into memory to make predictions.
a. Online
learning algorithms
Several, but not all, machine learning algorithms
can be trained using one observation at a time instead of taking all the data
into memory. Upon the arrival of a new data point, the model is trained and the
observation can be forgotten; its effect is now incorporated into the model’s
parameters.
b.
Dividing a large matrix into many small ones
By cutting a large data table into small
matrices, for instance, we can still do a linear regression. The logic behind
this matrix splitting and how a linear regression can be calculated with
matrices can be found in the sidebar.
c.
Mapreduce
MapReduce algorithms are easy to understand
with an analogy: Imagine that you were asked to count all the votes for the
national elections. Your country has 25 parties, 1,500 voting offices, and 2
million people. You could choose to gather all the voting tickets from every
office individually and count them centrally, or you could ask the local
offices to count the votes for the 25 parties and hand over the results to you,
and you could then aggregate them by party.
2. Choosing the right data structure
Algorithms can make or break your program, but the way you store your
data is of equal importance. Data structures have different storage
requirements, but also influence the performance of CRUD (create, read,
update, and delete) and other operations on the data set.
Sparse data
A sparse data set contains relatively little information compared to its
entries (observations). Look at figure 4.6: almost everything is “0” with just
a single “1” present in the second observation on variable 9. Data like this
might look ridiculous, but this is often what you get when converting textual
data to binary data. Imagine a set of 100,000 completely unrelated Twitter
tweets.
b. Tree structures
Trees are a class of data structure that allows you to retrieve
information much faster than scanning through a table. A tree always has a root
value and subtrees of children, each with its children, and so on.
c. Hash tables
Hash tables are data structures that calculate a key for every value in
your data and put the keys in a bucket. This way you can quickly retrieve the
information by looking in the right bucket when you encounter the data.
Dictionaries in Python are a hash table implementation, and they’re a close
relative of key-value stores.
3. Selecting the right tools
With the right class of algorithms and data structures in place, it’s
time to choose the right tool for the job.
a.
Python
tools
Python has a number of libraries that can help you deal with large data.
They range from smarter data structures over code optimizers to just-in-time
compilers. The following is a list of libraries we like to use when confronted
with large data:
Cython - Cython, a superset
of Python, solves this problem by forcing the programmer to specify the data
type while developing the program. Once the compiler has this information, it
runs programs much faster.
Numexpr—Numexpr is at the core of many of the big data packages, as is NumPy
for in-memory packages. Numexpr is a numerical expression evaluator for NumPy
but can be many times faster than the original NumPy.
Numba —Numba helps you to
achieve greater speed by compiling your code right before you execute it, also
known as just-in-time compiling.
Bcolz—Bcolz helps you
overcome the out-of-memory problem that can occur when using NumPy. It can
store and work with arrays in an optimal compressed form.
Blaze —Blaze is ideal if you
want to use the power of a database backend but like the “Pythonic way” of
working with data. Blaze will translate your Python code into SQL but can
handle many more data stores than relational databases such as CSV, Spark, and
others.
Theano—Theano enables you to
work directly with the graphical processing unit (GPU) and do symbolical simplifications
whenever possible, and it comes with an excellent just-in-time compiler.
Dask—Dask enables you to
optimize your flow of calculations and execute them efficiently.
b. Use python as a master to control other tools
Most software and tool producers support a Python interface to their software. This enables you to tap into specialized pieces of software with the ease and productivity that comes with Python. This way Python sets itself apart from other popular data science languages such as R and SAS
Comments
Post a Comment