Introduction to Data Science - Unit : 2 - Topic 5 : PROBLEMS AND GENERAL TECHNIQUES FOR HANDLING LARGE DATA, PYTHON TOOLS

HANDLING LARGE DATA: PROBLEMS AND GENERAL TECHNIQUES FOR HANDLING LARGE DATA

The problems you face when handling large data:

A large volume of data poses new challenges, such as overloaded memory and algorithms that never stop running. It forces you to adapt and expand your repertoire of techniques. But even when you can perform your analysis, you should take care of issues such as I/O (input/output) and CPU starvation, because these can cause speed issues. The below Figure shows a mind map that will gradually unfold as we go through the steps: problems, solutions, and tips.

 


A computer only has a limited amount of RAM. When you try to squeeze more data into this memory than actually fits, the OS will start swapping out memory blocks to disks, which is far less efficient than having it all in memory. But only a few algorithms are designed to handle large data sets; most of them load the whole data set into memory at once, which causes the out-of-memory error. Other algorithms need to hold multiple copies of the data in memory or store intermediate results. All of these aggravate the problem.

General techniques for handling large volumes of data

Never-ending algorithms, out-of-memory errors, and speed issues are the most common challenges you face when working with large data. In this section, we’ll investigate solutions to overcome or alleviate these problems. The solutions can be divided into three categories: using the correct algorithms, choosing the right data structure, and using the right tools

 


1.     Choosing the right algorithm

Choosing the right algorithm can solve more problems than adding more or better hardware. An algorithm that’s well suited for handling large data doesn’t need to load the entire data set into memory to make predictions.


a.      Online learning algorithms

Several, but not all, machine learning algorithms can be trained using one observation at a time instead of taking all the data into memory. Upon the arrival of a new data point, the model is trained and the observation can be forgotten; its effect is now incorporated into the model’s parameters.

b.     Dividing a large matrix into many small ones

By cutting a large data table into small matrices, for instance, we can still do a linear regression. The logic behind this matrix splitting and how a linear regression can be calculated with matrices can be found in the sidebar.

c.      Mapreduce

MapReduce algorithms are easy to understand with an analogy: Imagine that you were asked to count all the votes for the national elections. Your country has 25 parties, 1,500 voting offices, and 2 million people. You could choose to gather all the voting tickets from every office individually and count them centrally, or you could ask the local offices to count the votes for the 25 parties and hand over the results to you, and you could then aggregate them by party.

2.     Choosing the right data structure

Algorithms can make or break your program, but the way you store your data is of equal importance. Data structures have different storage requirements, but also influence the performance of CRUD (create, read, update, and delete) and other operations on the data set.


  Sparse data

A sparse data set contains relatively little information compared to its entries (observations). Look at figure 4.6: almost everything is “0” with just a single “1” present in the second observation on variable 9. Data like this might look ridiculous, but this is often what you get when converting textual data to binary data. Imagine a set of 100,000 completely unrelated Twitter tweets.

b.     Tree structures

Trees are a class of data structure that allows you to retrieve information much faster than scanning through a table. A tree always has a root value and subtrees of children, each with its children, and so on.



c.      Hash tables

Hash tables are data structures that calculate a key for every value in your data and put the keys in a bucket. This way you can quickly retrieve the information by looking in the right bucket when you encounter the data. Dictionaries in Python are a hash table implementation, and they’re a close relative of key-value stores.

3.     Selecting the right tools

With the right class of algorithms and data structures in place, it’s time to choose the right tool for the job.



a.     Python tools

Python has a number of libraries that can help you deal with large data. They range from smarter data structures over code optimizers to just-in-time compilers. The following is a list of libraries we like to use when confronted with large data:

Cython - Cython, a superset of Python, solves this problem by forcing the programmer to specify the data type while developing the program. Once the compiler has this information, it runs programs much faster.

Numexpr—Numexpr is at the core of many of the big data packages, as is NumPy for in-memory packages. Numexpr is a numerical expression evaluator for NumPy but can be many times faster than the original NumPy.

Numba —Numba helps you to achieve greater speed by compiling your code right before you execute it, also known as just-in-time compiling.

BcolzBcolz helps you overcome the out-of-memory problem that can occur when using NumPy. It can store and work with arrays in an optimal compressed form.

Blaze Blaze is ideal if you want to use the power of a database backend but like the “Pythonic way” of working with data. Blaze will translate your Python code into SQL but can handle many more data stores than relational databases such as CSV, Spark, and others.

TheanoTheano enables you to work directly with the graphical processing unit (GPU) and do symbolical simplifications whenever possible, and it comes with an excellent just-in-time compiler.

DaskDask enables you to optimize your flow of calculations and execute them efficiently.

b.     Use python as a master to control other tools

Most software and tool producers support a Python interface to their software. This enables you to tap into specialized pieces of software with the ease and productivity that comes with Python. This way Python sets itself apart from other popular data science languages such as R and SAS 

Comments

Popular posts from this blog

How to Get a Job in Top IT MNCs (TCS, Infosys, Wipro, Google, etc.) – Step-by-Step Guide for B.Tech Final Year Students

Common HR Interview Questions

How to Get an Internship in a MNC