Introduction to Data Science - Unit : 2 - Topic 2 : PYTHON TOOLS LIKE SKLEARN , OPTIMIZING OPERATIONS,

 

PYTHON TOOLS LIKE SKLEARN

Python has an overwhelming number of packages that can be used in a machine learning setting. The Python machine learning ecosystem can be divided into three main types of packages.

PACKAGES FOR WORKING WITH DATA IN MEMORY

When prototyping, the following packages can get you started by providing advanced functionalities with a few lines of code:

Ø  SciPy is a library that integrates fundamental packages often used in scientific computing such as NumPy, matplotlib, Pandas, and SymPy.

Ø  NumPy gives you access to powerful array functions and linear algebra functions.

Ø  Matplotlib is a popular 2D plotting package with some 3D functionality.

Ø  Pandas is a high-performance, but easy-to-use, data-wrangling package. It introduces data frames to Python, a type of in-memory data table. It’s a concept that should sound familiar to regular users of R.

 

Ø  SymPy is a package used for symbolic mathematics and computer algebra.

Ø  StatsModels is a package for statistical methods and algorithms.

Ø  Scikit-learn is a library filled with machine learning algorithms.

Ø  RPy2 allows you to call R functions from within Python. R is a popular open source statistics program.

Ø  NLTK (Natural Language Toolkit) is a Python toolkit with a focus on text analytics.

These libraries are good to get started with, but once you make the decision to run a

certain Python program at frequent intervals, performance comes into play.





OPTIMIZING OPERATIONS

Once your application moves into production, the libraries listed here can help you deliver the speed you need. Sometimes this involves connecting to big data infrastructures such as Hadoop and Spark.

Ø  Numba and NumbaPro—These use just-in-time compilation to speed up applications written directly in Python and a few annotations. NumbaPro also allows you to use the power of your graphics processor unit (GPU).

Ø  PyCUDA—This allows you to write code that will be executed on the GPU instead of your CPU and is therefore ideal for calculation-heavy applications. It works best with problems that lend themselves to being parallelized and need little input compared to the number of required computing cycles. An example is studying the robustness of your predictions by calculating thousands of different outcomes based on a single start state.

Ø  Cython, or C for Python—This brings the C programming language to Python. C is a lower-level language, so the code is closer to what the computer eventually uses (bytecode). The closer code is to bits and bytes, the faster it executes. A computer is also faster when it knows the type of a variable (called static typing). Python wasn’t designed to do this, and Cython helps you to overcome this shortfall.

Ø  Blaze —Blaze gives you data structures that can be bigger than your computer’s main memory, enabling you to work with large data sets.

Ø  Dispy and IPCluster —These packages allow you to write code that can be distributed over a cluster of computers.

Ø  PP —Python is executed as a single process by default. With the help of PP you can parallelize computations on a single machine or over clusters.

Ø  Pydoop and Hadoopy—These connect Python to Hadoop, a common big dataframework.

Ø  PySpark—This connects Python and Spark, an in-memory big data framework.

Comments

Popular posts from this blog

How to Get a Job in Top IT MNCs (TCS, Infosys, Wipro, Google, etc.) – Step-by-Step Guide for B.Tech Final Year Students

Common HR Interview Questions

How to Get an Internship in a MNC