Introduction to Data Science - Unit : 2 - Topic 2 : PYTHON TOOLS LIKE SKLEARN , OPTIMIZING OPERATIONS,
PYTHON
TOOLS LIKE SKLEARN
Python has an overwhelming number of packages that can be used in a
machine learning setting. The Python machine learning ecosystem can be divided
into three main types of packages.
PACKAGES
FOR WORKING WITH DATA IN MEMORY
When prototyping, the
following packages can get you started by providing advanced functionalities
with a few lines of code:
Ø SciPy is a library that integrates fundamental packages often used in
scientific computing such as NumPy, matplotlib, Pandas, and SymPy.
Ø NumPy gives you access to powerful array functions and linear algebra
functions.
Ø Matplotlib is a popular 2D plotting package with some 3D
functionality.
Ø Pandas is a high-performance, but easy-to-use, data-wrangling
package. It introduces data frames to Python, a type of in-memory data table.
It’s a concept that should sound familiar to regular users of R.
Ø SymPy is a package used for symbolic mathematics and computer algebra.
Ø StatsModels is a package for statistical methods and
algorithms.
Ø Scikit-learn is a library filled with machine learning
algorithms.
Ø RPy2 allows you to call R functions from within Python. R is a popular open
source statistics program.
Ø ■ NLTK (Natural Language Toolkit) is a
Python toolkit with a focus on text analytics.
These libraries are
good to get started with, but once you make the decision to run a
certain Python
program at frequent intervals, performance comes into play.
OPTIMIZING OPERATIONS
Once your application moves into production, the libraries listed here
can help you deliver the speed you need. Sometimes this involves connecting to
big data infrastructures such as Hadoop and Spark.
Ø Numba and NumbaPro—These use just-in-time compilation to speed
up applications written directly in Python and a few annotations. NumbaPro also
allows you to use the power of your graphics processor unit (GPU).
Ø PyCUDA—This allows you to write code that will be executed on
the GPU instead of your CPU and is therefore ideal for calculation-heavy
applications. It works best with problems that lend themselves to being
parallelized and need little input compared to the number of required computing
cycles. An example is studying the robustness of your predictions by
calculating thousands of different outcomes based on a single start state.
Ø Cython, or C for Python—This brings the C programming language to
Python. C is a lower-level language, so the code is closer to what the computer
eventually uses (bytecode). The closer code is to bits and bytes, the faster it
executes. A computer is also faster when it knows the type of a variable
(called static typing). Python wasn’t designed to do this, and Cython
helps you to overcome this shortfall.
Ø Blaze —Blaze gives you data structures that can be bigger than
your computer’s main memory, enabling you to work with large data sets.
Ø Dispy and IPCluster —These packages allow you to write code that can be distributed over a cluster
of computers.
Ø PP —Python
is executed as a single process by default. With the help of PP you can
parallelize computations on a single machine or over clusters.
Ø Pydoop and Hadoopy—These connect Python to Hadoop, a common big
dataframework.
Ø PySpark—This connects Python and Spark, an in-memory big data
framework.
Comments
Post a Comment