The DataTech: 2017

Wednesday, 19 April 2017

Get Started with Docker in Ubuntu

1. $sudo apt install docker.io

An image is a lightweight, stand-alone, executable package that includes everything needed to run a piece of software, including the code, a runtime, libraries, environment variables, and config fil

A container is a runtime instance of an image – what the image becomes in memory when actually executed. It runs completely isolated from the host environment by default, only accessing host files and ports if configured to do so.

2. $sudo docker run hello-world
to run your first app in docker

3. $sudo docker images
To see the installed images

4. $docker --help
To see the list of commands in docker

5. $sudo docker run -it fedora bash
$sudo docker run -it ubuntu bash
To lauch terminal with fedora's instance.

6. $sudo docker inspect fedora
To get the details of the fedora image

7. $sudo docker version
To get the version

8. $sudo docker ps
To see the container id and other details

9. $sudo docker stop
To gracefully stop the container

10. $ sudo docker history fedora
Shows history of images

Refrences :
1. https://docs.docker.com/engine/reference/commandline/container_start/#parent-command
2. https://docs.docker.com/engine/userguide/

Saturday, 4 March 2017

Machine Learning! Where To Start? - Tools

Popular Tools in 5 points

Octave

This is an Open Source project.
Easy to write complex Machine Learning equations.
Vectorization helps with easy manipulations for matrices operations.
Python like CLI
Limited to small data sets (cannot handle BIGDATA).

R

This is an Open Source project.
R can create graphics to be displayed on the screen or saved to file. It can also prepare models that can be queried and updated.
R is a tool to use when you need to analyze data, plot data or build a statistical model for data.
Build with an idea of statistic centric design for computation.
This Blog covers almost every thing: http://machinelearningmastery.com/what-is-r/

Python

NumPy and Pandas are two most recommended libraries to get you started with data manipulation and some statistical calculations.
IPython notebook is also gaining popularity now a days.
Sci-kit learn and Tensor-Flow are machine learning libraries that are available which makes model building simpler for everyone.
Python is more popular choice amongst the programmers.

Apache Mahout

Open source Scalable Machine learning platform.
Runs multiple map-reduce jobs to run a machine learning algorithm.
Its build over top of Hadoop.
Its batch processing.

Apache Spark

Open source Big Data platform.
MLlib is the machine learning library available here.
It is in-memory processing, that's why faster than Mahout.
It supports micro-batch processing.

Summary

Hopefully, it will help you to make right choice. I would recommend if you would like to understand the mathematics of machine learning then use octave or python to implement machine learning algorithms without any library. If your objective is to only apply these algorithms then you may start with python and scikit learn, moreover there are plenty tutorials over the Internet that can get you started. Other Popular libraries are Apache Apex-SAMOA, H2O, Flink, Weka, Java-ML etc. Please comment if you would like to have some other comparisons.

Sunday, 26 February 2017

Auto Import Python libraries at startup

*Requirements Below description is applicable for Linux OS

Create a file auto_import.py

import numpy as np

import pandas as pd

#.... and so on...

#with all your import statements

In Terminal

1. Open $HOME/.bashrc file

$ gedit $HOME/.bashrc

2. Copy paste the following line at the end of file:

#be sure of the path you provide

export PYTHONSTARTUP=/path/to/auto_import.py

3. Save and close the file

4. Relaunch the Terminal or use command below to execute changes in bashrc file

$ exec bash

4.In terminal

$ python

>>> #Congratulations!!! all your libraries are auto imported.