By: Brian Ray [email protected]
This project's goal is to use docker containers to set up a network of services and workbenches commonly used by Data Scientists working on Machine Learning problems. It's currently marked as experimental and contributions are welcome. The Docker Compose file outlines a couple of the containers. They should be configured to work with eachother over the docpyml network you create on your docker VM.
List of Containers:
- docpyml-namenode: Hadoop NameNode. Keeps the directory tree of all files in the file system.
- docpyml-datanode1: Data Storage HadoopFileSystem
- docpyml-datanode2: Data Storage HadoopFileSystem
- docpyml-spark-master: Apache Spark Master
- spark-worker (<- may launch many): Spark Workers. Also contain the Python version matching docpyml-conda
- docpyml-sparknotebook: Preconfigured Spark Notebook
- docpyml-hdfsfb: HDFS FileBrowser from Cloudera Hue
- docpyml-conda: Anaconda Python 3.5 with Jupyter Notebook, machine learning packages, pySpark preconfigured
- docpyml-rocker: RStudio
Prerequisites. Docker Toolbox.
optionally adjust your VM settings:
docker-machine stop
VBoxManage modifyvm default --cpus 4
VBoxManage modifyvm default --memory 8192
docker-machine start
To start the enviroment:
docker network create docpyml
docker-compose up -d
If says docker not running try first:
eval "$(docker-machine env default)"
To scale up spark-workers:
docker-compose scale spark-worker=3