Friday, December 27, 2013

Setting up multi-node Hadoop cluster on Mac


On Mac (Lion OS 10.7.5)
  1. Download
    1. Download VMWare Fusion
    2. Install it. You may need to buy license after trial period.
    3. Download Ubuntu desktop 12.04LTS image onto local folder

  1. Launch
    1. Common for all the four nodes below
      1. We want to create four nodes below to simulate ideal multi-node cluster. Edge node will have cloudera manager (to install hadoop on the cluster), Eclipse (to develop code), to submit jobs, etc. Namenode will have namenode, secondary namenode and job tracker services. Data nodes will have datanode and task tracker services.
      2. Open VMWare Fusion and create four virtual machines (VMs) using the Ubuntu image that you downloaded earlier (Click on "Add" >> "New" >> "Install from disc or image" >> "Continue" >> "User another disc or disc image" >> Point to the downloaded Ubuntu image file >> "Customize" based on the memory/processors you have available on your Mac and then give the virtual machine names according to their usage. (You will need to remember the userid+password that you provide here)
      3. Launch each of the machines
      4. Login to the machine
      5. Click on "Dash" >> search for "Terminal" >> Open Terminal
      6. sudo apt-get install openssh-server (to accept SSH connections)
      7. ssh-keygen -t rsa -P "" (then hit enter when prompted)
      8. cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
      9. chmod 600 ~/.ssh/id_rsa.pub
      10. Run "ifconfig" command and note down the ip address for each of the machines
      11. sudo vi /etc/hostname (replace "ubuntu" with new VM name e.g. "edge" or "nn1" or "dn1" or "dn2")
      12. sudo hostname <VM_Name> (e.g. sudo hostname edge) to change the VM name
      13. sudo vi /etc/hosts (Comment out the lines for "localhost" and "ubuntu" then add a line for each of the VMs "IPaddress   Machine_Name")
      14. sudo vi /etc/sudoers (add a line at the bottom "<user_id>  ALL=(ALL) NOPASSWD: ALL") to provide root previleges for the <user_id>
      15. Set time and timezone
        1. sudo apt-get install ntp
        2. sudo dpkg-reconfigure tzdata
      16. Restart the macnine (Click on power button on top right >> "shutdown" >> "restart")
    2. Edge node
      1. cat ~/.ssh/id_rsa.pub
      2. Highlight and copy the contents from the cat command above
      3. Go to each of the other three nodes, vi ~/.ssh/authorized_keys
      4. Paste the copied contents at the end of the above file & save
      5. SSH to all the four machines including itself couple of times to make sure you are not prompted for anything. First time you may need to type "yes" in the middle (e.g. ssh nn1, ssh dn1, ssh dn2)
      6. Download Cloudera manager & run per instructions at the same link
      7. cd to download dir
      8. chmod +x cloudera-manager-installer.bin
      9. sudo ./cloudera-manager-installer.bin
      10. Follow the instructions
      11. Open a browser and go to http://localhost:7180
      12. Login with "admin" and "admin"
      13. Start the install; Go with the user ID that you started off when you created the VMs
      14. Enter the ip addresses of all the four (including the edge node itself)
      15. Continue the installation.
    3. Name node
      1. cat ~/.ssh/id_rsa.pub
      2. Highlight and copy the contents from the cat command above
      3. Go to each of the other three nodes, vi ~/.ssh/authorized_keys
      4. Paste the copied contents at the end of the above file & save
      5. SSH to all the four machines including itself couple of times to make sure you are not prompted for anything. First time you may need to type "yes" in the middle (e.g. ssh nn1, ssh dn1, ssh dn2)
    4. Data node 1
      1. cat ~/.ssh/id_rsa.pub
      2. Highlight and copy the contents from the cat command above
      3. Go to each of the other three nodes, vi ~/.ssh/authorized_keys
      4. Paste the copied contents at the end of the above file & save
      5. SSH to all the four machines including itself couple of times to make sure you are not prompted for anything. First time you may need to type "yes" in the middle (e.g. ssh nn1, ssh dn1, ssh dn2)
    5. Data node 2
      1. cat ~/.ssh/id_rsa.pub
      2. Highlight and copy the contents from the cat command above
      3. Go to each of the other three nodes, vi ~/.ssh/authorized_keys
      4. Paste the copied contents at the end of the above file & save
      5. SSH to all the four machines including itself couple of times to make sure you are not prompted for anything. First time you may need to type "yes" in the middle (e.g. ssh nn1, ssh dn1, ssh dn2)

Thursday, December 19, 2013

Model thinking

Course by Scott E Page. Models help solve problems by physicalizing abstract or complex things into something you can tweak and play around with. With models you can introduce and analyze various parameters that influence the result one by one. Models help explain whey things happened one way (why rich get richer) or help in coming up equations / rules so as to easily solve (predict whats going to happen next) when parameters change.

Productivity is highest during wars. Why? People are focussed and produce more. A manager must introduce problems periodically so that the team will produce more and be more resilient (e.g. controlled forrest fires).

Some problems become very easy to solve once you change the way you represent the problem (e.g. cartesian coordinations vs polar or sum three to 15 puzzle).

Many models are better than one. It helps because one gets stuck.

Do higher level work and delegate lower level work to machines. Thats the technology advantage that will keep the sustained growth rate (technology advantage). New innovation (new skills) is the way to keep increasing your salary (output). Typically people have rapid rise in salary in the early years and it flattens out (like countries with high growth peters out after a while unless you innovate)

Data points are given. Now find some insight from that. Build models. Predict whats going to happen next.