Seven Databases in Seven Weeks – Hbase Day 2

This post is a recap of the second day of Hbase from the Seven Databases in Seven Weeks book.

Most of the commands and scripts can be found at GitHub:
https://github.com/eyalgo/seven-dbs-in-seven-weeks/tree/master/hbase/day_2

Streaming Script
The first thing in day 2 was to download lots of data (big data) and stream it into Hbase.
There’s a JRuby script, which I had to alter in order for it to work.
https://github.com/eyalgo/seven-dbs-in-seven-weeks/blob/master/hbase/day_2/import_from_wikipedia.rb

After altering it, as the book suggested, I had to add some compression to the column family.
After that, I could run the script:

curl http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 | bzcat | /opt/hbase/hbase-0.94.18/bin/hbase shell /home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/import_from_wikipedia.rb

curl http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 | bzcat | /opt/hbase/hbase-0.94.18/bin/hbase shell import_from_wikipedia.rb

This is the output while the script runs

1 10.0G    1  128M    0     0   456k      0  6:23:37  0:04:48  6:18:49  817k19000 records inserted (Serotonin)
  1 10.0G    1  131M    0     0   461k      0  6:19:03  0:04:51  6:14:12  921k19500 records inserted (Serotonin specific reuptake inhibitors)
  1 10.0G    1  135M    0     0   469k      0  6:12:45  0:04:54  6:07:51 1109k20000 records inserted (Tennis court)
  1 10.0G    1  138M    0     0   477k      0  6:06:12  0:04:57  6:01:15 1269k20500 records inserted (Tape drive)

The next part in this chapter talks about regions and some other plumbing stuff.

Build links table
In this part the source is the large Wiki table and the output is ‘links’ table.
Each link has ‘From:’ and ‘To:’
Here’s a link to the altered working script.
https://github.com/eyalgo/seven-dbs-in-seven-weeks/blob/master/hbase/day_2/generate_wiki_links.rb

The rest of the chapter shows how to look at the data, count it and more.

Homework
The main part in the homework, was to create a new table: ‘foods’ that takes data from an XML, which can be downloaded from the US’s health & nutrition site.
This data shows the nutrition facts per type of food.

I decided to create a very simple table. The column family does not have any special options.
I created one column family: facts. Each row data from the XML file will be part of facts.
I also decided that the row’s key would be the Display_Name. After all, it’s much easier to look by key and not by some ID.

create 'foods' , 'facts'

In order to see how I should create the script I looked at two sources:
1. The script that imported data for the Wiki table
2. One element (food) from the XML

Here’s one element:

<Food_Display_Row>
  <Food_Code>12350000</Food_Code>
  <Display_Name>Sour cream dip</Display_Name>
  <Portion_Default>1.00000</Portion_Default>
  <Portion_Amount>.25000</Portion_Amount>
  <Portion_Display_Name>cup </Portion_Display_Name>
  <Factor>.25000</Factor>
  <Increment>.25000</Increment>
  <Multiplier>1.00000</Multiplier>
  <Grains>.04799</Grains>
  <Whole_Grains>.00000</Whole_Grains>
  <Vegetables>.04070</Vegetables>
  <Orange_Vegetables>.00000</Orange_Vegetables>
  <Drkgreen_Vegetables>.00000</Drkgreen_Vegetables>
  <Starchy_vegetables>.00000</Starchy_vegetables>
  <Other_Vegetables>.04070</Other_Vegetables>
  <Fruits>.00000</Fruits>
  <Milk>.00000</Milk>
  <Meats>.00000</Meats>
  <Soy>.00000</Soy>
  <Drybeans_Peas>.00000</Drybeans_Peas>
  <Oils>.00000</Oils>
  <Solid_Fats>105.64850</Solid_Fats>
  <Added_Sugars>1.57001</Added_Sugars>
  <Alcohol>.00000</Alcohol>
  <Calories>133.65000</Calories>
  <Saturated_Fats>7.36898</Saturated_Fats>
</Food_Display_Row>

I created the script by examining the wiki script and one element.
Opening a document is when seeing an open XML element tag: Food_Display_Row
When seeing Food_Display_Row as the close tag, the script creates the document.

include Java
import 'org.apache.hadoop.hbase.client.HTable'
import 'org.apache.hadoop.hbase.client.Put'
import 'org.apache.hadoop.hbase.HBaseConfiguration'
import 'javax.xml.stream.XMLStreamConstants'

def jbytes( *args )
  args.map { |arg| arg.to_s.to_java_bytes }
end

factory = javax.xml.stream.XMLInputFactory.newInstance
reader = factory.createXMLStreamReader(java.lang.System.in)

document = nil
buffer = nil
count = 0

puts( @hbase )
conf = HBaseConfiguration.new
table = HTable.new( conf, "foods" )
table.setAutoFlush( false )

while reader.has_next
  type = reader.next
  
  if type == XMLStreamConstants::START_ELEMENT # (3)
  
    case reader.local_name
    when 'Food_Display_Row' then document = {}
    when /Display_Name|Portion_Default|Portion_Amount|Portion_Display_Name|Factor/ then buffer = []
    when /Increment|Multiplier|Grains|Whole_Grains|Vegetables|Orange_Vegetables/ then buffer = []
    when /Drkgreen_Vegetables|Starchy_vegetables|Other_Vegetables|Fruits|Milk|Meats/ then buffer = []
    when /Drybeans_Peas|Soy|Oils|Solid_Fats|Added_Sugars|Alcohol|Calories|Saturated_Fats/ then buffer = []
    end
    
  elsif type == XMLStreamConstants::CHARACTERS
    buffer << reader.text unless buffer.nil?
    
  elsif type == XMLStreamConstants::END_ELEMENT
    
    case reader.local_name
    when /Display_Name|Portion_Default|Portion_Amount|Portion_Display_Name|Factor/
      document[reader.local_name] = buffer.join
    when /Increment|Multiplier|Grains|Whole_Grains|Vegetables|Orange_Vegetables/
      document[reader.local_name] = buffer.join
    when /Drkgreen_Vegetables|Starchy_vegetables|Other_Vegetables|Fruits|Milk|Meats/
      document[reader.local_name] = buffer.join
    when /Drybeans_Peas|Soy|Oils|Solid_Fats|Added_Sugars|Alcohol|Calories|Saturated_Fats/
      document[reader.local_name] = buffer.join

    when 'Food_Display_Row'
      key = document['Display_Name'].to_java_bytes
      
      p = Put.new( key )
      p.add( *jbytes( "facts", "Display_Name", document['Display_Name'] ) )
      p.add( *jbytes( "facts", "Portion_Default", document['Portion_Default'] ) )
      p.add( *jbytes( "facts", "Portion_Amount", document['Portion_Amount'] ) )
      p.add( *jbytes( "facts", "Portion_Display_Name", document['Portion_Display_Name'] ) )
      p.add( *jbytes( "facts", "Factor", document['Factor'] ) )
      p.add( *jbytes( "facts", "Increment", document['Increment'] ) )
      p.add( *jbytes( "facts", "Multiplier", document['Multiplier'] ) )
      p.add( *jbytes( "facts", "Grains", document['Grains'] ) )
      p.add( *jbytes( "facts", "Whole_Grains", document['Whole_Grains'] ) )
      p.add( *jbytes( "facts", "Vegetables", document['Vegetables'] ) )
      p.add( *jbytes( "facts", "Orange_Vegetables", document['Orange_Vegetables'] ) )
      p.add( *jbytes( "facts", "Drkgreen_Vegetables", document['Drkgreen_Vegetables'] ) )
      p.add( *jbytes( "facts", "Starchy_vegetables", document['Starchy_vegetables'] ) )
      p.add( *jbytes( "facts", "Other_Vegetables", document['Other_Vegetables'] ) )
      p.add( *jbytes( "facts", "Fruits", document['Fruits'] ) )
      p.add( *jbytes( "facts", "Milk", document['Milk'] ) )
      p.add( *jbytes( "facts", "Meats", document['Meats'] ) )
      p.add( *jbytes( "facts", "Drybeans_Peas", document['Drybeans_Peas'] ) )
      p.add( *jbytes( "facts", "Soy", document['Soy'] ) )
      p.add( *jbytes( "facts", "Oils", document['Oils'] ) )
      p.add( *jbytes( "facts", "Solid_Fats", document['Solid_Fats'] ) )
      p.add( *jbytes( "facts", "Added_Sugars", document['Added_Sugars'] ) )
      p.add( *jbytes( "facts", "Alcohol", document['Alcohol'] ) )
      p.add( *jbytes( "facts", "Calories", document['Calories'] ) )
      p.add( *jbytes( "facts", "Saturated_Fats", document['Saturated_Fats'] ) )

      table.put( p )
      
      count += 1
      table.flushCommits() if count % 10 == 0
      if count % 500 == 0
        puts "#{count} records inserted (#{document['Display_Name']})"
      end
    end
  end
end

table.flushCommits()
exit

Following are the shell commands that take the XML file and stream them to Hbase.
The first command runs against the file with the single element.
After I verified the correctness, I ran it against to full file.

curl file:///home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/food-display-example.xml | cat | /opt/hbase/hbase-0.94.18/bin/hbase shell /home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/import_food_display.rb

curl file:///home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/MyFoodapediaData/Food_Display_Table.xml | cat | /opt/hbase/hbase-0.94.18/bin/hbase shell /home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/import_food_display.rb

Let’s get some food…

get 'foods' , 'fruit smoothie made with milk'

And the result:

COLUMN CELL
facts:Added_Sugars timestamp=1399932481440, value=82.54236
facts:Alcohol timestamp=1399932481440, value=.00000
facts:Calories timestamp=1399932481440, value=197.96000
facts:Display_Name timestamp=1399932481440, value=fruit smoothie made with milk
facts:Drkgreen_Vegetables timestamp=1399932481440, value=.00000
facts:Drybeans_Peas timestamp=1399932481440, value=.00000
facts:Factor timestamp=1399932481440, value=1.00000
facts:Fruits timestamp=1399932481440, value=.56358
facts:Grains timestamp=1399932481440, value=.00000
facts:Increment timestamp=1399932481440, value=.25000
facts:Meats timestamp=1399932481440, value=.00000
facts:Milk timestamp=1399932481440, value=.22624
facts:Multiplier timestamp=1399932481440, value=.25000
facts:Oils timestamp=1399932481440, value=.00808
facts:Orange_Vegetables timestamp=1399932481440, value=.00000
facts:Other_Vegetables timestamp=1399932481440, value=.00000
facts:Portion_Amount timestamp=1399932481440, value=1.00000
facts:Portion_Default timestamp=1399932481440, value=2.00000
facts:Portion_Display_Name timestamp=1399932481440, value=cup
facts:Saturated_Fats timestamp=1399932481440, value=1.91092
facts:Solid_Fats timestamp=1399932481440, value=24.14304
facts:Soy timestamp=1399932481440, value=.00000
facts:Starchy_vegetables timestamp=1399932481440, value=.00000
facts:Vegetables timestamp=1399932481440, value=.00000
facts:Whole_Grains timestamp=1399932481440, value=.00000

Linkedin Twitter facebook github

Advertisements

Seven Databases in Seven Weeks – Hbase Day 1

Hbase is a columnar NoSQL database.
The first day of Hbase was short and clear.
Installing it was easy. No issues whatsoever.
The examples simulated some wiki pages with revisions.
It was fairly easy.

Installation
I found a really easy tutorial on how to install Hbase on Fedora:
http://tutorialforlinux.com/2014/03/18/how-to-getting-started-with-apache-hbase-on-fedora-19-20-21-3264bit-linux-easy-guide/

Hbase will usually work on several (many) servers. It is recommended to run it with at least 5 machines.
However, it’s possible to run it on a single machine for POC / learning purposes. I am using an old, weak laptop, and Hbase works just fine.

JRuby Script
Part of the learning consists of understanding JRuby, as some scripts and exercises use it.

To load a JRuby script into the Hbase shell, run something like:
/opt/hbase-latest/bin/hbase org.jruby.Main PATH-TO-SCRIPT

The example script: put_multiple_columns initially didn’t work. I think it’s due to different versions.
In the book’s forum I found a similar question and an answer for that problem:
http://forums.pragprog.com/forums/202/topics/11494

I uploaded the working script to GitHub: GitHub-put_multiple_columns.rb

Day 1 Material
Under GitHub, some links, material and homework answers.
https://github.com/eyalgo/seven-dbs-in-seven-weeks/tree/master/hbase/day_1

Day 1 Homework
The exercise is more of a JRuby / Ruby and less of Hbase.

def put_many( table_name, row, column_values )
  import 'org.apache.hadoop.hbase.client.HTable'
  import 'org.apache.hadoop.hbase.client.Put'
  import 'org.apache.hadoop.hbase.HBaseConfiguration'

  def jbytes( *args )
    args.map { |arg| arg.to_s.to_java_bytes }
  end

  puts( @hbase )
  conf = HBaseConfiguration.new
  table = HTable.new( conf, table_name )
  p = Put.new( *jbytes( row ) )
  
  column_values.each do |key, value|
    (key_family, key_name) = key.split(':')
    key_name ||= ""
    p.add( *jbytes( key_family, key_name, value ))
  end
  
  table.put( p )
end

Day 2, working with big data looks really interesting…

Linkedin Twitter facebook github

Seven Databases in Seven Days – Riak

In this post I am summarizing the three days of Riak, which is the second database in the Seven Databases in Seven Days book.
This post is actually in order for me to remember some tweaks I had to do while reading this chapter as sometimes the book wasn’t entirely correct.

A good blog, which I used a little, can be found at:
http://blog.wakatta.jp/blog/2011/12/09/seven-databases-in-seven-weeks-riak-day-3/
(this link directs to the 3rd Riak’s day)

I have everything pushed to GitHub as raw material:
https://github.com/eyalgo/seven-dbs-in-seven-weeks

Installing
The book recommends to install using the source code itself.
I needed to install Erlang as well.

Besides the information in the book, the following link was mostly helpful:
http://docs.basho.com/riak/latest/ops/building/installing/from-source/

I installed everything under /usr/local/riak/.

Start / Stop / Restart
A nice command line to start/stop/restart all the servers:

# under /usr/local/riak/riak-1.4.8/dev
for node in `ls`; do $node/bin/riak start; done
# change start to restart or stop

Port
The port which was installed in my machine was: 10018 for dev1, 10028 for dev2 etc.
The port is located in app.config file, under the etc folder.

Day 3 Issues
Pre-commit
I kept getting PUT aborted by pre-commit hook message instead of the one described in the book.
I had to add the language (javascript) to the operation:

curl -i -X PUT http://localhost:10018/riak/animals -H "content-type: application/json" -d '{"props":{"precommit":[{"name":"good_score","language":"javascript"}]}}'

(see: http://blog.sacaluta.com/2012/07/riak-precommit-hook-example.html)

Running a solr query
Running the suggested query from the book
( curl http://localhost:10018/solr/animals/select?wt=json&q=nickname:rin%20breed:shepherd&q.op=and)
kept returning 400 – Bad Request.
All I needed to do was to surround the URL with: ‘ (apostrophe).

Inverted Index
Running the link as mentioned in the book gives bad response:

Invalid link walk query submitted. Valid link walk query format is: ...

The correct way, as described in http://docs.basho.com/riak/latest/dev/using/2i/

curl http://localhost:10018/buckets/animals/index/mascot_bin/butler

Conclusion
Riak chapter gives a taste of this database.
It explains more about the “tooling” of it rather than the application of it.
I feel that it didn’t explain too much on why someone would use it instead of something else (let’s wait for Redis).

The book had errors in how to run commands.
I had to find by myself how to fix these problems.
Perhaps it’s because I’m reading eBook (PDF on my computer and mobi on my Kindle), and the hard-copy has less issues.
The good part of this problem, is that I had to drill down and read more online and learn more from those mistakes.

Linkedin Twitter facebook github

The Node Beginner Book – Book Review

I have finished a few days ago a beginner’s book about Node.
Name of the book: The Node Beginner Book , A comprehensive Node.js tutorial
Author: Manuel Kiessling
I liked it and the way it’s written so I would like to share it with you.

I came across this book while looking for Node tutorials on the web.
So first, here’s the site of the book: http://www.nodebeginner.org/
It consists part of the first chapter of the book. As a tutorial and explanation.
I started going over the tutorial and when I was done with this part I went on and bought the book.
Another cool thing is that you can buy it from Leanpub as part of a bundle that has Hands-On Node.js as well. (I haven’t read this one yet).

The book walks through a very basic Node server construction.
Instead of just Hello World, we actually build a small server that can upload and show an image.

It covers the basics, which I feel is enough to understand the concepts of Node and to give a really good kickoff for someone who is interested in Node.

We start by installing Node and understand a little bit about JS.
At the beginning there’s a clear explanation of the use case of what we’re going to develop.
And, the importance of the architecture and how it should look like.

Then, step by step, we build the server.js , the index.js, router.js and request handlers.
I think that this is really important, as it emphasize a good approach of architecture design.
The author emphasize how important is to separate concerns and create an organized code.

Another really good aspect is the explanation of functional programming and how it helps in Node and HTTP server. Now, you’re not going to be a functional programmer after reading this book, but you will defiantly understand the concepts and get the idea.
For me, it’s a really good thing. As a Java developer, I don’t use the functional paradigm, and it’s an important tool these days.
(Yes, I know that there are many other functional languages. But that’s my point. By reading this book, I had a good opportunity to play with some functional paradigm.)

The book gradually evolves the server creation.
After we build the server.js, we start enhancing it.
We build index.js file that holds mapping of routing.
We build router.js that routes to the request handler.
And requestHandlers.js to work with the different requests.

Each part in the system evolves while reading the book.
For example, at the beginning a function does not accept any parameter. Then it accepts some and later the parameters change.
Every change is explained in the context and how it helps with aspects such as good architecture and design, asynchronous and other concepts.

One of the examples, which I liked was why passing a callback function is important. The book shows nicely what happens if you run a slow operation (find in a file system), which is synchronous (not a callback function). Basically your whole server gets stuck.

Towards the end, after we built simple yet flexible server, we learn some technical Node stuff. How to use external libraries with the package manager NPM.
And by using it, we learn how to show an image, upload a file and rename it.

At the end of the book we get a working Node server for upload and image and show it.
It’s fun !

I highly recommend it to anyone who wants to understand what Node is all about, but more than just the syntax.

The author has another book, which I bought but haven’t read yet: The Node Craftsman Book.

Happy Reading !

Linkedin Twitter facebook github

Getting Started with Google Guava – Book Review

I recently got my hands (my kindle) on the book: Getting Started with Google Guava by Bill Bejeck.

I love reading technical books and always hope to learn new stuff. As an extensive user of the Guava library, I was really intrigued to see what I was missing from this library and how I could improve the usage of it.

I will not go over it chapter by chapter with explanations, as anyone can check the TOC and see the details of what this book covers. Instead, I will try to give my own impression.

The book covers all aspects of the Guava library. For each aspect, the author shows the most used implementation and mentions other ones.

In nearly every chapter, I was introduced to some gems that immediately went into our own codebase when I started refactoring. That was FUN. And I saw code improvements instantly.

I really enjoyed reading the code examples with the extensive usage of JUnit as showcases for the behavior of the various classes. It’s a great way of showing what the library does. And as a side effect, it shows developers how a test is used as the specs of the code.

It seems that the author was very meticulous in writing clean and testable code. Two areas, which I think are, well, the most important for being a professional developer (a craftsman).

I think that this book is great for both newbies and experienced Guava users.
I think it is also great for developers who want to have some kind of knowledge on how to write clean and better code.