Seven Databases in Seven Weeks – Hbase Day 2

This post is a recap of the second day of Hbase from the Seven Databases in Seven Weeks book.

Most of the commands and scripts can be found at GitHub:
https://github.com/eyalgo/seven-dbs-in-seven-weeks/tree/master/hbase/day_2

Streaming Script
The first thing in day 2 was to download lots of data (big data) and stream it into Hbase.
There’s a JRuby script, which I had to alter in order for it to work.
https://github.com/eyalgo/seven-dbs-in-seven-weeks/blob/master/hbase/day_2/import_from_wikipedia.rb

After altering it, as the book suggested, I had to add some compression to the column family.
After that, I could run the script:

curl http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 | bzcat | /opt/hbase/hbase-0.94.18/bin/hbase shell /home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/import_from_wikipedia.rb

curl http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 | bzcat | /opt/hbase/hbase-0.94.18/bin/hbase shell import_from_wikipedia.rb

This is the output while the script runs

1 10.0G    1  128M    0     0   456k      0  6:23:37  0:04:48  6:18:49  817k19000 records inserted (Serotonin)
  1 10.0G    1  131M    0     0   461k      0  6:19:03  0:04:51  6:14:12  921k19500 records inserted (Serotonin specific reuptake inhibitors)
  1 10.0G    1  135M    0     0   469k      0  6:12:45  0:04:54  6:07:51 1109k20000 records inserted (Tennis court)
  1 10.0G    1  138M    0     0   477k      0  6:06:12  0:04:57  6:01:15 1269k20500 records inserted (Tape drive)

The next part in this chapter talks about regions and some other plumbing stuff.

Build links table
In this part the source is the large Wiki table and the output is ‘links’ table.
Each link has ‘From:’ and ‘To:’
Here’s a link to the altered working script.
https://github.com/eyalgo/seven-dbs-in-seven-weeks/blob/master/hbase/day_2/generate_wiki_links.rb

The rest of the chapter shows how to look at the data, count it and more.

Homework
The main part in the homework, was to create a new table: ‘foods’ that takes data from an XML, which can be downloaded from the US’s health & nutrition site.
This data shows the nutrition facts per type of food.

I decided to create a very simple table. The column family does not have any special options.
I created one column family: facts. Each row data from the XML file will be part of facts.
I also decided that the row’s key would be the Display_Name. After all, it’s much easier to look by key and not by some ID.

create 'foods' , 'facts'

In order to see how I should create the script I looked at two sources:
1. The script that imported data for the Wiki table
2. One element (food) from the XML

Here’s one element:

<Food_Display_Row>
  <Food_Code>12350000</Food_Code>
  <Display_Name>Sour cream dip</Display_Name>
  <Portion_Default>1.00000</Portion_Default>
  <Portion_Amount>.25000</Portion_Amount>
  <Portion_Display_Name>cup </Portion_Display_Name>
  <Factor>.25000</Factor>
  <Increment>.25000</Increment>
  <Multiplier>1.00000</Multiplier>
  <Grains>.04799</Grains>
  <Whole_Grains>.00000</Whole_Grains>
  <Vegetables>.04070</Vegetables>
  <Orange_Vegetables>.00000</Orange_Vegetables>
  <Drkgreen_Vegetables>.00000</Drkgreen_Vegetables>
  <Starchy_vegetables>.00000</Starchy_vegetables>
  <Other_Vegetables>.04070</Other_Vegetables>
  <Fruits>.00000</Fruits>
  <Milk>.00000</Milk>
  <Meats>.00000</Meats>
  <Soy>.00000</Soy>
  <Drybeans_Peas>.00000</Drybeans_Peas>
  <Oils>.00000</Oils>
  <Solid_Fats>105.64850</Solid_Fats>
  <Added_Sugars>1.57001</Added_Sugars>
  <Alcohol>.00000</Alcohol>
  <Calories>133.65000</Calories>
  <Saturated_Fats>7.36898</Saturated_Fats>
</Food_Display_Row>

I created the script by examining the wiki script and one element.
Opening a document is when seeing an open XML element tag: Food_Display_Row
When seeing Food_Display_Row as the close tag, the script creates the document.

include Java
import 'org.apache.hadoop.hbase.client.HTable'
import 'org.apache.hadoop.hbase.client.Put'
import 'org.apache.hadoop.hbase.HBaseConfiguration'
import 'javax.xml.stream.XMLStreamConstants'

def jbytes( *args )
  args.map { |arg| arg.to_s.to_java_bytes }
end

factory = javax.xml.stream.XMLInputFactory.newInstance
reader = factory.createXMLStreamReader(java.lang.System.in)

document = nil
buffer = nil
count = 0

puts( @hbase )
conf = HBaseConfiguration.new
table = HTable.new( conf, "foods" )
table.setAutoFlush( false )

while reader.has_next
  type = reader.next
  
  if type == XMLStreamConstants::START_ELEMENT # (3)
  
    case reader.local_name
    when 'Food_Display_Row' then document = {}
    when /Display_Name|Portion_Default|Portion_Amount|Portion_Display_Name|Factor/ then buffer = []
    when /Increment|Multiplier|Grains|Whole_Grains|Vegetables|Orange_Vegetables/ then buffer = []
    when /Drkgreen_Vegetables|Starchy_vegetables|Other_Vegetables|Fruits|Milk|Meats/ then buffer = []
    when /Drybeans_Peas|Soy|Oils|Solid_Fats|Added_Sugars|Alcohol|Calories|Saturated_Fats/ then buffer = []
    end
    
  elsif type == XMLStreamConstants::CHARACTERS
    buffer << reader.text unless buffer.nil?
    
  elsif type == XMLStreamConstants::END_ELEMENT
    
    case reader.local_name
    when /Display_Name|Portion_Default|Portion_Amount|Portion_Display_Name|Factor/
      document[reader.local_name] = buffer.join
    when /Increment|Multiplier|Grains|Whole_Grains|Vegetables|Orange_Vegetables/
      document[reader.local_name] = buffer.join
    when /Drkgreen_Vegetables|Starchy_vegetables|Other_Vegetables|Fruits|Milk|Meats/
      document[reader.local_name] = buffer.join
    when /Drybeans_Peas|Soy|Oils|Solid_Fats|Added_Sugars|Alcohol|Calories|Saturated_Fats/
      document[reader.local_name] = buffer.join

    when 'Food_Display_Row'
      key = document['Display_Name'].to_java_bytes
      
      p = Put.new( key )
      p.add( *jbytes( "facts", "Display_Name", document['Display_Name'] ) )
      p.add( *jbytes( "facts", "Portion_Default", document['Portion_Default'] ) )
      p.add( *jbytes( "facts", "Portion_Amount", document['Portion_Amount'] ) )
      p.add( *jbytes( "facts", "Portion_Display_Name", document['Portion_Display_Name'] ) )
      p.add( *jbytes( "facts", "Factor", document['Factor'] ) )
      p.add( *jbytes( "facts", "Increment", document['Increment'] ) )
      p.add( *jbytes( "facts", "Multiplier", document['Multiplier'] ) )
      p.add( *jbytes( "facts", "Grains", document['Grains'] ) )
      p.add( *jbytes( "facts", "Whole_Grains", document['Whole_Grains'] ) )
      p.add( *jbytes( "facts", "Vegetables", document['Vegetables'] ) )
      p.add( *jbytes( "facts", "Orange_Vegetables", document['Orange_Vegetables'] ) )
      p.add( *jbytes( "facts", "Drkgreen_Vegetables", document['Drkgreen_Vegetables'] ) )
      p.add( *jbytes( "facts", "Starchy_vegetables", document['Starchy_vegetables'] ) )
      p.add( *jbytes( "facts", "Other_Vegetables", document['Other_Vegetables'] ) )
      p.add( *jbytes( "facts", "Fruits", document['Fruits'] ) )
      p.add( *jbytes( "facts", "Milk", document['Milk'] ) )
      p.add( *jbytes( "facts", "Meats", document['Meats'] ) )
      p.add( *jbytes( "facts", "Drybeans_Peas", document['Drybeans_Peas'] ) )
      p.add( *jbytes( "facts", "Soy", document['Soy'] ) )
      p.add( *jbytes( "facts", "Oils", document['Oils'] ) )
      p.add( *jbytes( "facts", "Solid_Fats", document['Solid_Fats'] ) )
      p.add( *jbytes( "facts", "Added_Sugars", document['Added_Sugars'] ) )
      p.add( *jbytes( "facts", "Alcohol", document['Alcohol'] ) )
      p.add( *jbytes( "facts", "Calories", document['Calories'] ) )
      p.add( *jbytes( "facts", "Saturated_Fats", document['Saturated_Fats'] ) )

      table.put( p )
      
      count += 1
      table.flushCommits() if count % 10 == 0
      if count % 500 == 0
        puts "#{count} records inserted (#{document['Display_Name']})"
      end
    end
  end
end

table.flushCommits()
exit

Following are the shell commands that take the XML file and stream them to Hbase.
The first command runs against the file with the single element.
After I verified the correctness, I ran it against to full file.

curl file:///home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/food-display-example.xml | cat | /opt/hbase/hbase-0.94.18/bin/hbase shell /home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/import_food_display.rb

curl file:///home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/MyFoodapediaData/Food_Display_Table.xml | cat | /opt/hbase/hbase-0.94.18/bin/hbase shell /home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/import_food_display.rb

Let’s get some food…

get 'foods' , 'fruit smoothie made with milk'

And the result:

COLUMN CELL
facts:Added_Sugars timestamp=1399932481440, value=82.54236
facts:Alcohol timestamp=1399932481440, value=.00000
facts:Calories timestamp=1399932481440, value=197.96000
facts:Display_Name timestamp=1399932481440, value=fruit smoothie made with milk
facts:Drkgreen_Vegetables timestamp=1399932481440, value=.00000
facts:Drybeans_Peas timestamp=1399932481440, value=.00000
facts:Factor timestamp=1399932481440, value=1.00000
facts:Fruits timestamp=1399932481440, value=.56358
facts:Grains timestamp=1399932481440, value=.00000
facts:Increment timestamp=1399932481440, value=.25000
facts:Meats timestamp=1399932481440, value=.00000
facts:Milk timestamp=1399932481440, value=.22624
facts:Multiplier timestamp=1399932481440, value=.25000
facts:Oils timestamp=1399932481440, value=.00808
facts:Orange_Vegetables timestamp=1399932481440, value=.00000
facts:Other_Vegetables timestamp=1399932481440, value=.00000
facts:Portion_Amount timestamp=1399932481440, value=1.00000
facts:Portion_Default timestamp=1399932481440, value=2.00000
facts:Portion_Display_Name timestamp=1399932481440, value=cup
facts:Saturated_Fats timestamp=1399932481440, value=1.91092
facts:Solid_Fats timestamp=1399932481440, value=24.14304
facts:Soy timestamp=1399932481440, value=.00000
facts:Starchy_vegetables timestamp=1399932481440, value=.00000
facts:Vegetables timestamp=1399932481440, value=.00000
facts:Whole_Grains timestamp=1399932481440, value=.00000

Linkedin Twitter facebook github

Advertisements

Seven Databases in Seven Weeks – Hbase Day 1

Hbase is a columnar NoSQL database.
The first day of Hbase was short and clear.
Installing it was easy. No issues whatsoever.
The examples simulated some wiki pages with revisions.
It was fairly easy.

Installation
I found a really easy tutorial on how to install Hbase on Fedora:
http://tutorialforlinux.com/2014/03/18/how-to-getting-started-with-apache-hbase-on-fedora-19-20-21-3264bit-linux-easy-guide/

Hbase will usually work on several (many) servers. It is recommended to run it with at least 5 machines.
However, it’s possible to run it on a single machine for POC / learning purposes. I am using an old, weak laptop, and Hbase works just fine.

JRuby Script
Part of the learning consists of understanding JRuby, as some scripts and exercises use it.

To load a JRuby script into the Hbase shell, run something like:
/opt/hbase-latest/bin/hbase org.jruby.Main PATH-TO-SCRIPT

The example script: put_multiple_columns initially didn’t work. I think it’s due to different versions.
In the book’s forum I found a similar question and an answer for that problem:
http://forums.pragprog.com/forums/202/topics/11494

I uploaded the working script to GitHub: GitHub-put_multiple_columns.rb

Day 1 Material
Under GitHub, some links, material and homework answers.
https://github.com/eyalgo/seven-dbs-in-seven-weeks/tree/master/hbase/day_1

Day 1 Homework
The exercise is more of a JRuby / Ruby and less of Hbase.

def put_many( table_name, row, column_values )
  import 'org.apache.hadoop.hbase.client.HTable'
  import 'org.apache.hadoop.hbase.client.Put'
  import 'org.apache.hadoop.hbase.HBaseConfiguration'

  def jbytes( *args )
    args.map { |arg| arg.to_s.to_java_bytes }
  end

  puts( @hbase )
  conf = HBaseConfiguration.new
  table = HTable.new( conf, table_name )
  p = Put.new( *jbytes( row ) )
  
  column_values.each do |key, value|
    (key_family, key_name) = key.split(':')
    key_name ||= ""
    p.add( *jbytes( key_family, key_name, value ))
  end
  
  table.put( p )
end

Day 2, working with big data looks really interesting…

Linkedin Twitter facebook github