Seven Databases in Seven Weeks – Hbase Day 2

This post is a recap of the second day of Hbase from the Seven Databases in Seven Weeks book.

Most of the commands and scripts can be found at GitHub:
https://github.com/eyalgo/seven-dbs-in-seven-weeks/tree/master/hbase/day_2

Streaming Script
The first thing in day 2 was to download lots of data (big data) and stream it into Hbase.
There’s a JRuby script, which I had to alter in order for it to work.
https://github.com/eyalgo/seven-dbs-in-seven-weeks/blob/master/hbase/day_2/import_from_wikipedia.rb

After altering it, as the book suggested, I had to add some compression to the column family.
After that, I could run the script:

curl http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 | bzcat | /opt/hbase/hbase-0.94.18/bin/hbase shell /home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/import_from_wikipedia.rb

curl http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 | bzcat | /opt/hbase/hbase-0.94.18/bin/hbase shell import_from_wikipedia.rb

This is the output while the script runs

1 10.0G    1  128M    0     0   456k      0  6:23:37  0:04:48  6:18:49  817k19000 records inserted (Serotonin)
  1 10.0G    1  131M    0     0   461k      0  6:19:03  0:04:51  6:14:12  921k19500 records inserted (Serotonin specific reuptake inhibitors)
  1 10.0G    1  135M    0     0   469k      0  6:12:45  0:04:54  6:07:51 1109k20000 records inserted (Tennis court)
  1 10.0G    1  138M    0     0   477k      0  6:06:12  0:04:57  6:01:15 1269k20500 records inserted (Tape drive)

The next part in this chapter talks about regions and some other plumbing stuff.

Build links table
In this part the source is the large Wiki table and the output is ‘links’ table.
Each link has ‘From:’ and ‘To:’
Here’s a link to the altered working script.
https://github.com/eyalgo/seven-dbs-in-seven-weeks/blob/master/hbase/day_2/generate_wiki_links.rb

The rest of the chapter shows how to look at the data, count it and more.

Homework
The main part in the homework, was to create a new table: ‘foods’ that takes data from an XML, which can be downloaded from the US’s health & nutrition site.
This data shows the nutrition facts per type of food.

I decided to create a very simple table. The column family does not have any special options.
I created one column family: facts. Each row data from the XML file will be part of facts.
I also decided that the row’s key would be the Display_Name. After all, it’s much easier to look by key and not by some ID.

create 'foods' , 'facts'

In order to see how I should create the script I looked at two sources:
1. The script that imported data for the Wiki table
2. One element (food) from the XML

Here’s one element:

<Food_Display_Row>
  <Food_Code>12350000</Food_Code>
  <Display_Name>Sour cream dip</Display_Name>
  <Portion_Default>1.00000</Portion_Default>
  <Portion_Amount>.25000</Portion_Amount>
  <Portion_Display_Name>cup </Portion_Display_Name>
  <Factor>.25000</Factor>
  <Increment>.25000</Increment>
  <Multiplier>1.00000</Multiplier>
  <Grains>.04799</Grains>
  <Whole_Grains>.00000</Whole_Grains>
  <Vegetables>.04070</Vegetables>
  <Orange_Vegetables>.00000</Orange_Vegetables>
  <Drkgreen_Vegetables>.00000</Drkgreen_Vegetables>
  <Starchy_vegetables>.00000</Starchy_vegetables>
  <Other_Vegetables>.04070</Other_Vegetables>
  <Fruits>.00000</Fruits>
  <Milk>.00000</Milk>
  <Meats>.00000</Meats>
  <Soy>.00000</Soy>
  <Drybeans_Peas>.00000</Drybeans_Peas>
  <Oils>.00000</Oils>
  <Solid_Fats>105.64850</Solid_Fats>
  <Added_Sugars>1.57001</Added_Sugars>
  <Alcohol>.00000</Alcohol>
  <Calories>133.65000</Calories>
  <Saturated_Fats>7.36898</Saturated_Fats>
</Food_Display_Row>

I created the script by examining the wiki script and one element.
Opening a document is when seeing an open XML element tag: Food_Display_Row
When seeing Food_Display_Row as the close tag, the script creates the document.

include Java
import 'org.apache.hadoop.hbase.client.HTable'
import 'org.apache.hadoop.hbase.client.Put'
import 'org.apache.hadoop.hbase.HBaseConfiguration'
import 'javax.xml.stream.XMLStreamConstants'

def jbytes( *args )
  args.map { |arg| arg.to_s.to_java_bytes }
end

factory = javax.xml.stream.XMLInputFactory.newInstance
reader = factory.createXMLStreamReader(java.lang.System.in)

document = nil
buffer = nil
count = 0

puts( @hbase )
conf = HBaseConfiguration.new
table = HTable.new( conf, "foods" )
table.setAutoFlush( false )

while reader.has_next
  type = reader.next
  
  if type == XMLStreamConstants::START_ELEMENT # (3)
  
    case reader.local_name
    when 'Food_Display_Row' then document = {}
    when /Display_Name|Portion_Default|Portion_Amount|Portion_Display_Name|Factor/ then buffer = []
    when /Increment|Multiplier|Grains|Whole_Grains|Vegetables|Orange_Vegetables/ then buffer = []
    when /Drkgreen_Vegetables|Starchy_vegetables|Other_Vegetables|Fruits|Milk|Meats/ then buffer = []
    when /Drybeans_Peas|Soy|Oils|Solid_Fats|Added_Sugars|Alcohol|Calories|Saturated_Fats/ then buffer = []
    end
    
  elsif type == XMLStreamConstants::CHARACTERS
    buffer << reader.text unless buffer.nil?
    
  elsif type == XMLStreamConstants::END_ELEMENT
    
    case reader.local_name
    when /Display_Name|Portion_Default|Portion_Amount|Portion_Display_Name|Factor/
      document[reader.local_name] = buffer.join
    when /Increment|Multiplier|Grains|Whole_Grains|Vegetables|Orange_Vegetables/
      document[reader.local_name] = buffer.join
    when /Drkgreen_Vegetables|Starchy_vegetables|Other_Vegetables|Fruits|Milk|Meats/
      document[reader.local_name] = buffer.join
    when /Drybeans_Peas|Soy|Oils|Solid_Fats|Added_Sugars|Alcohol|Calories|Saturated_Fats/
      document[reader.local_name] = buffer.join

    when 'Food_Display_Row'
      key = document['Display_Name'].to_java_bytes
      
      p = Put.new( key )
      p.add( *jbytes( "facts", "Display_Name", document['Display_Name'] ) )
      p.add( *jbytes( "facts", "Portion_Default", document['Portion_Default'] ) )
      p.add( *jbytes( "facts", "Portion_Amount", document['Portion_Amount'] ) )
      p.add( *jbytes( "facts", "Portion_Display_Name", document['Portion_Display_Name'] ) )
      p.add( *jbytes( "facts", "Factor", document['Factor'] ) )
      p.add( *jbytes( "facts", "Increment", document['Increment'] ) )
      p.add( *jbytes( "facts", "Multiplier", document['Multiplier'] ) )
      p.add( *jbytes( "facts", "Grains", document['Grains'] ) )
      p.add( *jbytes( "facts", "Whole_Grains", document['Whole_Grains'] ) )
      p.add( *jbytes( "facts", "Vegetables", document['Vegetables'] ) )
      p.add( *jbytes( "facts", "Orange_Vegetables", document['Orange_Vegetables'] ) )
      p.add( *jbytes( "facts", "Drkgreen_Vegetables", document['Drkgreen_Vegetables'] ) )
      p.add( *jbytes( "facts", "Starchy_vegetables", document['Starchy_vegetables'] ) )
      p.add( *jbytes( "facts", "Other_Vegetables", document['Other_Vegetables'] ) )
      p.add( *jbytes( "facts", "Fruits", document['Fruits'] ) )
      p.add( *jbytes( "facts", "Milk", document['Milk'] ) )
      p.add( *jbytes( "facts", "Meats", document['Meats'] ) )
      p.add( *jbytes( "facts", "Drybeans_Peas", document['Drybeans_Peas'] ) )
      p.add( *jbytes( "facts", "Soy", document['Soy'] ) )
      p.add( *jbytes( "facts", "Oils", document['Oils'] ) )
      p.add( *jbytes( "facts", "Solid_Fats", document['Solid_Fats'] ) )
      p.add( *jbytes( "facts", "Added_Sugars", document['Added_Sugars'] ) )
      p.add( *jbytes( "facts", "Alcohol", document['Alcohol'] ) )
      p.add( *jbytes( "facts", "Calories", document['Calories'] ) )
      p.add( *jbytes( "facts", "Saturated_Fats", document['Saturated_Fats'] ) )

      table.put( p )
      
      count += 1
      table.flushCommits() if count % 10 == 0
      if count % 500 == 0
        puts "#{count} records inserted (#{document['Display_Name']})"
      end
    end
  end
end

table.flushCommits()
exit

Following are the shell commands that take the XML file and stream them to Hbase.
The first command runs against the file with the single element.
After I verified the correctness, I ran it against to full file.

curl file:///home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/food-display-example.xml | cat | /opt/hbase/hbase-0.94.18/bin/hbase shell /home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/import_food_display.rb

curl file:///home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/MyFoodapediaData/Food_Display_Table.xml | cat | /opt/hbase/hbase-0.94.18/bin/hbase shell /home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/import_food_display.rb

Let’s get some food…

get 'foods' , 'fruit smoothie made with milk'

And the result:

COLUMN CELL
facts:Added_Sugars timestamp=1399932481440, value=82.54236
facts:Alcohol timestamp=1399932481440, value=.00000
facts:Calories timestamp=1399932481440, value=197.96000
facts:Display_Name timestamp=1399932481440, value=fruit smoothie made with milk
facts:Drkgreen_Vegetables timestamp=1399932481440, value=.00000
facts:Drybeans_Peas timestamp=1399932481440, value=.00000
facts:Factor timestamp=1399932481440, value=1.00000
facts:Fruits timestamp=1399932481440, value=.56358
facts:Grains timestamp=1399932481440, value=.00000
facts:Increment timestamp=1399932481440, value=.25000
facts:Meats timestamp=1399932481440, value=.00000
facts:Milk timestamp=1399932481440, value=.22624
facts:Multiplier timestamp=1399932481440, value=.25000
facts:Oils timestamp=1399932481440, value=.00808
facts:Orange_Vegetables timestamp=1399932481440, value=.00000
facts:Other_Vegetables timestamp=1399932481440, value=.00000
facts:Portion_Amount timestamp=1399932481440, value=1.00000
facts:Portion_Default timestamp=1399932481440, value=2.00000
facts:Portion_Display_Name timestamp=1399932481440, value=cup
facts:Saturated_Fats timestamp=1399932481440, value=1.91092
facts:Solid_Fats timestamp=1399932481440, value=24.14304
facts:Soy timestamp=1399932481440, value=.00000
facts:Starchy_vegetables timestamp=1399932481440, value=.00000
facts:Vegetables timestamp=1399932481440, value=.00000
facts:Whole_Grains timestamp=1399932481440, value=.00000

Linkedin Twitter facebook github

Advertisement

GIT Pull Requests Using GitHub

Old Habits

We’ve been working with git for more than a year.
The SCM was migrated from SVN, with all its history.
Our habits were migrated as well.

Our flow is (was) fairly simple:
master branch is were we deploy our code from.
When working on a feature, we create a feature branch. Several people can work on this branch.
Some create private local branch. Some don’t.
Code review is done one-on-one. One member asks another to join and walks through the code.

Introducing Pull Request to the Team

Recently I introduced to the team, with the help of a teammate the Pull Requests concept.
It takes some time to grasp the methodology and see the benefits.
However, I already start seeing improvements in collaboration, code quality and coding behaviors.

Benefits

  1. Better collaboration
    When a person does a change and calls for a pull request, the entire team can see the change.
    Everyone can comment and give remarks.
    Discuss changes before they are merged to the main branch.
  2. Code ownership
    Everyone knows about the change and anyone can check and comment. The result is that each one can “own” the code.
    It helps each team member to participate in coding and reviewing any piece of code.
  3. Branches organization
    There’s extra revision of the code before it is merged.
    Branches can be (IMHO should be) deleted after merging the feature.
    git history (the log) is clearer. (This one is totally dependent on the quality of comments)
  4. Improved code quality
    I see that it improves the code quality even before the code review.
    People don’t want to introduce bad code when knowing that everyone can watch it.
  5. Better code review
    We’ve been doing extensive code review since the beginning of the project.
    However, as I explained above, we did it one-on-one, which usually the writer explained the code to the reviewer.
    In my perspective, by doing that, we miss the advantages of code review. The quality of the code review is decreased when the writer explains the material to the reviewer.
    Using pull request, if the reviewer does not understand something, it means that perhaps the code is not clean enough.
    So more remarks and comments, thus, better code.
  6. Mentoring
    When a senior does code review to a junior, one-on-one, nobody else sees it.
    It’s more difficult for the senior to show case the expectations of how the code should look like and how code review should be performed.
    (there are of course other ways passing it, like code dojos. And, pair-programming, although it’s also one-on-one).
    By commenting review in the pull request, the team can see what’s important and how to review.
    Everyone benefits from review of other team members.
  7. Improved git usage habits
    When someone collaborates with the whole team, he/she will probably write better git comments.
    The commits will be smaller and more frequent, as no one wants to read huge amount of diff rows. So no one wants to “upset” the team.
    Using pull requests forces the usage of branches, which improves git history.

Objections

Others may call this section as disadvantages.
But the way I see it, it’s more of complaints of “why do we need this? we’re good with how things were till now”

  1. I get too many email already
    Well, this is true. Using pull request, we start getting much more emails, which is annoying.
    There’s too much noise. I might not notice important emails.
    The answer for that is simple:
    If you are part of this feature, then this mail is important because it mentions code changes in some parts that you are working on.
    If you want to stop receiving emails for this particular pull request, you can ask to mute it.
    Mute Thread

    Mute Thread

  2. If we start emailing, we’ll stop talking to each other
    I disagree with this statement.
    It will probably reduce the one-on-one review talks. But in my (short) experience, it improved our verbal discussions.
    The verbal discussion come after the reviewer watched the code change. If a reviewer did not understand something, only then she will approach the developer.
    The one-on-one discussions are much more efficient and ‘to the point’.
  3. Ahh ! I need to think on better commit comments. Now I have more to think of
    This is good, isn’t it?
    By using pull requests, each one of the team members need to improve the way comments are written in the commits.
    It will also improve git habits. In terms of smaller commits in shorter time.
  4. It’s harder to understand. I prefer that the other developer will explain to me the intentions
    Don’t we miss important advantages of code review by getting a walk though from the writer?
    I mean, if I need to have explanation of what the code does, then we better fix that code.
    So, if it’s hard to understand, I can write my comments until it improves.

How?

In this section I will explain briefly the way we chose to use pull requests.
The screenshots are taken fron GitHub, although BitBucket supports it as well.

Branching From the “main” Branch

I did not write ‘master’ intentionally.
Let’s say that I work on some feature in a branch called FEATURE_A (for me, this is the main branch).
This branch was created from master.
Let’s say that I need to implement some kind of sub feature in FEATURE_A.
Example (extremely simple): Add toString to class Person.
Then I will create a branch (locally out of FEATURE_A):

# On branch FEATURE_A, after pull from remote do:
# git checkout -b <branch-name-with-good-description>
git checkout -b FEATURE_A_add_toString_Person

# In order to push it to remote (GitHub), run this:
# git push -u origin <branch-name-with-good-description>
git push -u origin FEATURE_A_add_toString_Person
# Pushing the branch can be later

Doing a Pull Request

After some work on the branch, and pushing it to GitHub, I can ask for Pull Request.
There are a few ways doing it.
The one I find “coolest” is using a button/link in GitHub for calling pull request.
When entering GitHub’s repository in the web, it shows a clickable notation for the last branch that I pushed to.
After sending the pull request, all team members will receive an email.
You can also assign a specific person to that pull request if you want him/her do the actual code review.

Compare & Pull Request

Compare & Pull Request


Assign Menu

Assign Menu

Changing the Branch for the diff

By default GitHub will ask to do pull request against master branch.
As explained above, sometimes (usually?) we’ll want to diff/merge against some feature branch and not master.
In the pull request dialog, you can select to which branch you want to compare your working branch.

Edit Diff Branch

Edit Diff Branch

Code Review and Discussion

Any pushed code will be added to the pull request.
Any team member can add comment. You can add at the bottom of the discussion.
And, a really nice option, add comment on specific line of code.

Several Commits in Pull Request

Several Commits in Pull Request

Merging and Deleting the Branch

After the discussion and more push code, everyone is satisfied and the code can be merged.
GitHub will tell you whether your working branch can be merged to the main (diff) branch for that pull request.
Sometimes the branches can’t be automatically merged.
In that case, we’ll do a merge locally, fix conflicts (if any) and then push again.
We try to remember doing it often, so usually GitHub will tell us that the branches can be automatically merged.

Branches can be automatically merged

Branches can be automatically merged


Confirm Merge

Confirm Merge


After the pull request is merged, it is automatically closed.
If you are finished, you can delete the branch.
Post Merge / Closed Screen

Post Merge / Closed Screen

Who’s Responsible?

People asked

  • Who should merge?
  • Who should delete the branch?

We found out that it most sensible that the person who initiated the pull request would merge and delete.
The merge will be only after the reviewer gave the OK.

Helpful git Commands

Here’s a list of helpful git commands we use.

# Automatically merge from one branch (from remote) to another
# On branch BRANCH_A and I want to merge any pushed change from BRANCH_B
git pull origin BRANCH_B

# show branches remotly
git remote show origin

# Verify which local branch, which is set to upstream can be deleted
git remote prune origin --dry-run

# Actual remove all tangled branches
git remote prune origin

# Delete the local branch
git branch -d <branch-name>

Resources

https://help.github.com/articles/using-pull-requests
https://www.atlassian.com/git/workflows#!pull-request

Enjoy…

Linkedin Twitter facebook github