Seven Databases in Seven Weeks – Hbase Day 1

Hbase is a columnar NoSQL database.
The first day of Hbase was short and clear.
Installing it was easy. No issues whatsoever.
The examples simulated some wiki pages with revisions.
It was fairly easy.

Installation
I found a really easy tutorial on how to install Hbase on Fedora:
http://tutorialforlinux.com/2014/03/18/how-to-getting-started-with-apache-hbase-on-fedora-19-20-21-3264bit-linux-easy-guide/

Hbase will usually work on several (many) servers. It is recommended to run it with at least 5 machines.
However, it’s possible to run it on a single machine for POC / learning purposes. I am using an old, weak laptop, and Hbase works just fine.

JRuby Script
Part of the learning consists of understanding JRuby, as some scripts and exercises use it.

To load a JRuby script into the Hbase shell, run something like:
/opt/hbase-latest/bin/hbase org.jruby.Main PATH-TO-SCRIPT

The example script: put_multiple_columns initially didn’t work. I think it’s due to different versions.
In the book’s forum I found a similar question and an answer for that problem:
http://forums.pragprog.com/forums/202/topics/11494

I uploaded the working script to GitHub: GitHub-put_multiple_columns.rb

Day 1 Material
Under GitHub, some links, material and homework answers.
https://github.com/eyalgo/seven-dbs-in-seven-weeks/tree/master/hbase/day_1

Day 1 Homework
The exercise is more of a JRuby / Ruby and less of Hbase.

def put_many( table_name, row, column_values )
  import 'org.apache.hadoop.hbase.client.HTable'
  import 'org.apache.hadoop.hbase.client.Put'
  import 'org.apache.hadoop.hbase.HBaseConfiguration'

  def jbytes( *args )
    args.map { |arg| arg.to_s.to_java_bytes }
  end

  puts( @hbase )
  conf = HBaseConfiguration.new
  table = HTable.new( conf, table_name )
  p = Put.new( *jbytes( row ) )
  
  column_values.each do |key, value|
    (key_family, key_name) = key.split(':')
    key_name ||= ""
    p.add( *jbytes( key_family, key_name, value ))
  end
  
  table.put( p )
end

Day 2, working with big data looks really interesting…

Linkedin Twitter facebook github

Advertisement

Seven Databases in Seven Days – Riak

In this post I am summarizing the three days of Riak, which is the second database in the Seven Databases in Seven Days book.
This post is actually in order for me to remember some tweaks I had to do while reading this chapter as sometimes the book wasn’t entirely correct.

A good blog, which I used a little, can be found at:
http://blog.wakatta.jp/blog/2011/12/09/seven-databases-in-seven-weeks-riak-day-3/
(this link directs to the 3rd Riak’s day)

I have everything pushed to GitHub as raw material:
https://github.com/eyalgo/seven-dbs-in-seven-weeks

Installing
The book recommends to install using the source code itself.
I needed to install Erlang as well.

Besides the information in the book, the following link was mostly helpful:
http://docs.basho.com/riak/latest/ops/building/installing/from-source/

I installed everything under /usr/local/riak/.

Start / Stop / Restart
A nice command line to start/stop/restart all the servers:

# under /usr/local/riak/riak-1.4.8/dev
for node in `ls`; do $node/bin/riak start; done
# change start to restart or stop

Port
The port which was installed in my machine was: 10018 for dev1, 10028 for dev2 etc.
The port is located in app.config file, under the etc folder.

Day 3 Issues
Pre-commit
I kept getting PUT aborted by pre-commit hook message instead of the one described in the book.
I had to add the language (javascript) to the operation:

curl -i -X PUT http://localhost:10018/riak/animals -H "content-type: application/json" -d '{"props":{"precommit":[{"name":"good_score","language":"javascript"}]}}'

(see: http://blog.sacaluta.com/2012/07/riak-precommit-hook-example.html)

Running a solr query
Running the suggested query from the book
( curl http://localhost:10018/solr/animals/select?wt=json&q=nickname:rin%20breed:shepherd&q.op=and)
kept returning 400 – Bad Request.
All I needed to do was to surround the URL with: ‘ (apostrophe).

Inverted Index
Running the link as mentioned in the book gives bad response:

Invalid link walk query submitted. Valid link walk query format is: ...

The correct way, as described in http://docs.basho.com/riak/latest/dev/using/2i/

curl http://localhost:10018/buckets/animals/index/mascot_bin/butler

Conclusion
Riak chapter gives a taste of this database.
It explains more about the “tooling” of it rather than the application of it.
I feel that it didn’t explain too much on why someone would use it instead of something else (let’s wait for Redis).

The book had errors in how to run commands.
I had to find by myself how to fix these problems.
Perhaps it’s because I’m reading eBook (PDF on my computer and mobi on my Kindle), and the hard-copy has less issues.
The good part of this problem, is that I had to drill down and read more online and learn more from those mistakes.

Linkedin Twitter facebook github

Ease at Work – A Talk by Kent Beck

The other day I was watching Kent Beck’s talk about Ease at Work.
It touched me deeply and I could really relate to what he described.
http://www.infoq.com/presentations/self-image

In order for me to remember, I decided to summarize the key points of the presentation.

The concept of Ease
There are many meanings to the word Ease, but in this case it’s not something like:
“I want to be rich and sit all day on the beach”.
Usually, us programmers would like to pick up hard and challenging work.

Here are the three points that relate to Ease in our case.

  1. State of comfort
  2. Freedom from worry, pain, or agitation
  3. Readiness in performance, facility

The pendulum inside us
People tend to have emotions, feelings and state of mind that change like a pendulum.
We sometimes feel at the top of the world. “I am the best programmer in the world and can solve anything”. This is one side.
Other times we feel deeply at the bottom: “I am soooo bad on what I do. I should stop programming and work on something completely different”.

There’s a huge range between the two sides of the pendulum.

So, being at ease would mean bring the amplitude of the swing down.
Decrease the range.

Once the pendulum swings at lower range, I will feel better on what I’m doing.

List of things that help being at ease
In his talk, Kent explains each point. I will just list the points.

  1. My work matters
  2. My code works
  3. I’m proud of my work
  4. Making public commitments
  5. Accountability (I am accountable)
  6. I interpret feedback
  7. I am a beginner – I don’t know everything
  8. Meditation
  9. I serve. No expectations of reward

These points are reminder for me. Whenever I don’t feel at ease, I should check what’s wrong. These points are a good starting point.

My personal addition
I thought of another two points:

  1. Clear vision of the future
  2. A good developer (well, any employee) wants to know that he/she has a career path.
    I will be more at ease if “I know where I’m going from here”

  3. Professional and flourishing environment
  4. A good developer will feel much more comfortable (at ease) in an environment of people he/she can learn from.
    The team should be diverse enough so everyone can learn from everyone else and mentor others.
    A good developer will feel at ease if he/she is surrounded with members who have the same passion as him (for technology, clean-code or whatever)

Linkedin Twitter facebook github

Why Abstraction is Really Important

Abstraction
Abstraction is one of the key elements of good software design.
It helps encapsulate behavior. It helps decouple software elements. It helps having more self-contained modules. And much more.

Abstraction makes the application extendable in much easier way. It makes refactoring much easier.
When developing with higher level of abstraction, you communicate the behavior and less the implementation.

General
In this post, I want to introduce a simple scenario that shows how, by choosing a simple solution, we can get into a situation of hard coupling and rigid design.

Then I will briefly describe how we can avoid situation like this.

Case study description
Let’s assume that we have a domain object called RawItem.

public class RawItem {
    private final String originator;
    private final String department;
    private final String division;
    private final Object[] moreParameters;
    
    public RawItem(String originator, String department, String division, Object... moreParameters) {
        this.originator = originator;
        this.department = department;
        this.division = division;
        this.moreParameters = moreParameters;
    }
}

The three first parameters represent the item’s key.
I.e. An item comes from an originator, a department and a division.
The “moreParameters” is just to emphasize the item has more parameters.

This triplet has two basic usages:
1. As key to store in the DB
2. As key in maps (key to RawItem)

Storing in DB based on the key
The DB tables are sharded in order to evenly distribute the items.
Sharding is done by a hash key modulo function.
This function works on a string.

Suppose we have N shards tables: (RAW_ITEM_REPOSITORY_00, RAW_ITEM_REPOSITORY_01,..,RAW_ITEM_REPOSITORY_NN),
then we’ll distribute the items based on some function and modulo:

String rawKey = originator + "_"  + department + "_" + division;
// func is String -> Integer function, N = # of shards
// Representation of the key is described below
int shard = func(key)%N;

Using the key in maps
The second usage for the triplet is mapping the items for fast lookup.
So, when NOT using abstraction, the maps will usually look like:

Map<String, RawItem> mapOfItems = new HashMap<>();
// Fill the map...

“Improving” the class
We see that we have common usage for the key as string, so we decide to put the string representation in the RawItem.

// new member
private final String key;

// in the constructor:
this.key = this.originator + "_" + this.department + "_"  + this.division;

// and a getter
public String getKey() {
  return key;
}

Assessment of the design
There are two flows here:
1. Coupling between the sharding distribution and the items’ mapping
2. The mapping key is strict. any change forces change in the key, which might introduce hard to find bugs

And then comes a new requirement
Up until now, the triplet: originator, department and division made up a key of an item.
But now, a new requirement comes in.
A division can have subdivision.
It means that, unlike before, we can have two different items from the same triplet. The items will differ by the subdivision attribute.

Difficult to change
Regarding the DB distribution, we’ll need to keep the concatenated key of the triplet.
We must keep the modulo function the same. So distribution will remain using the triplets, but the schema will change and hava ‘subdivision’ column as well.
We’ll change the queries to use the subdivision together with original key.

In regard to the mapping, we’ll need to do a massive refactoring and to pass an ItemKey (see below) instead of just String.

Abstraction of the key
Let’s create ItemKey

public class ItemKey {
    private final String originator;
    private final String department;
    private final String division;
    private final String subdivision;

    public ItemKey(String originator, String department, String division, String subdivision) {
        this.originator = originator;
        this.department = department;
        this.division = division;
        this.subdivision = subdivision;
    }

    public String asDistribution() {
        return this.originator + "_" + this.department + "_"  + this.division;
    }
}

And,

Map<ItemKey, RawItem> mapOfItems = new HashMap<>();
// Fill the map...
    // new constructor for RawItem
    public RawItem(ItemKey itemKey, Object... moreParameters) {
        // fill the fields
    }

Lesson Learned and conclusion
I wanted to show how a simple decision can really hurt.

And, how, by a small change, we made the key abstract.
In the future the key can have even more fields, but we’ll need to change only the inner implementation of it.
The logic and mapping usage should not be changed.

Regarding the change process,
I haven’t described how to do the refactoring, as it really depends on how the code looks like and how much is it tested.
In our case, some parts were easy, while others were really hard. The hard parts were around code that was looking deep in the implementation of the key (string) and the item.

This situation was real
We actually had this flow in our design.
Everything was fine for two years, until we had to change the key (add the subdivision).
Luckily all of our code is tested so we could see what breaks and fix it.
But it was painful.

There are two abstraction that we could have initially implement:
1. The more obvious is using a KEY class (as describe above). Even if it only has one String field
2. Any map usage need to be examined whether we’ll benefit by hiding it using abstraction

The second abstraction is harder to grasp and to fully understand and implement.

So,
do abstraction, tell a story and use the interfaces and don’t get into details while telling it.

Linkedin Twitter facebook github