General

HBase Writer version 0.98.7 is now released.  There are several changes to this version making it easier to use and more stable. 

Easier Library Usage:
Support was added to make extending HBase Writer much easier.  Previously, to add custom logic to HBase Writer, the user had to extend the HBaseWriter class among other classes or fork and maintain a separate branch.  With this update the user can extend the HBaseWriterProcessor class and reference the new class from the Heritrix job config.  That's it.  Here is an example of how to extend.

public class MyHBaseWriterProcessor extends HBaseWriterProcessor {
	@Override
	public void modifyPut(final HBaseParameters hBaseParameters, final CrawlURI curi, final String ip, Put put, 
RecordingOutputStream recordingOutputStream, RecordingInputStream recordingInputStream) throws IOException {
// To access the client request data
ReplayInputStream requestStream = recordingOutputStream.getReplayInputStream(); byte[] req = HBaseWriter.getByteArrayFromInputStream(requestStream, (int) recordingOutputStream.getSize());
// to access the server response data
ReplayInputStream resopnseStream = recordingInputStream.getReplayInputStream(); byte[] res = HBaseWriter.getByteArrayFromInputStream(resopnseStream, (int) recordingInputStream.getSize());
// ... do stuff ..
// to add custom cells for writing to the hbase table Put object.
put.add(Bytes.toBytes("mycf"), Bytes.toBytes("myqual"), Bytes.toBytes("myvalue"));
put.add(Bytes.toBytes("mycf"), Bytes.toBytes("myqual2"), Bytes.toBytes("myvalue2"));
// you also have direct access to the host ip, crawl uri object and hbassParameters object. } }

 

New Features:
A new feature was added to log Heritrix annotations, if any are encountered during the crawl.  For example, if a fetched URL has data larger than the configured max size limit, no data will be written to HBase, but an annotation cell, "c:an" => "size" will get written instead.  The default table column names have been shortened to something more reasonable.  The delimiter used when multiple annotations are present is also configurable.

Other Additions:
A wrapper method was added so all data written to the "Put" object gets serialized, but only if a serializer is specified in the config.  I am also ensuring both request and response streams are closed at the end of the write to the table.  Both have the potential to be open if the custom method is used and streams are not closed, now this shouldn't be a problem.  A bug in the shouldProcess() method was found and fixed.  If a record got an IOException, HBase Writer was logging the record as an error but still trying to process the record because the method was returning 'true'. Now it returns 'false' and the record won't be processed as expected.  The HBase Writer project now includes a text file containing a list of current dependencies to make it easier to update HBase-Writer dependencies inside of heritrix/lib.  Click here for a link to the list.  This latest version checks for null and allows setters for all column name variables.  The project pom.xml file <properties> got moved to top of the file.  All Maven plugin versions in the pom.xml file were updated to use their latest versions.  Support was added to use the latest version of Hadoop, HBase and Heritrix.  

Resources:
Click here
to check out the project website, the source code or to download the jar library.   Alternatively you can configure your maven project to use the Nexus archive repository hosted by OpenSource Masters, click here for access.  Or download the jar and try it out today, click here.

Thank you and Enjoy! :)

-RJ

HBase-Writer version 0.94.0 has been released and is available for download now.  This version of HBase-Writer continues to have support for both Heritrix2 & Heritrix3 (3.1.1) and has been tested against the latest release version of HBase (0.95.1) and Hadoop (1.1.2) and all their dependencies.  An exception handling bug was discovered in the makeWriter() method.  Previously a RuntimeException was not logging the parent exception but it is fixed in this latest release.  Several new dependencies were added from HBase and they have been added to the README files.  The HBase server I tested on is running 0.95.2 but hbase-writer is built against 0.95.1 because of a RuntimeException caused during unit testing.  Here was the stacktrace:

testCreateHBaseWriter(org.archive.io.hbase.TestHBaseWriter)  Time elapsed: 0.366 sec  <<< FAILURE!
java.lang.RuntimeException: hbase-default.xml file seems to be for and old version of HBase (0.95.2-hadoop2), this version is 0.95.2-hadoop1
    at org.apache.hadoop.hbase.HBaseConfiguration.checkDefaultsVersion(HBaseConfiguration.java:70) .....

After reading the hbase mailing list and talking with some developers it seems to be caused from bad packaging issues.  These issues should be resolved in v0.96.x  I didnt bother to debug to find where the reference to 0.95.2-hadoop2 is coming from but 0.95.1-hadoop1 builds and passes the instance creation test so the latest release of hbase-writer has this version set in the maven build file (pom.xml).  Here are the jar dependencies I needed to copy from my test hbase installation (v0.95.2-hadoop1 running hadoop v1.1.2) into my test heritrix installation (v3.1.1):

cp hbase-writer-0.94.0.jar heritrix/lib/

cp hbase/lib/hbase-common-0.95.2-hadoop1.jar heritrix/lib/
cp hbase/lib/hbase-protocol-0.95.2-hadoop1.jar heritrix/lib/
cp hbase/lib/hbase-server-0.95.2-hadoop1.jar heritrix/lib/
cp hbase/lib/hbase-client-0.95.2-hadoop1.jar heritrix/lib/
cp hbase/lib/protobuf-java-2.4.1.jar heritrix/lib/
cp hbase/lib/commons-configuration-1.6.jar lib/
cp hbase/lib/hbase-protocol-0.95.2-hadoop1.jar heritrix/lib/
cp hbase/lib/slf4j-api-1.6.4.jar heritrix/lib/
cp hbase/lib/htrace-core-2.00.jar heritrix/lib/
cp hbase/lib/jackson-mapper-asl-1.8.8.jar heritrix/lib/
cp hbase/lib/jackson-core-asl-1.8.8.jar heritrix/lib/

In hbase-writer TRUNK currently I have added a "jar-with-dependencies" goal so hbase-writer and all of its dependencies can be placed into one jar and you can use this one jar to copy over to heritrix/lib.

After adding the bean configuration described in hbase-writer's README for Heritrix3, you should be able to start up Heritrix3 and use the Heritrix3 web-ui to make crawls that write to hbase tables. 

Happy crawling..... Thank you and Enjoy :)

HBase-Writer version 0.90.4 has been released and is available for download now.  This version of HBase-Writer continues to have support for both Heritrix2 & Heritrix3 but has had two major bug fixes.  Connections and resources were not being pooled, as they were unknowingly removed in teh last update.  And connections were not being closed properly, thus creating a potentioal for the application to hang on an Out Of Memory Exception.  It is highly recommended that you switch to this new version if you are using an older version of the plugin.  These issues have been properly addressed by a patch submitted by Greg Lu once again.  Thank you to Greg for giving back to the open source community.  Next on the TODO List is to add unit tests to check for connection and resource pooling.  In the meantime, for future releases, I will add JMX support to my testing instance and will use JConsole to monitor the object creation count over the course of a few crawls.  This should help ensure pooling is being used.  Thanks for checking it out and Enjoy! :)

HBase-Writer version 0.90.3 has been released and is available for download now.  This version of HBase-Writer continues to have support for both Heritrix2 & Heritrix3.  HBase-Writer is now using the WARCWriterPool from Heritrix I/O instead of its own implementation.   The README file changed slightly, mainly to be compatible with Spring 3.x since the last version of Heritrix was using Spring 2.x.  Be sure to check the README file for the rest of the details.  Much thanks to Karthik MV for submitting the initial compatibility patch.  Anyone is free to create Issues if you want to see support added for something or if you have a bug to report.  Thanks for checking it out and Enjoy! :)

HBase-Writer version 0.9-SNPASHOT has been released and is available for download now.  This version of HBase-Writer has support for both Heritrix2 & Heritrix3.  The README file changed and some new ones were added.  Be sure to check them out for new Heritrix3 support.  Much thanks to Greg Lu for spearheading this effort and sending in the initial patch. Once Heritrix has an official 3.0.0-RELEASE, then HBase-writer will release version 0.9-RELEASE.  Feel free to create Issues if you want to see support added for something or if you have a bug to report.  Thanks for checking it out and Enjoy! :)

HBase-Writer version 0.20.3 has been released and is available for download now.  This version of HBase-Writer has a new runtime dependency:  ZooKeeper.  This is because HBase-0.20.X now depends on ZooKeeper to manage configuration and connection information.  This version has been tested on a few Heritrix2-2.0.2 crawls on Hadoop 0.20.1, HBase 0.20.1 and ZooKeeper 3.2.1.  and works fine as far as my tests go.  The main difference you will have to be aware of when upgrading from 0.19.x to 0.20.x are 2 things:

  1. In the global sheet configuration for your heritrix job, There is no "master" address for HBaseWriterProcessor anymore.  Instead you need to provide a comma-seprated list of zookeeper hosts that make up the zookeeper quorum (zkquorum).  Heritrix will talk to ZooKeeper to determine the master address of HBase.  This has been done by HBase in 0.20.x to avoid the Master node being a SPOF (single point of failure)  Support for an alternate zk client port has been added as well..
  2. You need to add the zookeeper.jar to the lib/ folder.  The zookeeper jar is included with the HBase distribution, or you can download it from the OSM Archive Repository .

The other changes in this version were under-the-hood.  The BatchUpdate API has been deprecated in HBase-0.20.x and HBase-Writer is now using the new Put/Get API from HBase to write and manage records when doing crawls.  Feel free to create Issues if you want to see support added for something or if you have a bug to report.  Thanks for checking it out and Enjoy! :)

HBase-Writer version 0.19.1 has been released and is available for download now.  This version has been tested on a few Heritrix2 crawls on Hadoop 0.19.0 and HBase -0.19.0 and runs better now.  This version fixes the previous new feature to work properly.  In 0.19.0, if  "only_new_records" is set to "true" and duplicate url records were in the hbase table, Heritrix would not download the content.  Which is fine except, then you cant crawl any new records because you have to download the page to get all the links to follow.  So this issue would better be solved by Heritrix itself by overriding extractor classes in Heritrix or taking snapshots during the crawl so you can pick up where you left off.  So now in hbase-writer version 0.19.1, when "only_new_records" is set to "true", Heritrix will always download the content associatesd with the crawled urls, but its content will only be written to the given HBase table once.  The next version of hbase-writer will have the option to not download the content if the record in hbase already exists (0.19.0 functionality).

Also important to note, Hadoop uses Java 1.6 now , and so HBase-Writer does as well.  Happy crawling & enjoy!

HBase-Writer version 0.19.0 has been released and is available for download now.  This version has been tested on a few Heritrix2 crawls on Hadoop 0.19.0 and HBase -0.19.0 and runs well.  I was able to add a new feature: "only-new-records".  This boolean option is set to "false" by default and will crawl and write all urls & their content to the given hbase table (as expected). But by setting this to "true", you ensure that only new urls(rowkeys) are written.  The way it works is normally when you crawl the same site more than once, you are adding multiple cells to the various crawl columns, (i.e.: "content:raw_data", "cui:url", etc..) but each cell will have a different timestamp associated with it.  So, for example, if you crawl the same site 5 times in a row, you will get one rowkey for each url crawled, but 5 occrences of each column, each with its unique timestamp;  The only exception being columns updated by the crawler in a batchUpdate will have the same timestamp.  This is so you know, all cells with the same timestamp came from the same fetch.  So when the Hbase-Writer option "only-new-records" is set to "true" you will get no more than one occurence of each column per rowkey.  This is useful in cases where you want to crawl a site over a long period of time and plan on starting and stopping the crawler many times.  This can also be useful if you want to crawl a site and only get new urls.  Future versions will implement the feature of not downloading the content from the webserver in addition to not writing it to HBase; This can greatly reduce the load on the webserver you are crawling as only the header is fetched and needed to determine if the url is already existing.

Also important to note, Hadoop uses Java 1.6 now , and so HBase-Writer does as well.  Happy crawling & enjoy!