Print
Parent Category: Weblog
Hits: 9976

HBase Writer version 0.98.7 is now released.  There are several changes to this version making it easier to use and more stable. 

Easier Library Usage:
Support was added to make extending HBase Writer much easier.  Previously, to add custom logic to HBase Writer, the user had to extend the HBaseWriter class among other classes or fork and maintain a separate branch.  With this update the user can extend the HBaseWriterProcessor class and reference the new class from the Heritrix job config.  That's it.  Here is an example of how to extend.

public class MyHBaseWriterProcessor extends HBaseWriterProcessor {
	@Override
	public void modifyPut(final HBaseParameters hBaseParameters, final CrawlURI curi, final String ip, Put put, 
RecordingOutputStream recordingOutputStream, RecordingInputStream recordingInputStream) throws IOException {
// To access the client request data
ReplayInputStream requestStream = recordingOutputStream.getReplayInputStream(); byte[] req = HBaseWriter.getByteArrayFromInputStream(requestStream, (int) recordingOutputStream.getSize());
// to access the server response data
ReplayInputStream resopnseStream = recordingInputStream.getReplayInputStream(); byte[] res = HBaseWriter.getByteArrayFromInputStream(resopnseStream, (int) recordingInputStream.getSize());
// ... do stuff ..
// to add custom cells for writing to the hbase table Put object.
put.add(Bytes.toBytes("mycf"), Bytes.toBytes("myqual"), Bytes.toBytes("myvalue"));
put.add(Bytes.toBytes("mycf"), Bytes.toBytes("myqual2"), Bytes.toBytes("myvalue2"));
// you also have direct access to the host ip, crawl uri object and hbassParameters object. } }

 

New Features:
A new feature was added to log Heritrix annotations, if any are encountered during the crawl.  For example, if a fetched URL has data larger than the configured max size limit, no data will be written to HBase, but an annotation cell, "c:an" => "size" will get written instead.  The default table column names have been shortened to something more reasonable.  The delimiter used when multiple annotations are present is also configurable.

Other Additions:
A wrapper method was added so all data written to the "Put" object gets serialized, but only if a serializer is specified in the config.  I am also ensuring both request and response streams are closed at the end of the write to the table.  Both have the potential to be open if the custom method is used and streams are not closed, now this shouldn't be a problem.  A bug in the shouldProcess() method was found and fixed.  If a record got an IOException, HBase Writer was logging the record as an error but still trying to process the record because the method was returning 'true'. Now it returns 'false' and the record won't be processed as expected.  The HBase Writer project now includes a text file containing a list of current dependencies to make it easier to update HBase-Writer dependencies inside of heritrix/lib.  Click here for a link to the list.  This latest version checks for null and allows setters for all column name variables.  The project pom.xml file <properties> got moved to top of the file.  All Maven plugin versions in the pom.xml file were updated to use their latest versions.  Support was added to use the latest version of Hadoop, HBase and Heritrix.  

Resources:
Click here
to check out the project website, the source code or to download the jar library.   Alternatively you can configure your maven project to use the Nexus archive repository hosted by OpenSource Masters, click here for access.  Or download the jar and try it out today, click here.

Thank you and Enjoy! :)

-RJ