OpenSource Masters

Main Menu

  • Home
  • Products & Services
  • Articles
  • Our Clients
  • Links
  • About Us
  • Search
  • Contact Us

Welcome to Open Source Masters

Details
Parent Category: Business
Hits: 30167

Reliable and affordable enterprise services and custom products.  From small computer networks and websites to large cluster systems and high availability application services.

Utilizing Open Source solutions in the following areas:

    Cloud application development and consulting
    Installing High Aviliability Cluster networks
    Data mining and web scale crawling
    Distributed Computing/Clustering/BigTable implementations
    Big Table/NoSQL and Relational Data Designs
    P2P & Server/Client Network Applications
    Large-scale Production Deployments and Build Management
    Fast Searching and Indexing
    Secure End-To-End Transactions
    Automated Build Process and Deployments
    Testing, Debugging & Troubleshooting processes
    Long-term Data Warehousing

Read more: Welcome to Open Source Masters

HBase Writer 0.98.7 Released

Details
Parent Category: Weblog
Hits: 19594

HBase Writer version 0.98.7 is now released.  There are several changes to this version making it easier to use and more stable. 

Easier Library Usage:
Support was added to make extending HBase Writer much easier.  Previously, to add custom logic to HBase Writer, the user had to extend the HBaseWriter class among other classes or fork and maintain a separate branch.  With this update the user can extend the HBaseWriterProcessor class and reference the new class from the Heritrix job config.  That's it.  Here is an example of how to extend.

public class MyHBaseWriterProcessor extends HBaseWriterProcessor {
	@Override
	public void modifyPut(final HBaseParameters hBaseParameters, final CrawlURI curi, final String ip, Put put, 
RecordingOutputStream recordingOutputStream, RecordingInputStream recordingInputStream) throws IOException {
// To access the client request data
ReplayInputStream requestStream = recordingOutputStream.getReplayInputStream(); byte[] req = HBaseWriter.getByteArrayFromInputStream(requestStream, (int) recordingOutputStream.getSize());
// to access the server response data
ReplayInputStream resopnseStream = recordingInputStream.getReplayInputStream(); byte[] res = HBaseWriter.getByteArrayFromInputStream(resopnseStream, (int) recordingInputStream.getSize());
// ... do stuff ..
// to add custom cells for writing to the hbase table Put object.
put.add(Bytes.toBytes("mycf"), Bytes.toBytes("myqual"), Bytes.toBytes("myvalue"));
put.add(Bytes.toBytes("mycf"), Bytes.toBytes("myqual2"), Bytes.toBytes("myvalue2"));
// you also have direct access to the host ip, crawl uri object and hbassParameters object. } }

 

New Features:
A new feature was added to log Heritrix annotations, if any are encountered during the crawl.  For example, if a fetched URL has data larger than the configured max size limit, no data will be written to HBase, but an annotation cell, "c:an" => "size" will get written instead.  The default table column names have been shortened to something more reasonable.  The delimiter used when multiple annotations are present is also configurable.

Other Additions:
A wrapper method was added so all data written to the "Put" object gets serialized, but only if a serializer is specified in the config.  I am also ensuring both request and response streams are closed at the end of the write to the table.  Both have the potential to be open if the custom method is used and streams are not closed, now this shouldn't be a problem.  A bug in the shouldProcess() method was found and fixed.  If a record got an IOException, HBase Writer was logging the record as an error but still trying to process the record because the method was returning 'true'. Now it returns 'false' and the record won't be processed as expected.  The HBase Writer project now includes a text file containing a list of current dependencies to make it easier to update HBase-Writer dependencies inside of heritrix/lib.  Click here for a link to the list.  This latest version checks for null and allows setters for all column name variables.  The project pom.xml file <properties> got moved to top of the file.  All Maven plugin versions in the pom.xml file were updated to use their latest versions.  Support was added to use the latest version of Hadoop, HBase and Heritrix.  

Resources:
Click here
to check out the project website, the source code or to download the jar library.   Alternatively you can configure your maven project to use the Nexus archive repository hosted by OpenSource Masters, click here for access.  Or download the jar and try it out today, click here.

Thank you and Enjoy! :)

-RJ

HBase-Writer 0.94.0 Released

Details
Parent Category: Weblog
Hits: 23257

HBase-Writer version 0.94.0 has been released and is available for download now.  This version of HBase-Writer continues to have support for both Heritrix2 & Heritrix3 (3.1.1) and has been tested against the latest release version of HBase (0.95.1) and Hadoop (1.1.2) and all their dependencies.  An exception handling bug was discovered in the makeWriter() method.  Previously a RuntimeException was not logging the parent exception but it is fixed in this latest release.  Several new dependencies were added from HBase and they have been added to the README files.  The HBase server I tested on is running 0.95.2 but hbase-writer is built against 0.95.1 because of a RuntimeException caused during unit testing.  Here was the stacktrace:

testCreateHBaseWriter(org.archive.io.hbase.TestHBaseWriter)  Time elapsed: 0.366 sec  <<< FAILURE!
java.lang.RuntimeException: hbase-default.xml file seems to be for and old version of HBase (0.95.2-hadoop2), this version is 0.95.2-hadoop1
    at org.apache.hadoop.hbase.HBaseConfiguration.checkDefaultsVersion(HBaseConfiguration.java:70) .....

After reading the hbase mailing list and talking with some developers it seems to be caused from bad packaging issues.  These issues should be resolved in v0.96.x  I didnt bother to debug to find where the reference to 0.95.2-hadoop2 is coming from but 0.95.1-hadoop1 builds and passes the instance creation test so the latest release of hbase-writer has this version set in the maven build file (pom.xml).  Here are the jar dependencies I needed to copy from my test hbase installation (v0.95.2-hadoop1 running hadoop v1.1.2) into my test heritrix installation (v3.1.1):

cp hbase-writer-0.94.0.jar heritrix/lib/

cp hbase/lib/hbase-common-0.95.2-hadoop1.jar heritrix/lib/
cp hbase/lib/hbase-protocol-0.95.2-hadoop1.jar heritrix/lib/
cp hbase/lib/hbase-server-0.95.2-hadoop1.jar heritrix/lib/
cp hbase/lib/hbase-client-0.95.2-hadoop1.jar heritrix/lib/
cp hbase/lib/protobuf-java-2.4.1.jar heritrix/lib/
cp hbase/lib/commons-configuration-1.6.jar lib/
cp hbase/lib/hbase-protocol-0.95.2-hadoop1.jar heritrix/lib/
cp hbase/lib/slf4j-api-1.6.4.jar heritrix/lib/
cp hbase/lib/htrace-core-2.00.jar heritrix/lib/
cp hbase/lib/jackson-mapper-asl-1.8.8.jar heritrix/lib/
cp hbase/lib/jackson-core-asl-1.8.8.jar heritrix/lib/

In hbase-writer TRUNK currently I have added a "jar-with-dependencies" goal so hbase-writer and all of its dependencies can be placed into one jar and you can use this one jar to copy over to heritrix/lib.

After adding the bean configuration described in hbase-writer's README for Heritrix3, you should be able to start up Heritrix3 and use the Heritrix3 web-ui to make crawls that write to hbase tables. 

Happy crawling..... Thank you and Enjoy :)

HBase-Writer 0.90.4 Released

Details
Parent Category: Weblog
Hits: 21339

HBase-Writer version 0.90.4 has been released and is available for download now.  This version of HBase-Writer continues to have support for both Heritrix2 & Heritrix3 but has had two major bug fixes.  Connections and resources were not being pooled, as they were unknowingly removed in teh last update.  And connections were not being closed properly, thus creating a potentioal for the application to hang on an Out Of Memory Exception.  It is highly recommended that you switch to this new version if you are using an older version of the plugin.  These issues have been properly addressed by a patch submitted by Greg Lu once again.  Thank you to Greg for giving back to the open source community.  Next on the TODO List is to add unit tests to check for connection and resource pooling.  In the meantime, for future releases, I will add JMX support to my testing instance and will use JConsole to monitor the object creation count over the course of a few crawls.  This should help ensure pooling is being used.  Thanks for checking it out and Enjoy! :)

Page 1 of 2

  • Start
  • Prev
  • 1
  • 2
  • Next
  • End

OpenSource Masters, Powered by Joomla! Joomla template by SiteGround