Welcome to Open Source Masters
- Details
- Parent Category: Business
- Hits: 35876
Reliable and affordable enterprise services and custom products. From small computer networks and websites to large cluster systems and high availability application services.
Utilizing Open Source solutions in the following areas:
Cloud application development and consulting
Installing High Aviliability Cluster networks
Data mining and web scale crawling
Distributed Computing/Clustering/BigTable implementations
Big Table/NoSQL and Relational Data Designs
P2P & Server/Client Network Applications
Large-scale Production Deployments and Build Management
Fast Searching and Indexing
Secure End-To-End Transactions
Automated Build Process and Deployments
Testing, Debugging & Troubleshooting processes
Long-term Data Warehousing
HBase Writer 0.98.7 Released
- Details
- Parent Category: Weblog
- Hits: 23392
HBase Writer version 0.98.7 is now released. There are several changes to this version making it easier to use and more stable.
Easier Library Usage:
Support was added to make extending HBase Writer much easier. Previously, to add custom logic to HBase Writer, the user had to extend the HBaseWriter class among other classes or fork and maintain a separate branch. With this update the user can extend the HBaseWriterProcessor class and reference the new class from the Heritrix job config. That's it. Here is an example of how to extend.
public class MyHBaseWriterProcessor extends HBaseWriterProcessor { @Override public void modifyPut(final HBaseParameters hBaseParameters, final CrawlURI curi, final String ip, Put put,
RecordingOutputStream recordingOutputStream, RecordingInputStream recordingInputStream) throws IOException {
// To access the client request data
ReplayInputStream requestStream = recordingOutputStream.getReplayInputStream(); byte[] req = HBaseWriter.getByteArrayFromInputStream(requestStream, (int) recordingOutputStream.getSize());
// to access the server response data
ReplayInputStream resopnseStream = recordingInputStream.getReplayInputStream(); byte[] res = HBaseWriter.getByteArrayFromInputStream(resopnseStream, (int) recordingInputStream.getSize());
// ... do stuff ..
// to add custom cells for writing to the hbase table Put object.
put.add(Bytes.toBytes("mycf"), Bytes.toBytes("myqual"), Bytes.toBytes("myvalue"));
put.add(Bytes.toBytes("mycf"), Bytes.toBytes("myqual2"), Bytes.toBytes("myvalue2"));
// you also have direct access to the host ip, crawl uri object and hbassParameters object. } }
New Features:
A new feature was added to log Heritrix annotations, if any are encountered during the crawl. For example, if a fetched URL has data larger than the configured max size limit, no data will be written to HBase, but an annotation cell, "c:an" => "size" will get written instead. The default table column names have been shortened to something more reasonable. The delimiter used when multiple annotations are present is also configurable.
Other Additions:
A wrapper method was added so all data written to the "Put" object gets serialized, but only if a serializer is specified in the config. I am also ensuring both request and response streams are closed at the end of the write to the table. Both have the potential to be open if the custom method is used and streams are not closed, now this shouldn't be a problem. A bug in the shouldProcess() method was found and fixed. If a record got an IOException, HBase Writer was logging the record as an error but still trying to process the record because the method was returning 'true'. Now it returns 'false' and the record won't be processed as expected. The HBase Writer project now includes a text file containing a list of current dependencies to make it easier to update HBase-Writer dependencies inside of heritrix/lib. Click here for a link to the list. This latest version checks for null and allows setters for all column name variables. The project pom.xml file <properties> got moved to top of the file. All Maven plugin versions in the pom.xml file were updated to use their latest versions. Support was added to use the latest version of Hadoop, HBase and Heritrix.
Resources:
Click here to check out the project website, the source code or to download the jar library. Alternatively you can configure your maven project to use the Nexus archive repository hosted by OpenSource Masters, click here for access. Or download the jar and try it out today, click here.
Thank you and Enjoy! :)
-RJ
HBase-Writer 0.94.0 Released
- Details
- Parent Category: Weblog
- Hits: 26896
HBase-Writer version 0.94.0 has been released and is available for download now. This version of HBase-Writer continues to have support for both Heritrix2 & Heritrix3 (3.1.1) and has been tested against the latest release version of HBase (0.95.1) and Hadoop (1.1.2) and all their dependencies. An exception handling bug was discovered in the makeWriter() method. Previously a RuntimeException was not logging the parent exception but it is fixed in this latest release. Several new dependencies were added from HBase and they have been added to the README files. The HBase server I tested on is running 0.95.2 but hbase-writer is built against 0.95.1 because of a RuntimeException caused during unit testing. Here was the stacktrace:
testCreateHBaseWriter(org.archive.io.hbase.TestHBaseWriter) Time elapsed: 0.366 sec <<< FAILURE!
java.lang.RuntimeException: hbase-default.xml file seems to be for and old version of HBase (0.95.2-hadoop2), this version is 0.95.2-hadoop1
at org.apache.hadoop.hbase.HBaseConfiguration.checkDefaultsVersion(HBaseConfiguration.java:70) .....
After reading the hbase mailing list and talking with some developers it seems to be caused from bad packaging issues. These issues should be resolved in v0.96.x I didnt bother to debug to find where the reference to 0.95.2-hadoop2 is coming from but 0.95.1-hadoop1 builds and passes the instance creation test so the latest release of hbase-writer has this version set in the maven build file (pom.xml). Here are the jar dependencies I needed to copy from my test hbase installation (v0.95.2-hadoop1 running hadoop v1.1.2) into my test heritrix installation (v3.1.1):
cp hbase-writer-0.94.0.jar heritrix/lib/
cp hbase/lib/hbase-common-0.95.2-hadoop1.jar heritrix/lib/
cp hbase/lib/hbase-protocol-0.95.2-hadoop1.jar heritrix/lib/
cp hbase/lib/hbase-server-0.95.2-hadoop1.jar heritrix/lib/
cp hbase/lib/hbase-client-0.95.2-hadoop1.jar heritrix/lib/
cp hbase/lib/protobuf-java-2.4.1.jar heritrix/lib/
cp hbase/lib/commons-configuration-1.6.jar lib/
cp hbase/lib/hbase-protocol-0.95.2-hadoop1.jar heritrix/lib/
cp hbase/lib/slf4j-api-1.6.4.jar heritrix/lib/
cp hbase/lib/htrace-core-2.00.jar heritrix/lib/
cp hbase/lib/jackson-mapper-asl-1.8.8.jar heritrix/lib/
cp hbase/lib/jackson-core-asl-1.8.8.jar heritrix/lib/
In hbase-writer TRUNK currently I have added a "jar-with-dependencies" goal so hbase-writer and all of its dependencies can be placed into one jar and you can use this one jar to copy over to heritrix/lib.
After adding the bean configuration described in hbase-writer's README for Heritrix3, you should be able to start up Heritrix3 and use the Heritrix3 web-ui to make crawls that write to hbase tables.
Happy crawling..... Thank you and Enjoy :)
HBase-Writer 0.90.4 Released
- Details
- Parent Category: Weblog
- Hits: 25202
HBase-Writer version 0.90.4 has been released and is available for download now. This version of HBase-Writer continues to have support for both Heritrix2 & Heritrix3 but has had two major bug fixes. Connections and resources were not being pooled, as they were unknowingly removed in teh last update. And connections were not being closed properly, thus creating a potentioal for the application to hang on an Out Of Memory Exception. It is highly recommended that you switch to this new version if you are using an older version of the plugin. These issues have been properly addressed by a patch submitted by Greg Lu once again. Thank you to Greg for giving back to the open source community. Next on the TODO List is to add unit tests to check for connection and resource pooling. In the meantime, for future releases, I will add JMX support to my testing instance and will use JConsole to monitor the object creation count over the course of a few crawls. This should help ensure pooling is being used. Thanks for checking it out and Enjoy! :)