|

|

|
Articles
|
Written by Ryan Smith
|
|
Sunday, 22 January 2012 |
|
HBase-Writer version 0.90.4 has been released and is available for download now. This version of HBase-Writer continues to have support for both Heritrix2 & Heritrix3 but has had two major bug fixes. Connections and resources were not being pooled, as they were unknowingly removed in teh last update. And connections were not being closed properly, thus creating a potentioal for the application to hang on an Out Of Memory Exception. It is highly recommended that you switch to this new version if you are using an older version of the plugin. These issues have been properly addressed by a patch submitted by Greg Lu once again. Thank you to Greg for giving back to the open source community. Next on the TODO List is to add unit tests to check for connection and resource pooling. In the meantime, for future releases, I will add JMX support to my testing instance and will use JConsole to monitor the object creation count over the course of a few crawls. This should help ensure pooling is being used. Thanks for checking it out and Enjoy! :) |
|
|
Written by Ryan Smith
|
|
Wednesday, 16 November 2011 |
HBase-Writer version 0.90.3 has been released and is available for download now. This version of HBase-Writer continues to have support for both Heritrix2 & Heritrix3. HBase-Writer is now using the WARCWriterPool from Heritrix I/O instead of its own implementation. The README file changed slightly, mainly to be compatible with Spring 3.x since the last version of Heritrix was using Spring 2.x. Be sure to check the README file for the rest of the details. Much thanks to Karthik MV for submitting the initial compatibility patch. Anyone is free to create Issues if you want to see support added for something or if you have a bug to report. Thanks for checking it out and Enjoy! :) |
|
|
Written by Ryan Smith
|
|
Monday, 29 March 2010 |
HBase-Writer version 0.9-SNPASHOT has been released and is available for download now. This version of HBase-Writer has support for both Heritrix2 & Heritrix3. The README file changed and some new ones were added. Be sure to check them out for new Heritrix3 support. Much thanks to Greg Lu for spearheading this effort and sending in the initial patch. Once Heritrix has an official 3.0.0-RELEASE, then HBase-writer will release version 0.9-RELEASE. Feel free to create Issues if you want to see support added for something or if you have a bug to report. Thanks for checking it out and Enjoy! :) |
|
|
Written by Ryan Smith
|
|
Monday, 16 February 2009 |
HBase-Writer version 0.20.3 has been released and is available for download now. This version of HBase-Writer has a new runtime dependency: ZooKeeper. This is because HBase-0.20.X now depends on ZooKeeper to manage configuration and connection information. This version has been tested on a few Heritrix2-2.0.2 crawls on Hadoop 0.20.1, HBase 0.20.1 and ZooKeeper 3.2.1. and works fine as far as my tests go. The main difference you will have to be aware of when upgrading from 0.19.x to 0.20.x are 2 things: - In the global sheet configuration for your heritrix job, There is no "master" address for HBaseWriterProcessor anymore. Instead you need to provide a comma-seprated list of zookeeper hosts that make up the zookeeper quorum (zkquorum). Heritrix will talk to ZooKeeper to determine the master address of HBase. This has been done by HBase in 0.20.x to avoid the Master node being a SPOF (single point of failure) Support for an alternate zk client port has been added as well..
- You need to add the zookeeper.jar to the lib/ folder. The zookeeper jar is included with the HBase distribution, or you can download it from the OSM Archive Repository .
The other changes in this version were under-the-hood. The BatchUpdate API has been deprecated in HBase-0.20.x and HBase-Writer is now using the new Put/Get API from HBase to write and manage records when doing crawls. Feel free to create Issues if you want to see support added for something or if you have a bug to report. Thanks for checking it out and Enjoy! :) |
|
|
Written by Ryan Smith
|
|
Monday, 16 February 2009 |
HBase-Writer version 0.19.1 has been released and is available for download now. This version has been tested on a few Heritrix2 crawls on Hadoop 0.19.0 and HBase -0.19.0 and runs better now. This version fixes the previous new feature to work properly. In 0.19.0, if "only_new_records" is set to "true" and duplicate url records were in the hbase table, Heritrix would not download the content. Which is fine except, then you cant crawl any new records because you have to download the page to get all the links to follow. So this issue would better be solved by Heritrix itself by overriding extractor classes in Heritrix or taking snapshots during the crawl so you can pick up where you left off. So now in hbase-writer version 0.19.1, when "only_new_records" is set to "true", Heritrix will always download the content associatesd with the crawled urls, but its content will only be written to the given HBase table once. The next version of hbase-writer will have the option to not download the content if the record in hbase already exists (0.19.0 functionality). Also important to note, Hadoop uses Java 1.6 now , and so HBase-Writer does as well. Happy crawling & enjoy! |
|
|
Written by Ryan Smith
|
|
Thursday, 12 February 2009 |
HBase-Writer version 0.19.0 has been released and is available for download now. This version has been tested on a few Heritrix2 crawls on Hadoop 0.19.0 and HBase -0.19.0 and runs well. I was able to add a new feature: "only-new-records". This boolean option is set to "false" by default and will crawl and write all urls & their content to the given hbase table (as expected). But by setting this to "true", you ensure that only new urls(rowkeys) are written. The way it works is normally when you crawl the same site more than once, you are adding multiple cells to the various crawl columns, (i.e.: "content:raw_data", "cui:url", etc..) but each cell will have a different timestamp associated with it. So, for example, if you crawl the same site 5 times in a row, you will get one rowkey for each url crawled, but 5 occrences of each column, each with its unique timestamp; The only exception being columns updated by the crawler in a batchUpdate will have the same timestamp. This is so you know, all cells with the same timestamp came from the same fetch. So when the Hbase-Writer option "only-new-records" is set to "true" you will get no more than one occurence of each column per rowkey. This is useful in cases where you want to crawl a site over a long period of time and plan on starting and stopping the crawler many times. This can also be useful if you want to crawl a site and only get new urls. Future versions will implement the feature of not downloading the content from the webserver in addition to not writing it to HBase; This can greatly reduce the load on the webserver you are crawling as only the header is fetched and needed to determine if the url is already existing. Also important to note, Hadoop uses Java 1.6 now , and so HBase-Writer does as well. Happy crawling & enjoy! |
|
|
Written by Ryan Smith
|
|
Tuesday, 11 November 2008 |
|
ApacheCon 2008 - New Orleans November 3, 2008 - November 7, 2008 Sheraton Hotel on Canal Street What a great conference to attend: Informative and entertaining. Not many "geek" conferences are considered fun, but ApacheCon 2008 in New Orleans was a blast! Mainly becasue on Nov. 6th at 7:30pm, HotWax Media and Brainfood.com put together a New Orleans Style funeral with marching brass band and marched down Canal Street with police escort. This funeral was to celerbate the death of commercial proprietary closed-source software. A few people carried a fake casket behind a 8 member brass band leading about 200 conference attendees and some resturant/bar patrons to the Howling Wolf bar to watch the Rebirth Brass Band do their thing. They're an amazing band, they coved many songs and I recommend checking them out if you get the chance. It was a great experience; definately one I will never forget.
The speakers at the conference were great. All were prepared, no one was fumbling through their slides or notes; there were no audio/video troubles. I was able to audio record some of the sessions and many of the presenters provided their presentation materials. I am already looking forward to the ApacheCon2009 in Oakland, CA Thanks to Charel Morris for the personal help in getting my conference pass. |
|
|
Written by Ryan Smith
|
|
Wednesday, 03 December 2008 |
|
HBase-Writer 0.18.2 has been released. This release contains support for max content size, default max size is 20 MB. Any content item crawled that is bigger than 20MB will be rejected by the writer. This release also contains a bug fix; If HBase throws an exception, the writer wasnt being added back to the Heritrix writerpool. The writer is now being added back. Thanks to Andrew Purtell at Apache for these patches.
HBase-Writer is a processor plugin following the Heritrix2 processor API. With HBase-Writer, you can have Heritrix2 crawl and save its results directly to a table in HBase. The HBase-Writer plugin was based off the Heritrix-HDFS-Writer plugin. Thanks to Questio for the support in releasing this project. |
|
| << Start < Prev 1 2 Next > End >>
| | Results 1 - 8 of 10 |
 |

|
|
|
|