spacer.png, 0 kB

Login Form






Lost Password?
No account yet? Register

Tag Cloud

0190   api   browse   client   content   crawl   customer   data   different   download   firewall   hadoop   hbase   hdfs   heritrix   heritrix2   https   jira   jxse   jxta   mining   mule   network   networks   peers   plugin   processor   records   release   released   server   svn   thanks   trunk   url   urls   version   web   writer   zookeeper   2008  

Syndicate


spacer.png, 0 kB
Home
Welcome to Open Source Masters

Reliable and affordable computer services.  From home computer system repair to websites to enterprise security systems and application services.  We do it all.

 

Utilizing Open Source solutions in the following areas:

  • Desktop/Laptop Computers
  • Reliable Enterprise Applications
  • Distributed Computing/Clustering
  • Relational Data Designs
  • P2P & Server/Client Network Applications
  • Large-scale Production Deployments
  • Fast Searching and Indexing
  • Secure End-To-End Transactions
  • Automated Build Process
  • Testing, Debugging & Troubleshooting
  • Automated Long-term Data Persistence
Read more...
 
HBase-Writer 0.20.3 Released

HBase-Writer version 0.20.3 has been released and is available for download now.  This version of HBase-Writer has a new runtime dependency:  ZooKeeper.  This is because HBase-0.20.X now depends on ZooKeeper to manage configuration and connection information.  This version has been tested on a few Heritrix2-2.0.2 crawls on Hadoop 0.20.1, HBase 0.20.1 and ZooKeeper 3.2.1.  and works fine as far as my tests go.  The main difference you will have to be aware of when upgrading from 0.19.x to 0.20.x are 2 things:

  1. In the global sheet configuration for your heritrix job, There is no "master" address for HBaseWriterProcessor anymore.  Instead you need to provide a comma-seprated list of zookeeper hosts that make up the zookeeper quorum (zkquorum).  Heritrix will talk to ZooKeeper to determine the master address of HBase.  This has been done by HBase in 0.20.x to avoid the Master node being a SPOF (single point of failure)  Support for an alternate zk client port has been added as well..
  2. You need to add the zookeeper.jar to the lib/ folder.  The zookeeper jar is included with the HBase distribution, or you can download it from the OSM Archive Repository .

The other changes in this version were under-the-hood.  The BatchUpdate API has been deprecated in HBase-0.20.x and HBase-Writer is now using the new Put/Get API from HBase to write and manage records when doing crawls.  Feel free to create Issues if you want to see support added for something or if you have a bug to report.  Thanks for checking it out and Enjoy! :)

 
Apachecon 2008

ApacheCon 2008 - New Orleans
November 3, 2008 - November 7, 2008
Sheraton Hotel on Canal Street

What a great conference to attend:  Informative and entertaining.  Not many "geek" conferences are considered fun, but ApacheCon 2008 in New Orleans was a blast!  Mainly becasue on Nov. 6th at 7:30pm, HotWax Media and Brainfood.com put together a New Orleans Style funeral with marching brass band and marched down Canal Street with police escort. This funeral was to celerbate the death of commercial proprietary closed-source software.  A few people carried a fake casket behind a 8 member brass band leading about 200 conference attendees and some resturant/bar patrons to the Howling Wolf bar to watch the Rebirth Brass Band do their thing.  They're an amazing band, they coved many songs and I recommend checking them out if you get the chance.  It was a great experience; definately one I will never forget. 

The speakers at the conference were great.  All were prepared, no one was fumbling through their slides or notes; there were no audio/video troubles.  I was able to audio record some of the sessions and many of the presenters provided their presentation materials.  I am already looking forward to the ApacheCon2009 in Oakland, CA   Thanks to Charel Morris for the personal help in getting my conference pass.

 

 
HBase-Writer 0.19.1 Released

HBase-Writer version 0.19.1 has been released and is available for download now.  This version has been tested on a few Heritrix2 crawls on Hadoop 0.19.0 and HBase -0.19.0 and runs better now.  This version fixes the previous new feature to work properly.  In 0.19.0, if  "only_new_records" is set to "true" and duplicate url records were in the hbase table, Heritrix would not download the content.  Which is fine except, then you cant crawl any new records because you have to download the page to get all the links to follow.  So this issue would better be solved by Heritrix itself by overriding extractor classes in Heritrix or taking snapshots during the crawl so you can pick up where you left off.  So now in hbase-writer version 0.19.1, when "only_new_records" is set to "true", Heritrix will always download the content associatesd with the crawled urls, but its content will only be written to the given HBase table once.  The next version of hbase-writer will have the option to not download the content if the record in hbase already exists (0.19.0 functionality).

Also important to note, Hadoop uses Java 1.6 now , and so HBase-Writer does as well.  Happy crawling & enjoy!

 
<< Start < Prev 1 2 Next > End >>

Results 1 - 4 of 5
spacer.png, 0 kB
spacer.png, 0 kB
 
download components joomla modules free joomla templates
All Content and Images Copyright © opensourcemasters.com 2007 - 2010