|
I noticed that TsProcessor is using the timestamp as the key for putting logs into hbase. But, my logs are coming in so fast that they have same timestamp like this:
2012-01-20 20:03:14,041 [INFO] [communication thread] [org.apache.hadoop.mapred.LocalJobRunner.statusUpdate()] 10 threads, 28 requests, 0 errors, 0 forbidden, 0.6 pages/s, 80 kb/s, 2012-01-20 20:03:14,852 [INFO] [Thread-274] [jcrawler.fetch.mapreduce.FetchMapper.doWork()] -activeThreads=10, spinWaiting=7, fetchQueues.totalSize=649 2012-01-20 20:03:14,852 [INFO] [Thread-274] [jcrawler.fetch.mapreduce.FetchMapper.feedQueueManager()] feeding 649 input urls ... 2012-01-20 20:03:14,852 [INFO] [Thread-274] [jcrawler.fetch.mapreduce.FetchMapper.logHeapUsage()] Fetcher feeding queue manager. Heap usage: 327668152 out of 932118528 bytes. I think because of this, they are getting reduced and takes only one log for a given timestamp. Any idea how to fix this? Thanks, |
|
You might want to extend TsProcessor and put in chunk sequence id into
the primary key to ensure that you get ordered entries in HBase. Hope this works for your use case. regards, Eric On Fri, Jan 20, 2012 at 8:20 PM, Abhijit Dhar <[hidden email]> wrote: > I noticed that TsProcessor is using the timestamp as the key for putting logs > into hbase. But, my logs are coming in so fast that they have same timestamp > like this: > > 2012-01-20 20:03:14,041 [INFO] [communication thread] > [org.apache.hadoop.mapred.LocalJobRunner.statusUpdate()] 10 threads, 28 > requests, 0 errors, 0 forbidden, 0.6 pages/s, 80 kb/s, > 2012-01-20 20:03:14,852 [INFO] [Thread-274] > [jcrawler.fetch.mapreduce.FetchMapper.doWork()] -activeThreads=10, > spinWaiting=7, fetchQueues.totalSize=649 > 2012-01-20 20:03:14,852 [INFO] [Thread-274] > [jcrawler.fetch.mapreduce.FetchMapper.feedQueueManager()] feeding 649 input > urls ... > 2012-01-20 20:03:14,852 [INFO] [Thread-274] > [jcrawler.fetch.mapreduce.FetchMapper.logHeapUsage()] Fetcher feeding queue > manager. Heap usage: 327668152 out of 932118528 bytes. > > I think because of this, they are getting reduced and takes only one log for > a given timestamp. > Any idea how to fix this? > > Thanks, > > -- > View this message in context: http://apache-chukwa.679492.n3.nabble.com/Missing-logs-in-hbase-because-of-same-timestamp-tp3677271p3677271.html > Sent from the Chukwa - Users mailing list archive at Nabble.com. |
|
Hi,
I used chunk sequence id, but even that is the same for a bunch of log entries streaming into my custom LogProcessor(I extened from AbstractProcessor). Is there anything else that I can use which is unique? I think everyone should be having this problem if this is the case of missing logs because of same timestamp (or am I missing something?)
Thanks,
Abhijit |
|
Btw, I just tried using the field "startOffset" from AbstractProcessor and looks like that is unique within a chunk. I think this will work. Please let me know if I'm wrong.
|
|
Yes, startOffset should be unique through out the life cycle of the data stream.
regards, Eric On Wed, Jan 25, 2012 at 7:50 PM, Abhijit Dhar <[hidden email]> wrote: > Btw, I just tried using the field "startOffset" from AbstractProcessor and > looks like that is unique within a chunk. I think this will work. Please let > me know if I'm wrong. > > -- > View this message in context: http://apache-chukwa.679492.n3.nabble.com/Missing-logs-in-hbase-because-of-same-timestamp-tp3677271p3689656.html > Sent from the Chukwa - Users mailing list archive at Nabble.com. |
| Powered by Nabble | Edit this page |
