Quantcast

Missing logs in hbase because of same timestamp

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Missing logs in hbase because of same timestamp

Abhijit Dhar
I noticed that TsProcessor is using the timestamp as the key for putting logs into hbase. But, my logs are coming in so fast that they have same timestamp like this:

2012-01-20 20:03:14,041 [INFO] [communication thread] [org.apache.hadoop.mapred.LocalJobRunner.statusUpdate()] 10 threads, 28 requests, 0 errors, 0 forbidden, 0.6 pages/s, 80 kb/s,
2012-01-20 20:03:14,852 [INFO] [Thread-274] [jcrawler.fetch.mapreduce.FetchMapper.doWork()] -activeThreads=10, spinWaiting=7, fetchQueues.totalSize=649
2012-01-20 20:03:14,852 [INFO] [Thread-274] [jcrawler.fetch.mapreduce.FetchMapper.feedQueueManager()] feeding 649 input urls ...
2012-01-20 20:03:14,852 [INFO] [Thread-274] [jcrawler.fetch.mapreduce.FetchMapper.logHeapUsage()] Fetcher feeding queue manager. Heap usage: 327668152 out of 932118528 bytes.

I think because of this, they are getting reduced and takes only one log for a given timestamp.
Any idea how to fix this?

Thanks,
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Missing logs in hbase because of same timestamp

Eric Yang-3
You might want to extend TsProcessor and put in chunk sequence id into
the primary key to ensure that you get ordered entries in HBase.  Hope
this works for your use case.

regards,
Eric

On Fri, Jan 20, 2012 at 8:20 PM, Abhijit Dhar <[hidden email]> wrote:

> I noticed that TsProcessor is using the timestamp as the key for putting logs
> into hbase. But, my logs are coming in so fast that they have same timestamp
> like this:
>
> 2012-01-20 20:03:14,041 [INFO] [communication thread]
> [org.apache.hadoop.mapred.LocalJobRunner.statusUpdate()] 10 threads, 28
> requests, 0 errors, 0 forbidden, 0.6 pages/s, 80 kb/s,
> 2012-01-20 20:03:14,852 [INFO] [Thread-274]
> [jcrawler.fetch.mapreduce.FetchMapper.doWork()] -activeThreads=10,
> spinWaiting=7, fetchQueues.totalSize=649
> 2012-01-20 20:03:14,852 [INFO] [Thread-274]
> [jcrawler.fetch.mapreduce.FetchMapper.feedQueueManager()] feeding 649 input
> urls ...
> 2012-01-20 20:03:14,852 [INFO] [Thread-274]
> [jcrawler.fetch.mapreduce.FetchMapper.logHeapUsage()] Fetcher feeding queue
> manager. Heap usage: 327668152 out of 932118528 bytes.
>
> I think because of this, they are getting reduced and takes only one log for
> a given timestamp.
> Any idea how to fix this?
>
> Thanks,
>
> --
> View this message in context: http://apache-chukwa.679492.n3.nabble.com/Missing-logs-in-hbase-because-of-same-timestamp-tp3677271p3677271.html
> Sent from the Chukwa - Users mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Missing logs in hbase because of same timestamp

Abhijit Dhar
Hi,

I used chunk sequence id, but even that is the same for a bunch of log entries streaming into my custom LogProcessor(I extened from AbstractProcessor).
Is there anything else that I can use which is unique? I think everyone should be having this problem if this is the case of missing logs because of same timestamp (or am I missing something?)

Thanks,
Abhijit
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Missing logs in hbase because of same timestamp

Abhijit Dhar
Btw, I just tried using the field "startOffset" from AbstractProcessor and looks like that is unique within a chunk. I think this will work. Please let me know if I'm wrong.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Missing logs in hbase because of same timestamp

Eric Yang-3
Yes, startOffset should be unique through out the life cycle of the data stream.

regards,
Eric

On Wed, Jan 25, 2012 at 7:50 PM, Abhijit Dhar <[hidden email]> wrote:
> Btw, I just tried using the field "startOffset" from AbstractProcessor and
> looks like that is unique within a chunk. I think this will work. Please let
> me know if I'm wrong.
>
> --
> View this message in context: http://apache-chukwa.679492.n3.nabble.com/Missing-logs-in-hbase-because-of-same-timestamp-tp3677271p3689656.html
> Sent from the Chukwa - Users mailing list archive at Nabble.com.
Loading...