public class SolrDeleteDuplicates extends Object implements org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord,org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord>, org.apache.hadoop.util.Tool
SolrDeleteDuplicates.SolrRecord
instances(which contain id, boost and timestamp)SolrDeleteDuplicates.SolrRecord
s with the same digest will be
grouped together. Now, of these documents with the same digests, delete
all of them except the one with the highest score (boost field). If two
(or more) documents have the same score, then the document with the latest
timestamp is kept. Again, every other is deleted from solr index.
DeleteDuplicate
s we assume that two documents in
a solr index will never have the same URL. So this class only deals with
documents with different URLs but the same digest.Modifier and Type | Class and Description |
---|---|
static class |
SolrDeleteDuplicates.SolrInputFormat |
static class |
SolrDeleteDuplicates.SolrInputSplit |
static class |
SolrDeleteDuplicates.SolrRecord |
Modifier and Type | Field and Description |
---|---|
static org.slf4j.Logger |
LOG |
Constructor and Description |
---|
SolrDeleteDuplicates() |
Modifier and Type | Method and Description |
---|---|
void |
close() |
void |
configure(org.apache.hadoop.mapred.JobConf job) |
void |
dedup(String solrUrl) |
void |
dedup(String solrUrl,
boolean noCommit) |
org.apache.hadoop.conf.Configuration |
getConf() |
static void |
main(String[] args) |
void |
reduce(org.apache.hadoop.io.Text key,
Iterator<SolrDeleteDuplicates.SolrRecord> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord> output,
org.apache.hadoop.mapred.Reporter reporter) |
int |
run(String[] args) |
void |
setConf(org.apache.hadoop.conf.Configuration conf) |
public org.apache.hadoop.conf.Configuration getConf()
getConf
in interface org.apache.hadoop.conf.Configurable
public void setConf(org.apache.hadoop.conf.Configuration conf)
setConf
in interface org.apache.hadoop.conf.Configurable
public void configure(org.apache.hadoop.mapred.JobConf job)
configure
in interface org.apache.hadoop.mapred.JobConfigurable
public void close() throws IOException
close
in interface Closeable
close
in interface AutoCloseable
IOException
public void reduce(org.apache.hadoop.io.Text key, Iterator<SolrDeleteDuplicates.SolrRecord> values, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord> output, org.apache.hadoop.mapred.Reporter reporter) throws IOException
reduce
in interface org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord,org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord>
IOException
public void dedup(String solrUrl) throws IOException
IOException
public void dedup(String solrUrl, boolean noCommit) throws IOException
IOException
public int run(String[] args) throws IOException
run
in interface org.apache.hadoop.util.Tool
IOException
Copyright © 2013 The Apache Software Foundation