org.apache.pig.piggybank.evaluation.util
Class Top
java.lang.Object
org.apache.pig.EvalFunc<DataBag>
org.apache.pig.piggybank.evaluation.util.Top
- All Implemented Interfaces:
- Algebraic
public class Top
- extends EvalFunc<DataBag>
- implements Algebraic
Top UDF accepts a bag of tuples and returns top-n tuples depending upon the
tuple field value of type long. Both n and field number needs to be provided
to the UDF. The UDF iterates through the input bag and just retains top-n
tuples by storing them in a priority queue of size n+1 where priority is the
long field. This is efficient as priority queue provides constant time - O(1)
removal of the least element and O(log n) time for heap restructuring. The
UDF is especially helpful for turning the nested grouping operation inside
out and retaining top-n in a nested group.
Assumes all tuples in the bag contain an element of the same type in the compared column.
Sample usage:
A = LOAD 'test.tsv' as (first: chararray, second: chararray);
B = GROUP A BY (first, second);
C = FOREACH B generate FLATTEN(group), COUNT(*) as count;
D = GROUP C BY first; // again group by first
topResults = FOREACH D {
result = Top(10, 2, C); // and retain top 10 occurrences of 'second' in first
GENERATE FLATTEN(result);
}
Constructor Summary |
Top()
|
Methods inherited from class org.apache.pig.EvalFunc |
finish, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, progress, setPigLogger, setReporter, warn |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Top
public Top()
exec
public DataBag exec(Tuple tuple)
throws IOException
- Description copied from class:
EvalFunc
- This callback method must be implemented by all subclasses. This
is the method that will be invoked on every Tuple of a given dataset.
Since the dataset may be divided up in a variety of ways the programmer
should not make assumptions about state that is maintained between
invocations of this method.
- Specified by:
exec
in class EvalFunc<DataBag>
- Parameters:
tuple
- the Tuple to be processed.
- Returns:
- result, of type T.
- Throws:
IOException
updateTop
protected static void updateTop(PriorityQueue<Tuple> store,
int limit,
DataBag inputBag)
getArgToFuncMapping
public List<FuncSpec> getArgToFuncMapping()
throws FrontendException
- Overrides:
getArgToFuncMapping
in class EvalFunc<DataBag>
- Returns:
- A List containing FuncSpec objects representing the Function class
which can handle the inputs corresponding to the schema in the objects
- Throws:
FrontendException
outputSchema
public Schema outputSchema(Schema input)
- Overrides:
outputSchema
in class EvalFunc<DataBag>
- Parameters:
input
- Schema of the input
- Returns:
- Schema of the output
getInitial
public String getInitial()
- Specified by:
getInitial
in interface Algebraic
- Returns:
- A string to instatiate f_init. f_init should be an eval func
getIntermed
public String getIntermed()
- Specified by:
getIntermed
in interface Algebraic
- Returns:
- A string to instantiate f_intermed. f_intermed should be an eval func
getFinal
public String getFinal()
- Specified by:
getFinal
in interface Algebraic
- Returns:
- A string to instantiate f_final. f_final should be an eval func parametrized by
the same datum as the eval func implementing this interface
Copyright © ${year} The Apache Software Foundation