org.apache.pig.piggybank.evaluation.util
Class Top

java.lang.Object
  extended by org.apache.pig.EvalFunc<DataBag>
      extended by org.apache.pig.piggybank.evaluation.util.Top
All Implemented Interfaces:
Algebraic

public class Top
extends EvalFunc<DataBag>
implements Algebraic

Top UDF accepts a bag of tuples and returns top-n tuples depending upon the tuple field value of type long. Both n and field number needs to be provided to the UDF. The UDF iterates through the input bag and just retains top-n tuples by storing them in a priority queue of size n+1 where priority is the long field. This is efficient as priority queue provides constant time - O(1) removal of the least element and O(log n) time for heap restructuring. The UDF is especially helpful for turning the nested grouping operation inside out and retaining top-n in a nested group. Assumes all tuples in the bag contain an element of the same type in the compared column. Sample usage: A = LOAD 'test.tsv' as (first: chararray, second: chararray); B = GROUP A BY (first, second); C = FOREACH B generate FLATTEN(group), COUNT(*) as count; D = GROUP C BY first; // again group by first topResults = FOREACH D { result = Top(10, 2, C); // and retain top 10 occurrences of 'second' in first GENERATE FLATTEN(result); }


Nested Class Summary
static class Top.Final
           
static class Top.Initial
           
static class Top.Intermed
           
 
Field Summary
 
Fields inherited from class org.apache.pig.EvalFunc
pigLogger, reporter, returnType
 
Constructor Summary
Top()
           
 
Method Summary
 DataBag exec(Tuple tuple)
          This callback method must be implemented by all subclasses.
 List<FuncSpec> getArgToFuncMapping()
           
 String getFinal()
           
 String getInitial()
           
 String getIntermed()
           
 Schema outputSchema(Schema input)
           
protected static void updateTop(PriorityQueue<Tuple> store, int limit, DataBag inputBag)
           
 
Methods inherited from class org.apache.pig.EvalFunc
finish, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, progress, setPigLogger, setReporter, warn
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Top

public Top()
Method Detail

exec

public DataBag exec(Tuple tuple)
             throws IOException
Description copied from class: EvalFunc
This callback method must be implemented by all subclasses. This is the method that will be invoked on every Tuple of a given dataset. Since the dataset may be divided up in a variety of ways the programmer should not make assumptions about state that is maintained between invocations of this method.

Specified by:
exec in class EvalFunc<DataBag>
Parameters:
tuple - the Tuple to be processed.
Returns:
result, of type T.
Throws:
IOException

updateTop

protected static void updateTop(PriorityQueue<Tuple> store,
                                int limit,
                                DataBag inputBag)

getArgToFuncMapping

public List<FuncSpec> getArgToFuncMapping()
                                   throws FrontendException
Overrides:
getArgToFuncMapping in class EvalFunc<DataBag>
Returns:
A List containing FuncSpec objects representing the Function class which can handle the inputs corresponding to the schema in the objects
Throws:
FrontendException

outputSchema

public Schema outputSchema(Schema input)
Overrides:
outputSchema in class EvalFunc<DataBag>
Parameters:
input - Schema of the input
Returns:
Schema of the output

getInitial

public String getInitial()
Specified by:
getInitial in interface Algebraic
Returns:
A string to instatiate f_init. f_init should be an eval func

getIntermed

public String getIntermed()
Specified by:
getIntermed in interface Algebraic
Returns:
A string to instantiate f_intermed. f_intermed should be an eval func

getFinal

public String getFinal()
Specified by:
getFinal in interface Algebraic
Returns:
A string to instantiate f_final. f_final should be an eval func parametrized by the same datum as the eval func implementing this interface


Copyright © ${year} The Apache Software Foundation