org.apache.pig.backend.hadoop.executionengine.mapReduceLayer
Class MRCompiler

java.lang.Object
  extended by org.apache.pig.impl.plan.PlanVisitor<PhysicalOperator,PhysicalPlan>
      extended by org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans.PhyPlanVisitor
          extended by org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler

public class MRCompiler
extends PhyPlanVisitor

The compiler that compiles a given physical plan into a DAG of MapReduce operators which can then be converted into the JobControl structure. Is implemented as a visitor of the PhysicalPlan it is compiling. Currently supports all operators except the MR Sort operator Uses a predecessor based depth first traversal. To compile an operator, first compiles the predecessors into MapReduce Operators and tries to merge the current operator into one of them. The goal being to keep the number of MROpers to a minimum. It also merges multiple Map jobs, created by compiling the inputs individually, into a single job. Here a new map job is created and then the contents of the previous map plans are added. However, any other state that was in the previous map plans, should be manually moved over. So, if you are adding something new take care about this. Ex of this is in requestedParallelism Only in case of blocking operators and splits, a new MapReduce operator is started using a store-load combination to connect the two operators. Whenever this happens care is taken to add the MROper into the MRPlan and connect it appropriately.


Field Summary
static String USER_COMPARATOR_MARKER
           
 
Fields inherited from class org.apache.pig.impl.plan.PlanVisitor
mCurrentWalker, mPlan
 
Constructor Summary
MRCompiler(PhysicalPlan plan)
           
MRCompiler(PhysicalPlan plan, PigContext pigContext)
           
 
Method Summary
 MROperPlan compile()
          The front-end method that the user calls to compile the plan.
 void connectMapToReduceLimitedSort(MapReduceOper mro, MapReduceOper sortMROp)
           
 CompilationMessageCollector getMessageCollector()
           
 MROperPlan getMRPlan()
          Used to get the compiled plan
 POForEach getPlainForEachOP()
           
 PhysicalPlan getPlan()
          Used to get the plan that was compiled
 Pair<MapReduceOper,Integer> getQuantileJob(POSort inpSort, MapReduceOper prevJob, FileSpec lFile, FileSpec quantFile, int rp, Pair<Integer,Byte>[] fields)
           
protected  Pair<MapReduceOper,Integer> getSamplingJob(POSort sort, MapReduceOper prevJob, List<PhysicalPlan> transformPlans, FileSpec lFile, FileSpec sampleFile, int rp, List<PhysicalPlan> sortKeyPlans, String udfClassName, String[] udfArgs, String sampleLdrClassName)
          Create a sampling job to collect statistics by sampling an input file.
 Pair<MapReduceOper,Integer> getSkewedJoinSampleJob(POSkewedJoin op, MapReduceOper prevJob, FileSpec lFile, FileSpec sampleFile, int rp)
          Create Sampling job for skewed join.
 MapReduceOper getSortJob(POSort sort, MapReduceOper quantJob, FileSpec lFile, FileSpec quantFile, int rp, Pair<Integer,Byte>[] fields)
           
 void randomizeFileLocalizer()
           
 void simpleConnectMapToReduce(MapReduceOper mro)
           
 void visitCollectedGroup(POCollectedGroup op)
           
 void visitDistinct(PODistinct op)
           
 void visitFilter(POFilter op)
           
 void visitFRJoin(POFRJoin op)
          This is an operator which will have multiple inputs(= to number of join inputs) But it prunes off all inputs but the fragment input and creates separate MR jobs for each of the replicated inputs and uses these as the replicated files that are configured in the POFRJoin operator.
 void visitGlobalRearrange(POGlobalRearrange op)
           
 void visitLimit(POLimit op)
           
 void visitLoad(POLoad op)
           
 void visitLocalRearrange(POLocalRearrange op)
           
 void visitMergeJoin(POMergeJoin joinOp)
          Since merge-join works on two inputs there are exactly two MROper predecessors identified as left and right.
 void visitPackage(POPackage op)
           
 void visitPOForEach(POForEach op)
           
 void visitSkewedJoin(POSkewedJoin op)
           
 void visitSort(POSort op)
           
 void visitSplit(POSplit op)
          Compiles a split operator.
 void visitStore(POStore op)
           
 void visitStream(POStream op)
           
 void visitUnion(POUnion op)
           
 
Methods inherited from class org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans.PhyPlanVisitor
visitAdd, visitAnd, visitBinCond, visitCast, visitCogroup, visitCombinerPackage, visitComparisonFunc, visitConstant, visitCross, visitDemux, visitDivide, visitEqualTo, visitGreaterThan, visitGTOrEqual, visitIsNull, visitJoinPackage, visitLessThan, visitLocalRearrangeForIllustrate, visitLTOrEqual, visitMapLookUp, visitMod, visitMultiply, visitMultiQueryPackage, visitNegative, visitNot, visitNotEqualTo, visitOr, visitPartitionRearrange, visitPOOptimizedForEach, visitPreCombinerLocalRearrange, visitProject, visitRead, visitRegexp, visitSplit, visitSubtract, visitUserFunc
 
Methods inherited from class org.apache.pig.impl.plan.PlanVisitor
popWalker, pushWalker, visit
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

USER_COMPARATOR_MARKER

public static final String USER_COMPARATOR_MARKER
See Also:
Constant Field Values
Constructor Detail

MRCompiler

public MRCompiler(PhysicalPlan plan)
           throws MRCompilerException
Throws:
MRCompilerException

MRCompiler

public MRCompiler(PhysicalPlan plan,
                  PigContext pigContext)
           throws MRCompilerException
Throws:
MRCompilerException
Method Detail

randomizeFileLocalizer

public void randomizeFileLocalizer()

getMRPlan

public MROperPlan getMRPlan()
Used to get the compiled plan

Returns:
map reduce plan built by the compiler

getPlan

public PhysicalPlan getPlan()
Used to get the plan that was compiled

Overrides:
getPlan in class PlanVisitor<PhysicalOperator,PhysicalPlan>
Returns:
physical plan

getMessageCollector

public CompilationMessageCollector getMessageCollector()

compile

public MROperPlan compile()
                   throws IOException,
                          PlanException,
                          VisitorException
The front-end method that the user calls to compile the plan. Assumes that all submitted plans have a Store operators as the leaf.

Returns:
A map reduce plan
Throws:
IOException
PlanException
VisitorException

visitSplit

public void visitSplit(POSplit op)
                throws VisitorException
Compiles a split operator. The logic is to close the split job by replacing the split oper by a store and creating a new Map MRoper and return that as the current MROper to which other operators would be compiled into. The new MROper would be connected to the split job by load-store. Also add the split oper to the splitsSeen map.

Overrides:
visitSplit in class PhyPlanVisitor
Parameters:
op - - The split operator
Throws:
VisitorException

visitLoad

public void visitLoad(POLoad op)
               throws VisitorException
Overrides:
visitLoad in class PhyPlanVisitor
Throws:
VisitorException

visitStore

public void visitStore(POStore op)
                throws VisitorException
Overrides:
visitStore in class PhyPlanVisitor
Throws:
VisitorException

visitFilter

public void visitFilter(POFilter op)
                 throws VisitorException
Overrides:
visitFilter in class PhyPlanVisitor
Throws:
VisitorException

visitStream

public void visitStream(POStream op)
                 throws VisitorException
Overrides:
visitStream in class PhyPlanVisitor
Throws:
VisitorException

connectMapToReduceLimitedSort

public void connectMapToReduceLimitedSort(MapReduceOper mro,
                                          MapReduceOper sortMROp)
                                   throws PlanException,
                                          VisitorException
Throws:
PlanException
VisitorException

simpleConnectMapToReduce

public void simpleConnectMapToReduce(MapReduceOper mro)
                              throws PlanException
Throws:
PlanException

getPlainForEachOP

public POForEach getPlainForEachOP()

visitLimit

public void visitLimit(POLimit op)
                throws VisitorException
Overrides:
visitLimit in class PhyPlanVisitor
Throws:
VisitorException

visitLocalRearrange

public void visitLocalRearrange(POLocalRearrange op)
                         throws VisitorException
Overrides:
visitLocalRearrange in class PhyPlanVisitor
Throws:
VisitorException

visitCollectedGroup

public void visitCollectedGroup(POCollectedGroup op)
                         throws VisitorException
Overrides:
visitCollectedGroup in class PhyPlanVisitor
Throws:
VisitorException

visitPOForEach

public void visitPOForEach(POForEach op)
                    throws VisitorException
Overrides:
visitPOForEach in class PhyPlanVisitor
Throws:
VisitorException

visitGlobalRearrange

public void visitGlobalRearrange(POGlobalRearrange op)
                          throws VisitorException
Overrides:
visitGlobalRearrange in class PhyPlanVisitor
Throws:
VisitorException

visitPackage

public void visitPackage(POPackage op)
                  throws VisitorException
Overrides:
visitPackage in class PhyPlanVisitor
Throws:
VisitorException

visitUnion

public void visitUnion(POUnion op)
                throws VisitorException
Overrides:
visitUnion in class PhyPlanVisitor
Throws:
VisitorException

visitFRJoin

public void visitFRJoin(POFRJoin op)
                 throws VisitorException
This is an operator which will have multiple inputs(= to number of join inputs) But it prunes off all inputs but the fragment input and creates separate MR jobs for each of the replicated inputs and uses these as the replicated files that are configured in the POFRJoin operator. It also sets that this is FRJoin job and some parametes associated with it.

Overrides:
visitFRJoin in class PhyPlanVisitor
Throws:
VisitorException

visitMergeJoin

public void visitMergeJoin(POMergeJoin joinOp)
                    throws VisitorException
Since merge-join works on two inputs there are exactly two MROper predecessors identified as left and right. Instead of merging two operators, both are used to generate a MR job each. First MR oper is run to generate on-the-fly index on right side. Second is used to actually do the join. First MR oper is identified as rightMROper and second as curMROper. 1) RightMROper: If it is in map phase. It can be preceded only by POLoad. If there is anything else in physical plan, that is yanked and set as inner plans of joinOp. If it is reduce phase. Close this operator and start new MROper. 2) LeftMROper: If it is in map phase, add the Join operator in it. If it is in reduce phase. Close it and start new MROper.

Overrides:
visitMergeJoin in class PhyPlanVisitor
Throws:
VisitorException

visitDistinct

public void visitDistinct(PODistinct op)
                   throws VisitorException
Overrides:
visitDistinct in class PhyPlanVisitor
Throws:
VisitorException

visitSkewedJoin

public void visitSkewedJoin(POSkewedJoin op)
                     throws VisitorException
Overrides:
visitSkewedJoin in class PhyPlanVisitor
Throws:
VisitorException

visitSort

public void visitSort(POSort op)
               throws VisitorException
Overrides:
visitSort in class PhyPlanVisitor
Throws:
VisitorException

getSortJob

public MapReduceOper getSortJob(POSort sort,
                                MapReduceOper quantJob,
                                FileSpec lFile,
                                FileSpec quantFile,
                                int rp,
                                Pair<Integer,Byte>[] fields)
                         throws PlanException
Throws:
PlanException

getQuantileJob

public Pair<MapReduceOper,Integer> getQuantileJob(POSort inpSort,
                                                  MapReduceOper prevJob,
                                                  FileSpec lFile,
                                                  FileSpec quantFile,
                                                  int rp,
                                                  Pair<Integer,Byte>[] fields)
                                           throws PlanException,
                                                  VisitorException
Throws:
PlanException
VisitorException

getSkewedJoinSampleJob

public Pair<MapReduceOper,Integer> getSkewedJoinSampleJob(POSkewedJoin op,
                                                          MapReduceOper prevJob,
                                                          FileSpec lFile,
                                                          FileSpec sampleFile,
                                                          int rp)
                                                   throws PlanException,
                                                          VisitorException
Create Sampling job for skewed join.

Throws:
PlanException
VisitorException

getSamplingJob

protected Pair<MapReduceOper,Integer> getSamplingJob(POSort sort,
                                                     MapReduceOper prevJob,
                                                     List<PhysicalPlan> transformPlans,
                                                     FileSpec lFile,
                                                     FileSpec sampleFile,
                                                     int rp,
                                                     List<PhysicalPlan> sortKeyPlans,
                                                     String udfClassName,
                                                     String[] udfArgs,
                                                     String sampleLdrClassName)
                                              throws PlanException,
                                                     VisitorException
Create a sampling job to collect statistics by sampling an input file. The sequence of operations is as following:
  • Transform input sample tuples into another tuple.
  • Add an extra field "all" into the tuple
  • Package all tuples into one bag
  • Add constant field for number of reducers.
  • Sorting the bag
  • Invoke UDF with the number of reducers and the sorted bag.
  • Data generated by UDF is stored into a file.
  • Parameters:
    sort - the POSort operator used to sort the bag
    prevJob - previous job of current sampling job
    transformPlans - PhysicalPlans to transform input samples
    lFile - path of input file
    sampleFile - path of output file
    rp - configured parallemism
    sortKeyPlans - PhysicalPlans to be set into POSort operator to get sorting keys
    udfClassName - the class name of UDF
    udfArgs - the arguments of UDF
    sampleLdrClassName - class name for the sample loader
    Returns:
    pair
    Throws:
    PlanException
    VisitorException


    Copyright © ${year} The Apache Software Foundation