org.apache.pig.piggybank.evaluation.util.apachelogparser
Class SearchTermExtractor

java.lang.Object
  extended by org.apache.pig.EvalFunc<String>
      extended by org.apache.pig.piggybank.evaluation.util.apachelogparser.SearchTermExtractor

public class SearchTermExtractor
extends EvalFunc<String>

SearchTermExtractor takes a url string and extracts the search terms. For example, given http://www.google.com/search?hl=en&safe=active&rls=GGLG,GGLG:2005-24,GGLG:en&q=purpose+of+life&btnG=Search then purpose of life would be extracted. From pig latin, usage looks something like searchTerm = FOREACH row GENERATE org.apache.pig.piggybank.evaluation.util.apachelogparser.SearchTermExtractor(referer); Supported search engines include alltheweb.com, altavista.com, aolsearch.aol.com, arianna.libero.it, as.starware.com, ask.com, blogs.icerocket.com, blueyonder.co.uk, busca.orange.es, buscador.lycos.es, buscador.terra.es, buscar.ozu.es, categorico.it, cerca.lycos.it, cuil.com, excite.it, godado.com, godado.it, gps.virgin.net, hotbot.com, ilmotore.com, it.altavista.com, ithaki.net, libero.it, lycos.es, lycos.it, mamma.com, megasearching.net, mirago.co.uk, netscape.com, ozu.es, ricerca.alice.it, search.aol.co.uk, search.bbc.co.uk, search.conduit.com, search.icq.com, search.live.com, search.lycos.co.uk, search.lycos.com, search.msn.co.uk, search.msn.com, search.myway.com, search.mywebsearch.com, search.ntlworld.com, search.orange.co.uk, search.sweetim.com, search.virginmedia.com, simpatico.ws, soso.com, suche.fireball.de, suche.web.de, terra.es, tesco.net, thespider.it, tiscali.co.uk, uk.altavista.com, uk.ask.com Thanks to Spiros Denaxas for his URI::ParseSearchString, which is the basis for the lookups.


Field Summary
 
Fields inherited from class org.apache.pig.EvalFunc
log, pigLogger, reporter, returnType
 
Constructor Summary
SearchTermExtractor()
           
 
Method Summary
 String exec(Tuple input)
          This callback method must be implemented by all subclasses.
 List<FuncSpec> getArgToFuncMapping()
           
 
Methods inherited from class org.apache.pig.EvalFunc
finish, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, outputSchema, progress, setPigLogger, setReporter, warn
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SearchTermExtractor

public SearchTermExtractor()
Method Detail

exec

public String exec(Tuple input)
            throws IOException
Description copied from class: EvalFunc
This callback method must be implemented by all subclasses. This is the method that will be invoked on every Tuple of a given dataset. Since the dataset may be divided up in a variety of ways the programmer should not make assumptions about state that is maintained between invocations of this method.

Specified by:
exec in class EvalFunc<String>
Parameters:
input - the Tuple to be processed.
Returns:
result, of type T.
Throws:
IOException

getArgToFuncMapping

public List<FuncSpec> getArgToFuncMapping()
                                   throws FrontendException
Overrides:
getArgToFuncMapping in class EvalFunc<String>
Returns:
A List containing FuncSpec objects representing the Function class which can handle the inputs corresponding to the schema in the objects
Throws:
FrontendException


Copyright © ${year} The Apache Software Foundation