org.apache.mahout.cf.taste.impl.model.file
Class FileDataModel

java.lang.Object
  extended by org.apache.mahout.cf.taste.impl.model.file.FileDataModel
All Implemented Interfaces:
Refreshable, DataModel

public class FileDataModel
extends java.lang.Object
implements DataModel

A DataModel backed by a comma-delimited file. This class typically expects a file where each line contains a user ID, followed by item ID, followed by preferences value, separated by commas. You may also use tabs.

The preference value is assumed to be parseable as a double. The user IDs and item IDs are read parsed as longs.

This class will reload data from the data file when refresh(Collection) is called, unless the file has been reloaded very recently already.

This class will also look for update "delta" files in the same directory, with file names that start the same way (up to the first period). These files should have the same format, and provide updated data that supersedes what is in the main data file. This is a mechanism that allows an application to push updates to without re-copying the entire data file.

The line may contain a blank preference value (e.g. "123,456,"). This is interpreted to mean "delete preference", and is only useful in the context of an update delta file (see above). Note that if the line is empty or begins with '#' it will be ignored as a comment.

It is also acceptable for the lines to contain additional fields. Fields beyond the third will be ignored.

Finally, for application that have no notion of a preference value (that is, the user simply expresses a preference for an item, but no degree of preference), the caller can simply omit the third token in each line altogether -- for example, "123,456".

Note that it's all-or-nothing -- all of the items in the file must express no preference, or the all must. These cannot be mixed. Put another way there will always be the same number of delimiters on every line of the file!

This class is not intended for use with very large amounts of data (over, say, tens of millions of rows). For that, a JDBC-backed DataModel and a database are more appropriate.

It is possible and likely useful to subclass this class and customize its behavior to accommodate application-specific needs and input formats. See processLine(String, FastByIDMap, boolean) and processLineWithoutID(String, FastByIDMap)


Constructor Summary
FileDataModel(java.io.File dataFile)
           
FileDataModel(java.io.File dataFile, boolean transpose)
           
 
Method Summary
protected  DataModel buildModel()
           
static char determineDelimiter(java.lang.String line, int maxDelimiters)
           
 java.io.File getDataFile()
           
 char getDelimiter()
           
 LongPrimitiveIterator getItemIDs()
           
 FastIDSet getItemIDsFromUser(long userID)
           
 int getNumItems()
           
 int getNumUsers()
           
 int getNumUsersWithPreferenceFor(long... itemIDs)
           
 PreferenceArray getPreferencesForItem(long itemID)
           
 PreferenceArray getPreferencesFromUser(long userID)
           
 java.lang.Float getPreferenceValue(long userID, long itemID)
          Retrieves the preference value for a single user and item.
 LongPrimitiveIterator getUserIDs()
           
 boolean hasPreferenceValues()
           
protected  void processFile(FileLineIterator dataOrUpdateFileIterator, FastByIDMap<?> data, boolean fromPriorData)
           
protected  void processFileWithoutID(FileLineIterator dataOrUpdateFileIterator, FastByIDMap<FastIDSet> data)
           
protected  void processLine(java.lang.String line, FastByIDMap<?> data, boolean fromPriorData)
           Reads one line from the input file and adds the data to a Map data structure which maps user IDs to preferences.
protected  void processLineWithoutID(java.lang.String line, FastByIDMap<FastIDSet> data)
           
protected  long readItemIDFromString(java.lang.String value)
          Subclasses may wish to override this if ID values in the file are not numeric.
protected  long readUserIDFromString(java.lang.String value)
          Subclasses may wish to override this if ID values in the file are not numeric.
 void refresh(java.util.Collection<Refreshable> alreadyRefreshed)
           Triggers "refresh" -- whatever that means -- of the implementation.
protected  void reload()
           
 void removePreference(long userID, long itemID)
          See the warning at setPreference(long, long, float).
 void setPreference(long userID, long itemID, float value)
          Note that this method only updates the in-memory preference data that this maintains; it does not modify any data on disk.
 java.lang.String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

FileDataModel

public FileDataModel(java.io.File dataFile)
              throws java.io.IOException
Parameters:
dataFile - file containing preferences data. If file is compressed (and name ends in .gz or .zip accordingly) it will be decompressed as it is read)
Throws:
java.io.FileNotFoundException - if dataFile does not exist
java.io.IOException - if file can't be read

FileDataModel

public FileDataModel(java.io.File dataFile,
                     boolean transpose)
              throws java.io.IOException
Parameters:
transpose - transposes user IDs and item IDs -- convenient for 'flipping' the data model this way
Throws:
java.io.IOException
See Also:
FileDataModel(File)
Method Detail

getDataFile

public java.io.File getDataFile()

getDelimiter

public char getDelimiter()

reload

protected void reload()

buildModel

protected DataModel buildModel()
                        throws java.io.IOException
Throws:
java.io.IOException

determineDelimiter

public static char determineDelimiter(java.lang.String line,
                                      int maxDelimiters)

processFile

protected void processFile(FileLineIterator dataOrUpdateFileIterator,
                           FastByIDMap<?> data,
                           boolean fromPriorData)

processLine

protected void processLine(java.lang.String line,
                           FastByIDMap<?> data,
                           boolean fromPriorData)

Reads one line from the input file and adds the data to a Map data structure which maps user IDs to preferences. This assumes that each line of the input file corresponds to one preference. After reading a line and determining which user and item the preference pertains to, the method should look to see if the data contains a mapping for the user ID already, and if not, add an empty List of Preferences to the data.

Note that if the line is empty or begins with '#' it will be ignored as a comment.

Parameters:
line - line from input data file
data - all data read so far, as a mapping from user IDs to preferences
fromPriorData - an implementation detail -- if true, data will map IDs to PreferenceArray since the framework is attempting to read and update raw data that is already in memory. Otherwise it maps to Collections of Preferences, since it's reading fresh data. Subclasses must be prepared to handle this wrinkle.

processFileWithoutID

protected void processFileWithoutID(FileLineIterator dataOrUpdateFileIterator,
                                    FastByIDMap<FastIDSet> data)

processLineWithoutID

protected void processLineWithoutID(java.lang.String line,
                                    FastByIDMap<FastIDSet> data)

readUserIDFromString

protected long readUserIDFromString(java.lang.String value)
Subclasses may wish to override this if ID values in the file are not numeric. This provides a hook by which subclasses can inject an IDMigrator to perform translation.


readItemIDFromString

protected long readItemIDFromString(java.lang.String value)
Subclasses may wish to override this if ID values in the file are not numeric. This provides a hook by which subclasses can inject an IDMigrator to perform translation.


getUserIDs

public LongPrimitiveIterator getUserIDs()
                                 throws TasteException
Specified by:
getUserIDs in interface DataModel
Returns:
all user IDs in the model, in order
Throws:
TasteException - if an error occurs while accessing the data

getPreferencesFromUser

public PreferenceArray getPreferencesFromUser(long userID)
                                       throws TasteException
Specified by:
getPreferencesFromUser in interface DataModel
Parameters:
userID - ID of user to get prefs for
Returns:
user's preferences, ordered by item ID
Throws:
NoSuchUserException - if the user does not exist
TasteException - if an error occurs while accessing the data

getItemIDsFromUser

public FastIDSet getItemIDsFromUser(long userID)
                             throws TasteException
Specified by:
getItemIDsFromUser in interface DataModel
Parameters:
userID - ID of user to get prefs for
Returns:
IDs of items user expresses a preference for
Throws:
NoSuchUserException - if the user does not exist
TasteException - if an error occurs while accessing the data

getItemIDs

public LongPrimitiveIterator getItemIDs()
                                 throws TasteException
Specified by:
getItemIDs in interface DataModel
Returns:
a List of all item IDs in the model, in order
Throws:
TasteException - if an error occurs while accessing the data

getPreferencesForItem

public PreferenceArray getPreferencesForItem(long itemID)
                                      throws TasteException
Specified by:
getPreferencesForItem in interface DataModel
Parameters:
itemID - item ID
Returns:
all existing Preferences expressed for that item, ordered by user ID, as an array
Throws:
NoSuchItemException - if the item does not exist
TasteException - if an error occurs while accessing the data

getPreferenceValue

public java.lang.Float getPreferenceValue(long userID,
                                          long itemID)
                                   throws TasteException
Description copied from interface: DataModel
Retrieves the preference value for a single user and item.

Specified by:
getPreferenceValue in interface DataModel
Parameters:
userID - user ID to get pref value from
itemID - item ID to get pref value for
Returns:
preference value from the given user for the given item or null if none exists
Throws:
NoSuchUserException - if the user does not exist
TasteException - if an error occurs while accessing the data

getNumItems

public int getNumItems()
                throws TasteException
Specified by:
getNumItems in interface DataModel
Returns:
total number of items known to the model. This is generally the union of all items preferred by at least one user but could include more.
Throws:
TasteException - if an error occurs while accessing the data

getNumUsers

public int getNumUsers()
                throws TasteException
Specified by:
getNumUsers in interface DataModel
Returns:
total number of users known to the model.
Throws:
TasteException - if an error occurs while accessing the data

getNumUsersWithPreferenceFor

public int getNumUsersWithPreferenceFor(long... itemIDs)
                                 throws TasteException
Specified by:
getNumUsersWithPreferenceFor in interface DataModel
Parameters:
itemIDs - item IDs to check for
Returns:
the number of users who have expressed a preference for all of the items
Throws:
TasteException - if an error occurs while accessing the data
NoSuchItemException - if an item does not exist

setPreference

public void setPreference(long userID,
                          long itemID,
                          float value)
                   throws TasteException
Note that this method only updates the in-memory preference data that this maintains; it does not modify any data on disk. Therefore any updates from this method are only temporary, and lost when data is reloaded from a file. This method should also be considered relatively slow.

Specified by:
setPreference in interface DataModel
Parameters:
userID - user to set preference for
itemID - item to set preference for
value - preference value
Throws:
NoSuchItemException - if the item does not exist
NoSuchUserException - if the user does not exist
TasteException - if an error occurs while accessing the data

removePreference

public void removePreference(long userID,
                             long itemID)
                      throws TasteException
See the warning at setPreference(long, long, float).

Specified by:
removePreference in interface DataModel
Parameters:
userID - user from which to remove preference
itemID - item to remove preference for
Throws:
NoSuchItemException - if the item does not exist
NoSuchUserException - if the user does not exist
TasteException - if an error occurs while accessing the data

refresh

public void refresh(java.util.Collection<Refreshable> alreadyRefreshed)
Description copied from interface: Refreshable

Triggers "refresh" -- whatever that means -- of the implementation. The general contract is that any should always leave itself in a consistent, operational state, and that the refresh atomically updates internal state from old to new.

Specified by:
refresh in interface Refreshable
Parameters:
alreadyRefreshed - s that are known to have already been refreshed as a result of an initial call to a method on some object. This ensure that objects in a refresh dependency graph aren't refreshed twice needlessly.

hasPreferenceValues

public boolean hasPreferenceValues()
Specified by:
hasPreferenceValues in interface DataModel

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.