AUTHOR: Stein de Groof
aprioryPositional.py contains a modified apriori algorithm that mines for
frequent sequences within itemsets.
There are two required arguments:
- filename: name of a file containing space-separated itemsets
- minsup: the minumum support to consider an itemset frequent
ex.: python3 aprioriPositional.py trump_tweets_2016_processed.txt 40
Two main modifications were made to the basic apriori algorithm:
+ Follow sets:
The function 'getFollow(db, frequent)' creates a dictionary where each item
is mapped onto all frequent items that can occur immediately after it. This
is used to trivially generate new candidates by appending all items that
could follow the rightmost item of a frequent sequence.
+ Positional inverted index
The function 'invertPositional(db)' creates a positional inverted index of
the db consisting of nested dictionaries, with the top level dictionary
being used to look up terms. These then map onto dictionaries of the
indices of itemsets in which those terms occur, and their positions within
This index is used by the 'getFrequent' function which looks up candidate
itemsets in this index instead of looping over the original db. By
intersecting the itemsets found for each item in a candidate we get the set
of itemsets in which the items of the candidate occur. The positions
recorded in the positional index then allow us to verify if the items are
adjacent and in order.
*Only frequent sequences of size 2 or greater are printed out as those are more
interesting than single items. To change this just modify the print loop at the
end of the program.
*The program does not do any error checking, so invalid inputs or edge cases
will likely result in ugly failures