name mode size
README 100644 1.83kB 100644 3.97kB 100644 3.06kB 100644 1.21kB
AUTHOR: Stein de Groof contains a modified apriori algorithm that mines for frequent sequences within itemsets. USAGE There are two required arguments: - filename: name of a file containing space-separated itemsets - minsup: the minumum support to consider an itemset frequent ex.: python3 trump_tweets_2016_processed.txt 40 ALGORITHM Two main modifications were made to the basic apriori algorithm: + Follow sets: The function 'getFollow(db, frequent)' creates a dictionary where each item is mapped onto all frequent items that can occur immediately after it. This is used to trivially generate new candidates by appending all items that could follow the rightmost item of a frequent sequence. + Positional inverted index The function 'invertPositional(db)' creates a positional inverted index of the db consisting of nested dictionaries, with the top level dictionary being used to look up terms. These then map onto dictionaries of the indices of itemsets in which those terms occur, and their positions within those itemsets. This index is used by the 'getFrequent' function which looks up candidate itemsets in this index instead of looping over the original db. By intersecting the itemsets found for each item in a candidate we get the set of itemsets in which the items of the candidate occur. The positions recorded in the positional index then allow us to verify if the items are adjacent and in order. NOTES *Only frequent sequences of size 2 or greater are printed out as those are more interesting than single items. To change this just modify the print loop at the end of the program. *The program does not do any error checking, so invalid inputs or edge cases will likely result in ugly failures