Posts Tagged ‘Python’

Organizing The Music

Monday, October 12th, 2009

So I’ve begun the process of manually merging my music collection. It’s a mess, quite frankly. I’ve got MP3s I’ve ripped or purchased on four different computers, spread throughout many directories. Compounding this is my iPod, which usually carries the latest tracks that I’ve added. Here’s how I’m organizing things. The fun part is that I got to write a Python script to help out.

First Steps

I’ve got one folder that was my primary music folder throughout my time in school. It rests on my file server. It generally contains all my music and is the most authoratative ‘source.’ In addition, it was the initial source, the ‘seed’ if you will, for the tracks on the iPod. At one point in the distant past, my iPod contained the tracks from this folder and nothing else. This is what I’m going to start with. To really drive home the point of my fresh start, I creates a share on my file server and started anew. These tracks wound up in a folder called ‘library.’

This is already a good start. I’ve been pretty meticulous in organizing my music library, essentially by artist then by album. The /library/ folder is going to be my new, massively-integrated library, as soon as I get finished organzing.

The iPod

Since my iPod contains several albums that never made it to the music share for one reason or another, it can also be considered ‘authoratative.’ So I ripped its contents to another folder in the new music shared, called /iPod/. I used the excellent tool SharePod to do this, as it allowed me to rip the tracks to artist/album folders with very little hastle.

Other Sources

I then rounded up all my other music, and put it into an ‘unsorted’ directory. This is stuff I would go through item by item, once the two primary sources were sorted out, and include or not include depending on if it wound up on my iPod or not. I have yet to get all the way through this step.

The Script

This is the important bit. I wrote a Python script to crawl through the two directories in parallel, and note any missing files or directories. This way, I’ll know what I need to copy from the /iPod/ folder to the /library/ folder. It’s a fairly simple command-line script, used like this:

compare.py left right outfile [filter1,filter2...]

left is the first directory, right is the second. outfile is a text file that the differences will be written to, and the [filter]s allow me to specify a whitelist of file types I care about. In this case, the whitelist would be restricted to audio file types. Here is the command I wound up running (drive Y:\ is the share I set up):

compare.ph Y:\library\ Y:\iPod\ Y:\results.txt mp3,m4a

This ran the Python script, comparing the /library/ and /iPod/ directories (and, recursively, their children), saving the log of all the differences to results.txt at the root of the share. Additionally, the program ignored any files except mp3 or m4a files (and directories, obviously). I wound up with a list of all the folders and files unique to the initial library and the one copied from my iPod. Then it was a simple matter to copy the iPod-unique folders to the library. I could even use it to update my iPod if I really wanted to, although it’s running pretty close to full now.

Of course, there’s still a lot of work to do: I’ve got to tag the /unsorted/ files. Have I mentioned how meticulous I am about my music library?

Source Code

import os # for files and paths
import sys # for command line arguments
 
def matches (path, fileName, filter):
    """Returns true if the given file matches the filter or is a directory, false otherwise.
    path - the directory the file resides in
    fileName - the name of the file in question
    filter - Either None to indicate no filtering should be applied, or a list of allowed extensions."""
    if filter == None:
        return True
    else:
        # if it's a directory, return true
        if (os.path.isdir(os.path.join(path, fileName))):
            return True
        ext = fileName.split(".").pop()
        return (ext in filter)
 
 
def compareDirectories (leftPath, rightPath, uniqueLeft, uniqueRight, filter = None):
    """Recursive function to compare the contents of two given directories. Two lists are
supplied to keep track of the unique files. An optional filter is allowed.
    leftPath - The path to the first directory.
    rightPath - The path to the second directory.
    uniqueLeft - A master list of files unique to the left directory tree.
    uniqueRight - A master list of files unique to the right directory tree.
    filter - Either None, or a list of allowed (whitelist) extensions for files. A unique file in
            either the left or right directory will not be counted as unique if its extension
            does not match one of the filter items."""
 
    # get contents of directories
    left = sorted(os.listdir(leftPath));
    right = sorted(os.listdir(rightPath));
 
    # without a filter, just add all unique files
    if (filter == None):
        # append unique files by using a list comprehension to get all files on one side
        # that are not on the other side
        uniqueLeft[len(uniqueLeft):] = [os.path.join(rightPath, fileName) for fileName in right if fileName not in left]
        uniqueRight[len(uniqueRight):] = [os.path.join(leftPath, fileName) for fileName in left if fileName not in right]
    # otherwise, use the filter function
    else:
        # same as above, but also checks to see that the files match the given filters
        uniqueLeft[len(uniqueLeft):] = [os.path.join(rightPath, fileName) for fileName in right
                                        if fileName not in left and matches(rightPath, fileName, filter)]
        uniqueRight[len(uniqueRight):] = [os.path.join(leftPath, fileName) for fileName in left
                                          if fileName not in right and matches(leftPath, fileName, filter)]
 
    # get a list of files in both directores. Since they by definition must be in both,
    # we can pull them from either side using a list comprehension to check that they're
    # in the other.
    both = [fileName for fileName in left if fileName in right]
 
    # now go through and recursively call the function for any directories in both parent directories
    for fileName in both:
        leftChild = os.path.join(leftPath, fileName)
        rightChild = os.path.join(rightPath, fileName)
        if (os.path.isdir(leftChild) and os.path.isdir(rightChild)):
            compareDirectories(leftChild, rightChild, uniqueLeft, uniqueRight, filter)
 
def usage ():
    print "\n\ncompare.py"
    print "Compares two directories recursively and lists files or folders unique to each one.\n"
    print "compare.py left right outfile [filter1,filter2...]"
    print "\tleft\tFirst directory to compare"
    print "\tright\tSecond directory to compare"
    print "\toutfile\tText file that results are written to"
    print "\t[filter1,filter2]\tOptional comma-separated whitelist"
    print" \t\t\t\tof extensions for files"
    exit()
 
if __name__ == "__main__":
    # slice off name of program from args
    args = sys.argv[1:]
 
    # if there's an incorrect number of parameters, print the usage
    if len(args) < 3 or len(args) > 4:
        usage()
 
    # set up filter whitelist, if any
    filter = None
    if len(args) == 4:
        filter = args[3].split(",")
 
    # set up lists of unique files on both sides
    uniqueRight = list();
    uniqueLeft = list();
 
    # do the comparison recursively
    compareDirectories(args[0], args[1], uniqueLeft, uniqueRight, filter)
 
    # write to the file
    out = open(args[2], 'w')
 
    out.write("UNIQUE TO LEFT:\n")
    for fileName in uniqueLeft:
        out.write(fileName + "\n")
 
    out.write("\nUNIQUE TO RIGHT:\n")
    for fileName in uniqueRight:
       out.write(fileName + "\n")
 
    out.close()

Wheel of Fortune Letter Frequency Analyzer

Wednesday, August 19th, 2009
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
from operator import itemgetter # for sorting
import sys # for command-line arguments
 
# makes sorting dictionaries prettier
def sortDictionary (s):
    return sorted(s.items(), key = itemgetter(1), reverse = True)
 
hexColors = ["F05DCF", "F4B213", "7BB5FE", "19B915",
       "C913E4", "E38080", "4891EB", "DCF725", "E02EB0",
       "EE7D18", "16D949", "73E0C9", "22F1DB", "1460A1",
       "CF8040", "FFFFFF", "CF8054", "204E00", "2B1160",
       "87513C", "DECEE9", "C913E4", "83B892", "597D4C",
       "DACA5D", "2F486B", "D79E17", "826889", "359DA1",
       "DE7A43", "568C51", "FBF786"]
 
 
if __name__ == "__main__":
    # set up command line arguments
    # thumb:   creates a smaller file, with shorter (or no, depending on letter count) labels
    # verbose: prints out each list of letters and frequencies, too
 
    if (len(sys.argv) < 2 or len(sys.argv) > 4):
        print "Usage: wof.py [filename] [t|f] [v]"
        exit()
 
    thumb = False
    verbose = False
    fileName = sys.argv[1]
    if len(sys.argv) > 2 and sys.argv[2].lower() == "t":
        thumb = True
    if len(sys.argv) > 3 and sys.argv[3].lower() == "v":
        verbose = True
 
 
    # set up lists of letters
    letters = ["a", "b", "c", "d", "e", "f", "g", "h",
               "i", "j", "k", "l", "m", "n", "o", "p",
               "q", "r", "s", "t", "u", "v", "w", "x",
               "y", "z"]
    consonants = ["b", "c", "d", "f", "g", "h",
                "j", "k", "l", "m", "n", "p",
               "q", "r", "s", "t", "v", "w", "x",
               "y", "z"]
    vowels = ["a", "e", "i", "o", "u"]
 
    # letters to exclude (already given to you on game show)
    already = ["r", "s", "t", "l", "n", "e"]
 
 
    # set up frequency dictionaries {leter : number of occurences}
    allFrequencies = dict((letter, 0) for letter in letters)
    vowelFrequencies = dict((letter, 0) for letter in vowels)
    consonantFrequencies = dict((letter, 0) for letter in consonants);
 
    # Read the data file. Should consist of one final puzzle
    # solution per line, optionally lines can start with "#" for a comment
    file = open(fileName)
    while True:
        line = file.readline()
        if not line: break #end of loop
        if line[0] == "#": continue # skip comments
        for letter in line:
            lower = letter.lower()
            if lower in allFrequencies:
                allFrequencies[lower] = allFrequencies[lower] + 1
            if lower in already: # exclude RSTLNE from vowels and consonants
                break
            if lower in vowelFrequencies:
                vowelFrequencies[lower] = vowelFrequencies[lower] + 1
            if lower in consonantFrequencies:
                consonantFrequencies[lower] = consonantFrequencies[lower] + 1
 
    #sort dictionaries
    allFrequencies = sortDictionary(allFrequencies);
    vowelFrequencies = sortDictionary(vowelFrequencies);
    consonantFrequencies = sortDictionary(consonantFrequencies);
 
    if verbose:
        #display the lists
        print "ALL:\n", allFrequencies
        print "\nVOWELS:\n", vowelFrequencies
        print "\nCONSONANTS:\n", consonantFrequencies
 
 
    charts = {"All+Letters" : allFrequencies, "Vowels" : vowelFrequencies,
             "Consonants" : consonantFrequencies}
 
    for chart in charts:
        # make the image URLs, using Google Charts
        if thumb:
            url = "http://chart.apis.google.com/chart?chs=100x100&cht=p"
        else:
            url = "http://chart.apis.google.com/chart?chs=400x300&cht=p"
 
        # build lists for data series and its labels
        labels = []
        data = []
        for entry in charts[chart]:
            if int(entry[1]) > 0: # exclude any letters not used
                # make sure a thumbnail doesn't have too many labels to clutter it
                if thumb and len(charts[chart]) <= 6:
                    labels.append(entry[0].upper())
                else:
                    labels.append(entry[0].upper() + "+(" + str(entry[1]) + ")")
                data.append(str(entry[1]))
 
        # set them to the query string parts for data and labels
        dataRange = "&chd=t:" + ",".join(data);
        if (thumb and len(charts[chart]) >= 6):
            labelRange = ""
        else:
            labelRange = "&chl=" + "|".join(labels);
 
 
        # build the array of chart colors
        chartColors = "&chco=" + ",".join(hexColors[0:len(charts[chart])-2])
 
        # build final URL
        url = url + dataRange + labelRange + "&" + chartColors + "&chtt=" + chart;
        print "\n", chart, "\n", url

How to Win (Or Maybe Not) on Wheel of Fortune

Wednesday, August 19th, 2009

Wheel! Of! Fortune!One nice thing about being a programmer is that you can automate certain calculations that you’d have to be crazy to attempt any other way. While some would see a non-programmer attempting to figure out some of this stuff as borderline insane, we coders just come across as eccentric with a lot of time on our hands. If people ask a question fairly frequently, and said question involves lots of number-crunching, you can bet some coder somewhere has taken a crack at trying to crunch those numbers.

Case in point: Wheel of Fortune. Now, I’ve never really been a huge fan of the show, but I see it a lot anyway. It’s on after Jeopardy!, which I really do enjoy and try to watch fairly frequently, so I’ve seen my fair share of episodes of Wheel. One thing that always got me was the final puzzle. For those of you who don’t know, this is how it works: the winning contestant from all the previous rounds must solve a shorter, harder puzzle by himself in a small span of time. He is given a (usually unhelpful) hint in the form a category, and some of the most common letters in the English language (R, S, T, L, N, and E) are already shown. Then the contestant must choose three more consonants and one more vowel. If any of these letters occur in the puzzle Vanna White shows them, and the players has ten seconds to guess what the word or phrase is.

There are other factors at play here, but they don’t relate to what I’m interested in most, namely: What are the best letters to pick? Can we do an analysis of the letter frequency of a whole bunch of these puzzles? Can we determine whether or not the producers of the show pay attention to these frequencies? Thanks to the Internet and some spare time in the hands of a programmer, the answer is a (qualified) yes, we can. Please note that while I do enjoy math, I am most certainly not a mathematician, so this is just an armchair analysis, and not a scholar’s take.

First, I needed a set of data. As interested as I was in determining the letter frequencies, I wasn’t about to spend six months collecting data by actually watching the end of each show. I have the Internet to do that sort of stuff for me! In this case, I found this forum, whose residents had already done the hard work. I was able to grab the final puzzles from a couple of threads on this site, and store them in some text files, one puzzle to a line. Then, I wrote a short Python script to parse through the results and generate links to a Google Charts representation of the data. If you’re going to screw around on the Internet, why waste time inputting data into Excel?

The Code

Below is the code. Please note that while I have commented it, it’s task-oriented code. I did not sit down and think things through for hours on end; I was more interested in the results produced by the code than the process of making it. To that end it may be a bit rough around the edges. If you’re a Python programmer you might even think it un-Pythonic.

Show/Hide Source Code

from operator import itemgetter # for sorting
import sys # for command-line arguments
 
# makes sorting dictionaries prettier
def sortDictionary (s):
    return sorted(s.items(), key = itemgetter(1), reverse = True)
 
hexColors = ["F05DCF", "F4B213", "7BB5FE", "19B915",
       "C913E4", "E38080", "4891EB", "DCF725", "E02EB0",
       "EE7D18", "16D949", "73E0C9", "22F1DB", "1460A1",
       "CF8040", "FFFFFF", "CF8054", "204E00", "2B1160",
       "87513C", "DECEE9", "C913E4", "83B892", "597D4C",
       "DACA5D", "2F486B", "D79E17", "826889", "359DA1",
       "DE7A43", "568C51", "FBF786"]
 
 
if __name__ == "__main__":
    # set up command line arguments
    # thumb:   creates a smaller file, with shorter (or no, depending on letter count) labels
    # verbose: prints out each list of letters and frequencies, too
 
    if (len(sys.argv) < 2 or len(sys.argv) > 4):
        print "Usage: wof.py [filename] [t|f] [v]"
        exit()
 
    thumb = False
    verbose = False
    fileName = sys.argv[1]
    if len(sys.argv) > 2 and sys.argv[2].lower() == "t":
        thumb = True
    if len(sys.argv) > 3 and sys.argv[3].lower() == "v":
        verbose = True
 
 
    # set up lists of letters
    letters = ["a", "b", "c", "d", "e", "f", "g", "h",
               "i", "j", "k", "l", "m", "n", "o", "p",
               "q", "r", "s", "t", "u", "v", "w", "x",
               "y", "z"]
    consonants = ["b", "c", "d", "f", "g", "h",
                "j", "k", "l", "m", "n", "p",
               "q", "r", "s", "t", "v", "w", "x",
               "y", "z"]
    vowels = ["a", "e", "i", "o", "u"]
 
    # letters to exclude (already given to you on game show)
    already = ["r", "s", "t", "l", "n", "e"]
 
 
    # set up frequency dictionaries {leter : number of occurences}
    allFrequencies = dict((letter, 0) for letter in letters)
    vowelFrequencies = dict((letter, 0) for letter in vowels)
    consonantFrequencies = dict((letter, 0) for letter in consonants);
 
    # Read the data file. Should consist of one final puzzle
    # solution per line, optionally lines can start with "#" for a comment
    file = open(fileName)
    while True:
        line = file.readline()
        if not line: break #end of loop
        if line[0] == "#": continue # skip comments
        for letter in line:
            lower = letter.lower()
            if lower in allFrequencies:
                allFrequencies[lower] = allFrequencies[lower] + 1
            if lower in already: # exclude RSTLNE from vowels and consonants
                break
            if lower in vowelFrequencies:
                vowelFrequencies[lower] = vowelFrequencies[lower] + 1
            if lower in consonantFrequencies:
                consonantFrequencies[lower] = consonantFrequencies[lower] + 1
 
    #sort dictionaries
    allFrequencies = sortDictionary(allFrequencies);
    vowelFrequencies = sortDictionary(vowelFrequencies);
    consonantFrequencies = sortDictionary(consonantFrequencies);
 
    if verbose:
        #display the lists
        print "ALL:\n", allFrequencies
        print "\nVOWELS:\n", vowelFrequencies
        print "\nCONSONANTS:\n", consonantFrequencies
 
 
    charts = {"All+Letters" : allFrequencies, "Vowels" : vowelFrequencies,
             "Consonants" : consonantFrequencies}
 
    for chart in charts:
        # make the image URLs, using Google Charts
        if thumb:
            url = "http://chart.apis.google.com/chart?chs=100x100&cht=p"
        else:
            url = "http://chart.apis.google.com/chart?chs=400x300&cht=p"
 
        # build lists for data series and its labels
        labels = []
        data = []
        for entry in charts[chart]:
            if int(entry[1]) > 0: # exclude any letters not used
                # make sure a thumbnail doesn't have too many labels to clutter it
                if thumb and len(charts[chart]) <= 6:
                    labels.append(entry[0].upper())
                else:
                    labels.append(entry[0].upper() + "+(" + str(entry[1]) + ")")
                data.append(str(entry[1]))
 
        # set them to the query string parts for data and labels
        dataRange = "&chd=t:" + ",".join(data);
        if (thumb and len(charts[chart]) >= 6):
            labelRange = ""
        else:
            labelRange = "&chl=" + "|".join(labels);
 
 
        # build the array of chart colors
        chartColors = "&chco=" + ",".join(hexColors[0:len(charts[chart])-2])
 
        # build final URL
        url = url + dataRange + labelRange + "&" + chartColors + "&chtt=" + chart;
        print "\n", chart, "\n", url

The Results

Might as well show off the pretty, pretty pictures, huh? Click any graph below to enlarge it.

At first look, the data is not too promising. I can give you two letters that will increase your chances of getting a ‘hit’, and one of them might come in handy. In our six months’ worth of data, O is the favorite… but not by much. Looking at both periods, it seems pretty clear that somebody at Merv Griffin Productions is responsible for distributing the vowels O, I, and A across the spectrum so none of them shows up too frequently. Notice how I and O are tied on the most recent set of data, but A is the second-most frequent vowel on the older figures. Combining all the numbers, we see that these three vowels are essentially tied in frequency, with U in a slightly lower class. But at least it’s something to work with, right? From the last six months of Wheel, it looks like ‘O’ is the best vowel to go with.

Now what about consonants? I was most excited when I pulled up the 2009 consonants graph (the first one I did), because you can clearly see that the top two letters are definitely a bit more common than the rest, and even the top three look pretty solid. H, G, and D… could those be the winning ones? My excitement faded, however, as I ran the earlier set of data through the script. Looking at the combined chart, H still has a statistically significant lead. But you’ll have no luck trying to discover the three letters to choose. But we can limit our options a little. F, G, and B are all clearly separated from the next letter (D) in the combined graph, with a decent-sized gap between them. It’s harder to say for certain, but it looks like the producers may be balancing these top four letters throughout their puzzles.

So, what to go with? You should definitely choose O for your vowel. H is the consonants which statistically is most likely to occur. Then any of F, G, and B would probably do you some good.

And how about those shifty producers? Are they gaming the final answers, so they don’t have to give out as much prize money? Are they maybe picking and choosing their phrases to deflect somebody who did a little research before heading down to the studio to play? Well, let’s try to find a pattern in the frequency of letters in the English language (please note that I’ve removed RSTLNE from these graphs):

Right away you should notice some major discrepancies between the Wheel of Fortune data set and written English. While H shows up at the top where we’d expect, D is clearly in a much higher class than B, F, and G that we picked above. In fact, F and G aren’t even in the top five, and other letters that show up often in English aren’t placed very high in the combined final puzzle data. This is probably the result of producers fine-tuning their answers over the years, either to avoid the letters contestants chose most often or to more evenly distribute the winning ones.

This is such a small data set, however, that we shouldn’t rely on it too heavily. After all, Wheel of Fortune has been on the air for twenty-six years, and we only have half a year’s worth of data, or around 2% of all that is available. But a small attempt at analyzing this data is probably better than going in blind and picking letters that you ‘think’ show up frequently.

There are other ways that this quick-and-dirty analysis can be improved. Mine is a pretty naive approach. Going over some basic rules of English might help to improve the method. For example, breaking the final puzzles down into phonemes could yield more information, as might looking at letter pairs instead of single letters. For instance, Q never occurs without U, and some letters are more common after others. This is especially useful in our task, as we need to choose three consonants but only one vowel. Consonants are most often followed by vowels, so consonant pairs increase the uniqueness of a phrase. P is often followed by R, L, or H, for example. Looking for patterns in the words themselves might also yield better predictions about what letters would be better to guess.

Another thing to realize is that these numbers are averages from a discrete set. Some puzzles might include the high-frequency letters and be solvable with only those (and RSTLNE), while some might not include a single one of the high-frequency picks. Picking from one of the high scorers might improve your odds of getting more letters, but it doesn’t guarantee that you’ll get some, or even any. You might wind up with something like ‘Blind Luck’, which doesn’t contain an O or an H. These estimates can help you, but only so much.

So, after all these calculations, I now know what I would do if I ever found myself in Wheel’s final round. I’d go for H, G, and B (G and B having been arbitrarily picked over F), and then O as my vowel. And maybe I’d win big. Of course, the biggest factor in all this is your ability to manipulate letters and words in your mind. That’s one subject in which I lack skill, as evinced by the Boggle-solving program I wrote (a story for another time). So I might tank, even if my statistically-chosen letters filled out quite a bit of the puzzle. A lot of it does come down to luck, which was probably the producers’ intention all along.