This post doesn’t have much to do with India. Or Padosan, which is a hilarious movie I’ve seen twice now, despite having the Hindi skills of a very shy two year old. It mostly has to do with how bored I currently am at work*.
Today morning, I woke with a desire to write Python. (And a headache, but that story ends pretty quickly with Ibuprofen.) More specifically, I was itching to use the NLTK Python module that I used briefly and superficially several months ago. Wrestling with language in a computational way is more intense than the WWE. Also, it involves less spandex. But I had what I thought was a pretty simple and fun way to test it out. I was going to write a classifier to determine if an Indian surname was South Indian or not.
Here are all the things I don’t know a lot about:
- Naive Bayes, or classifiers in general
- South India
But hey! I had the Internet, the NLTK book, my own “cultural experiences”, and a lot of free time!
What I know about classifiers and Naive Bayes comes from this great NLTK introduction to it. The way I understand it is, you build what’s called a feature set to distinguish properties of a corpus of data. For instance, you have a set of all the Tweets that have to do with McDonalds and you need to figure out if they are positive or negative. A feature set could include the frequency of the word “sucks” (or “sux”, I guess, given Twitter). Or, as the docs explain it, you could build a feature set that decides if a name is male or female (given the context of the Western world’s naming conventions) by looking at the last letter of the name.
I thought I had a pretty straightforward algorithm: count the number of times the letter ‘a’ appears in the name. It’s a sort of joke how many ‘a’s South Indian names have – they are complicated and sound like ‘Ramakrishnan’, ‘Subramaniam’, ‘Krishnaswamy’ and so on. Of course, I wouldn’t just count how many times ‘a’ appeared. I’d build a sort of density of the number of ‘a’s in the name — number of letters divided by the total length of the name.
Immediately, I ran into problems which frankly a lobotomized monkey would’ve seen coming. The frequency of the letter ‘a’ in the name ‘Narayanaswamy’ – 0.384 – is impressive given how complicated it is. But a-frequency in ‘Rai’ is 0.33. Not far behind, and not statistically significant.
So I would need to take into account the length of the word. I immediately thought of TF-IDF, which I was temporarily obsessed with some time last year. TF-IDF calculates the “importance” of a word in a corpus, and is probably one of the tools used by search engines to rank documents in order of relevance of a text. If you are looking for “brown dog”, it takes into account the possibility of the text being really small, the possibility of the text consisting entirely of the words “brown” and “dog” and nothing else, and the possibility of your query containing really common words like “the”, “an”, “what”, etc.
What if I treated each name like a text, with the “word” I was looking for actually being ‘a’?
I congratulated myself on this sophisticated thinking. Approximately three seconds later, I realized I was probably overdoing it. TF-IDF relies on having an actual corpus, a repository of texts which are composed of words. Part of a common TF-IDF algorithm takes into account the number of texts that are in the corpora and how many of them contain frequent occurrences of the word you’re looking at. But all I really needed was some way of classifying the name, not picking out the most relevant ones or ranking them.
Post-lunch, I had my third brainwave. I am pretty proud of this because post-lunch I usually have all the energy of an anesthetized sloth. How many times had I talked about how “complicated” South Indian surnames** are? Complicated in this instance is really just syllable count, isn’t it? How hard could it be to code something that counts the number of syllables in a name?
And then we could return to my beloved TF-IDF, because I could treat each syllable as a text, count the number of ‘a’s in the text, and treat the name as a corpus consisting of syllables. Then I could add the TF-IDF values for a name, and maybe proceed from there.
I was fairly pleased with myself until I dug into the problem of syllable determination a little more. Some quick searching revealed that several theses and dissertations had been written on this subject. Okay, not as easy as I had initially assumed.
This is when, as the children say, things got real.
Syllable recognition is one of those tasks that’s unthinkingly easy for a human and just bewilderingly difficult for a computer. We know that “racing” has two syllables, but how do you code for that? Some StackOverflow posts suggested segregation by vowels, which made some sense: you just split by vowels, which are where you expect one syllable to break and the next to start. Of course, this only works for English, but I’d translate all South Indian names to English anyway. Frankly, French would be more confusing. In this schema, “racing” would fit perfectly well. The “ra” is the first syllable, and “cing” is the second, right?
Except for “flying”. And “attention”. And “delicious”. And…
I could build in a set of exceptions that had to do with “tion” and “cion” and “ceive” and a bunch of other things. Which still felt like cheating, but at least I was doing something slightly more sophisticated. And then I got to words like “hope”, which was ironic because I was feeling anything but. One syllable, but perhaps that was because the ‘o’ was concatenated with a singe-syllable “pe” ending?
Okay, but then what about something like “chilly”? If I followed this (now increasingly idiotic) schema, I’d have “chi” and “lly”. Where would I decide that two consonants — the double l’s — would have to separate? And then what about “fetching”? Possibly I could code a set of rules that distinguished between soft consonants like ‘c’ and hard consonants like ‘t’. Except that ‘c’ is hard in ‘cookie’, ‘occult’ and a dozen other things.
I slowly began to understand why people would spend several years writing up papers about syllable splitting.
After checking Facebook, being reminded for the fiftieth time that a respectable radio institution like NPR covers Bieber arrests, and sadly contemplating pop culture, I returned to the problem. Googling this time for “determining vowels in words”, I then realized I had been looking at the thing in a noodle-headed fashion (again).
Why would I need to know what the syllables were, exactly? Why couldn’t I just build another density count, this time of the letter ‘a’ to the number of syllables in the name? And I wouldn’t even need to be precise about where the syllables ended and began — I’d just have to estimate the syllable count. I could do this by splitting along the vowels anyway.
At first, using this simple guesstimator returned some odd results for things like “Ramaswamy”, giving me 3 instead of 4 syllables. I then realized that for the purposes of the Tamil-English translation, I’d have to treat all y’s as i’s instead, so that the last syllable could be counted correctly.
But wait (there’s always something).
What about typically North Indian names like “Rao”? Or “Jain”? According to my dumb vowel-splitter, I’d have two vowels each for those names.
Here’s where the density count could help me, because the frequency of a’s to the syllables would actually be reduced, to 0.5 in this case. So names like “Somasekhar”, which also has an a-to-syllable density of 0.5, would scrape in barely by the skin of its teeth.
Of course, it wasn’t necessary for me to cling to my letter a hypothesis. (Let’s be honest, it wasn’t necessary for me to be building a Python classifier for South Indian names, either.) I could have just stuck to syllables. But that would’ve scrapped my pet theory and made things simpler, which of course is sheer anathema to me. I like complicating my life.
So I clung to my letter-a-syllable hypothesis like a particularly stubborn limpet and proceeded to use NLTK’s classifier.
When I read the documentation I first thought I was missing something. Okay, so you define a feature set, blah blah, specify a corpora which you’ve tagged with the “correct” values, and then divide it into a training and testing set. You train the classifier on the train set and then test it against the testing set, reasonably enough. But how the hell would the classifier “know” what I was giving it? All I was saying was, “here is the feature I want you to identify in each of the names, and then you can figure out the rest”.
That’s the beauty of the classifier, though. It actually identifies trends of words based on the features you want it to focus on, and given enough data will actually generate a report of the most important results — like that density count — that helps it determine if the word it’s looking at belongs to one category or another.
Incidentally, the Naive Bayes classifier in particular assumes a very simplistic set of factors that determines which category something falls into. They’re simplistic because the algorithm assumes that each feature is completely independent from the other, and calculates the joint probability of all of them appearing by just multiplying the likelihood of each one happening.
By this time, it was about 6.30 pm and I was vaguely satisfied with the fact that I had least coded some Python that day. I was halfway back home when, for the fifteenth time that day, I was struck by how pointless everything I’d done thus far was.
Why was I trying to manipulate the classifier? Why was I so obsessed by coming up with an algorithm to define what a South Indian name was, trying to find thresholds for the density of the letter ‘a’ and every other harebrained scheme?
The point of defining features is to pick out specific characteristics of your data set that you think might be relevant to grouping the data. It’s up to the data and the way the classifier is applied to the data to figure out which characteristics stand out in the data set. Instead of poking and prodding my classifier, I should have just let it do its thing.
So today morning I came in and rewrote my feature generating function. “Rewrote” is a little grandiose; it was three lines long. It defined three features to look for: the name length, the number of times ‘a’ appeared, and the number of syllables the name might have (I couldn’t let this one go, okay? I invested some time in that crap).
Then I wrote down some South and North*** Indian names on a couple of text files and used them to train my classifier.
A word about data. If you’re a veteran of data manipulation this will be old news, but… data is possibly the single most important thing you can collect to make sure your language models or classifiers work reasonably well. You’re going to be using some of that data to train your classifiers, so any inaccurate data (“Singh” is unlikely to be South Indian, so shouldn’t be in your “South Indian Names.txt” file), unclean data (“VenkataSingh” is an unhelpful concatenation of names that could skew your data), or a very small data set (you can’t really train anything on three names) are all going to muddy the waters. You’ll also want to keep your classifier as unbiased as possible, which means the data you use to train your model must be different from the data you use to test it out.
My data is, of course, suspect. I tend to pick names that people in my community traditionally favor, and they have a very specific length/letter-a pattern that just happens to fit my hypothesis. Imagine that.
My first attempt looked like this:
Okay, taking this step by step: My accuracy was at 0.933 out of 1.0. Cool. You know, impressive for the fifteen names I had on my corpus. Additionally, the most_informative_features function is brilliant. In this scenario it’s showing, among other things, that if you have 4 syllables, you’re 3.9 times more likely to have a South Indian name. Apparently, if your name length is 10, you’re 2.1 times more likely to be a South Indian.
Step two: It confirmed my hypothesis to some extent, but mainly I liked quantifying the things I’d suspected.
Step three: But… Venkatasubramaniam is pretty definitely a South Indian name.
I went back and adjusted my data set, a tactic otherwise known as “cheating”. I reasoned it was pretty small to begin with. Now I had 18 names each for my South and North Indian naming lists!
Note that the number of most informative features has increased. So has my accuracy! And, most importantly, Venkatasubramaniam has been restored to the pantheon of Overly Complicated But Kind of Adorably Traditional South Indian Names.
So, what did I learn from this involved exercise?
- Let the classifier do its thing and stop forcing a hypothesis
- Next time, try to actually get a corpus of data instead of “last name samplings of friends, aunties and uncles”
- Python list comprehensions are cool
- A very simple way to build features for classifying some basic data using NLTK