Python Coder Test
Since Net Communities have their new developer, I can post the coder test I wrote to find a replacement. The test was not meant to be particularly challenging, and we only requested that developers do one out of the three problems. Most developers did all three--presumably in an effort to impress--but nobody got all three correct, first time around. That was probably my fault, because the third problem had a subtle issue that is easy to overlook.
I'll quote the problems here, including typos. Occasionally the way I phrased things was slightly ambiguous, but I didn't penalize if the candidate misunderstood a requirement due to my use of language.
1. Write a script that takes a file containing a number of tags words (one per line) and sorts them by the number of times they occur in the file (descending order), but exclude words that only occur once. Ignore the case of the words, and filter out any puncutation characters. The output should be the word, followed by a space and the number of times it occurs in the file.Example input:
microsoft
apple
microsoft.
Apple?
security
microsoft
internet
Example output:
microsoft 3
apple 2
This is something I would expect any programmer to be able to solve in any language. What I was looking for is simplicity in the solution. If a developer solves a simple problem like this with complex code, then he* would probably solve a complex program with unmaintainable code. And although efficiency was not a requirement for any of the problems, I was disappointed by the number of developers that didn't know when to use a dictionary / set over a list. Many candidates failed on this problem because they didn't fulfill all the requirements, not because the code was buggy but because the a requirement was simply ignored. This left me in a bit of a quandary, because I suspected that most of the candidates could have done it adequately if prompted--but then shouldn't a developer pay attention to details?
2. Write a script that removes blank paragraphs from a piece of html. A blank paragraph should be considered to be a <p> </p> tag containing only white space.
I was looking for a regular expression here. Even if a developer hasn't had a great deal of experience with regular expressions, he should be able to recognize that this is a situation where an R.E. would come in handy. One developer parsed the HTML to find blank paragraphs, which was hard to fault because it did work--but I tend to prefer the simplest solution that works, and a handful of lines is generally preferably to several pages of code.
3. Write a decorator that stores the result of a function call and returns the cached version in subsequent calls (with the same parameters) for 5 minutes, or ten times -- whichever comes first.
This was the one that everyone failed. The main issue was that to do it properly you need to create an immutable key out of the parameters to the function to be cached. But that was only the final hurdle, many candidates failed on simpler issues. In fact, many 'solutions' didn't actually run at all! This surprised me--if you have plenty of time to write the code and there is potentially a job riding on it, don't you test your solution?
Although nobody got that last problem correct, we did interview some good developers with a mix of skills and abilities. Most of whom were very passionate about working Django, and passion is a good thing in any field.
Being on the other side of an interview was an interesting experience for me, I hope I wasn't too much of a jerk! Thanks to all who applied.
If anyone wants to post a solution to any of the problems, they are more than welcome--but I'm afraid the job has been filled.
* I hate to type 'he or she', please assume thats what I mean.
Question #1:
The "ignore punctuation" bit came from stack overflow.
#1 is easy
#2 I don't know a lot of HTML. Isn't it allowed to have e.g attribute strings contain tags? In that case it you would have to resort to a parser to be sure you were cleaning only HTML markup and not attribute strings.
#3 Would require pickling the parameters in order to deal with mutable data.
lets see:
1)
2)
empty_p could also be re.compile(r'\s*) for a more general solution.
3) hmm... now this is more interesting:
I hope the formatting won't be way off...
hmm... wordpress seems to cut html formatting from comments.
so i guess this won't work as well:
def f(x):
pass
For #2, I'd expect a regexp from a general developer, or one with perl experience - but I'd push harder on someone claiming python experience :-) I've moved a number of perl programmers over to python and one sign of being "early" in the transition is using regexps for things with more robust and readable functions available - str.endswith, os.path.basename, and for this particular example, BeautifulSoup:
Here is my solution to the first problem:
$ cat test.txt | tr -d '[:punct:]' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | grep -v '^ *1 '
3 microsoft
2 apple
The second problem is trivial with a regex, although as andre says, you could have attribute strings with tags in.
The third problem is tricky - strictly speaking pickling will not work, since it is not guaranteed that pickling the same parameters twice will always produce the same string - this is something that has bitten me in the past. In this case it will work most of the time, but you will get an occasional cache miss.
I forgot that uniq has an option to remove non-duplicated entries, so the grep at the end is unnecessary:
$ cat test.txt | tr -d '[:punct:]' | tr '[:upper:]' '[:lower:]' | sort | uniq -cd | sort -nr
3 microsoft
2 apple
I am quite a Python newbie (and I have not come across writing decorators yet, so the 3rd problem is screwed) but the following are my solutions to the test.
#1:
I like using with/as for the input file; using itertools.groupby to do the "uniq" thingy. Not sure about normalize, though: perhaps a regexp would be better?
#2:
So, I stand by the idea behind my code, in that using a regexp for parsing HTML is, uhm, really not exactly what I would like to see if I were the interviewer. Libraries to parse and manipulate HTML do exist, and choosing the right tool would be a matter of additional context given to the problem. (This is the same reason why I liked the idea of using itertools instead of grouping "by hand".) Here I used minidom because it's simple and in the standard library, even if for producing output it's not exactly practical. If external libraries were allowed, I would have tried something like BeautifulSoup (which BTW has been indeed used by another commenter).
Uh, for Dave Kirby: the output format seems to prescribe that the word should be placed before the number of its occurrencies, not the other way around.
Having written this blog entry, is there no more 'context implied' information that you could have supplied?
For example:
In 1, Would it be correct to assume ASCII?
In 2, might you have said beforehand, that a regex would do, or give some representative examples of input HTML?
In 3, You say no one "did it properly". This flips your comments on problem 2, where you wanted "the simplest solution". Without telling the candidates you require them to be mind readers to produce the right level of detail.
In summary, I think it is easy to write insufficient specs. you might have tried your problem on a few anonymous coders, taking care to remember what extra needed to be imparted to get the answers you were thinking of.
- Paddy.
The third problem is indeed nontrivial. In the documentation of my decorator module I use a frozenset to ensure the function arguments are hashable:
Michele, your implementation has a problem with unhashable objects:
doing it with repr() as I have done it, allows using unhashable objects, but has the drawback that objects need to implement __repr__ as a meaningful function, so that if X and Y differ in semantics and/or functionality, repr(X) != repr(Y).
which of-course can't always be guaranteed.
I'm not sure that there is a correct implementation which does not assume some sort of predefined structure on the input.
oh, and for all those using full blown html parsers to solve problem #2, when the problem is simple, the additional cost of parsing the html, and building objects around it (a-la BeautifulSoup or ElementTree) , just isn't worth it.
"""Uh, for Dave Kirby: the output format seems to prescribe that the word should be placed before the number of its occurrencies, not the other way around."""
My bad - I shouldn't write code at 1am. It is easily fixed by adding
| awk '{print $2, $1}'
to the end of the pipline.
For the python version of the same problem, I would use the string translate method to strip out punctuation - it is faster and simpler than processing a character at a time then joining the list back into a string:
The #end if/for comments are because the indentation gets mangled by the blog - there is a script 'pindent.py' included in the python distro that should re-indent code marked with these comments.
If you have any examples of the good vs bad code that you could put up it would be interesting. Just to put myself on the firing line here's a quick solution for the first one:
microsoft 3
apple 2
Knowledge of the standard library can make #1 be succinct. Note, that #1's spec. is incomplete. For tags occurring the same number of times, it isn't defined in what order they should be output. I've gone for alphabetical.
In order to preserve indentation, I've replaced some of the spaces at the start of the line with underscores.
I've added pre tags arround the code. It should make this thread easier to read for future visitors! Just wait for my new blog system that has syntax highlighting in the comments. :-)
Dave Kirby, show off! Your Python solutions is nice a succinct though.
_Mark_ The beautiful soup solution is very nice, and more general.
Paddy3118, Any assumption would be fine if it wasn't stated explicitly in the problem. I wouldn't penalize someone on a technicality. I only 'marked' the tests on a fail or pass basis. For three, the solutions that got close would treat floats and ints as the same, and unicode and strings as the same. I think that requirement is at least implicit from the description of the problem, but something only more experienced coders would pick up on.
For the first problem, I'm surprised nobody used the setdefault method on dictionaries. A defaultdict class is cleaner though. Using itertools.groupby is another possibility that avoids the need for a dictionary at all.
For problem 2, any of the solutions that actually parsed the html are better from a technical point of view. It depends on the context though, sometimes a simple RE will suffice.
For problem 3, my solution for generating the key was to use repr, as others have done. The simplest way to generate a key, I think, is just do do repr( (args, kwargs) ). I had also thought that picking, but Dave Kirby suggests that wont work -- which is a new one on me! I wonder if there are other Python serializer mechanisms can guarantee consistent output with the same input.
Another issue in problem 3, is where to you store the dictionary that contains the cached output? Most solutions have one dictionary and use the function as part of the key. I don't like that solution much, it feels a little 'dirty'. IMHO it is better to store the dictionary as a function attribute of the function being cached.
Actually, as long as you don't store the cache dictionary in "module level", you will get a new cache dictionary for each function you wrap with the decorator. adding the function as part of the key is redundant, but I have added it for good measure. (just so you could reconstruct the call from the key, if you wanted to)
but I do like adding the cache to the function as an attribute, it has the same semantics, without the redundant part.
This python bug report gives more information on the pickling issue - it was closed as invalid, since pickled string equality is not guaranteed:
http://mail.python.org/pipermail/python-bugs-list/2004-January/021745.html
The main problem is object references - if you pickle two identical object (or the same object twice), but there are a different number of references to the objects then you may get a different pickle.
>>> import cPickle as pickle
>>> a = {}
>>> pickle.dumps(a)
'(dp1\n.'
>>> pickle.dumps({})
'(d.'
this does not seem to happen with the pure python version of pickle, but that is likely to be so slow that it is useless for memoizing (and equality is still not guaranteed).
I'll also note that for all that a regexp solution is supposed to be "simpler", we've seen only one (that apparently misunderstood the question) and from the comments, none of the suggestions that a regexp would be simpler considered things like html comments, cdata, or whether nbsp is whitespace... I've found it quite common that when you've got a simple regexp for something - you've missed some part of the problem :-)
(Also, like lambdas, regexps don't have docstrings...)
I'm guessing you are talking about my comment, unfortunately, wordpress deletes everything between angle brackets. I'll post it again substituting angle brackets with curly ones.
import re
empty_p = re.compile(r'{p}\s*{/p}') # subsitute \s with ' 'if you don't include tabs and newlines as white space)
input = emtpy_p.sub('',input)
empty_p could be generalized to be more robust:
empty_p = re.compile(r'{\s*p\s*}\s*{\s*/\s*p\s*}',re.IGNORECASE)
this will catch also p tags which are uppercase or contains spaces inside the tag.
the regexp solution IS simpler, given that you can speak regexp fluently, and it has the advantage of being quicker than other solutions.
but it falls down to the problem, I won't use regexp in a situation where the the structure needed extraction is complicated, when I need to do multiple tasks with the html, or when the problem can't be solved with a regular language automata.
And remember that most of the times, you write html parsers for specific sites, where you know how the html behaves.
and while I'm at it, I'll re-post my decorator which the wordpress-monster ate...
def cache(max_time=datetime.timedelta(minutes=5), max_count=10):
CACHE = {}
def decorator(f):
@functools.wraps(f)
def wrapped(*args, **kwargs):
key = repr(f)+'|'+repr(args)+'|'+repr(kwargs)
if key in CACHE:
first_access, count, res = CACHE[key]
if datetime.datetime.now() - first_access >= max_time or count >= max_count:
del CACHE[key]
else:
CACHE[key] = (first_access, count + 1, res)
return res
res = f(*args, **kwargs)
CACHE[key] = (datetime.datetime.now(), 1, res)
return res
return wrapped
return decorator
damn, pre tags, don't work :(
@Dave Kirby: Can't the ''.translate() also do the ''.lower() at the same time? (Also, the delete string is being built each time around the loop.) I'd also be interested to see your approach if the output had to be sorted by descending frequency and then ascending tag.
Just for completeness: ido, thank you for reposting that regexp - it is simple, and matches the most-naive interpretation of the request; it's just that *any* simple regexp will get the wrong answer in the face of html comments or cdata sections (which I won't try to enter here, since there's no preview to see if it eats them.) regexps have their place - my point is just that it isn't as broad a place as people seem to think :-)
@Ralph Corderoy: yes, .translate() could be used to do the lowercase conversion as well - I didnt do that originally because it would not work with unicode, but in hindsight my code would not work with unicode anyway, especially if there are unicode punctuation characters (I think that is true of all the other solutions here).
The unicode type also has a translate method, but it works differently - it takes a dictionary of characters to translate, and does not do deletions.
Because I am in golfing mood:
from itertools import groupby;filter(lambda p:p[1]>1,((x[0],len(list(x[1]))) for x in groupby(sorted(e.strip('.? ').lower() for e in 'microsoft apple microsoft. Apple? security microsoft internet'.split()))))
ups forgot to sort:
from operator import itemgetter;from itertools import groupby;sorted(filter(lambda p:p[1]>1,((x[0],len(list(x[1]))) for x in groupby(sorted(e.strip('.? ').lower() for e in 'microsoft apple microsoft. Apple? security microsoft internet'.split())))),key=itemgetter(1),reverse=True)
For golf, you could import longname as l to save space later.
My solution to #1 was very similar the the one in the first comment, so I came up with this:
import string
raw_tags = ["".join([c for c in tag if c not in string.punctuation]).lower().strip() for tag in open('tags').readlines()]
tags = [(tag, len(raw_tags) - len([a for a in raw_tags if a != tag]) ) for tag in set(raw_tags)]
print "\n".join(["%s %s"% (k,v) for k, v in \
sorted([(tag, len(raw_tags) - len([a for a in raw_tags if a != tag]))\
for tag in set(raw_tags)], key=lambda x:(x[1], x[0]), reverse=True) if v != 1])
Is it elegant?
Where did I fail?
c = open(“input.txt”).read()
a =
b = filter(lambda x:x>1, sorted(, cmp=lambda x,y:-cmp(x,y)))
for w in b: print w,w
#1
i might post the others later if i feel like doing them, hows my code look?