Strings hiding in plain sight
It's not often I come across something in Python that surprises me. Especially in something as mundane as string operations, but I guess Python still has a trick or two up its sleeve.
Have a look at this string:
>>> s = "A"
How many possible sub-strings are in s
? To put it another away, how many values of x
are there where the expression x in s
is true?
Turns out it is 2.
2?
Yes, 2.
>>> "A" in s
True
>>> "" in s
True
The empty string is in the string "A"
. In fact, it's in all the strings.
>>> "" in "foo"
True
>>> "" in ""
True
>>> "" in "here"
True
Turns out the empty string has been hiding every where in my code.
Not a complaint, I'm sure the rationale is completely sound. And it turned out to be quite useful. I had a couple of lines of code that looked something like this:
if character in ('', '\n'):
do_something(character)
In essence I wanted to know if character
was an empty string or a newline. But knowing the empty string thang, I can replace it with this:
if character in '\n':
do_something(character)
Which has exactly the same effect, but I suspect is a few nanoseconds faster.
Don't knock it. When you look after the nanoseconds, the microseconds look after themselves.
I'd argue this is an anti-feature.
Looking at
if character in '\n':
, I'd consider""
matching a bug.Man this is a nice comment box so far. If I have to sign up to disqus or something I'm gonna be mad.
Why not just do
if character == '\n':
then it wont match an empty stringAlso, did you have to sign up to disqus? I'll guess I'll figure out by replying.
It's set theory, not an anti-feature.
It is odd looking. But these edge cases tend to be well thought out in Python by some very smart people. So I'm guessing there is some solid thinking behind it.
The comment system is home grown. Did it work well? There may be some glitches left.
String containment is about substrings (slices), NOT about single characters: (sub in string) == any(sub == string[i:len(sub)] for i in range(len(string) - len(sub)).
Interesting find. Thanks for posting it. Personally I'd prefer the redability of ('', '\n') so that it's clear that you're checking for empty chars which is not obvoius when using the string method.
I also have a problem with your assumtion
Did you measure the speed improvement or guess?
I did a small test (Python3.5.2, 32bit) and the behaviour is inconsistent. With the code below, exact matches to the set are slightly faster than checking the string.
Good point! You're right, I didn't test it.
Your test doesn't try all the possible inputs, which may perform differently. I tweaked it to try the empty string, a newline, and another character.
It does look like the string version is a tiny bit faster (and I do mean tiny).
You're definitely right about the readability. I'd only use this in a very tight loop and with a comment.
BTW the
('', '\n')
is just a tuple. Didn't occur to me to try with a set.I actually did know about the empty string is contained in any string thing, but for me.
if character in ('', '\n'):
is more clear (to the reader than)
if character in '\n':
Because though the latter works and may be a few "nanoseconds" faster, the former makes it clear to the user that the effect is intentional.
No argument there.
I used it in a tight cpu bound loop and with a comment. Still feels a little dirty...
As an elaboration, guess HOW MANY empty strings in 'abc'.
That is peculiar. Replacing the empty string with something else, actually makes more empty strings.