Monday 10 September 2012

Python: str.split has a "limit" argument

I have recently made this little discovery. It seems that str.split has a maxsplit argument, which tells it to only split into a certain amount of parts. This could be really useful for text parsing.
I have in the past run into some (rare) situations where I needed to do this, but didn't know of the maxsplit parameter, and ended up using str.join and slices, to recreate the rest of the string with the delimiters.

It's a little boring to do, and it is ugly.
>>> url = '/posts/blog-1/10251/'
>>>
>>> #problem: split the URL into two parts
... #such that first_part == 'posts' and second_part == 'blog-1/10251'
... #first solution: split and join with slices.
...
>>> first_part = url.strip('/').split('/')[0]
>>> second_part = '/'.join(url.strip('/').split('/')[1:])
>>> first_part, second_part
('posts', 'blog-1/10251')
However, if we do this using the split limit argument, it becomes much more readable.
>>> #second solution: use unpacking, and str.split() with the limit argument
...
>>> first_part, second_part = url.strip('/').split('/',1)
>>> first_part, second_part
('posts', 'blog-1/10251')
>>>
The "limit" argument asks you how many splits you would like, not how many fragments you would like. So specify the n-1 when you want n fragments.

What about splitting by whitespace?

Splitting by whitespace is one of the most powerful features in str.split(). Since I usually invoke this functionality using "".split() without any arguments, I was worried about splitting by whitespace, with the limit argument being a positional-only argument, but you can also use "".split(None).
This is nice since the exact whitespace that used to be there would be impossible to recover with the above tactic (since it's not just a delimiter character).
>>> 'OneFragment TwoFragments ThreeFragments'.split()
['OneFragment', 'TwoFragments', 'ThreeFragments']
>>> 'OneFragment TwoFragments ThreeFragments'.split(maxsplit=1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: split() takes no keyword arguments
>>> 'OneFragment TwoFragments ThreeFragments'.split(None, 1)
['OneFragment', 'TwoFragments ThreeFragments']

Split by whitespace, and preserve it.

When you split by whitespace, str.split splits by spaces, tabs, carrier returns and newlines. There are many whitespace characters, and sometimes you want to preserve this information. When using string.split and joining it back, you have no way of getting that information back. It's gone. However, the maxsplit argument allows you to preserve the existing whitespace.
>>> 'And together we fled.\r\nWe said:\r\n\t"Hello!"'.split(None, 1)
['And', 'together we fled.\r\nWe said:\r\n\t"Hello!"']
>>> print _[1]
together we fled.
We said:
        "Hello!"

2 comments:


  1. Great presentation of Python form of blog and Python tutorial. Very helpful for beginners like us to understand Python course. if you're interested to have an insight on Python training do watch this amazing tutorial.
    https://www.youtube.com/watch?v=e9p0_NB3WrM

    ReplyDelete