The fastest way to detect a vowel in a string

31 points by azhenley

0x2ba22e11

This matches old Python performance lore. Specifically: replacing an explicit loop with a call to a C function that contains exactly the same loop (like numpy style vectorisation) works for string processing too.

JordiGH

Haha, the primes example is ridiculous. It reminds me of Gödel numbering, likely the inspiration.

Also, using Wilson’s theorem for primality testing, haha, this is hilarious (because it’s just about the worst possible way to test for primes, but it’s fine for such small primes).

gmorling

Was hoping to see some cool SIMD trick actually. Is this possible with Python?

sknebel

Only similarly to the regex part, by calling into a native library doing it (I’m guessing something from the scientific ecosystem would have it, Numpy or pandas?). Python itself doesn’t expose anything like that.

sknebel

Matches my experience. Years ago in uni I was trying to performance-tweak the Python microformats2 parser because it was a big chunk of the processing time in brid.gy, and using regex (and sometimes combining multiple regex and then using simple checks to decide which it was) made a difference in many cases. Although overall “parsing HTML into Python objects” was the big whammy of course, at some point optimizing Python is not the most efficient use of your time…

olliej

Regex being the fastest is exactly what I’d expect - I don’t know why the author would assume similar performance to the for loop. Fundamentally this is comparing the performance of a for loop in Python to a for loop in C (I can’t recall which regex engine Python uses).

This is similar to those articles that compare “Python” performance to C by comparing a naive C implementation of something (matrix multiplies or similar) to Python code that … calls a library written in C and assembly specifically for the purpose of numeric processing.

There’s a real gap in understanding by some people when comparing Python perf that remains confusing to me - the same gap does not seem to exist in the JS community (there’s much more understanding of when something is implemented in something other than js even if it is being called from js)

frontsideair

I wonder if it can be made faster with magic numbers and bitmasking. A loop is necessary and since the check is static, it can be micro optimized. Even moving “e”s to the start can improve performance for most real world cases.

Garbi

magic numbers and bitmasking

This is what I was expecting. Maybe I’ve read too many of that kind of article.

After .lower() and subtracting 65 you get 0, 4, 8, 14, 20 or 000000, 000100, 001000, 001110, 010100. Kind of interesting. You can rule out any byte that is odd. I guess you still have to compare every byte to something, so the comparison would have to be very fast to make up for the overhead and the whole thing would be worthwhile only for very long strings. Since you are looking at English language strings I doubt any of this would matter because you’d find a vowel very quickly.

dsr

Falsehoods people believe about English:

the vowels are a,e,i,o,u
…and sometimes y
words are composed of one or more syllables in which every syllable has a vowel
syllables have at least one vowel
a change in vowel sound indicates a new syllable
English has rules

JordiGH

Do you believe everything you said here about English or would any other human language also fit here?

I see a lot of folklore about “haha, English is so nonsensical,” but just about any human language is (and also, they’re not, linguistics couldn’t exist if human languages didn’t exhibit remarkable regularity). I wonder if you hold the same beliefs about other languages.
- dsr
  
  Hmm.
  
  English has heuristics, not rules. For every rule of English, there is at least one reasonable exception. The prescriptivist view is simply wrong.
  
  There are languages where spelling is perfectly regularized, so that if you can read a word, you know how to pronounce it. There are languages with authorities trying to control their usage (notably L’Académie française for French) with some degree of success.
  
  Do I believe what I said about English? Ja, you betcha.
  - Johz
    
    By that standard, all languages have “heuristics not rules”. That’s just how languages work. But at that point, you’re just using “heuristic” as a new word for “rule”.
    
    Sure, some languages might have more regular spelling or top-down prescriptivist systems, but (a) spelling is only part of a language, (b) those languages still have exceptions, and (c) authorities that determine how a language gets used exist in every language - how else do you think schools work?
    
    A better list of falsehoods would probably begin: “some languages are uniquely special snowflakes and that factoid you know about them is definitely true”.
    
    dsr
    
    How do I think schools work? Well, for English, the subject in question, the answer is “badly, inaccurately, and often without even straightforward presentation of the more accurate heuristics”. It is a testament to students, not to pedagogical techniques, that so many of them end up reasonably literate.
    
    There is no authority which determines how English works.
  - fanf
    
    I can’t tell which code snippets correspond to which timing results: the headings in the article don’t match the labels in the tables and the function names in the snippets are all the same instead of matching the labels.
    
    The image at the top looks like it should scroll to reveal the cropped text, but it doesn’t. The bytecode seems unrelated to the python so I am not sure why they are shown next to each other. Confusing.
    
    azhenley
    
    Really? I thought the function names, like method_set_intersection and method_nested_loop, and the headers, like “Set Intersection” and “Nested Loop”, were completely descriptive of their unique approach. I’ll fix it.
    
    fanf
    
    Ah that’s much better now, I can see the results make sense.
    
    sigmonsez
    
    just reiterates how incredibly slow python truly is. Anything it does fast is actually done in C
    
    pronoiac
    
    I really want a graph or two for this.
    
    pronoiac
    
    I graphed the data in a Google Doc.
    
    A little analysis:
    
    for string length 10, loop_in is about 14x faster than prime
    
    regex at length 1000 is barely slower than prime at length 10 - 0.020088 vs 0.016113
    
    I’d expect these to scale at O(n), which would be straight lines, but it looks like they’re faring worse
    
    tonnydourado
    
    I’d try making the vowels set a global constant and use in with the simple for loop, wonder if that would make much of a difference. Probably wouldn’t beat regex, though, which is mildly annoying, because I like sets XD
    
    cblake
    
    Python was never meant as a “write it in Pure This” language, but as a so-called glue language, like the shell itself. If you want low-level, ignoring all internationalization niceties and character classification databases, as per the article, Cython makes this easy. On i7-1370P, otherwise idle Linux I get about 27 microseconds with this Cython:
    
    cdef int first(const char *s, int n, char *set): cdef int i # Location of first in `set` for 0 <= i < n: if set[<int>(s[i])] == 1: return i return -1 # ^^^-- to silence [char] warning cdef char set[256] # Only 4 64B cache lines from libc.string cimport memset memset(&set[0], 0, 256) # Set up a char set set[ord('a'[0])] = 1; set[ord('e'[0])] = 1 set[ord('i'[0])] = 1; set[ord('o'[0])] = 1 set[ord('u'[0])] = 1# set[ord('y'[0])] = 1 from sys import argv, exit; from time import time cdef bytes av = bytes(argv[1], encoding="utf8") \ if len(argv)>1 else b'' cdef const char *s = av cdef int n = len(av) t0 = time() cdef int spot = first(s, n, set) print(time() - t0, " for spot ", spot)
    
    and test code for a 100_000 prefix (L2 CPU cache, basically):
    
    cycc first.pyx; test=$(repeat 100000 printf h)i; repeat 5 ./first $test
    
    with output:
    
    2.7179718017578125e-05 for spot 100000 2.7179718017578125e-05 for spot 100000 2.7179718017578125e-05 for spot 100000 3.0994415283203125e-05 for spot 100000 2.6941299438476562e-05 for spot 100000
    
    Also, if you are going to do multiple runs for something like this, best to use the min(N) { 5 here } with more details over at https://github.com/c-blake/bu/blob/main/doc/tim.md
    
    EDIT: This is about 1 CPU clock cycle per input byte. As for the claim at the end of TFA, the fastest way likely involves SIMD, much as with strlen or its generalization memchr for only 5 specific bytes, probably doing a little better depending upon whether you need 1-pass or 5-pass, if strings fit in CPU caches, memory bandwidth, and yadda-yadda. For example, I believe memchr achieves better than 5B/cycle in L1 cache on most modern CPUs.