rstripping Simon Pegg: Don’t use rstrip for file extension removal

March 12, 2018 i82much Leave a comment

In Python, what does '/path/to/my/simon_pegg.jpeg'.rstrip('.jpeg') yield?

If you guessed '/path/to/my/simon_pegg' – good try, but not quite right. The real answer is '/path/to/my/simon_'.

Despite what you might intuitively think, rstrip does NOT strip off a substring from the end of a string. Instead, it eliminates all of the characters from the end of the string that are in the argument. (Note that this is not mutating the string; it returns a copy of the string.)

Since pegg contains the characters that are in .jpeg, it is eliminated as well.

While this behavior is documented, it may be surprising.

Why does it matter? There are many instances of people attempting to use this rstrip approach to strip off a file extension. For instance, you might be converting an image from one filetype to another, and need to construct the final path. Or you might want to rename a bunch of files to have consistent extensions (jpeg → jpg).

This buggy implementation of stripping a file extension is tricky because most of the time it works – but it works by coincidence (the file name happens not to end with the characters in the file extension).

Github is rife with examples of people making this same mistake. For instance,

main.py: name = str(name).rstrip(".tif")

fixname.py:os.rename(i.path, i.path.rstrip('.gzip').rstrip('.gz') + '.gzip')

selector.py:l = [i.rstrip(".jpg") for i in k]

The Go language has the same semantics for its TrimRight function. This leads to the same sort of mistakes when people use it to trim file names. For instance,

filehelper.go: filename := strings.TrimRight(f.Name( ".pdf")

latest_images.go: idStr := strings.TrimRight(f.Name(), ".jpg")

The lessons to be learned from this are,

Read the documentation of the library functions you use.
Test your code, and not just of the happy paths. Good tests should try to break your implementation and exercise edge cases.

(Hat tip to my colleague Fredrik Lundh who alerted me to this problem and inspired this post)

Categories: go, programming, Python, Uncategorized Tags: bug, github, golang, python, rstrip

Go gotcha #1: variable shadowing within inner scope due to use of := operator

March 3, 2014 i82much 3 comments

Disclaimer: Go is open source and developed by many Google employees. I work for Google, but the opinions expressed here are my own and do not necessarily represent that of Google.

Last week I described how the range keyword in conjunction with taking the address of the iterator variable will lead to the wrong result. This week I’ll discuss how it’s possible to accidentally shadow one variable with another, leading to hard to find bugs.

Let’s take the same basic setup as last week; we have a Solution struct, and we’re searching for the best (lowest cost) one in a slice of potential candidates.

package main

import "fmt"

type Solution struct {
    Name     string
    Cost     int
    Complete bool
}

func FindBestSolution(solutions []*Solution) *Solution {
    var best *Solution
    for _, solution := range solutions {
        if solution.Complete {
            if best == nil || solution.Cost < best.Cost {
                best := solution
                fmt.Printf("new best: %v\n", *best)
            }
        }
    }
    return best
}

func main() {
    solutions := []*Solution{
        &Solution{
            Name:     "foo",
            Cost:     10,
            Complete: true,
        },
    }
    fmt.Printf("Best solution is: %v", FindBestSolution(solutions))
}

Output:

new best: {foo 10 true}
Best solution is: <nil>
Program exited.

Go playground

What’s going on? We see that we have a good candidate solution from the debugging information. Why does the function return the wrong value?

The bug is in this line:

best := solution

The problem is that we’re declaring and initializing a new variable (with the := operator) rather than assigning to the existing best variable in the outer scope. The corrected line is

best = solution

Use = to change the value of an existing variable, use := to create a new variable.

If I had not referenced the new variable with the debug print statement, this code would not have compiled:

if best == nil || solution.Cost < best.Cost {
    best := solution
}
prog.go:16: best declared and not used
 [process exited with non-zero status]

Go playground

Why is this shadowing of variables in other scopes allowed at all?

There is a long thread on the subject on Go-nuts, debating this subject.

Arguments For

Nate Finch:

type M struct{}

func (m M) Max() int {
    return 5
}

func foo() {
    math := M{}
    fmt.Println(math.Max())
}

If shadowing didn’t work, importing math would suddenly break this program.

…
My point was about adding an import after writing a lot of code (when
adding features or whatever), and that without shadowing, merely importing
a package now has the potential to break existing code….

The current shadowing rules insulate code inside functions from what
happens at the top level of the file, so that adding imports to the file
will never break existing code (now waiting for someone to prove me wrong
on this 😉

Rui Maciel:

There is a simpler and better solution: use a short variable declaration
when you actually want to declare a variable, and use an assignment
operator when all you want to do is assign a value to a variable which
you’ve previously declared. This doesn’t require any change to either
the language or the compiler, particularly one which is that cryptic.

Arguments Against

Johann Höchtl:

See it this way. I can carry a gun in my hand aiming towards a target. I
pull the trigger and hit the target. Everything happens exactly the whay
it is expected to happen.

Suddenly an inner block jumps in … the instructor. Me, a gun in my
hand, the instructor in between and on the other side the target. I pull
the trigger.

Still … everything happens exactly the way it is told to behave. Which
still makes the end results not a desirable result. Adding an “inner
block”, which by itself is behaving in a fully specified way,
influences the whole.

Somewhat odd I admit, but you may get what I mean?

Conclusion

I don’t think that the shadowing should be an error but I do think there should be a warning. The go vet tool already helps find common mistakes, such as forgetting to include arguments to printf. For instance:

example.go:

package main

import "fmt"

func main() {
    fmt.Printf("%v %v", 5)
}

Run:

go vet example.go

example.go:6: missing argument for Printf verb %v: need 2, have 1

If the vet tool were modified to add this warning, it would occasionally yield false positives for those cases where the shadowing is done intentionally. I think this is a worthwhile tradeoff. Virtually all Go programmers I’ve talked with have made this mistake, and I would be willing to bet that these cases are far more frequent than intentional shadowing.

Categories: go, programming Tags: assignment, declaration, go, golang, gotcha, shadowing, variable, workaround

Go gotcha #0: Why taking the address of an iterated variable is wrong

February 25, 2014 i82much 6 comments

Golang mascot
Go mascot – by Renée French under Creative Commons Attribution-Share Alike 3.0 Unported from http://en.wikipedia.org/wiki/File:Golang.png

Disclaimer: Go is open source and developed by many Google employees. I work for Google, but the opinions expressed here are my own and do not necessarily represent that of Google.

Go is my new favorite programming language. It’s compact, garbage collected, terse, and very easy to read. There are some things that trip me up even now after I’ve been using it for awhile. Today I’m going to discuss the range construct and how it has a surprising feature that might violate your assumptions.

Range

First, the range keyword is a way to iterate through the various builtin data structures in Go. For instance,

a := map[string]int {
    "hello": 1,
    "world": 2,
}
// 2 element range gets key and value
for key, value := range a {
    fmt.Printf("key %s value %d\n", key, value)
}
// 1 element is just the key
for key := range a {
    fmt.Printf("key %s\n", key)
}

// Works for slices (think of them as vectors/lists) too
b := []string {"hello", "world"}
// 2 element range gets the index as well as the entry
for i, s := range b {
    fmt.Printf("entry %d: %s\n", i, s)
}
// 1 element gets just the index (notice the pattern?)
for i := range b {
    fmt.Printf("entry %d\n", i)
}

This outputs

key hello value 1
key world value 2
key hello
key world
entry 0: hello
entry 1: world
entry 0
entry 1

Try this code in the Go Playground

Solution search – pointers

Imagine the case where we have a struct as follows

type Solution struct {
    Name string
    Cost int
    Complete bool
}

Say that we’re doing some sort of optimization where we’re looking for the minimum cost solution that meets some criteria; for simplicity’s sake, I’ve put that as the ‘complete’ bool. It’s possible that no such Solution matches, in which case we return a nil solution.

A reasonable implementation would be as follows

func FindBestSolution(solutions []Solution) *Solution {
    var best *Solution
    for _, solution := range solutions {
        if solution.Complete {
            if best == nil || solution.Cost < best.Cost {
                best = &solution
            }
        }
    }
    return best
}

Do you see the bug? Don’t worry if you don’t – I’ve made this mistake a few times now.

Let’s add some tests to find the problem. This is an example of a table driven test, where the test cases are given as a slice of struct literals. This makes it very easy to add new test cases.

func TestFindBestSolution(t *testing.T) {
    tests := []struct {
        name      string
        solutions []Solution
        want      *Solution
    }{
        {
            name:      "Nil list",
            solutions: nil,
            want:      nil,
        },
        {
            name: "No complete solution",
            solutions: []Solution{
                {
                    Name:     "Foo",
                    Cost:     25,
                    Complete: false,
                },
            },
            want: nil,
        },
        {
            name: "Sole solution",
            solutions: []Solution{
                {
                    Name:     "Bar",
                    Cost:     12,
                    Complete: true,
                },
            },
            want: &Solution{
                Name:     "Bar",
                Cost:     12,
                Complete: true,
            },
        },
        {
            name: "Multiple complete solution",
            solutions: []Solution{
                {
                    Name:     "Foo",
                    Cost:     25,
                    Complete: false,
                },
                {
                    Name:     "Bar",
                    Cost:     12,
                    Complete: true,
                },
                {
                    Name:     "Baz",
                    Cost:     25,
                    Complete: true,
                },
            },
            want: &Solution{
                Name:     "Bar",
                Cost:     12,
                Complete: true,
            },
        },
    }
    for _, test := range tests {
        got := FindBestSolution(test.solutions)
        if got == nil && test.want != nil {
            t.Errorf("FindBestSolution(%q): got nil wanted %v", test.name, *test.want)
        } else if got != nil && test.want == nil {
            t.Errorf("FindBestSolution(%q): got %v wanted nil", test.name, *got)
        } else if got == nil && test.want == nil {
            // This is OK
        } else if *got != *test.want {
            t.Errorf("FindBestSolution(%q): got %v wanted %v", test.name, *got, *test.want)
        }
    }
}

If you run the tests you’ll find that the last test fails:

--- FAIL: TestFindBestSolution (0.00 seconds)
    prog.go:82: FindBestSolution("One complete solution"): got {Baz 25 true} wanted {Bar 12 true}
FAIL
 [process exited with non-zero status]

This is strange – it works fine in the single element case, but not with multiple values. Let’s try adding a case where the correct value is last in the list.

    {
        name: "Multiple - correct solution is last",
        solutions: []Solution{
            {
                Name:     "Baz",
                Cost:     25,
                Complete: true,
            },
            {
                Name:     "Bar",
                Cost:     12,
                Complete: true,
            },
        },
        want: &Solution{
            Name:     "Bar",
            Cost:     12,
            Complete: true,
        },
    },

Sure enough, this test passes. So somehow if the element is last the algorithm works. What’s going on?

From the go-wiki entry on Range:

When iterating over a slice or map of values, one might try this:

items := make([]map[int]int, 10)
for _, item := range items {
        item = make(map[int]int, 1) // Oops! item is only a copy of the slice element.
        item[1] = 2                 // This 'item' will be lost on the next iteration.
}

The make and assignment look like they might work, but the value property of range (stored here as item) is a copy of the value from items, not a pointer to the value in items.

This is exactly what’s happening in this case. The solution variable is getting a copy of each entry, not the entry itself. Thus when you take the address of the entry, you end up with a pointer pointing at the LAST element in the slice (since the iteration stops at that point). To illustrate:

package main

import "fmt"

func main() {
    strings := []string{"some","value"}
    for i, s := range strings {
        fmt.Printf("Element %d: %s Pointer %v\n", i, s, &s)
    }
}

Element 0: some Pointer 0x10500168
Element 1: value Pointer 0x10500168

Note that the same pointer is used in both cases. This explains why the Solution pointer ended up pointing at the last element of the slice.
Playground

So how do we work around this problem? The key is to introduce a new variable whose address it’s safe to take; its contents won’t change out from underneath you.

Broken:

if solution.Complete {
    if best == nil || solution.Cost < best.Cost {
        best = &solution
    }
}

Fixed:

if solution.Complete {
    if best == nil || solution.Cost < best.Cost {
        tmp := solution
        best = &tmp
    }
}

With this patch the tests pass:

PASS

Program exited.

Alternative design

A great feature of Go is that you can return multiple values from a single function. Here’s an alternative implementation that doesn’t suffer from the previous problem.

func FindBestSolution(solutions []Solution) (Solution, bool) {
    var best Solution
    found := false
    for _, solution := range solutions {
        if solution.Complete {
            if !found || solution.Cost < best.Cost {
                best = solution
                found = true
            }
        }
    }
    return best, found
}

Since best is copying the VALUE of the solution variable, this works correctly. You can play with this example and see how the tests change in the Playground.

This illustrates one other nice feature of Go – all types have a ‘zero’ value that is legal to use. For strings this is the empty string, for pointers it’s nil, for ints it’s 0, for structs all of types are set to zero values. The line var best Solution implicitly sets best to be the zero solution. If I wanted to I could get rid of the found bool altogether and just compare the returned solution with another zero valued Solution.

Conclusion

I introduced some basic features of Go, including maps, slices, range, structs, and functions. I provided links to the amazingly useful Go playground which lets you easily test out code, format it, and share it with others.

I showed two implementations of a function that searches through a slice of struct values, searching for a solution that meets some criteria.

The first example using pointers led to a subtle bug that’s hard to find and solve unless you know how range works. I showed how to write unit tests that exercise the function and helped flush out the bug. I also explained what the bug was and how to work around it.

Finally I showed a version of the same function that uses Go’s multiple return types to return a found boolean rather than using a nil pointer to signify that the value wasn’t found.

Categories: go, programming Tags: garbage collection, go, golang, google, gopher, gotcha, iterator, java, pointer, programming languages, workaround

Developmentality

Archive

rstripping Simon Pegg: Don’t use rstrip for file extension removal

Go gotcha #1: variable shadowing within inner scope due to use of := operator

Arguments For

Arguments Against

Conclusion

Go gotcha #0: Why taking the address of an iterated variable is wrong

Range

Solution search – pointers

Alternative design

Conclusion

Top Posts

Email Subscription

Tags

Categories

Google+

Follow on twitter

Search the site

Stack Overflow profile

Nick’s tweets

Archives