Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix panic caused by accessing non-existent header #5804

Merged
merged 5 commits into from
Jun 20, 2024
Merged

Conversation

blanet
Copy link
Contributor

@blanet blanet commented Jun 18, 2024

We encountered a panic while checking out a repository with many LFS files, even with a fresh build from main:

panic: runtime error: index out of range [0] with length 0

goroutine 49 [running]:
github.com/git-lfs/git-lfs/v3/tq.(*basicDownloadAdapter).download(0xc0000148d0, 0xc0009391a0, 0xc0005e7970, 0x0, 0xc000014620, 0x0, {0xb4d7b0, 0xc000552780})
	/data00/go/pkg/mod/github.com/git-lfs/git-lfs/v3@v3.4.0/tq/basic_download.go:156 +0x14a8
github.com/git-lfs/git-lfs/v3/tq.(*basicDownloadAdapter).DoTransfer(0xc0000148d0, {0x40?, 0x0?}, 0xc0009391a0, 0x0?, 0x0?)
	/data00/go/pkg/mod/github.com/git-lfs/git-lfs/v3@v3.4.0/tq/basic_download.go:96 +0x3a8
github.com/git-lfs/git-lfs/v3/tq.(*adapterBase).worker(0xc0000c46c0, 0x5, {0x0, 0x0})
	/data00/go/pkg/mod/github.com/git-lfs/git-lfs/v3@v3.4.0/tq/adapterbase.go:183 +0x58b
created by github.com/git-lfs/git-lfs/v3/tq.(*adapterBase).Begin
	/data00/go/pkg/mod/github.com/git-lfs/git-lfs/v3@v3.4.0/tq/adapterbase.go:96 +0x270
error: external filter 'git-lfs filter-process' failed

After some digging, we found the panic was caused by accessing a non-existent "Retry-After" response header when receiving a 429 status code. We should use the Get() method to safely access response headers.

@blanet blanet requested a review from a team as a code owner June 18, 2024 09:21
We encountered a panic while checking out a repository with many LFS
files, even with a fresh build from main:

	panic: runtime error: index out of range [0] with length 0

	goroutine 49 [running]:
	github.com/git-lfs/git-lfs/v3/tq.(*basicDownloadAdapter).download(0xc0000148d0, 0xc0009391a0, 0xc0005e7970, 0x0, 0xc000014620, 0x0, {0xb4d7b0, 0xc000552780})
		/data00/go/pkg/mod/github.com/git-lfs/git-lfs/v3@v3.4.0/tq/basic_download.go:156 +0x14a8
	github.com/git-lfs/git-lfs/v3/tq.(*basicDownloadAdapter).DoTransfer(0xc0000148d0, {0x40?, 0x0?}, 0xc0009391a0, 0x0?, 0x0?)
		/data00/go/pkg/mod/github.com/git-lfs/git-lfs/v3@v3.4.0/tq/basic_download.go:96 +0x3a8
	github.com/git-lfs/git-lfs/v3/tq.(*adapterBase).worker(0xc0000c46c0, 0x5, {0x0, 0x0})
		/data00/go/pkg/mod/github.com/git-lfs/git-lfs/v3@v3.4.0/tq/adapterbase.go:183 +0x58b
	created by github.com/git-lfs/git-lfs/v3/tq.(*adapterBase).Begin
		/data00/go/pkg/mod/github.com/git-lfs/git-lfs/v3@v3.4.0/tq/adapterbase.go:96 +0x270
	error: external filter 'git-lfs filter-process' failed

After some digging, we found the panic was caused by accessing a
non-existent "Retry-After" response header when receiving a 429 status
code. We should use the `Get()` method to safely access response
headers.

Reported-by: Yunpeng Li <liyunpeng.chn@bytedance.com>
Signed-off-by: Xing Xin <xingxin.xx@bytedance.com>
@blanet blanet changed the title fix panic when accessing non-existent response header Fix panic caused by accessing non-existent header Jun 18, 2024
Several of our test scripts in the t/ directory do not have file
permissions allowing them to be executed directly, so we correct
that now.
A prior commit in this PR resolves a bug where a 429 response to an
upload or download request causes a Go panic in the client if the
response lacks a Retry-After header.

We can simulate this condition and confirm that the changes fix the
problem by updating our test server to respond with a 429 but no
Retry-After header when passed a new sentinel value, and then adding
a pair of tests which send this value.  Both tests generate the
Go panic condition without the other changes in this PR.
In commits 2197f76 and
4fe0a8d of PR git-lfs#3449 support was added
to parse any Retry-After headers provided by a server with a 429 response
and delay making further requests until the appropriate time.  As part of
this work, the NewRetriableLaterError() function in the "errors" package
was introduced, which attempts to parse the Retry-After header value.

This method accepts an input error, such as that defined by the
handleResponse() method of the "lfshttp" package, with the expectation
that it will be wrapped in a new retriableLaterError structure.

However, the NewRetriableLaterError() function overwrites its input
"err" argument when it attempts to parse the header, and so only
wraps a "nil" value.  We can resolve this problem by using a different
name for the errors returned from the header parsing attempts.

As well, we rename the value returned from the attempt to parse the
header as a timestamped formatted according to RFC 1123, so it does
not collide with the name of the standard library's "time" package.
A prior commit in this PR resolves a bug where a 429 response to an
upload or download request causes a Go panic in the client if the
response lacks a Retry-After header.

The same condition, when it occurs in the response to a batch API
request, does not trigger a Go panic; instead, though, we simply
fail without retrying the batch API request at all.  This stands
in constrast to how we now handle 429 responses for object uploads
and downloads when no Retry-After header is provided, because in
that case, we perform multiple retries, following the exponential
backoff logic introduced in PR git-lfs#4097.

This difference stems in part from the fact that the download()
function of the basicDownloadAdapter structure and the DoTransfer()
function of the basicUploadAdapter structure both handle 429 responses
by first calling the NewRetriableLaterError() function of the "errors"
package to try to parse any Retry-After header, and if that returns
nil, then calling the NewRetriableError() function, so they always
return some form of retriable error after a 429 status code is received.

We therefore modify the handleResponse() method of the Client structure
in the "lfshttp" package to likewise always return a retriable error
of some kind after a 429 response.  If a Retry-After header is found
and is able to be parsed, then a retriableLaterError (from the "errors"
package) is returned; otherwise, a generic retriableError is returned.

This change is not sufficient on its own, however.  When the batch
API returns 429 responses without a Retry-After header, the transfer
queue now retries its requests following the exponential backoff
logic, as we expect.  If one of those eventually succeeds, though,
the batch is still processed as if it encountered an unrecoverable
failure, and the Git LFS client ultimately returns a non-zero exit code.

The reason this occurs is because the enqueueAndCollectRetriesFor()
method of the TransferQueue structure in the "tq" package sets the
flag which causes it to return an error for the batch both when an
object in the batch cannot be retried (because it has reached its
retry limit) or when an object in the batch can be retried but no
specific retry wait time was provided by a retriableLaterError.

The latter of these two cases is what is now triggered when the batch
API returns a 429 status code and no Retry-After header.  In commit
a3ecbcc of PR git-lfs#4573 this code was
updated to improve how batch API 429 responses with Retry-After headers
are handled, building on the original code introduced in PR git-lfs#3449
and some fixes in PR git-lfs#3930.  This commit added the flag, named
hasNonScheduledErrors, which is set if any objects in a batch which
experiences an error either can not be retried, or can be retried but
don't have a specific wait time as provided by a Retry-After header.
If the flag is set, then the error encountered during the processing
of the batch is returned by the enqueueAndCollectRetriesFor() method,
and although it is wrapped by NewRetriableError function, because the
error is returned instead of just a nil, it is collected into the errors
channel of the queue by the collectBatches() caller method, and this
ultimately causes the client to report the error and return a non-zero
exit code.

By constrast, the handleTransferResult() method of the TransferQueue
structure treats retriable errors from individual object uploads and
downloads in the same way for both errors with a specified wait time
and those without.

To bring our handling of batch API requests into alignment with
this approach, we can simply avoid setting the flag variable when
a batch encounters an error and an object can be retried but without
a specified wait time.

We also rename the flag variable to hasNonRetriableObjects, which
better reflects its meaning, as it signals the fact that at least
one object in the batch can not be retried.  As well, we update
some related comments to clarify the current actions and intent of
this section of code in the enqueueAndCollectRetriesFor() method.

We then add a test to the t/t-batch-retries-ratelimit.sh test suite
like the ones we added to the t/t-batch-storage-retries-ratelimit.sh
script in a previous commit in this PR.  The test relies on a new
sentinel value in the test repository name which now recognize in
our lfstest-gitserver test server, and which causes the test server
to return a 429 response to batch API requests, but without a
Retry-After header.  This test fails without both of the changes
we make in this commit to ensure we handle 429 batch API responses
without Retry-After headers.
@chrisd8088
Copy link
Contributor

Hey, thanks for the bug report and fix! This seems correct and valuable to me.

I think it does expose some other problems which we should address at the same time, however.

First, it would be ideal if we could also adds some tests for this condition. For example, I was able to trigger the panic condition by commenting out the lines which set the Retry-After header for uploads or downloads in our test server, building the lfstest-gitserver with those changes, and then running the t/t-batch-storage-retries-ratelimit.sh test script.

Second, a similar kind of error occurs if one comments out the Retry-After header that's set for batch responses, builds the test server, and runs the t/t-batch-retries-ratelimit.sh script. The proposed change to enqueueAndCollectRetriesFor() in this PR doesn't prevent that on its own, unfortunately. This condition doesn't cause a Go panic, but the client exits with an error code and doesn't retry the batch requests at all.

There's a history of efforts to better support 429 response handling, starting with PR #3449, with fixes and improvements in PRs #3930, #4097, and #4573 (and possibly others). I've spent a fair bit of time today reading through these PRs, and trying to put together some additional commits for this PR.

My fix-batch-retry-without-header branch contains four additional commits to the one in this PR. Would you consider pulling those into this PR? Otherwise I can open a separate PR which contains your changes too.

The purpose of these four extra commits is to add the tests I outlined above, and also fix the related problem with how we handle batch API responses when a 429 is received without a Retry-After header. My changes should ensure that in that situation, as for the ones you addressed with individual object uploads and downloads, we follow the exponential backoff strategy from PR #4097 and make a series of batch API requests until either one succeeds or we reach the maximum number of requests permitted by the lfs.transfer.maxRetries configuration setting.

And thank you again for your contribution; we greatly appreciate the help!

Copy link
Contributor

@chrisd8088 chrisd8088 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for finding this bug and fixing it!

It would be great to add some tests for this condition, like the ones I've put into my fix-batch-retry-without-header branch, if you're willing to consider adding the commits from there to this PR.

Thanks again very much for the contribution!

Comment on lines +415 to +417
if header == "" {
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change, thank you!

As I mentioned in #5804 (comment), this change on its own isn't enough to help us handle batch API responses when we receive a 429 without a Retry-After header, because we don't convert those into any kind of retriable error; we fall through and just return a non-retriable error, which means the client just exits in such a case.

To help us handle 429s without Retry-After headers from the batch API as well as from individual object uploads and downloads, I've proposed some additional commits in my fix-batch-retry-without-header branch.

Would you be willing to pull those into this PR as well, if they seem reasonable to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! It's difficult for me to construct those tests because I am not very familiar with the implementation details of git-lfs, thanks for your help. :-)

I really appreciate your efforts in tracking the long history of the retry mechanism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries at all! It was an interesting exercise and think we're making real improvements. Thank you again for the key analysis and fix.

@chrisd8088 chrisd8088 added bug and removed bug labels Jun 19, 2024
@chrisd8088 chrisd8088 requested a review from a team June 19, 2024 18:46
for _, t := range batch {
if q.canRetryObject(t.Oid, err) {
hasNonScheduledErrors = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reading of the history here is that in commit 2197f76 of PR #3449, when support for Retry-After handling was added, the following conditional which checks q.canRetryObjectLater() was introduced with an extra line that set err = nil.

That turned out to cause a subtle bug where, when the batch API returned a 429 response with a Retry-After header, only the first object in the batch was scheduled to be retried. All the others would be skipped, because when q.canRetryObjectLater() was called for them as the loop moved through the list of objects in the batch, that would call q.canRetryLater(err) with a nil for err, which meant errors.IsRetriableLaterError(err) would return false, and the conditional branch would not run and hence would not append the objects to the next batch.

This was fixed in commit a3ecbcc of PR #4573, which removed the err = nil line, and replaced it with the hasNonRetriableObjects flag. However, the new implementation maintained the distinction between how the two types of retriable errors were handled, which reflects the original implementation in PR #3449.

I suspect that this distinction is not valid, which is why I remove the hasNonScheduledErrors = true line here, and which in turn allows us to now handle the condition where the batch API returns a 429 but not a Retry-After header.

I also suspect we haven't bumped into this problem before in part because the handleResponse() method of the Client structure in the lfshttp package, which processes the response from the batch API, has in general not created "generic" retriable errors before, only "retriable later" errors, in the specific case of a 429 response with a valid Retry-After header.

@chrisd8088
Copy link
Contributor

I'm going to request a review from @bk2204 here, because I don't think I should review my own commits. :-)

@chrisd8088 chrisd8088 requested a review from a team June 19, 2024 19:21
@chrisd8088 chrisd8088 merged commit b0e2f25 into git-lfs:main Jun 20, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants
-