> While our log analysis, conducted from March 5 through March 8, confirmed that this was a rare issue, it could not rule out the possibility that a session had been incorrectly returned but then never used. This was not a risk we were willing to take, given the potential impact of even one of these incorrectly returned sessions being used.
It is funny how they make it seem like they are super cautious. When in fact they had no other choice than do it when you rephrase the problem as " there were maybe some people who got the admin right of the github account of some of our customers"
You say that, but resetting all the sessions is very publicly visible and plenty of smaller companies would paper over such things. Honestly I don't really expect any less from Github or Microsoft scale but don't kid yourself into thinking that kind of things goes un-noticed, un-investigated or un-actioned elsewhere.
To the contrary, I find my session tokens get reset left and right by websites that don’t even care about how much of an inconvenience it is for the customers.
If your concern is tracking, clearing the cookies doesn't help much. Next day you will login on the same services and websites that you logged before, and if they share data with 3rd parties they will keep sharing it, with or without cookies.
"Delete cookies and site data when Firefox is closed" has always bothered me; it should be "Delete cookies and site data when Firefox is opened", because abnormal termination is not "closing", and you'd still want the cookies deleted when you start up again.
It is not. But a ton of sites seem to be inconsistent about it, or it feels that way at least. Some sites I frequent don't log me out for months at a time, and then suddenly they do. And then they do it again a few weeks later. Followed by multiple months of not doing it once more.
I use a food delivery service (website only, no app) that doesn’t even remember the chosen language setting, and of course, never keeps you logged in past 24hrs. So effectively, need to login every time I make an order...
That happens to me even on YouTube. Every couple of months I have to fix the language setting and switch back to dark theme. What can't be fixed are Google's weird attempts at auto-translating video titles from other languages, making me click on videos that I simply cannot understand. Don't do that, Google. Do full title, thumbnail and video translation with 90% accuracy or just...don't. Half-assed tech is easy but not helpful for anyone.
This is my biggest pet peeve about delivery service websites. What are the chances that I've moved house since I last ordered a pizza two days ago? Why do you incessantly ask for my postcode? When I log in, assume I'm the same person at the same address. If I have moved and accidentally order to the old house...well that's on me and I'd have to be pretty stupid (or hungry) to do such as thing.
Because Discord doesn't let you change avatars I have created three different Discord accounts. I get logged out of all three all the time. It's a massive pain.
Maybe part of the sentence is missing and they meant change avatars _per server_? That would be the only reason I can imagine someone might want to have multiple accounts.
Obviously if they are logging all actions of all users, and there's some decent retention period, they can find out how many people got unintended access and what all they did with it. Recently, I got an advice from a C-level executive to include such analytics in ASPSecurityKit [0], because that's what companies are looking for these days. This GH incident makes me consider his suggestion more seriously.
This points to the last A in AAA of security which stands for Authentication, Authorization & Accounting, AAA moniker is commonly used in reference to either RADIUS or Diameter (network protocols), the concept is widely used for software application security as well.
So Accounting implies What resources were accessed, at what time, by whom, and what commands were issued?
Well that's embarrassing. Passing a mutable dictionary—containing callbacks that could be closed over heaven-even-knows-what, no less!—between threads? And it's re-used between requests?
They are lucky this wasn't much, much worse.
Why does the error reporting service need to call the web thread back anyway?
Yet another case where immutable-everything would have prevented this hairball from accreting in the first place.
An IDOR is embarrassing - this was a complex bug and Github's investigation/response was great here. I'd be proud if my company handled a security incident like this.
For the people as confused as me, it stands for "Insecure Direct Object Reference" and not "International Day of Radiology" or "Iowa Department of Revenue".
Unicorn isn't GitHub's code, though. So the object-reuse that was the root cause here wasn't their code, and treating your webserver as a working black box is pretty typical (right up until you hit an edge case like this).
And Unicorn is pretty old. It wouldn't surprise me if this object reuse dates back to mongrel's code base, back when doing anything with threads in Ruby would have been unusual.
It's still bad code on unicorn's part (when I got to that part of the post I said "Unicorn does WHAT?" out loud), and GH should have been more careful introducing threads to an environment that didn't have concurrency before, but every step of the chain is pretty understandable IMHO.
I think this issue is relevant for Python programmers as well. uwsgi for many years shipped a default allocation mode for the WSGI environment which just cleared it between requests. One had to switch over to `wsgi-env-behaviour=holy` to prevent it. Holy is now the default but many people still set it to cheap because that prevented some segfaults for a while in uWSGI.
> On March 2, 2021, we received a report via our support team from a user who, while using GitHub.com logged in as their own user, was suddenly authenticated as another user.
Sobering thought for me is how probable it is that a similar issue might be present in other platforms. The title calls it "rare" but given (a) the number of complex-enough web/app platforms in use today, (b) just the endless number of ways/infra permutations for "doing x", and (c) the size of a typical company's codebase and dependencies, could it be a more common occurrence than we think?
Total anecdote, up for you to believe: Back in April 2016, one of my then-housemates bought a brand-new Macbook Air. A week into his ownership, he exclaims surprise that he is logged-in to another Facebook account, someone we are totally not acquainted with. I borrow his MBA and try to investigate but, alas, I lack expertise for it. The only relevant thing I can eke out is that they both went to the same event a week or so ago and probably shared a Wifi AP, where a "leak" could've happened. Or maybe a hash collision in FB's side?
I would've reported it to FB but then I don't have enough details; with the two conjectures I have, I'm basically holding water with a sieve. But now the thought that someone I totally don't know might end up logged-in in my FB account is a possibility that bothers me. And I have no idea how to prevent it other than minimizing what could be compromised (and no, "Get off FB!" just isn't in the cards, I'm sorry).
Kudos for Github for investigating and fixing this issue. Another thought that occurred to me while writing this is how many users of other platforms might be filing issues like this but can't provide enough detail and so their reports are eventually marked as "Could Not Reproduce". Not exactly leaky sessions but just general security mishaps that they would struggle to describe to support.
> Hello everyone! This is forum regular fronzelneekburm, posting from the account of some poor Chinese guy that gog randomly logged me into. I'll explain the situation in further detail in another post from my actual account once I logged out of this one.
> The underlying bug existed on GitHub.com for a cumulative period of less than two weeks at various times between February 8, 2021 and March 5, 2021. Once the root cause was identified and a fix developed, we immediately patched GitHub.com on March 5. A second patch was deployed on March 8 to implement additional measures to further harden our application from this type of bug. There is no indication that other GitHub.com properties or products were affected by this issue, including GitHub Enterprise Server. We believe that this session misrouting occurred in fewer than 0.001% of authenticated sessions on GitHub.com.
If we use the 0.001% figure and assume all 56 million of their users are authenticated, then this occurred with 560 sessions at most.
You are right in that when something only happens to a minuscule % of users is hard to investigate and repro. But at the same time something as critical as "I logged in and was authenticated as another user" should probably be immediately escalated to the highest levels within your company.
So I am one of the users who reported this. I quickly filed a support ticket but then I also searched my network to find someone on GH's security team to email. I'm not sure if support or the security person I emailed escalated it first.
I recently reported a phishing site to GitHub which looked exactly like GitHub.com and GitHub responded within a day. So maybe your escalation wouldn't even have been necessary. :)
I believe they are referring to the rarity of hitting this bug during the period of time it was broken, and not the rarity of the bug in general.
People were speculating on this here: https://hackertimes.com/item?id=26395138, and I even left a comment about this type of bug, but on Node. This category of bug, where session data leaks across requests, is actually incredibly common.
> The only relevant thing I can eke out is that they both went to the same event a week or so ago and probably shared a Wifi AP, where a "leak" could've happened. Or maybe a hash collision in FB's side?
Given two users in extremely close geographic proximity, the chances of them having manually logged in on the device themselves with your housemate either forgetting or not knowing are much much higher than the chances of an authentication exploit in the world's most popular site that would have gone undiscovered for a subsequent five years at this point.
Probably somewhere between the two (but closer to the "they just logged in on that computer" side of things) is the chance of the LAN being some eldritch abomination that just occasionally serves responses to the wrong local IPs, but that's not something any online service needs to care about.
I once had the exact same thing happen on PayPal. I logged into my account, but was greeted by the information of another user. I immediately logged out. I never reported it and it it never happened again.
Really interesting write-up. Curious how they do logging, and how they were able to get such a detailed look at the previous requests, including the HTTP response. Wouldn’t it be a massive amount of data if all HTTP requests/responses were logged? (Not to mention the security implications of that)
It would indeed be a massive amount of data, but bear in mind that only a small fraction of HTTP requests hitting GitHub are actually both authenticated and attempting to change the state of the system. Most requests are unauthenticated read-only and cause no state changes in GitHub.
It’s a reasonable theory but step one required to trigger this bug is an unauthenticated request. It’s unclear if their logs indicated that or they got lucky trying to repro that someone thought to try with an unauthenticated request first when the back to back session didn’t repro.
We recently fixed a bug with incomplete requests by noticing in nginx logs that the size of the headers + the size of the body was the size of the buffer of a linux socket.
It's pretty common to do exhaustive logging even if the data is enormous. It's valuable for debugging and security investigations. As far as the security of storing log data, you need log messages to not contain PII regardless of how long you retain them.
It's not uncommon to see companies talk about processing terabytes or even petabytes of log data per day. In the end the storage isn't that expensive [1], and you can discard everything after a little while.
[1] Assuming of course that any company generating a petabyte of logs every day is doing so because they have an enormous number of customers, AFAIK the cost of storing 1PB of logs on something like Glacier is in the 100-300k USD/yr range. Absolutely a tolerable expense for the peace of mind you get by knowing you can go back in 2 months and find every single request that touched a security vulnerability or hunt down the path an attacker took to breach your infrastructure.
Purely a guess, but I've sometimes been able to work backwards from the content length (very often included in HTTP server logs) to figure out what the body had to be.
Yes it is a large amount of data but its worth paying to store it because it lets you go back in time and work out wtf happened which will save you far more hours and solve more problems making it worth it.
I guess the incident happened more often, then they wanted to tell us. So as a first step they probably turned up the logging and waited for new reports...
Given the rarity of this bug I doubt a small sample sized would have captured the right requests to be able to debug this. The probability would be the probability of this bug occurring * (sample percentage)^3 [ you'd need to have sampled the correct 3 consecutive requests].
As expensive as it is, datadog provides the required level of insight via, among other drill-down methods, detailed flamegraphs, to get to the bottom of problems like this one.
No, not affiliated with datadog, and not convinced the cost/benefit ratio is positive for us in the long run.
That provides more information but doesn’t fully answer any of my questions.
Things that are still unclear:
- GitHub provides an estimate of the prevalence at 0.001%, but doesn’t state how many leaked sessions were actually identified.
- Is GitHub confident that they identified all leaked user sessions? This has bearing on my question about notification of users whose sessions were leaked; if they did not identify all leaked sessions, they obviously could not contact those users.
- GitHub states that this bug could not be intentionally triggered by a malicious party (why?), but they don’t state whether any of the natural occurrences may have been used maliciously.
- How long/for how many requests could these sessions have been valid between their leak and when they were definitively revoked on the 8th?
- Is it possible that any private user data was leaked? If so, what data?
People with technical understanding can infer most of this information, but I think GitHub should issue some straightforward answers rather than beating around the bush.
(Correction to my original comment: the bug was live for a cumulative period of “less than 2 weeks”)
> GitHub states that this bug could not be intentionally triggered by a malicious party (why?),
Because this bug requires two users making authenticated requests at nearly the same time. An attacker cant force their target to perform actions, especially not in the same small time frame that is likely required (milliseconds? less?). I suppose if the malicious party didnt care what account they got swapped into, and github didnt rate limit their requests, perhaps they could trigger this bug with enough time?
GitHub says there could have been sessions that were leaked but can't be identified now:
"While our log analysis, conducted from March 5 through March 8, confirmed that this was a rare issue, it could not rule out the possibility that a session had been incorrectly returned but then never used."
Those sessions are now invalidated and apparently never used
Brutal - Unicorn didn't use Rack.env in a thread safe way. It's tougher given that this was in a widely used app-server and not even in Rails code.
Impressed with GH folks in finding this, though I wonder how many more of these types of bugs are out there in Ruby libs. It'd be nice to make this more difficult to do, a la FP languages, by making it really intentional to modify state. Maybe in Ruby 4??
I feel like if you dig down deep enough, every single site on the web has a serious security bug in its app code or the layers it runs on top of. How many more bugs are there still in the CPU your server runs on?
>Unicorn didn't use Rack.env in a thread safe way.
Was Unicorn itself not thread safe? It sounds like Unicorn had the requirement that once you send back the request to the user you must stop accessing the request object. Maybe Unicorn didn't document that requirement clearly, but even if it's not documented it's not clear to me that Unicorn itself is not thread safe, but rather that it has undocumented pitfalls that can easily lead to thread unsafety.
unicorn is single-threaded by design, and is not expected to support any background threading workarounds that people build on top of it. That "feature" in github shouldn't be passing the env hash.
This is a good example of why anything related with threading makes me nervous when using Ruby. When a language is mutable and thread-unsafe by default, keeping the whole stack sane at all times seems prohibitively arduous. One cannot code-review a whole ecosystem.
I believe that this could be addressed with an architecture that favored single-threaded processes that can handle a single request at a time. Said processes would be numerous, have a (best-effort) low memory footprint, and be pooled under some master process written in a concurrent/performant manner.
The result wouldn't be necessarily pretty or resource-efficient (especially in terms of RAM), but it also can be seen as a consequence of Ruby's requirements for safety.
An org like GH almost certainly can afford using more/beefier servers it it means that entire classes of bugs will go away.
To be fair to Ruby, most languages are largely mutable and without any strong thread-safety guarantees. Does that make it okay? No, but it’s far from a Ruby-specific problem.
> I believe that this could be addressed with an architecture that favored single-threaded processes that can handle a single request at a time.
Unicorn (the application web server GitHub is using) does follow this model. GitHub added their own additional thread within that process that broke that design in an unforeseen way.
I just want to make clear that for me, write ups containing this level of detail analyzing the error that occurred give me greater confidence in the company rather than less confidence. Greater even than if I had never heard of the original bug. There are many bugs I have never heard about, but any company that uses this level of care and openness is likely to be very reliable.
I have to say the presentation is exceptional, including very good diagrams of the problem.
But there's a thought in the back of my head wondering what kind of stuff hit the fan, and wondering who had to create the presentation. Was it the person who checked in the broken code, was it the bug fixer or his manager, or was it a pm that got stuck having to write up what happened?
My guess is that someone close to the remediation wrote it (or it was written by the eng team) and it went through 5 levels of editing before publication.
I agree with you in general, but in this particular case they knowingly deployed thread-unsafe code in production while increasing the threaded concurrency. That should have broken internal development rules.
Now imagine what happens behind the scenes at other service providers. I’d bet the ones that have policies that bury the details of incidents, on average, have worse problems.
Yes. Writing code with bugs breaks internal development rules. The issue wasn't thread-unsafe code. It was confusion about object lifecycle. An env object was assumed to no longer be used by the web framework after a request terminated, while in reality it was recycled for future requests as a performance optimization.
Agreed. However, I smirked a bit when I read your comment, since this is ultimately coming from Microsoft. Let's hope some of this mentality seeps into other Microsoft projects
What would be the benefit for Microsoft in that case? - They are transforming themselves from a platform company based around windows, it a platform company based around Azure. GitHub is a relevant part there.
This doesn't mean they are only doing good™ with GitHub, but destroying it and open source vendors would be counted to their interests.
Every bug is a "n00b mistake" when you are looking right at it. There is not a single large project that is not filled with or at least has had trivial bugs with serious repercussions.
> Every bug is a "n00b mistake" when you are looking right at it.
Except the ones that are just the universe conspiring against you, like bit flipping in the SM-1800 because of heavily irradiated cattle being transported on a nearby train line.
Some time ago I dealt with this exact bug for an e-commerce site with millions of requests per day, and it was also Rails + Unicorn.
ANY time ruby 1.8 raised a timeout, it was silently swallowed, and Unicorn returned the adjacent request session_id. We had to constantly defend against any code that might hit the native timeout.
> We initially thought this to be an internal reporting problem only and that we would see some data logged for an otherwise unrelated request from the background thread.
So they already conceded that their code is fundamentally not thread-safe and they decided to let it slide.
Yes, but in fairness, we all make tradeoffs to get more done in less time. It's much easier to justify throwing your top engineers at a race condition for a few days when you know it's placing security at risk.
> One such performance improvement involved moving the logic to check a user’s enabled features into a background thread that refreshed on an interval rather than checking their enablement while the request was being processed
It's never a good idea to perform multithreaded operations within the web server, unless you have a strong knowledge of its internal workings. For instance, if you miss manage threads in an IIS web application you can end up having your web requests queued since the web server has by default a fixed number of worker threads to serve web requests.
From their bug description it's clear that ruby on rails is no exception to this rule. In such cases where background processing is required due to some condition triggered by a web request it's best to stick with a propper queue system [1]
Another case where using immutable data by default would have prevented any shenanigans.
It also shows the highly entangled relationship between state and complexity.
(in this case, the env object)
Immutable data forces you to design your system in a different way and remove layers of complexity.
While in general I agree, in this case that aspect of the problem entirely came down to "don't use object pools that span multiple requests". That's a conscious optimization that was made and then retracted, not a subtlety that accidentally slipped in due to mutability itself.
I guess the point here is that an object pool isn't the default anyways: theyve chosen to do something non-default for performance, which you can still do in immutable by default languages.
How would that help here? They would just exchange one piece immutable data (un-signed in user cookie) with another piece of wrong immutable data (someone else’s session).
"While the immediate risk has been mitigated, we worked with the maintainer of Unicorn to upstream the change to make new requests allocate their own environment hashes. If Unicorn uses new hashes for each environment, it removes the possibility of one request mistakenly getting a hold of an object that can affect the next request."
> One such performance improvement involved moving the logic to check a user’s enabled features into a background thread that refreshed on an interval rather than checking their enablement while the request was being processed.
This type of thing is why PHP's shared-nothing model is vastly undervalued IMO.
Does PHP have the ability to use threads if you really need it? Yes.
Are they generally a PITA to use? Yes.
Is this a problem for the vast, vast, vast majority of projects? Not in the slightest.
Having a separate invocation of the runtime for each request is a feature, not a failing.
This is some fantastic debugging. I love war stories like this! They give me confidence that the world is full of mere mortals like myself, and we can do great things nonetheless.
I may be missing something here. I do see that the patch[1] now creates a new request object, and thus a new "env" Ruby hash for each request.
But I don't see the old behavior described as "allocates one single Ruby Hash that is then cleared (using Hash#clear) between each request"...even after poking around in http_request.rb and other places. I reads like the pre-patched version would just re-use the hash as-is, only overwriting/adding if it saw a new key or value, but not deleting any keys.
> Because the Rack environment is the same object for all requests
Who in the heck thought this would be a good idea? For the record, back when Desk.com existed, I also saw strange session issues that were probably complex and irreproducible race conditions as well.
Note that in Elixir/Phoenix, this class of bug is literally impossible, thanks to immutability-all-the-way-down. Turing bless it.
At D2L we have a large C# code base which gets deployed in a few different subsets, but is largely a monolithic fleet of web servers.
To prevent these kind of problems we have a few approaches, but the main way is to prevent shared mutable state. To do this we have a custom C# code analyzer (source here: https://github.com/Brightspace/D2L.CodeStyle/tree/master/src... , but it's not documented for external consumption at this point... ImmutableDefinitionChecker is the main place to look.) It goes like this:
[Immutable]
public class Foo : Bar {
// object is a scary field type, but "new object()"
// is always an immutable value.
private readonly object m_lock = new object();
// we'll check that ISomething is [Immutable] here
private readonly ISomething m_something;
// Fine because this lambdas in initializers can't
// close over mutable state that we wouldn't
// otherwise catch.
private readonly Action m_abc = () => {};
// see the constructor
private readonly Func<int, int> m_xyz;
// danger: not readonly
private int m_zzz;
// danger: arrays are always mutable
private readonly int[] m_array1 new[]{ 1, 2, 3 };
// ok
private readonly ImmutableArray<int> m_array2
= ImmutableArray.Create( 1, 2, 3 );
public Foo( ISomething something ) {
m_something = something;
// ok: static lambdas can't close over mutable state
m_xyz = static _ => 3;
// danger: lambdas can in general close over mutable
// state.
int i = 0;
m_xyz = x => { i += x; return i; };
}
}
Here we see that a type has the [Immutable] attribute on this class, so we will check that all the members are readonly and also contain immutable values (via looking at all assignments, which is easy for readonly fields/properties). Additionally, we will check that instances of Bar (our base class) are known to be immutable.
Any class that were to extend Foo (as a base class) will be required to be [Immutable] as well. There's a decent number of aspects to this analysis (e.g. a generic type can be "conditionally immutable" -- ImmutableArray<int> is, but ImmutableArray<object> is not), check the source if you're interested.
We require all static (global) variables be immutable, and any "singleton" type (of which we have tens of thousands if I remember correctly) also must be.
Another important thing we do to is to cut off any deferred boot-up code from accessing customer data (via a thread-local variable that is checked before access through the data layer). This prevents accidentally caching something for a particular customer inside, say, a persistent Lazy<T> (or the constructor for a singleton, etc.)
We've adapted our code base to be very strict about this over the last few years and it discovered many obscure thread-safety bugs and doubtlessly prevented many from happening since.
This is really cool, it seems similar in intent to Rust's "shared or mutable, but not both" philosophy. I hope it spreads to C# more generally somehow.
This post mortem and changes to unicorn should be useful for a lot of other ruby shops. I suspect github aren't the only people who are not defensively copying rack data when sharing it with background threads.
How do you even have logs for something like this? I feel like the log level to see the raw cookie responses would be a no go since you'd easily have hundreds of gigabytes of logs every day.
This is the kind of bug that makes me leery of frameworks and tall application stacks. Having to trust more than one layer of multithreaded code between an application and the kernel gets riskier with every layer.
Also another reason not to mix the main application code with the authentication/authorization code. Setting security cookies should be done by dedicated code in a dedicated process to avoid this kind of interaction, and the load-balancer layer should filter the setting of security cookies from non-security backends to enforce it.
I really like Python, but the difficulties in using the same process for multiple requests really get me (and I think this bug labels a similar sort of difficulty with Ruby).
There are of course safe ways to do all of this kind of stuff, but there are _so many_ unsafe ways to do things, and accidentally share things across requests. At least in the Django world I think ASGI is going to make it more likely that stuff is done the right way but there's a big learning curve there.
It doesn't seem like they took this security incident very seriously. It's a pretty serious bug that was reported by two different users within a few hours. It could have been abused by others, but it doesn't look like it occurred to them to roll back recent changes whilst they investigated the problem.
There is a lesson here about the way we go about sharing memory between threads that is implicit and fundamentally broken. Adding extra threading in order to improve performance is clearly the wrong tradeoff when you're in an environment where threads can introduce bugs like these. When it takes 10 senior engineers to figure out why one user session gets data from another session you've already lost.
Shared memory environments are very unsafe and we should be working towards a memory model where data that belongs to different organizations can never get mixed. This isn't impossible, but it requires new programming languages that allow us to express which memory areas can't get mixed up.
We've got to start designing our backend infrastructure differently such that this entire class of memory errors can't ever occur.
Replace Rails and similar server-side frameworks with CGI scripts, and you're done. Every request is a new process with its own memory space.
Yes, performance is an issue with that old approach. Around the turn of the century I was responsible for a Perl web application that ran as a CGI script, and to speed it up I wrote my own Perl httpd server that had my web application code pre-loaded. Then for each request I forked a new process, which carried over all of the pre-loaded code and already-running perl.exe, so there was very little startup time (compared to a normal CGI script startup.) I was careful to take full advantage of the operating system's Copy-on-write memory semantics, so that the only memory that had to be allocated for the new process was Perl's runtime stack. All of the memory containing the application code was shared across processes because it never got written to.
This web application is still running today, though I left the company in 2010. Back then, it was handling about six million requests per day, about 75% of which was during US daytime working hours. So, around 160 requests/second, spread across five or six servers that were getting old in 2010.
The modern frameworks have taken the same concept a step further, and replaced the process fork with a thread pool. That's faster, but it allows memory leaking in a way process forking does not. You have to go out of your way to share memory between processes, but with threads it's easy to do accidentally.
In many languages and web frameworks, the assumption is that the application server will do this for you. For example, by issuing a new env variable with each request such that requests can't easily share memory with other requests, though as noted that might not stop background threads from retaining such memory.
In other cases, you might assume containers would be enough separation, or maybe different databases (sharding, for example). One approach I've considered but not yet implemented is to take end-user JWT cookies and use them to authenticate to a database such that a user can only ever see the rows they should be allowed access to, and all other SQL queries would fail.
But no matter how you implement a security control, it's possible to make mistakes and allow others to see data you didn't intend to share with them. Even Rust, famous for its explicit memory ownership model, could still have similar bugs, they just might be a bit more obvious to spot in some session connection pool implemented incorrectly, for example, or maybe exposed via a compiler optimization gone wrong. Code review will still likely be one of the primary defenses of this kind of bug, though hopefully better tooling and language support to detect race conditions or shared ownership warnings will help over time?
This happened to a student colleague of mine in I think 2014. We were working on uni projects and when he logged in to check the repository, he was suddenly another user and we were all really surprised.
I'm not sure what difference it makes. Ultimately, all memory gets reused (you don't throw out your 32GB DIMM after you've written 32GB to it), so some tooling is required to manage the reuse. You can read this story and say "I wouldn't reuse a per-request data structure between requests" or especially "if I did, I wouldn't pass it to a background thread", but it doesn't get at the core of the problem -- if your program has multiple threads of execution, you need some way to check that there isn't unexpected sharing. One way is a tool like ThreadSanitizer, run your tests against it, run 1% of your production traffic against it, and you have a higher probability of detecting data races. Another way is to have some sort of structured concurrency (see "goroutines considered harmful" which was on HN just the other day).
Ultimately, I don't think one can ever be happy about "there was a giant security hole in our app, and we only found out because a customer noticed it and reported it to us, it turns out it was in a random library that we didn't even write and so now that we found this everything is fine!" Everything is not fine, and this will just happen again.
food for thought: if they have used puma (which is sort of what people have been using since 2017 when it comes to ruby/rails) would this bug have still existed?
Huh? I want my bank to be VERY clear on when I transferred money for my home closing (and keep that record). They need to keep records and the browser used, IP used, authentication flow etc. Why does this have to be thrown away?
You're right but the parent's point is that multithreaded code with shared objects is known to be particularly difficult to do perfectly, and hard work to debug. It's so difficult that the rational strategy may be to architect to avoid it. We will have bugs, but not those bugs. Those bugs stink.
This is the kind of multithreading bug that Rust’s borrow checker can prevent: the compiler would have complained that the background task and the main request handler have write access to the env object at the same time, which is not allowed.
Writing your website in PHP would also have prevented it, because its amateur process per request design is actually the best way to do things, scales well, and has the right security properties. Having an app server that can have requests from different users in the same memory space is a mistake.
Oof, my brain. With that giant fiasco master/slave thread the other day, I glanced at “race condition” and “GitHub” and immediately thought this was related to that.
It doesn’t. The issue here seems like a more vanilla logical data race + object reuse that Rust can’t really do anything about. Rust only makes sure that only one thread at a time writes and other reads don’t race that write. It can’t make guarantees that the relative ordering of reads and wires makes sense for your use-case.
It can simplify certain abstractions of course so that you can provide easier tools for devs to avoid footguns.
What rust would do for this particular situation is make it painful to share an object like this across threads and make it obvious that such a share is happening. That wouldn't prevent this problem, but it would make it harder to stumble into.
Maybe, maybe not. A claim like that is definitely in the realm of "people think race condition bugs are solved by switching to Rust". I don't know Rust well enough to evaluate your claim but that feels like an incorrect statement - sharing data between threads should be as easy as wrapping something in Arc & Mutex to get similar semantics I think.
Even if that particular bug is actually harder to stumble into, there's no guarantee the space of thread synchronization issues one might stumble into is inherently smaller for Rust than for Ruby, so all you're doing is exchanging one class of failure modes for another.
As I said, they aren't solved but they are hard to do unknowingly.
In most other languages, sharing mutable data across threads is unprotected. You are lucky if you get a compiler warning. Rust disallows that completely. You cannot share mutable data across threads without first wrapping that data in thread safety constructs.
That's what happened here. A mutable variable was shared between threads and updated from different threads. Rust makes it obvious that this variable is accessed from multiple threads even if it doesn't prevent 2 threads from writing to the variable at wrong times. That's where it is a partial solution to the problem.
Sharing is as easy as wrapping the thing in a Mutex, and if you are a programmer that sees something wrapped in a mutex you should immediately start thinking of the thread implications. You don't have to, but you should.
That mutex follows the variable around anywhere you want to start changing it.
In a language like C or Ruby, that's not something that is patently obvious. Shared variables aren't wrapped with anything. It's perfectly understandable for someone to not realize this bit of shared state is mutated across threads.
I wonder if Microsoft has long-term plans to move Github off of things like Unicorn/Rails/Erlang into a technology stack that would be more familiar to them.
I'm not judging the choices of technology, but I'm thinking it was easier for them to buy Github outright then try to compete with it. I assume Github still operates independently, with engineers who are familiar with this stack.
Agreed, the comment was short sighted. Was hesitant to even post it considering many involved are likely here on HN.
I wasn't sure how independent Github is from Microsoft after the acquisition. My thoughts were initially in moving engineers around internally from within Microsoft to work on Github.
Multi-threaded race conditions in a Rails app under Unicorn Rack takes a lot of experience in specific tooling.
I love how everyone here on HN shits on Gab's mishap just because they happen to host a lot of nonsense alt-right garbage and their CEO is a weirdo, but everyone's totally cool with GitHub knowingly having inconsistencies across threads:
> We initially thought this to be an internal reporting problem only and that we would see some data logged for an otherwise unrelated request from the background thread. Though inconsistent, we considered this safe since each request has its own request data and Rails creates a new controller object instance for each request.
I mean this is like an SNL skit: "though inconsistent, we considered this safe" -- inconsistency, no matter where it's happening, literally means your code isn't safe. Like, jeez... this is GITHUB we're talking about. I mean, I give money to GitHub -- this is pretty serious and we should really stop with the "omg I love these detailed post-mortems!" stuff. A billion-dollar company knowingly having code that's not thread safe in their flagship product and going "eh, it's probably fine" is bonkers to me.
I think there's a pretty big difference between an engineer writing thread-unsafe code, testing and seeing inconsistencies that (incorrectly) appear to not affect any external users, shipping it, and then the security team finding the problem and developing new processes to prevent this in the future — vs the CTO of Gab writing raw SQL using string concatenation, shipping it without code review, and then trying to cover it up afterwards.
No software maintained by a large team will be bug-free forever, and that includes security bugs. The response to bugs is what matters. In this case Github's response was quite mature; in Gab's case, it wasn't.
The billions doesn't automatically scale development. You know that. Every development is still done at the human level. But you still might be right that there is no excuse.
Since this was essentially a potential organizational administrator account takeover security vulnerability, I can see why this got a write up. And as others have said, it's impressive.
Github Actions has had many problems that affect many people this year alone. While the missed deadlines and stress for people can't be quantified by Github, they are happening.
I'd urge Github to take ongoing service interruptions in Github Actions more seriously. I hope Github will take a stronger stance on providing customer communication on migration timelines between Packages and Container Registry.
Just because a bug or service interruption doesn't risk random user session exposure doesn't mean it isn't important to communicate about and describe how the business is working to prevent it from happening again.
> We believe that transparency is key in earning and keeping the trust of our users and want to share more about this bug.
I find it ironic that the largest repository of open source code is closed source and and owned by the largest closed source software company on the planet, yet they somehow still "believe in transparency."
That's like McDonald's "transparently" recalling burgers because of a chemical contamination, and then when you ask what else is in the burgers they tell you it's a secret.
Strategic issues aside, open sourcing a codebase that is over a decade old is extremely hard. You need to be absolutely sure there's nothing in the code or commit history that's not suitable for release.
Just one example: what if someone pasted an error log message into a commit message that included a piece of PII?
Fortunately Microsoft, like McDonalds, can use their billions of dollars in the bank to produce high quality graphics to 'explain the situation' - and when people visit the site, they will be Delighted and Impressed with these high quality graphics - instantly remedying the situation. 'Look how nice we explained it - look at the cute cartoon characters solving a puzzle in the header at the top of the page, see how slick we are?'
It is funny how they make it seem like they are super cautious. When in fact they had no other choice than do it when you rephrase the problem as " there were maybe some people who got the admin right of the github account of some of our customers"