Greetings GNOMErs,
at this point, I think it’s safe to assume that many of you already heard of a memory leak that was plaguing GNOME Shell. Well, as of yesterday, the two GitLab’s MRs that help fixing that issue were merged, and will be available in the next GNOME version. The fixes are being considered for backporting to GNOME 3.28 – after making sure they work as expected and don’t break your computer.
First, I’d like to thank the GJS maintainer, Philip C., for all the hand-holding, the reviews, and the incredibly insightful discussions we had. Secondly, to my employer, Endless, for the support they gave me to fix this issue. And last but not least, to the Ubuntu folks, which made a public call for testing with the changes – this will give us confidence that the fix is working, and that backporting it will be a relatively safe and smooth process.
I’m writing this blog post with three goals in mind:
- Explain in greater details what is the issue (or at least, what we think it is), the journey to find it, and how it was fixed.
- Give more exposure to important extra work from other contributors that absolutely deserve more credits.
- Expose a social issue that showed up during this time, and open a discussion about it.
Memory Leak
To me, it all started when I saw GitLab’s ticket #64 passing by in the IRC channels. It was challenging enough, I was curious to dig into GNOME Shell/Mutter/GJS internals, perfect match. Of course, when you’re not familiar with a given codebase, the first step to fixing a bug is being able to reproduce it, so I started to play around with GNOME Shell to see if I could find a reliable way to reproduce it.
Well, I found a way and wrote a very simple observation: running animations (showing and hiding the Overview, switching applications using Alt+Tab, etc) was reliably increasing memory usage. Then a few people came in, and dropped bits of useful information here and there. But at this point, it was still pointing to a wide range of directions, and definitely there was not actionable task there. This is when OMG! Ubuntu first wrote about it.
Carlos Garnacho then came in and wrote a pretty relevant comment with important information. It was specially insightful because he put numbers on the guts of GNOME Shell. His comment was the first real solid step to uncover what was going on.
A week passed, and I experimented different toys tools in order to have a better understanding of memory management inside GNOME Shell. This is the kind of tedious work that nobody talks about, but I learned tons of new stuff, so in the end it was worth the hassle. I even wrote about my crazy experiments, and the results of this long week are documented in a long comment in GNOME/gnome-shell#64. I kept experimenting until I reached heapgraph, an interesting tool that allowed generating the following picture:
Well, as stated in the comment, GJS’ garbage collect was indeed collecting memory when triggered. Problem is, it wasn’t being triggered at all. That was the leading clue to one of the problems that was going on. One idea came to my mind, then, and I decided to investigate it further.
A Simple Example
Consider that we have a few objects in memory, and they have parent/child relationships:
Lets suppose that we decided that we don’t need the root object anymore, so we drop a reference to it, and it is marked for garbage collection.
If we destroy the root object, we would need to destroy the other objects related to it, and go destroying everyone that depended, directly or indirectly, on the root object. Traditionally, JavaScript objects track who they own, so the garbage collector can clean up every other dependent object. Here’s the problem: C objects don’t track who owns them; instead, they only track how many owners they have. This is the traditional reference counting mechanism, and it works fine in C land because C is not garbage collected. To the garbage collector, however, the C objects would look like this:
The garbage collector, then, will go there and destroy the root one. This object will be finalized, and the directly dependent objects will be marked for garbage collection.
But… when will the next GC happen? Who knows! Can be now, can be in 10 minutes, or tomorrow morning! And that was the biggest offender to the memory leak – objects were piling up to be garbage collected, and these objects had child objects that would only be collected after, and so it goes. In other words, this is not really a memory leak – the memory is not being lost. I’d label it as a “misbehavior” instead.
The Solution
While people might think this was somehow solved, the patches that were merged does not fix that in the way it should be fixed. The “solution” is basically throwing a grenade to kill ants. We now queue a garbage collection every time an object is marked for destruction. So every single time an object becomes red, as in the example, we queue a GC. This is, of course, a very aggressive solution.
But it is not all bad. Some early tests shows that this has a small impact on performance – at least, it’s much smaller than what we were expecting. A very convincing explanation is that the higher frequency of GCs is reducing the number of things that are being destroyed each GC. So now we have smaller and more frequent garbage collections.
EDIT: Looks like people need more clarification here, since the comments about it are just plain wrong. I’ll be technical, and precise – if you don’t understand, please do some research. The garbage collector is scheduled every time a GObject wrapped in GJS has its toggle reference gone from >1 to 1. And scheduled here means that a GC is injected into the mainloop as an idle callback, that will be executed when there’s nothing else to be executed in the mainloop. The absolute majority of the time, it means that only one GC will happen, even if hundreds of GObjects are disposed. I’ve spotted in the wild it happening twice. This fix is strictly specific to GObjects wrapped by GJS; all other kinds of memory management, such as strings and whatever else, aren’t affected by this fix. Together with this patch, an accompanying solution landed that reduces the number of objects with a toggle reference.
This obviously needs more testing on a wider ranger of hardwares, specially on lower ends. But, quite honestly, I’m personally sure that this apparently small performance penalty is compensated by the memory management gains.
Other Improvements
While the previous section covered my side of this history, there are a few other contributors that did a great job, and I think it would be unfair with them if their work was not properly highlighted.
Red Hat’s Carlos Garnacho published two merge requests for GJS that, in my testing, substantially improved the smoothness of GNOME Shell. The first one changes the underlying data structure of JS objects, which allows us to stop using an O(n) algorithm and starting an O(1) one. The second one is particularly interesting, and it yields the most noticeable improvements in my computer. Gross, it vastly reduces the number of temporary memory allocations. He also has a number of patches on Mutter and GNOME Shell.
Another prominent contributor regarding performance is Canonical’s Daniel van Vugt, which helped early testing the GJS patches, and is doing some deep surgeries in Mutter to make the rendering smoother.
And for every great contributor, there is a great reviewer too. It would be extremely unfair if those relevant people haven’t had their work valued by the community, so please, take a moment to appreciate their work. They deserve it.
Final Thoughts
At this point, hopefully the cautious reader will have at least a superficial knowledge on the problem, the solution, and other relevant work around the performance topic. Which is good – if I managed to communicate that well enough, by the time you finish reading this blog post, you’ll have more knowledge. And more knowledge is good.
You can stop here if you want nothing more than technical knowldedge.
Still around?
Well, I’d like to raise an interesting discussion about how people reacted to the memory leak news, and reflect upon that. By reading the repercussions of the news, I found it quite intriguing to read comments like these:
As a regular contributor for the last few years, this kind of comment sound alien to me. These comments sound completely disconnected to the reality of the development process of GNOME. It completely misses the individuality of the people involved. Maybe because we all know each other, but it is just plain impossible to me to paint this whole community as “they”; “GNOME developers”; etc. To a deeper degree, it misses the nuances and the beauty of community-driven development, and each and every individual that make it happen.
To some degree, I think this is a symptom of users being completely disconnected to GNOME development itself.
It almost feels like there’s a wall between the community and the users of what this community produces. Which is weird. We are an open community, with open development, no barriers for new contributors – and yet, there is such a distance between the community of users and the community of developers/designers/outreachers/etc.
Is that a communication problem from our side? How can we bridge this gap? Well, do we want to bridge this gap? Is it healthy to reduce the communication bandwidth in order to increase focus, or would it be better to increase that and deal with the accompanying noise?
I would love to hear your opinions, comments and thoughts on this topic.
Leave a Reply