The Infamous GNOME Shell Memory Leak

Memory graph

Greetings GNOMErs,

at this point, I think it’s safe to assume that many of you already heard of a memory leak that was plaguing GNOME Shell. Well, as of yesterday, the two GitLab’s MRs that help fixing that issue were merged, and will be available in the next GNOME version. The fixes are being considered for backporting to GNOME 3.28 – after making sure they work as expected and don’t break your computer.

First, I’d like to thank the GJS maintainer, Philip C., for all the hand-holding, the reviews, and the incredibly insightful discussions we had. Secondly, to my employer, Endless, for the support they gave me to fix this issue. And last but not least, to the Ubuntu folks, which made a public call for testing with the changes – this will give us confidence that the fix is working, and that backporting it will be a relatively safe and smooth process.

banner-down
As always, great new features and fixes are a courtesy of Endless

I’m writing this blog post with three goals in mind:

  1. Explain in greater details what is the issue (or at least, what we think it is), the journey to find it, and how it was fixed.
  2. Give more exposure to important extra work from other contributors that absolutely deserve more credits.
  3. Expose a social issue that showed up during this time, and open a discussion about it.

Memory Leak

To me, it all started when I saw GitLab’s ticket #64 passing by in the IRC channels. It was challenging enough, I was curious to dig into GNOME Shell/Mutter/GJS internals, perfect match. Of course, when you’re not familiar with a given codebase, the first step to fixing a bug is being able to reproduce it, so I started to play around with GNOME Shell to see if I could find a reliable way to reproduce it.

Well, I found a way and wrote a very simple observation: running animations (showing and hiding the Overview, switching applications using Alt+Tab, etc) was reliably increasing memory usage. Then a few people came in, and dropped bits of useful information here and there. But at this point, it was still pointing to a wide range of directions, and definitely there was not actionable task there. This is when OMG! Ubuntu first wrote about it.

Carlos Garnacho then came in and wrote a pretty relevant comment with important information. It was specially insightful because he put numbers on the guts of GNOME Shell. His comment was the first real solid step to uncover what was going on.

A week passed, and I experimented different toys tools in order to have a better understanding of memory management inside GNOME Shell. This is the kind of tedious work that nobody talks about, but I learned tons of new stuff, so in the end it was worth the hassle. I even wrote about my crazy experiments, and the results of this long week are documented in a long comment in GNOME/gnome-shell#64. I kept experimenting until I reached heapgraph, an interesting tool that allowed generating the following picture:

Memory graph
Notice the sudden drops of memory at x=42 and x=71

Well, as stated in the comment, GJS’ garbage collect was indeed collecting memory when triggered. Problem is, it wasn’t being triggered at all. That was the leading clue to one of the problems that was going on. One idea came to my mind, then, and I decided to investigate it further.

A Simple Example

Consider that we have a few objects in memory, and they have parent/child relationships:

Example 1
The root object is “1”

Lets suppose that we decided that we don’t need the root object anymore, so we drop a reference to it, and it is marked for garbage collection.

Example 2
The root object is now marked for garbage collection

If we destroy the root object, we would need to destroy the other objects related to it, and go destroying everyone that depended, directly or indirectly, on the root object. Traditionally, JavaScript objects track who they own, so the garbage collector can clean up every other dependent object. Here’s the problem: C objects don’t track who owns them; instead, they only track how many owners they have. This is the traditional reference counting mechanism, and it works fine in C land because C is not garbage collected. To the garbage collector, however, the C objects would look like this:

Example 3
The garbage collector has no means to know the relationships between C objects.

The garbage collector, then, will go there and destroy the root one. This object will be finalized, and the directly dependent objects will be marked for garbage collection.

Example 4
Only the directly dependent objects are marked for the next garbage collection.

But… when will the next GC happen? Who knows! Can be now, can be in 10 minutes, or tomorrow morning! And that was the biggest offender to the memory leak – objects were piling up to be garbage collected, and these objects had child objects that would only be collected after, and so it goes. In other words, this is not really a memory leak – the memory is not being lost. I’d label it as a “misbehavior” instead.

The Solution

While people might think this was somehow solved, the patches that were merged does not fix that in the way it should be fixed. The “solution” is basically throwing a grenade to kill ants. We now queue a garbage collection every time an object is marked for destruction. So every single time an object becomes red, as in the example, we queue a GC. This is, of course, a very aggressive solution.

But it is not all bad. Some early tests shows that this has a small impact on performance – at least, it’s much smaller than what we were expecting. A very convincing explanation is that the higher frequency of GCs is reducing the number of things that are being destroyed each GC. So now we have smaller and more frequent garbage collections.

EDIT: Looks like people need more clarification here, since the comments about it are just plain wrong. I’ll be technical, and precise – if you don’t understand, please do some research. The garbage collector is scheduled every time a GObject wrapped in GJS has its toggle reference gone from >1 to 1. And scheduled here means that a GC is injected into the mainloop as an idle callback, that will be executed when there’s nothing else to be executed in the mainloop. The absolute majority of the time, it means that only one GC will happen, even if hundreds of GObjects are disposed. I’ve spotted in the wild it happening twice. This fix is strictly specific to GObjects wrapped by GJS; all other kinds of memory management, such as strings and whatever else, aren’t affected by this fix. Together with this patch, an accompanying solution landed that reduces the number of objects with a toggle reference.

This obviously needs more testing on a wider ranger of hardwares, specially on lower ends. But, quite honestly, I’m personally sure that this apparently small performance penalty is compensated by the memory management gains.

Other Improvements

While the previous section covered my side of this history, there are a few other contributors that did a great job, and I think it would be unfair with them if their work was not properly highlighted.

Red Hat’s Carlos Garnacho published two merge requests for GJS that, in my testing, substantially improved the smoothness of GNOME Shell. The first one changes the underlying data structure of JS objects, which allows us to stop using an O(n) algorithm and starting an O(1) one. The second one is particularly interesting, and it yields the most noticeable improvements in my computer. Gross, it vastly reduces the number of temporary memory allocations. He also has a number of patches on Mutter and GNOME Shell.

Another prominent contributor regarding performance is Canonical’s Daniel van Vugt, which helped early testing the GJS patches, and is doing some deep surgeries in Mutter to make the rendering smoother.

And for every great contributor, there is a great reviewer too. It would be extremely unfair if those relevant people haven’t had their work valued by the community, so please, take a moment to appreciate their work. They deserve it.

Final Thoughts

At this point, hopefully the cautious reader will have at least a superficial knowledge on the problem, the solution, and other relevant work around the performance topic. Which is good – if I managed to communicate that well enough, by the time you finish reading this blog post, you’ll have more knowledge. And more knowledge is good.

You can stop here if you want nothing more than technical knowldedge.

Still around?

Well, I’d like to raise an interesting discussion about how people reacted to the memory leak news, and reflect upon that. By reading the repercussions of the news, I found it quite intriguing to read comments like these:

weird comment 1

Captura de tela de 2018-04-20 22-51-53

Captura de tela de 2018-04-20 22-52-17

As a regular contributor for the last few years, this kind of comment sound alien to me. These comments sound completely disconnected to the reality of the development process of GNOME. It completely misses the individuality of the people involved. Maybe because we all know each other, but it is just plain impossible to me to paint this whole community as “they”; “GNOME developers”; etc. To a deeper degree, it misses the nuances and the beauty of community-driven development, and each and every individual that make it happen.

To some degree, I think this is a symptom of users being completely disconnected to GNOME development itself.

It almost feels like there’s a wall between the community and the users of what this community produces. Which is weird. We are an open community, with open development, no barriers for new contributors – and yet, there is such a distance between the community of users and the community of developers/designers/outreachers/etc.

Is that a communication problem from our side? How can we bridge this gap? Well, do we want to bridge this gap? Is it healthy to reduce the communication bandwidth in order to increase focus, or would it be better to increase that and deal with the accompanying noise?

I would love to hear your opinions, comments and thoughts on this topic.

Advertisements

123 thoughts on “The Infamous GNOME Shell Memory Leak

  1. Hello, i found your article when searching for more information about memory leak in gnome in Fedora 28. Although I am not skilled developer (at least not for now), I appreciate that there is a community of people who takes care about such a beautiful and polished product as gnome is. There are few issues which I hope will be fixed sometimes as full support of fractional scaling and big memory consumption.

    You mentioned the communication between users and contributors and its drawbacks. I think there is often problem in managing communication and integrating community into development process. The problem is that feedback given from users is in different from. In usual user speech on forums, in comments bellow the articles of internet magazines and elsewhere on the internet, there is a big messy mix of emotion, technical disorientation and random thoughts. These must be carefully cut away for delivering only the useful feedback for contributors.

    So i would personally suggest to create and maintain official channels for communication “user – contributors”. Of course that requires a lot of effort and takes a lot of human resources to manage. But the socially skilled volunteering advisor should be able to collect, analyze and restylize the feedback from users and deliver it in more understandable way to contributors.
    These were just my thoughts, I hope it was clear enough to understand it, as I am not a native English speaker.

    I appreciate your work and wish you to have a great day!

    Liked by 1 person

  2. Hey thx for the insights about the memory leak, I think I learned something here.

    About the communication issue you asked about. I don’t know how it works with the gnome project since I never contributed but I helped out at the home-assistant https://www.home-assistant.io/ project.

    In general, the discussions at the home-assistant project are really nice and to the point. I rarely see trolls there and it always surprised me how nice the community is over there. I believe most of the credit for this goes to Paulus, who is the founder of home-assistant and he just is a nice guy. Compare his way of treating people with Linus and well, the fish stinks from the head…
    But that is not all. I always found home-assistant to be one of the most beginner friendly projects I ever saw. If you take some time browsing the webpage, you will find a detailed section of how to build components, how the core works, basically how everything works and they just try to make it easy to contribute. You can find many of the developers on discord and usually they respond and are helpful. The barer for contribution is as low as it gets I think.

    The key aspect is to get people to contribute to a project in any kind of way. This way “users” become “developers” and “they” becomes “us” and to achieve this, I think, the gnome project would need to invest a lot more in explaining how gnome works. Saying here is the code have a look and then you know how it works is not enough. One has to invest first if one want some return.

    Like

  3. This was reported as a bug over and over again since the absolute disaster that is gnome-shell was first released. The response was the typical GNOME denial and closing bugs just because they were reported 1 release ago. The GNOME development process has become an absolute joke…

    Zawinski was right 15 years ago and he’s even more right now.

    Like

    1. These quotes from Bug #685513 sum up the situation nicely:

      “Guessing wildly, people are probably discussing, like, fifteen different bugs, here. This bug report is now so diffuse and vague that there is really zero chance of the devs paying any more attention to it.”

      And:

      “All of our research and investigation, including all of the massif and valgrind logs you guys have provided (thanks for that, it really helped us out!) has shown that it’s a compound of many, many different technical issues. So this bug isn’t really going to help fix anything. It’s going to rack up more “me too” comments while people complain about high memory usage.”

      All this “Gnome developers are incompetent/don’t care/can’t fix a simple problem” gibberish is just that. Claiming “Memory leak” is a single issue, the same issue as it was years ago, or framing it as denial or apathy is really just tiresome mudslinging and naivety.

      Like

  4. I still don’t completely understand the leak, nor the proposed solution (ignorance on my part), but why doesn’t running imports.system.gc() 10,000 times (or more… I’ve tried) consecutively in lookingglass reclaim all the “lost” memory? And, how/why is the proposed solution more effective than running imports.system.gc() in lookingglass?

    Like

  5. Hi, your final remark “To some degree, I think this is a symptom of users being completely disconnected to GNOME development itself.” stuck with me, it was an item on a JupiterBroadcasting show as well. I thought I would be a good idea to bring to your attention the first example I would see of one sided decisions made by a Gnome dev, regardless of what users might want, and also “final”, as in “without the ability to opt-in if people would still prefer to have the functionality”.

    So it took about a month for the recent announcement to remove the ability to launch programs from nautilus. Things like this have a deeper impact than you might think. It is not necessarily about the decision itself, just the way it is done, and how users feel their opinion does not matter anyway.

    Just wanted to add my 2 cents here, as you genuinely seem to care.

    Like

  6. Hey George, I just wanted to let you know that the workaround has made its way into Debian’s 3.28.2 build and so far it seems okay. I’m running a few tests still (with and without the extension that was mentioned in another comment), but so far I’ve let my PC running overnight and gnome-shell’s memory usage hasn’t gone above 200 MB.

    Good job!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s