Conquering my academic demon
In 2021, I was about halfway through a Master's in Computer Science at Columbia University. This was during COVID, so classes were being held remotely. I was working in a standard software engineering job at the time and decided to attend school part-time while the campus was closed, meaning I would take one class at a time and do my coursework in the evenings after work. The class I was taking that semester was called "Neural Networks and Deep Learning", a required course for those on the machine learning (ML) track of the Master's.
As someone who had been studying/using machine learning on and off for about eight years at that point (I began my Master's several years after finishing a B.S. C.S. and working in the tech industry), most of the course was redundant for me. I had already done a lot of "multi-layer perceptron from scratch" tutorials and seen every convolutional neural network GIF under the sun. Despite this, I was not able to get the course requirement waived or substitute a more advanced course in its place. Thus stuck with it, I decided I would make the best of the situation. I would engage as much as possible with the review work to make sure I was solid on my fundamentals, while using any extra time I had from pre-existing knowledge to look for additional learning opportunities. Such an opportunity came in the form of the course's final project: a completely open-ended prompt that simply required us to pick any (neural network related) topic, write some code, and type up a pseudo-research paper in LaTeX.
Generally speaking, I am not great at coming up with novel research ideas on my own. Point me at something specific you need built, or something you need optimized, and I will work tirelessly and diligently towards the task's completion. But coming up with something truly novel, truly unique? That is not one of my strengths, especially under deadline pressure. If I am lucky enough to have such ideas, they are the result of luck, subconscious processes, or one of the muses showing up at my house by mistake. The point is that I can't force inspiration. So rather than chase novelty relative to the wider research community, I prioritized novelty relative to my own knowledge, and decided to choose a project to learn about an area of machine learning I was quite weak in: reinforcement learning (RL). More specifically, I decided I wanted to teach an agent how to solve Rubik's cubes from scratch.
The more I thought about this idea, the more sense it made. Rubik's cubes seemed to me to be much simpler than games like Chess and Go that RL had already conquered, so they would be within my computational budget, while still being complex enough to not permit lucky or random solutions. Something was missing, though. Maybe it was because I was worried RL was already a solved problem, or maybe it was because of the nature of the Rubik's cube itself, but for whatever reason I started viewing the problem in terms of graphs, which led me naturally to graph neural networks, another topic exploding in popularity at the time. Rather than pick between RL and graph learning, I decided I would do both. At this point it was definitely substantial enough to be a final project. It featured two trending topics that I could compare, contrast, or combine to my liking.
My plan was absolutely genius. It would allow me to gain more from the course than just strengthening my fundamentals; I would be gaining important new knowledge. In two different areas! What could go wrong?
As you probably guessed, quite a bit. I found that off-the-shelf machine learning algorithms were not as stable as I expected. Even with a "simple" problem like a Rubik's cube, which is a fully observable Markov decision process, easily expressed as an OpenAI gym environment, I could not get the state-of-the-art open source frameworks to solve the problem better than a hand-written tabular Q-learning algorithm. I spent many sleepless nights bashing my head against my laptop trying to get my custom PyTorch RL set-ups to work. And those were the better sections. Graph learning was a total nonstarter. I tried to combine RL and graph learning with a novel algorithm, but with none of the fundamentals of GNNs serving as a template I had no way of coming up with a sensible architecture and no way of judging novelty in the first place.
In the end, rather than mastering both RL and GNNs through this course project, I ended up mastering neither, and ironically ended up developing a bit of a phobia of both (especially RL). My ego had taken a hit, and though I would have liked to take the time to improve my understanding and results, in just a couple of weeks the next course would begin, which combined with my day job would command my full attention.
By sheer coincidence, shortly after the course ended, I came across a video from YouTuber Simon Clark titled "Conquering my academic demon". (Yes, I stole the title verbatim.) In the video (which I highly recommend viewing), Simon recounts his own struggle with the course in quantum field theory in his fourth year reading physics at the University of Oxford. He discusses the multi-faceted reasons behind his initial "failure" and the psychological effect it had on him at the time. Revisiting the subject years later with the objectivity only time and physical distance can bring, he was able to "conquer his academic demon" by both learning the subject in the literal sense (i.e. mastering the basic physics of quantum field theory) while also recontextualizing his original struggle, showcasing intellectual and psychological growth.
Inspired by his journey, I vowed to myself that I would one day revisit my own academic demon (this final project for "Neural Networks and Deep Learning" class). I wanted to master the subject matter (at least the basics) and in so doing improve my own "meta-learning" strategies (i.e. learning how to learn). No matter how long it took, and no matter how much personal and professional life got in the way, I simply had to teach a machine to solve a (pocket) Rubik's cube in a proper way. I'm happy to say, about four years on from the original project's due date, that I finally have.
Why I failed (the first time)
Scope creep
The first and most obvious thing you probably noticed about my proposal is that it is essentially two disjoint proposals: solve the cubes with reinforcement learning and solve the cubes with graph neural networks. If I were well-grounded in one of the topics and wanted to extend my knowledge to the other, combining them in a novel way, that strategy could have made sense. Unfortunately, since I was going into the project without a solid foundation in either, I ended up struggling with the initial (re-)implementations of algorithms, which left no time for novel combinations. I should have simply picked one topic or the other, rather than biting off more than I could chew. It is much better to master one subject than make no real progress on either.
Misunderstanding the research landscape
In 2013, DeepMind agents could beat most Atari games. In 2016, AlphaGo beat the human champion in Go. In 2019, DeepMind agents were grandmaster level in StarCraft II. At around the same time, OpenAI had a system on par with the best humans at Dota II. Reinforcement learning looked unstoppable. As long as you had an environment with a clearly defined reward, state, and action spaces, it should be simple enough to point an algorithm at it, leave it running for a while, and come back to an agent that had utterly mastered it. Which algorithm hardly seemed to matter — there were dozens of new acronyms claiming to be state-of-the-art being added to the RL zoo every year.
Thus I thought solving a Rubik's cube would be quite trivial in 2021, since I would say it is a harder problem than Atari but easier than mastering Go, Chess, StarCraft II, or DotA II. While this is certainly true, and my later success with the problem shows that it isn't impossible, I can now say with confidence that it is much harder than it sounds. After spending more time with the subject, I now understand that training networks with reinforcement learning is much less stable and it is much harder to replicate state-of-the-art results compared to supervised learning. To get the amazing results I linked above, teams with dozens of experts spent months tuning the systems and fixing (often very silly) bugs to stabilize training. They also had a lot more compute to work with. These undocumented tips and tricks for RL seem to be much more important than they are for supervised learning. I say this for the following reason. If you take an architecture like a convolutional neural network (CNN) and apply it to image classification, or a transformer and apply it to next token prediction, you obviously wouldn't get state-of-the-art results for your hobby project either. The difference is that the training is unlikely to diverge completely, especially at small scales (i.e. when working with models with less than ten billion parameters).
I had underestimated the difficulty of reproducing top-tier RL results, and to make matters worse, I also overestimated the generalization properties of current deep neural network architectures. Out-of-distribution length generalization and true algorithmic reasoning is still very much an unsolved problem, though there has been good progress with specialized architectures in recent years. Solving a Rubik's cube efficiently is inherently algorithmic. Given a certain cube configuration you are meant to repeatedly execute a set of steps until you arrive at another configuration, at which point you can repeat a certain set of moves to get to the next configuration, onwards until you have solved the cube. When training a transformer or any other network on this task, ideally we would like the network to truly "grok" the algorithmic method of solving cubes, rather than simply memorizing certain moves. Even though my networks were eventually able to solve the pocket cubes with 100% accuracy, I am not convinced they ever did.
It goes to show that even now in 2025, there are interesting contrasts in what AI can do vs. where it still struggles. Moravec's paradox is alive and well. We have seen LLMs achieve gold medal performance at IMO 2025 while still hallucinating basic facts and struggling to count the number of letters in words, so perhaps I shouldn't have been surprised that this project ended up being non-trivial.
Working hard but not smart
My original failure was not due to laziness or a lack of work ethic. This is harder to admit, as it would be less bruising to my ego to claim I simply didn't try the first time around. The truth is I did try, I just did so very ineffectively.
To be fair, the early stages were productive for me. Setting up a class to represent the cubes, converting this to an OpenAI gym environment, and running open-source implementations of common RL algorithms were all accomplished without much trouble. These tasks were all well within my wheelhouse as a software engineer and are amenable to unit testing.
The trouble came when it was time to actually teach the computer to solve the cubes. I hadn't procrastinated, so I still had plenty of time for this part after completing all of the set-up. Better still, there were early signs of life. My initial agents were solving around 25-40% of the trial cubes I was giving them. I hadn't expected instant results, so nothing was out of the ordinary at this point. I had a few theories about things I thought would help performance, or possible places to check for bugs in my implementation, so I leisurely set about trying these ideas and fixing things. Surprisingly, despite my theories being seemingly sound, and some legitimate bugs being caught, the evals were intransigent and didn't budge from the 25-40% solve rate. Weeks were going by with work getting done but no actual progress being made.
Eventually, I found myself only two weeks from the deadline. Logically I knew there probably wouldn't be time for a major overhaul of my approach at this point. Rather than calmly re-examine the issues I was facing from first principles, the deadline pressure caused me to spiral, and my only strategy was to tweak something in the code, pray, and then wait several hours for it to run. I was staying up late, banging my head against the wall, and watching my training jobs tick along for no particular reason (other than giving myself the illusion I was doing something).
Why I succeeded (the second time)
In some ways, this section is by necessity an inverse of the above section. However, rather than just saying "this time I worked smart, not hard", I will go into the specifics of how I actually did that.
I started small and didn't forget the basics
Because of deadline pressure, by the end of my original attempt at the project it felt like I had no choice but to spend every hour of free time working directly on the project goal, which was to fully solve randomly shuffled pocket cubes from scratch. My original motivation for the project was to master RL and graph learning, but I had completely lost sight of that and became overly focused on the output of the mastery and not acquiring the skills themselves.
The first thing I did the second time around was to spend time with the two subjects completely independently of the Rubik's cube project. When learning a new subject, I find that usually a lot of blog posts and videos are poorly written or don't quite click with me. They are either too high-level to the point of not teaching any new information, or too low-level to the point of irrelevance. Thus for me, it is important to consume content in a variety of different formats from a variety of different sources, until I find something that makes the subject click. For both RL and graph learning, I read dozens of blogs and papers and watched countless videos. Not all of them were useful, but the ones that were more than made up for any "wasted" time on the others. Two resources I used that I highly recommend are the Deep Reinforcement Learning course offered by HuggingFace and Jure Leskovec's Machine Learning with Graphs course on YouTube.
In addition to going back to the basics in terms of properly learning the subjects, I found it necessary to downsize the goal of my project and make progress incrementally. Rather than going straight to solving a fully shuffled cube, I tried to see if I could solve cubes that were just one step away from being solved, then cubes that were two steps away, then three steps away, etc. I started plotting more metrics during training to better understand what was going on, and inspecting algorithmic specific numbers like the q-values in Q-learning. These things seem obvious but they do lead to interesting insights. For instance, I often found my loss would go up but the network would actually be solving cubes at a higher rate and vice versa (for actor loss particularly in actor-critic set-ups). Manually inspecting failure cases is one of the most fruitful things one can do in machine learning. One time I was debugging a certain agent that had poor performance and found that it was primarily due to getting stuck in a loop when attempting to solve the cubes. Fixing its propensity for getting into cycles by increasing its stochasticity was enough to turn a seemingly bad model into a great one.
Finally, in addition to small incremental tasks building up to the larger goal, it is sometimes useful to construct artificial toy problems, even if they are superficially orthogonal to the original problem. I highly recommend creating the toy problem from scratch rather than using someone else's. For instance, when learning RL, if you use popular toy problems like Cart Pole and Lunar Lander, you will likely end up copy/pasting others' solutions and having the illusion of competence. By developing my own problems, even straightforward ones, I was able to avoid the temptation to copy/paste. I discovered that when starting from scratch in an editor, finding a solution is considerably more difficult, but the sense of satisfaction and the knowledge acquired is commensurately greater.
I did proper sweeps, not just "tweak and pray"
There are many hyperparameters that affect the performance of a machine learning model. Some are obvious (like learning rates), and others are not technically hyperparameters (like the choice of backbone for the model architecture), but can still be swept over as if they were, to compare and contrast different strategies. When I originally failed with the project, my workflow was little more than making a copy of my colab, bundling a few random changes to hyperparameters or some aspect of the RL training loop, running it, and hoping for the best. When things inevitably didn't work, I would close the tab (unhelpfully named something like "Copy of copy of copy of rubiksml (5).ipynb").
A much better approach is to implement all the features, options, and configurations you think should be useful, and then doing a simple grid search over the cross-product of all the options. This approach was sufficient for me and easy to implement by hand. It doesn't require any framework imports (although there are some very good ones) or fancy things like Bayesian hyperparameter optimization. It solves a serious problem of the "tweak and pray" strategy, which is that after a few days, you can barely remember which ideas you already tried (and that already failed), and which remain to be tested. Tracking both positive and negative results from the past is vital to know what to work on in the future, so checkpointing and documentation are essential. While a quick tweak and re-run is faster for a single experiment and might be the optimal strategy if your deadline is in exactly one hour, if you have a deadline even just a week away, you will be much better off doing an organized sweep.
I took time away from the problem
To be fair, this one is very hard to do under deadline pressure and was much easier to do when I was able to work on the problem freely with no deadlines. But despite this advice not being applicable in all situations, it is important to highlight and appreciate its utility.
Time away from a problem is so valuable because as useful as organized hyperparameter sweeps are, sometimes the real solution cannot be obtained with any combination of hyperparameters you've currently implemented. You may need more data or a fundamentally different approach. Time away from the problem allows for more creativity to happen, and prevents a streetlight effect where you only search for solutions in the region that you have already implemented in code.
My first go around, I tried to set goals like "read 50 pages a day" from papers or a textbook. This works reasonably well for literature classes but not as well for technical courses since the information content of different pages can be highly asymmetrical. There have been single pages of papers that took me more than a week to truly understand. When you encounter such pages filled with difficult equations, you need to decide whether it's important to learn or whether it is skippable. If it is important to learn, you should take however long it takes to learn it, rather than skip it to keep up with some arbitrary "pages per day" goal.
This approach requires both knowing your limits (to prevent going down rabbit holes) and having the internal discipline to not skip things that shouldn't be skipped or simply not reading anything. When researching a problem, it is common to find a paper that links to dozens of other papers, each of which links to dozens of other papers, in an endless chain. This can become its own form of procrastination where you think you are working towards your goal but are actually getting led farther away. You'll only have so much bandwidth after a long day at work for your side projects, so it is important to effectively prioritize what you use it on. Make peace with not being able to read every paper or book on your subject, and spend some free time away from the subject rather than pursuing every aspect of the long-tail.
I parallelized more, in every way
If you have a large amount of options to explore (e.g. a grid search) and a short amount of time to explore them in, your only option is to parallelize your search.
Even for a simple problem like pocket cubes, single-threaded RL isn't super fast. What seems like an experiment that should only take ten minutes to converge could end up taking hours, which doesn't sound like a lot, but in practice it means only being able to try a few ideas in one day. Production RL use-cases achieve impressive wall clock times via many actors generating data for learners to learn on concurrently. Due to the high-cost of cloud GPU rentals, it was important to benchmark how long it took to train to convergence on various types of GPUs and then determine what the optimal set-up was to get the most possible training runs done in the lowest amount of time at the lowest cost.
Beyond literal parallelization of the training jobs, it was also critical to parallelize coding tasks and learning tasks. Even for top AI labs with hundreds of thousands of high-end GPUs, there is inevitably still a significant gap in time between when a training job is launched and when it will complete. What you do with that downtime is important. For corporate, cutting-edge large-scale deep learning, there may be a lot of model babysitting involved to make sure faulty hardware or bizarre gradient spikes don't derail the model training. For my jobs, however, the training was much more stable (since I was working with much smaller models). This meant that my effort babysitting the models while training when I originally attempted the project was mostly pointless. The second time around, I used the downtime more productively. I no longer just assumed the run would get the eval results I was hoping for, but rather assumed it wouldn't, and that I would need new ideas and new code to get across the finish line. With this in mind, I would continually watch videos, read papers, and code small demos to test ideas while waiting for other results to come in.
Increasing the amount of work I could do concurrently in a cost-effective way, combined with using downtime more productively, vastly increased my rate of progress.
I didn't trust everything I read on the internet
As much as I recommended consuming online content above, it is not without its pitfalls. A lot of online content creators don't actually know the subject material they are teaching; they are copying content they consumed from others. Like any telephone game, some content or information is altered or lost with each copy, leading to information being missing or just plain wrong.
I watched several lectures on actor-critic algorithms from very good sources like DeepMind, HuggingFace, and others, and each one had key differences. This is probably because of undocumented assumptions in the formula like reward being 0 or 1 or value estimates being a certain way, and if these unstated assumptions are violated signs can be wrong or things can blow up. When re-implementing someone else's algorithm, you need to trust yourself to understand the concepts and then implement the core idea in your problem space, which may require your own sensible version of clipping, changing signs, or scaling values to match your inputs and outputs. Using A2C as an example, it was critical to put a stop_gradient/detach on the advantage and TD target so that the agent only learns to make the actor probabilities go up or down rather than trying to make the advantage 0.
On the infrastructure side, we're used to compilers working correctly in Java and C++. It's exceptionally rare for code to not work due to a compiler error as opposed to user error. PyTorch is also pretty reliable. But there are other more experimental machine learning frameworks you may encounter that do have legitimate bugs on their end. These bugs can be very subtle and cause nondeterminism. This makes it harder to debug machine learning code vs. regular code, since not only must you consider your own errors, but also possible errors in the framework you are using.
Finally, it's not just factual errors in formulas that can be wrong in online content. There is also some received wisdom in the ML community that may not actually be true. One design problem I had to face specific to Rubik's cubes was how to represent the action space. Half of the turns that can be made are equivalent to the other half, e.g. turning the front face clockwise is equivalent to turning the back face counterclockwise. In this case, you might think that any neural network should easily be able to learn that. I have so often heard "you don't need inductive biases in the model as it will easily learn those patterns and relationships given sufficient data". In practice, though, sometimes it will and sometimes it won't. I completely believe in the bitter lesson, but people need to remember that the bitter lesson applies in the infinite data, infinite compute regime, which is what we are guaranteed on an infinite timeline. But if you have limited compute like I did on the project, and limited data, you can get way more compute efficient results by applying relevant inductive biases.
Conclusion
Four years after I originally submitted the project, I was finally able to finish what I had started: solving pocket Rubik's cubes with a variety of reinforcement learning and graph learning techniques. What had started as a fun testbed for exploring these subdomains had morphed into my academic white whale, an intellectual quagmire that weighed in the back of my mind for too many years. Finishing the project after all this time allowed me to not just accomplish my original goals (solving the cubes, learning RL, and learning graph neural networks), but also learn how to learn more effectively, how to manage my time and conduct research better, and ultimately grow as a person and be kinder to myself. I am confident that the next time I face a difficult problem I will be able to solve it in a quicker and happier way.
To those interested, the complete code for the project is available at https://github.com/dsteiner93/rubiks-ml. Thank you for reading and I wish you the best of luck conquering your own academic demons!

