Monorepository and polyrepository: two poor solutions to many unsolved problems. Part 1

TL;DR

I feel that code-keeping doesn’t have a satisfying solution these days. However, if you need to solve this problem right now, consider utilizing monorepositories as they currently have the best tools.

If you are ready for a long trip into the world of software engineering problems and feel open to ideas that could evolve code-keeping practices everywhere, please continue reading.

Defining the topic

Monorepositories and polyrepositories are relatively new discussion topics. The first mention of monorepo on Twitter is from 2011 (or from 2010, if we count the phrase “mono repo”). The first tweet containing the word “polyrepo” is from 2013. These are two approaches to organizing your source code in either one repository or multiple repositories.

It’s worth noting that there is no agreement on the name of polyrepo. The word “multirepo” is also in use, and it appeared even earlier than the other two.

In the case of a monorepository, you create a single repo containing multiple projects divided by folders. When using polyrepositories, each project has its own dedicated repo. There is also a hybrid approach to which we all probably tend to shift. Still, we will concentrate on the extremes to highlight the differences better.
The most extreme example of a polyrepo I’ve ever seen is the Predix Design System space on GitHub. Each UI component there has a dedicated repository. This approach means a separate set of supplementary files for each element of the UI. A dropdown has its own license, and it’s the same for a tooltip. It’s scary to imagine the amount of work required to support this separation of concerns.

Overcoming the subtle naming differences

One thing that differentiates the experienced programmer from a newcomer is seeing the difference among almost identical phenomena. Let’s imagine a function. E.g., readAloud takes this article and reads it aloud for you. If we invoke it the following way: readAloud(article), we’ll continue calling it a function. Invoking it as article.readAloud() will bring us to a different world with another name for a notion giving the same result: a method. With the function, we can either be in a structured or functional world. Using methods brings us to the world of object-oriented programming.

Such a subtle difference also follows the term “monorepository”. What is the difference between a monorepository and a monolith? Does one imply the nother? Does a polyrepository have something in common with microservices?

No, all of these concepts are very different. When talking about a monolith or microservices, or even services, we mean the live instances of the application. An instance is a code state created after pressing the “run” button. Now it is in the shape of a process or many processes. A monorepository and a polyrepository are approaches to organizing your code blueprints. They are the way that you structure your files that contain the code.

A monorepository can either lead to a monolith or to any other deployment approach. A polyrepository has the same possibilities. On the one hand, you can develop several microservices in dedicated repos, and on the other, you can create a few open-source libraries with a corresponding repository for each. Then you can utilize them in your monolithic application living in a dedicated repo.

Now we understand that when talking about something named “repository”, we mean the way the source code is organized.

Why do we even bother?

As I’ve already written in some of my previous articles, I like the “why” question. I’ve recently finished reading the Toyota Production System book, and it is also famous for introducing the “five whys” method to the world. The idea is to ask “why” five times in order to dig deep enough into the reasons for things. On the one hand, you’ll quickly pass over the symptoms and avoid any attempt to fix them, leaving the problem untouched. On the other hand, you won’t go too far into the things you can’t influence, e.g., the culture of your fellow countrymen or the existence of the four seasons.

The reason that we care about the topic is that the issue of storing code is unavoidable for anyone writing it. We must also consider the fact that code versioning with version control systems is the de facto standard in the industry now. Almost everyone uses Git, and so do some code hosting services like GitHub, Bitbucket, and GitLab. There is no way you can bypass this fact.

What is the state of thought in modern code organization?

According to what I’ve seen during my research on this post, people are trying to wrap up this discussion. Some articles state that they will end the ongoing debate. Others ask you to stop thinking about the alternative because of specific reasons. None of the sides are convincing enough to make me stick to one single solution. The examples of Google, Facebook, Microsoft, Uber, Airbnb, and Twitter utilizing the monorepository approach make me a little nervous about contradicting it. On the other hand, the flaws are so apparent that I can’t say that humanity has reached the ultimate solution to organizing the storage of source code.

What is the plan?

This topic is difficult to write about because of the scale at which the problem becomes observable. Also, during our investigation, we’ll notice many different variables that can bring us to one side or to the other. Unfortunately, this article won’t be as practical as my previous ones. We will not build live examples, but we’ll concentrate on moving the whole discussion forward, bringing attention to the topics not yet covered.

Please also remember that my primary area of knowledge is the front end. Thus, I may miss some of the particularities that back-end developers experience. However, I plan to cover things that are common to both worlds.

In the following sections, we’ll investigate the claimed pros of each approach and assess them. Then I’ll share my view on what we’re missing to upgrade the whole practice of code storage. There should be a place to discover the limits of the dogmas we take for granted.

A deep dive into the discussion

Earlier I mentioned several famous companies known for using monorepositories. You may find the very same list on the dedicated Wikipedia page. Their presence in this list doesn’t mean that all their code exists in a single monorepo. E.g., Google has a GitHub space dedicated to Material UI, and there are 26 repositories in it. Microsoft has a Git repository for the Windows system, and there are no other projects in it. It was 270 GB in 2017 and required an additional technological effort to stay maintainable.

These examples demonstrate that even the term “monorepo” can mean different things. Also, we can see that companies do use the hybrid approach despite having the reputation of being monorepo advocates. The common ground here is Git scalability problems.

What praise do monorepos get?

At the time of writing this article, monorepos have gained the most significant support from the worldwide programming community. Google, Facebook, Microsoft, and others invest their effort into making monorepos easier to use, like they usually do with their tools.

So why do people love monorepos? Are there incontestable reasons for their love?

Importing necessary parts is easy

When you have all the necessary parts of the project locally, the ease of importing comes out of the box. It’s worth mentioning that the difference from a polyrepo is not that, with a polyrepo, your solution uses remote dependencies. The actual difference is that for a monorepo, all the positions of all modules come naturally without additional provisioning. The file structure is predefined. For a polyrepo, you have to put in extra work to set everything in the proper order.

Using the importing features of programming languages is more straightforward than going the library way: packaging parts of your software and publishing them to a dedicated registry. The ease of importing vaporizes here both for monorepos and polyrepos.

Cross-Project changes are easy

The state of easiness here also depends on how you export your dependencies. If your approach is local path-based exporting, then yes, you’ll get the results of your changes on the fly.

However, I’ve tried applying cross-package automatic refactoring to the Jest project, which uses Lerna for monorepository management. My WebStorm IDE couldn’t cope with the task. Possibly, the reason behind this is that Lerna passes the local packages via the package.json file, which applies a kind of insuperable detachment.

Another possible obstacle is that some monorepositories are so huge that they require partial loading. Like I mentioned before, the Microsoft Windows repository had 270 GB in it in 2017. Cloning the whole project would last 12+ hours; running git status would require 10 minutes. Microsoft created the Git Virtual File System tool to allow on-demand file loading. It’s like a solution for cloud storage applications for your computer: you see only references but not the whole files before you open them. This type of cloning prevents our tools from being able to analyze the entire codebase locally, even if you use a simple grep tool.

Despite the inability to utilize automatic refactoring tools in both cases, monorepositories have another advantage here. You can create a branch and apply an isolated and cohesive update to the whole project. Polyrepositories can’t do that; they require manual synchronization or custom tooling. This state of things means that polyrepositories tend to become out of sync. I once experienced this and then received a task to arrange a post-mortem. The problem was with the broken CI/CD for the back end.

When a project is in a very early stage, and the walls are shifting every day, a monorepo seems a pretty fitting solution.

Improved reusability

This positive side of monorepo indeed exists. However, as you might have already noticed, this article is more like a bitter pill than a sweet candy. As written above, your advantage in the case of a monorepo is the predefined positioning of existing modules. We already know that this advantage exists only for relatively small monorepositories that do not require virtualization. Huge repositories lack discoverability, and thus the improved reusability fades. Polyrepositories are not much better here. You can try applying a search in your code source hosting tool and then start cursing it up and down. The UX of the default tools (e.g., GitHub search) is extremely poor. Still, much more sophisticated services exist, like Sourcegraphs and possibly others I do not know of. They do a good job, but they do not aim at the core of the problem and perhaps cannot aim at it. Let me go deeper here.
I read the book Facts and Fallacies of Software Engineering a few years ago. We can find two interesting facts in it:

Fact 15. Reuse-in-the-small (libraries of subroutines) began nearly 50 years ago (the 1950s) and is a well-solved problem.

Fact 16. Reuse-in-the-large (components) remains a mostly unsolved problem, even though everyone agrees it is important and desirable.

The problem from the second fact exists both in monorepos and polyrepos. The reason behind it is not the technical weakness of existing tools but the significant amount of managerial activities required to create something reusable for multiple teams.

You may think that the focus on the second part means no problems with reuse-in-the-small. Sorry, but we also have problems here. Consider the following story about a small digit-filtering function:

Imagine that you need a function to clean incoming strings of everything but the numbers of the decimal system. You sit for a while, looking through the shared files of your project. Looking into the utilities doesn’t lead to any success, and you decide to write something on your own:

function keepDigitsOnly(stroke: string): string {
  const ONLY_DIGITS = /\d+/g;

  return stroke.match(ONLY_DIGITS)?.join('') || '';
}

Your solution seems to work fine, and you make a PR and deliver the new functionality. You decide to put your function into the project’s utilities folder. After a while, a QA engineer comes to you and informs you about a problem with the digit-filtering feature. It occurs in a module you never even touched, and you certainly didn’t use your function there. Maybe someone else did it?

It turns out that in one rarely visited location of the project, there are some local utilities, and they contain the following function written a few years ago by a developer who doesn’t work here anymore.

const removeAnythingButDigits = (characters: string): string => {
  const DIGITS = new Set(['1', '2', '3', '4', '5', '', '7', '8', '9']);

  return characters
	.split('')
	.filter((character) => DIGITS.has(character))
	.join('');
};

Do you see the omitted “6” here? Yes, this is the root cause of the problem the QA engineer discovered. As a solution, you remove the local function written by your predecessor and utilize yours. Everything looks fine except the fact that you are still unprotected from the very same problem the next time. And it doesn’t matter how many repositories you have.

Why does reusability suffer? How did the duplication sneak in there? It came from the attempt to solve the same problem twice. What are the tools for problem discovery in your projects? Do you have a searchable registry of the issues solved in your source files? I don’t.

As a side note, solving the same problem several times doesn’t always mean something disappointing for us. Think of local food or local architecture. Our lives would be much duller if we all agreed to the first outcome we discovered and it magically spread worldwide.

Speaking of magic, that brings us to a low-scale solution for the reusability problem, and it also is not about the number of repositories you have. The key is to keep Conway’s law in mind:

Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.

It doesn’t apply only to organizations, but to teams as well. You’ll have poor reusability despite a well-designed code-keeping strategy if you do not regularly communicate with your teammates about the solutions you create or do not use a good structure for them. I want to reemphasize that this approach has some scalability issues. Dunbar’s number might be food for reflection in this case.

Straightforward introduction of common CI/CD policies

I am not a DevOps engineer, but investigating GitHub Actions gave me some thoughts on how it’s easier to settle standard CI/CD for a single repo than for multiple repos. Yes, this is easier, indeed. You only need to support a single YAML file instead of numerous files. You can settle path-based reactions using the path property.

However, you won’t be trapped when you’d like to have a single workflow for multiple repositories. Reusable workflows are at your service. I hope other tools like Jenkins also have something similar for eliminating duplication.

Straightforward introduction of common processes

On a daily basis, I work in a monorepository in Bitbucket. We regularly rethink our processes, and one of the latest upgrades that we made was the introduction of the Definition of Done (DoD). Now the upgrade exists on the periphery of our knowledge base, and as you might suppose, it is not an inevitable reality but rather a set of something good somewhere but not here. We want to make it closer to our daily processes. We can do so by adding the checklist to every PR. Furthermore, we can even block the PR merge until everything is checked.

Whether all your teams agree on a unified definition of done or have a very authoritative manager, this single-point automation can be beneficial. Adding the same approach to multiple repositories and then supporting it may require some manual labor which is prone to errors and memory lapse.

On the other hand, when you follow the “teams define processes themselves” approach, you might soon face the limitations of the monorepo and the need for the flexibility of multiple repositories.

What praise do polyrepos get?

Well, the fun fact is that I couldn’t find any significant praise that doesn’t start with the words “monorepo proponents tend to brag about X, but we also have it”. One genuine characteristic was positive thoughts about the speed of the CI/CD pipelines. This advantage requires a default number of dedicated workers for each repository. I’m not sure that a monorepo couldn’t handle this limitation by increasing the available resources manually.

It makes partial open source possible

When you want to share only a part of your project, there is no way apart from extracting it to a separate repository. This is possibly the only inevitable reason to use another repo if you prefer the monorepository approach.

This concludes my overview of monorepos and polyrepos. In the second part of this article, I’ll be discussing the potential improvements of industry standards leading us to a better code-keeping solution.

Infrastructure for user images, videos & documents