Page MenuHomePhabricator

Is there a legal reason to have license block in every single PHP file?
Closed, ResolvedPublic

Description

It wastes space.
It brings noise to diff of first change of any file every year.
It also confuses Git copy detection, see D3878 for example.
Also, why we don't have license in JS and CSS files? And images? What does it exactly mean?
If not deleted, can license blocks be converted to a one liner at least (URL)?

Revisions and Commits

Closed

Event Timeline

Applying the license to new software states: "Each original source document SHOULD include a short license header". It's not MUST. I'll try to delete it as it is causing troubles.

@davidrecordon was the guy I used to go to at FB for these sorts of questions. I am not sure who owns open source nowadays. I am conceptually down with "less" always. :)

I completely agree that these headers cause concrete problems. I'd love to get rid of them, but I'm not comfortable doing it without an OK from Facebook.

I don't know why we need them or what the legal reasoning behind having them is, or in what scenario anyone would benefit from their presence. I think I questioned the need to add them in the first place, but got some pushback and it seemed like it wasn't the most important battle to fight at the time.

Ideally, I'd like to understand the legal value of having these headers, weigh that value against the cost of maintaining them, and justify their existence or removal based on that analysis. Right now the implementation is pretty half-assed (no notices on CSS, JS, minified CSS/JS, images, other binaries, test cases, documentation, XHPAST, build artifacts, Aphlict, etc). And the copyright years on files with notices are more or less arbitrary. This state of things is simply because it was the bar Facebook set. It's possible that there is careful legal reasoning behind this, but without knowing what that reasoning is it feels completely arbitrary. I've just been maintaining this arbitrary bar because that's what I was asked to do.

I wrote to opensource@ explaining what troubles it causes us, why do we think that it is not obligatory to have them in every single file and whether we can delete it.

Alma Chao (Lead Open Source and IP Counsel at Facebook) is OK with it:

Yeah there's no requirement that we include it in each file if we have it in the root directory. I think we have it in each file just in case people copy single files instead of the entire repo, which allows them to identify the license easier. But it's up to you guys. I'm ok with removing them as long as there's a license file in the main directory.

@davidrecordon will share his thoughts here.

^^ Was exactly my understanding, that we only need one LICENSE file per distro. I might also toss out there that this might be a good time to decide if we want to stick with the Apache license or maybe move to something a bit more protective for building out a SaaS company.

Facebook is the only entity that has a copyright license to the entire codebase, so we'd need to work something out with them if want to change the license (or fork and have some weird mixed-license). I'm comfortable with the status quo of granting copyright license to Facebook and distributing under an Apache license, but now is probably the right time to talk about it if you or anyone else has concerns.

My general thinking is that a permissive license is good for the Facebook/Phacility relationship, that someone cloning the business probably isn't a credible threat, and that having a permissive license is potentially an advantage and a differentiator for us. But I'm mostly coming from a point of view of naive idealism rather than, say, cutthroat business savvy.

So basically the options are:

  • Ask Facebook to change the license to something slightly more restrictive (Mozilla Public License or GPL v3).
  • Fork the code and keep portions Apache (Facebook contrib) or our license
  • Do nothing, nobody is evil in this world.

I suppose there is a fourth option, which is for Phacility to create a non-profit with Facebook and have it control Phabricator's source. This is how Wordpress is structured and might be the most 'correct' way to go, but also is a bit more work. BTW, who owns the term "Phabricator?"

My only concern with Apache is it's extreme openness, it's unlikely like you said that someone would improve Phabricator and not contribute the source back, but I can see it happening. I mean if Github is fixing Windows bugs, hell yea I want those updates.

Few reasons software often contains copyright headers on a per-file basis. It's worth figuring out if any of these are things we care about for Phabricator at this point.

  • Making it easy for someone to identify the license in general when copying code. This may apply more to a file like utils.php in libphutil in terms of someone coming across it in a variety of contexts and the header making it clear that they can use it within their own project (open source or not). This seems to be the thrust behind why the Apache Software Foundation recommends this practice (http://www.apache.org/legal/src-headers.html): License headers allow someone examining the file to know the terms for the work, even when it is distributed without the rest of the distribution. Without a licensing notice, it must be assumed that the author has reserved all rights, including the right to copy, modify, and redistribute.
  • On the opposite side of that is making it clear when code is not open source. This is why Facebook puts a header on all of our internal source code so that if it were to become available outside the company there's no question that someone coming across it does not have a license to use it. Considering that one argument toward removing the headers on Phabricator is that it confuses Git copy detection, then removing the headers doesn't feel like solving the underlying problem for users of Phabricator. If this is painful for Phabricator development it has to be painful for other projects reviewing code with Differential as well.
  • It's easier to identify the non-Apache code within Phabricator since all of those files have copyright headers as well. Likely not a compelling argument since they're all in the externals directory, but many of the licenses require keeping the copyright header intact so it's a bit of a moot point.

As Alma said, I don't think there's a definitive legal argument that Phabricator source must contain the header in each file versus there being one why internal Facebook source code must. At the same time if Git copy detection is a real issue I don't see removing the headers for this project being the best solution. If updating the year is a real problem I think someone finally scripted it at Facebook and we could just give you that script as well.

@chad Even under GPLv3 they wouldn't be required to contribute it back as long as they were just SaaS, right?

As far as I know no one has trademarked "Phabricator".

Right, Mozilla and GPL are copyleft style (http://en.wikipedia.org/wiki/Copyleft) licenses which trigger upon distribution of the source code. The AGPL is more of the license @chad is describing which defines distribution as running the software on a network.

Ah sorry, you are right. I was mostly concerned with getting the source back if they had made positive changes. It seems like we can solve most of that with continuing to be really open and amenable to contributions. Thanks for the clarification.

@davidrecordon Thanks!

The "identify the license when copying code" point makes sense. At least on our side, this seems pretty low-value though: very few files in the codebase are useful or even usable on their own, as almost everything relies on the library framework to load or execute. The utils and utf8 functions in libphutil are the only exceptions I can think of offhand. We could retain the header in just those files fairly easily and without much cost (they're large enough to not hit Git copy detection, which is the major cost we pay) -- do you think that's worthwhile?

As far as Git copy detection is concerned, we could definitely try to address it technically but I don't see a clear path forward. Git detects copies by looking at line-by-line similarity. Since we have a large number of small files, a 17-line copyright header triggers detection as a copy fairly often. SVN and Mercurial track moves as first-class VCS operations, and do not suffer from this problem (with the tradeoff that manual copies and moves aren't detected). The technical approaches I see are:

  1. Implement explicit move/copy tracking in Git.
  2. Implement explicit move/copy tracking in arc (arc mv, arc cp) which run the underlying command in SVN and Mercurial but do something scary in Git.
  3. Implement with textconv/diff drivers (see T204) so the copy detection doesn't see the headers. I'm not immediately sure where in the pipeline copy/move detection is relative to textconv and diff drivers. We'd want to show them in Differential so this might require diffing twice.
  4. Implement our own move/copy detection and stop using the -C -M flags.
  5. Implement some code which "undetects" spurious Git moves/copies.

None of these seem very attractive to me. (1) would be great but is probably not realistic. (2) through (5) are not so great. We also haven't had any requests to improve this behavior from users that I'm aware of. If there was a more attractive option I'd say we should definitely do this regardless (we still have <?php\n\n in common and lines like }\n that could be safely discarded, and could generally do a better job with a small amount of logic than Git can naively) but I don't see a clean way to implement this -- although we already so some line-by-line copy detection, so maybe some combination of (4) and (5) wouldn't be that bad. @vrana, do you have thoughts on how practical this is?

T784 is also vaguely related.

Even if we had a good solution here, these headers still impose an (admittedly very small) cost every time someone opens a file and has to scroll down past the header, and there's nothing we can do about that technically (well, "write a plugin for every editor..."). Removing them is a cleaner fix, and I think desirable even if we did everything we could on the git/arc/Differential sides.

The "making it clear when code is non-apache" point makes sense in the general case, but I think you're right that they're all moot in this specific case. There are plenty of good technical arguments that align well with the externals/ separation too, so it's unlikely that this will change in the future.

(The copyright year updating stuff is fully automated, it just also has some small-but-nonzero cost that would be nice to stop paying.)

@chad, I live in the same idealistic world as Evan and I don't think that a different license would solve any problem we could have. Not contributing the changes back is a pain also for the non-contributor as upgrades are harder.

On the other hand, I think that we should register a Phabricator trademark as we could have real troubles if someone else would do that. Either Facebook or Phacility/Facebook non-profit org should do that. I registered a trademark for my project Adminer to avoid this ago.

@davidrecordon, thanks for sharing your thoughts. As Evan mentioned, it's almost impossible to use any file separately, the only exception is probably utils.php (utf8.php depends on utils). Plus if I don't know a license of some file then I simple can't use it, right? That seems like a good default for us.

I agree that making it clear what is licensed and what not (or under a different license) is very important. But I guess that the rule "traverse directories up until you find LICENSE" is obvious enough? We already use this rule for non-source-code files (binaries, docs).

@epriestley, I think that investing a big effort in improving copy detection is not worth it. This problem doesn't trigger that often. How (4) or (5) would look like? Ignore first comment in a file? Ignore known licenses? Allow specifying a regexp what to ignore?

I'd be slightly surprised if Facebook didn't register the trademark when buying domain names; we generally do. In any case, I can check on that.

For companies with proprietary code I'd agree that a default of no clear license meaning all rights reserved is a good thing. I disagree that it is a good thing for open source projects which is what the Apache Software Foundation is trying to say behind their reasoning of all of their projects including the header. If we've all put in the effort of doing a good job releasing open source code, we might as well remove any barrier toward it being used widely.

Also in this idealistic world having people know to keep going up directories until they find a LICENSE file works, from what we see internally of people finding code in various places it's far from the case.

If this problem isn't triggered very often, why is it a problem here pushing the removal of the headers? I'm just having a hard time making sense of the argument since either it's a big problem which I'd then argue we should try to fix for users of Phabricator as well or it's not and we shouldn't be using it as the reason to remove copyright headers :)

The biggest reason for me is the wasted space. The header takes 50% (16/32) of lines displayed in my editor by default. I would much rather see more code than a text with no value for me.

Other reasons are smaller although the broken copied files detection is really quite confusing when it happens. I don't see a simple solution here and it isn't worth investing much effort in it. Even if we would solve this subproblem then the main subproblem will stay.

Another reason is consistency - we don't have license headers in images which are usable by themselves quite easily. On the other hand - most of PHP files in Phabricator aren't usable by themselves. If you want to use some code fragment then you need to figure out the license anyway. I just don't see how this can be a problem: "Ah, there's some code in public repo on GitHub. I wonder if I can use it or change it?" I just can't imagine anyone who wouldn't be able to figure this out and instead chooses to not use the code.

Removing the headers is a quick, clear-cut, simple solution with well-understood technical effects and no product impact. It also solves a lot of smaller issues which smarter arc/Differential/Diffusion handling would not: having to scroll past the headers in editors, approve automated application of them from the lint pipeline, dealing with them if you want to grep for a term which appears in them, etc. Around 12% of the lines and 16% of the bytes in the PHP codebase are license headers.

The biggest issue for me personally is really just that they feel like clutter. I'm generally very happy with the other 88% of the codebase, and if I'm not I can fix it, but every time I open a file there's a big block of boilerplate that I can't do anything about.

Solving the problem generally in tooling is desirable, but a very complicated, murky solution with a lot of limitations (e.g., won't do anything for editors or grep or feelings of clutter or wasted space) and open technical and production questions: how do we actually do it? All of the approaches above take a long time to develop and have tradeoffs. How do we communicate what we've done to the user, so they aren't surprised when "git" produces different results than Differential does? Most of the approaches would also require the end user to either change their behavior (run arc commands instead of git commands) or configure things (specify patterns that should be ignored or undone in move heuristics). AFAIK no users have asked for this feature, so it's possible that it would benefit no one else even if we did build it. And we can possibly now offer them a simpler, more easily understood solution if they do ask for it: remove the headers.

avivey changed the visibility from "All Users" to "Public (No Login Required)".Feb 26 2016, 7:03 PM