Implement robots.txt
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	epriestley
	Mar 14 2014, 1:34 PM

Description

We currently seem to have 3-4 separate webspiders aggressively crawling every version of every file in Diffusion. Since the content is ajaxed in in most cases, they can't even index anything meaningful. These pages require git operations and are relatively expensive for us to generate.

Diviner looks like it's also a bit of a spider trap, although not as bad (and there's content there, and indexing it could be useful, and it's not as costly for us to generate).

I think T3923 is the most general solution here (force individual clients to back off) but serving robots.txt could be helpful too. Particularly, I suspect no installs are ever interested in spiders generating an index of Diffusion. Can you guys think of any reason to let spiders into Diffusion?

Can you come up with reasonable use cases for giving administrators more control? Given that policies already exist, my thinking is that we should just block /diffusion/ unconditionally and leave it at that for now.

Revisions and Commits

rP Phabricator
	D8532	rP838f78128544 Add a robots.txt file to disallow /diffusion/

Related Objects

Mentioned Here: T4558: Make Diviner useful for third-parties
T7472: Search repositories using a dedicated index for performance (and SVN support)
T7701: Surface Code/File search better in Diffusion

Event Timeline

epriestley created this task.Mar 14 2014, 1:34 PM

epriestley raised the priority of this task from to Normal.

epriestley updated the task description. (Show Details)

epriestley added subscribers: epriestley, chad, btrahan.

epriestley edited this Maniphest Task.Mar 14 2014, 2:23 PM

epriestley edited this Maniphest Task.Mar 14 2014, 6:53 PM

Closed by commit rP838f78128544.

At wikimedia we would prefer to allow search engines to index our code (https://phabricator.wikimedia.org/T76992), however, it does seem like it might not be that useful if the spiders can't make sense of it. Is it really likely to generate a ton of load without generating useful search results?

Can you give me an example of the kind of integration you're interested in? For example, what's a good query I can plug into Google or Bing and get a reasonable result in a codebase for?

I'm not sure if this is a good idea or if I'm just talking out my ass here, and it certainly isn't high priority by any means. But with that said:

Example use case: I often use google to look for uses of a class, e.g. looking for examples of phabricator custom fields in the wild. I would search for "extends PhabricatorCommitCustomField" or "extends DifferentialStoredCustomField" to see if anyone published their extensions to phabricator so I could get a better idea of how these fields are used. In this case, sadly. I didn't find much at all, though it probably isn't because of the robots.txt exclusion in diffusion. That's just my own recent use-case. Another less concrete example would be source code comments: they contain prose that often explains something about the code which might not be published elsewhere, sometimes with references to related bug trackers, etc. My totally unfounded assertion is that there is probably some value in making them find-able.

It's fairly obvious that most search engines aren't very well optimized for searching code. With the exception of a few specialized code search engines, it seems there is a lot of room for improvement.

Maybe it would be useful to offer a simplified view of the code, without so many links to revisions, logs, tags, branches, etc. Something like a plain git-log with a lot less cross-linking? This could be cached and offered to robots as well as anonymous / JavaScript-disabled clients?

Anyway this probably isn't worth the effort right now but it might be at some point?

...so I could get a better idea of how these fields are used. In this case, sadly. I didn't find much at all, though it probably isn't because of the robots.txt exclusion in diffusion.

Spending time on this in the hope of improving this use case seems pretty low-value compared to, e.g., improving the documentation or organizing community resources.

Another less concrete example would be source code comments: they contain prose that often explains something about the code which might not be published elsewhere, sometimes with references to related bug trackers, etc.

Maybe we could just extract this into the documentation. But you can already, e.g., search Google for "phabricator mpull", and get a link to the documentation, then click from there into the source code.

Basically, my belief here is that Diviner (or other equivalent generated documentation) is a better target for essentially all source queries -- and it's already indexed, and Googleable. I don't think the (substantial) effort to create a spiderable view of Diffusion is likely to be worthwhile, and may make things worse for many queries (if searching for, e.g., PHP functions took me to the source code instead of the documentation my life would be worse overall). GitHub also denies source indexing, and I generally don't know of any class of queries where source is broadly preferable to documentation as a result set.

I think the situation is simply that source is preferable to nothing, and is mainly useful when there is a lack of documentation, which is the case with a large portion of source code. The wikimedia task covering this topic is referring to our own code, which doesn't use diviner and is documented on-wiki but only sporadically. Obviously this isn't anything we expect the upstream to be concerned with, I was simply cleaning up one of our workboards and it sparked a few thoughts on the subject.

Thanks for the considerate response, as usual!

Since your code is mostly PHP, you could possibly just run Diviner on things to get something for Google to hit. That might be way off (and we may not be able to parse/atomize things in the MW codebase correctly), but I could see it being at least as good as indexing the source itself. If someone searches for a class or function name, I think getting the definition is almost always way more useful than getting callsites -- maybe even if the hit is shell docs with a link to the definition. And in some future world where everyone has infinite time to generate wonderful stable docs, the doc hit seems clearly preferable.

I'm not sure that more than, like, 1% of projects ever get that far, and maybe this isn't a practical goal that makes sense for very many projects (we probably have like a 10% reasonable-documentation rate, if that), but projects concerned about googleability are probably at least more likely to have (or want) meaningful docs.

In any case, since we want nice docs some day, it's easier for us to prioritize Diviner (T4558) -- which we perceive as the "right" solution here, for the most part -- than indexable source (which feels like a lot of effort pursuing a second-rate result that isn't really a very good solution to many problems).

You can also do source code search within Phabricator (see T7701, T7472). I think improving this is this is the "right" solution to some problems which indexing the codebase could solve approximately/poorly.

Very good suggestions, and I will look into running diviner on our codebase, to see how that looks. It might be as good or better than what I was imagining in my previous comment. And phabricator code search is definitely a nice and desirable feature. That is probably a lot more valuable than search-engine-friendly diffusion.

Thanks!

Implement robots.txtClosed, ResolvedPublicActions

Description

Revisions and Commits

Related Objects

Event Timeline

Implement robots.txt
Closed, ResolvedPublic
Actions