Page MenuHomePhabricator

Localize Phabricator
Closed, ResolvedPublic

Tokens
"Like" token, awarded by avivey."Love" token, awarded by liuxinyu970226."Love" token, awarded by amire80."Mountain of Wealth" token, awarded by Danny_B."Like" token, awarded by ofbeaton."Mountain of Wealth" token, awarded by jayvdb."Mountain of Wealth" token, awarded by nemobis.
Assigned To
Authored By
qgil, Jun 5 2014

Description

Opportunity: Phabricator is ready to be localized.

Problem: it comes without any translations.

Requirement: Wikimedia needs at least the strings visible to users of a bare bones Phabricator instance with Legalpad.

Possibility: create a Phabricator project at http://translatewiki.net and let our community contribute some translations. If there are translations available somewhere, they could be integrated as well. At some point Phabricator could get the fresh translations directly merged to your code base, just like we do with MediaWiki (probably the most localized piece of software in the World).

I will try to find volunteers to help setting up the project. In the meantime, see https://translatewiki.net/wiki/Translating:New_project

Twin report: https://phabricator.wikimedia.org/T225

Details

Differential Revisions
D9425: Very rough proof-of-concept of inline-translatable strings
D9424: Add a "translate mode" to libphutil
Commits
D16932 / rTRANSLATEWIKIe998f7f5f1f9: Move frequency data to top level when exporting for Translatewiki
D16931 / rTRANSLATEWIKI00108900457e: In translatewiki exports: human-readable types, subdivision by application…
D16825 / rTRANSLATEWIKI1433d08de8f6: Add a "--clean" flag to "bin/translatewiki export" to drop project caches
D16824 / rPafa1bb286044: Fix some grammatical gender constants
D16823 / rP980367452522: Extract variable type information from pht() calls
D16822 / rPHUc6634479d0f1: Add a `phutil_person()` wrapper for the string extractor
D16821 / rPHU5dbc2a8e01f6: Add an optional pht() callback for collecting string frequencies
D16809 / rPHUbf430206c8a9: Fix a minor i18n issue in libphutil
D16810 / rTRANSLATEWIKI6f4a1dfabf01: Provide a rough import/export pathway for translation files in Translatewiki…
D16807 / rP2f93ce4c25be: Don't show "Limited" or "Test" translations unless an install is in developer…
D16808 / rP960c0be6898c: Fix some issues with Phabricator i18n string extraction
D16227 / rPccc7c1b42436: Make i18n string extraction faster and more flexible
D15983 / rP10cc633b88b2: Warn and continue when failing to extract pht() strings
D15980 / rP5b77b86ffbf9: Show translation option names natively, instead of in the current translation
D15979 / rPHU1d216e95b39c: Provide locale support for languages we've seen translations for
D15978 / rP10ffa42504db: Separate locales into more usable groups in the translation menu
D15974 / rPHUec2301efad5b: Allow locale definitions to provide gender/plural rules

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
epriestley moved this task from Backlog to The Queue on the Prioritized board.Jul 2 2016, 7:24 PM

I've not looked into the details of the implementation you started, but because it's gettext I thought maybe some links wouldn't harm:

This comment was removed by st_we.
popozhu removed a subscriber: popozhu.Aug 4 2016, 9:22 AM

I'd like to get this task un-stuck. I'm unclear on Phabricator's current internationalization capabilities.

At https://secure.phabricator.com/settings/user/MZMcBride/ I see a "Translation" option below the text "Choose which language you would like the Phabricator UI to use." and next to an associated drop-down menu. I changed my settings to "Spanish (Spain)", but it seems to have had little or no effect. Nothing in the user interface looks different to me, in a Maniphest task view or in the settings pane itself or anywhere else in the user interface. Am I doing something wrong?

chad added a comment.EditedOct 30 2016, 1:03 AM

The current status is the infrastructure is provided, but the translations are not (you need to either translate it yourself, or check the community wiki page here for community contributed translations). Then install said translations locally.

This task isn't currently stuck on anything other than engineering resources, it's been prioritized and will get completed sooner than later.

In T5267#198537, @chad wrote:

The current status is the infrastructure is provided, but the translations are not (you need to either translate it yourself, or check the community wiki page here for community contributed translations). Then install said translations locally.

I'm confused why the drop-down menu for this installation and for Wikimedia's installation at https://phabricator.wikimedia.org/settings/user/MZMcBride/ have user-visible and user-selectable options for a "Spanish (Spain)" translation if the language isn't installed/available/working. Should this issue be filed as a separate task? Where does the drop-down list of languages come from? The user experience is a bit rough, to have a user find and change the option for the user interface language only to discover that the option is non-functional. :-/

I looked at https://secure.phabricator.com/diffusion/P/browse/master/src/docs/contributor/internationalization.diviner, but it's not clear to me what needs to happen to get a Spanish user interface working for Wikimedia's Phabricator installation. We would create or use our own version control repository to house PHP(?) language files that would override Phabricator's system keys/messages, it sounds like. Is that correct? If so, is there an example file that we could start from or would we just duplicate the English file that ships with Phabricator and go from there?

chad added a comment.Oct 30 2016, 1:39 AM

Everything you mention is covered by this task.

chad added a comment.Oct 30 2016, 1:45 AM

If you want to pursue i18n ahead of the upstream, T10814 has some walkthrough I did with another user. It's basically, dump strings into a file, translate, add to src/extensions. https://secure.phabricator.com/w/community_resources/#translations has examples of Chinese and German users have shared here.

chad added a comment.Oct 30 2016, 1:47 AM

P1792 is the German example.

I would like to submit a translation to upstream. Where do I start?

It looks like T10814, specifically T10814#170824, provide some useful info. Thank you for that link!

I'm confused why the drop-down menu for this installation and for Wikimedia's installation at https://phabricator.wikimedia.org/settings/user/MZMcBride/ have user-visible and user-selectable options for a "Spanish (Spain)" translation if the language isn't installed/available/working. Should this issue be filed as a separate task? Where does the drop-down list of languages come from? The user experience is a bit rough, to have a user find and change the option for the user interface language only to discover that the option is non-functional. :-/

This link provides some context: https://secure.phabricator.com/diffusion/PHU/browse/master/src/internationalization/locales/.

Would upstream Phabricator accept patches to add files such as P1970 (Spanish translation) and P1792 (German translation) to the main Git repository (i.e., not in an extension or other separate place)? I'm trying to figure out, for Wikimedia's Phabricator installation, whether we should create/use our own Git repository or if we could submit patches upstream.

We could also sync to this Phabricator's installation Paste application if there's a real need to, but it feels very weird to use the Paste application as a version control system for language files.

chad added a comment.Oct 30 2016, 7:27 AM

I would subscribe to the Paste, and spend 20 seconds updating the file if it changes. Like I mentioned before, this task has already been prioritized and we're not going to work on it until other things in the prioritization queue have been completed. I think self-maintaining is a reasonable work-around.

@MZMcBride, couple of questions so we can improve this:

  • Did you notice that "Spanish (Spain)" was under the "Limited Translations" heading, instead of the "Translations" heading?
  • If so, what did you think "Limited Translations" meant or might have meant?

For context, here's what I expected you to see:

I think D16810 should get this started, at least. Known technical issues:

  • When exporting, I don't know how to escape $ for Translatewiki (this affects ~10 strings).
    • Likewise, when importing, I don't know how to unescape $.
  • We don't currently extract docblocks properly. This affects ~50 strings, but most of them are huge blocks of help text in configuration, so they probably are not very high priorities to translate.
  • We don't use (and don't plan to use) string keys. I've generated keys by digesting the source strings. This may or may not be the best approach we can take.
  • We don't currently export any gender/plural information, but we could sometimes provide it automatically because we can infer it statically from the pht() callsites. Unclear how important this really is. Both genders and plurals are likely relatively clear, I think (e.g., "...$1 task(s)..." is consistently a plural).

Likely issues with content:

  • We don't have QQQ metadata/hints. I've autogenerated some QQQ information (links to where strings appear in the codebase) but this may or may not be of much use.
  • We undoubtedly have some strings which are just not great quality. For example, these strings currently appear in the export corpus:
"$1"
"$1 $2"
"$1 $2 $3"
"$1 ($2)"
""
"<$1: $2>"
"$1: $2"

On the one hand, these probably don't need to be translated most of the time. On the other hand, good luck translating them.

sgrimm removed a subscriber: sgrimm.Nov 6 2016, 9:49 PM
  • Did you notice that "Spanish (Spain)" was under the "Limited Translations" heading, instead of the "Translations" heading?

Yep.

  • If so, what did you think "Limited Translations" meant or might have meant?

I expected at least the user settings user interface to be in Spanish after successfully changing the translation drop-down selection. For example, "Choose which language you would like the Phabricator UI to use." would be in Spanish if I selecte "Spanish (Spain)". I clicked around a bit and saw no part of the user interface that was in Spanish.

I expected "limited" to mean incomplete or partial. That is, I thought the "main" and most user-visible parts of the user interface would be translated, but I figured that more obscure parts of the user interface might fall back to the English translation.

What does "limited" mean in this context? Where would I see Spanish text in this installation or in Wikimedia's installation currently? I have "Spanish (Spain)" set for both and still haven't seen a non-English string anywhere yet.

I also failed to realize that there's apparently a distinction between registering a translation in Phabricator (i.e., making it available in the settings drop-down menu) and actually providing/installing an interface translation.

The technical definition of "Limited" is that it has fewer than 512 strings, including 0 strings.

Most of the translations in the "Limited" section currently have 0 strings because we've accepted the locale definitions (which define a locale code and plural/gender rules) upstream so that third-parties can provide local translations, but have not yet accepted any translation strings upstream.

I think I'm just going to move forward with D16807, which hides these "Limited" translations unless an install is in developer mode. Anyone adding new translation strings should, at least theoretically, be running in that mode.

Likewise, when importing, I don't know how to unescape $.

You don't need to take care of import. Translate will do all the conversion to and from the chosen l10n format. Just make sure to use strings from the l10n files directly and to make the JSON (flat and) valid.

  • When exporting, I don't know how to escape $ for Translatewiki (this affects ~10 strings).

I don't know what this means. $ in the string contents isn't anything special for translatewiki.net.

  • We don't use (and don't plan to use) string keys. I've generated keys by digesting the source strings. This may or may not be the best approach we can take.

I don't what would be worse. Well, actually I do. Not supporting l10n at all, so props for that ;)

Right now you are suggesting an implementation in which:

  • Fixing just a spelling mistake in original text throws all translations away for that string.
  • Different strings which use the same text are forced to have a single translation, such as "Open" (which can either be verb or adjective and requires different translations in pretty much any language other than English).

Gettext solves the first by doing fuzzy-matching to re-use existing translations, but even that requires translators to confirm/fix them. Gettext solves the second by allowing to differentiate the uses with msgctxt. Gettext's mistake is not making identical strings different by default, so that translators need to keep poking the developers every time developers forget to differentiate the strings. It's easy to translate the same string multiple times with help of translation memory and message documentation, but it is impossible to provide good translation if a string is used in two different contexts with the same translation.

  • We don't currently export any gender/plural information, but we could sometimes provide it automatically because we can infer it statically from the pht() callsites. Unclear how important this really is. Both genders and plurals are likely relatively clear, I think (e.g., "...$1 task(s)..." is consistently a plural).

I have no idea what this means either. Having strings such as "...$1 task(s)..." means you don't support plurals at all. Yes, the translator can see there is some weak attempt to work around the issue, but that does not help as he/she cannot provide a proper translation if there is no dedicated support for plural forms.

Here the standard solutions are:

  • Duplicating a plural-using string N times (varies per language) for each plural form and choose one of the strings during runtime depending on a number and on some logic (varies per language). This approach supports only one plural per string and leads to a lot of repetition for longer strings.
  • Supporting an inline plural syntax, such as the one used by MediaWiki. Instead of duplicating the strings, there needs to be a little more code that parses the translations during runtime to replace parts of the translation, but the logic is the same. This approach supports multiple plurals per string.

Now, translatewiki.net supports inline syntax even when a project uses the duplication approach (such as Gettext): we can convert the former to the latter on the fly when exporting from translatewiki.net. This only helps to avoid repetition though, the limitation of one plural per string stays.

Likely issues with content:

  • We don't have QQQ metadata/hints. I've autogenerated some QQQ information (links to where strings appear in the codebase) but this may or may not be of much use.

Better than nothing, assuming you can handle additions from translatewiki.net and not throw them away when regenerating files.

  • We undoubtedly have some strings which are just not great quality. For example, these strings currently appear in the export corpus: "$1 $2 $3"

At translatewiki.net we mark these kind of strings as ignored (not shown at all) or optional (allows customization, but not shown to translators by default).

epriestley added a comment.EditedNov 7 2016, 2:51 PM

I don't know what this means. $ in the string contents isn't anything special for translatewiki.net.

Suppose I have this string in Phabricator:

pht('These cupcakes cost $1 each.');

Today, if we export this in the MW JSON format, it will look like this:

"These cupcakes cost $1 each."

However, that isn't distinguishable from this string, which has the same export format, but means something different, since "$1" is "first parameter":

pht('These cupcakes cost %s each.', $variable_dollar_amount);

Likewise, if a translated string contains the literal substrings $1, {{GENDER:...}}, the | character in a GENDER/PLURAL variant, etc., I'm not sure how to identify that they are literal substrings instead of tokens in the translation grammar.

Fixing just a spelling mistake in original text throws all translations away for that string.

Yeah, this isn't great. I think this is fairly rare, though, and if it turns out to be a big problem we could mitigate it without switching to string keys, by maintaining a "typo fix" map of the digested keys. I think this could be at least partially automated, and the total size of the map should be fairly small. Typos are also a lot more common in big blocks of text (like help/configuration stuff) which are probably less important to translate: we don't typo the common short phrases that appear most frequently in the UI very often, since someone would tend to notice pretty quickly.

As an aside, some small string changes aren't typo fixes: for example, we may correct a material error in a piece of setup help that told the user to run the wrong command, or something like that. In this case, translations should be regenerated. How do you currently deal with this case with string keys? Do you just manually rekey setup-help to setup-help-v2? Or is this rare enough that you just accept that the translations may occasionally be out of date? Or does the tool identify translations with changed English text and highlight them for review?

Having strings such as "...$1 task(s)..." means you don't support plurals at all.

Our "string keys" are written in a sort of "proto-english". This language is usually close enough to English that we can show it in the UI without translating it explicitly. For example, here's how we write a string with a gender and plural tasks:

pht("%s closed %s task(s).", $user, $count);

In this case, the "proto-english" key is readable to English users, but not the best localization, so we would translate this string into English like this:

array(
  array(
    '%s closed a task.',
    '%s closed %s tasks.',
  )
)

Because English isn't gendered, the outer array() just skips variation on the first parameter. If we were translating into a language with gendered verbs and three plural variants, we would translate like this:

array(
  array(
    '%s he-closed a task.',
    '%s he-closed some few-tasks.',
    '%s he-closed %s tasks.',
  ),
  array(
    '%s she-closed a task.',
    '%s she-closed some few-tasks.',
    '%s she-closed %s tasks.',
  )
)

This format is intended to be efficient for machines to read, so it is cumbersome to type by hand, but the translation engine can parse it quickly and select runtime variants quickly.

D16810 provides a conversion layer between MediaWiki-style JSON {{GENDER:$1|X|Y}} variants and our nested representation.

However, the string that we present to translators right now just says:

$1 closed $2 task(s).

Today, they would need to infer that they can safely put {{GENDER:$1|X|Y}} on $1 and {{PLURAL:$2|X|Y}} on $2. Presumably, if we had this information at generation time, a better string to generate would be something like this, which tells the translator explicitly about the types of the variables:

{{GENDER:$1|$1|$1}} closed {{PLURAL:$2|$2|$2}} task(s).

(This may not be the best string we could generate, just an improvement over "$1 closed $2 task(s).", since it has more information for translators.)

Better than nothing, assuming you can handle additions from translatewiki.net and not throw them away when regenerating files.

I don't plan to maintain Translatewiki QQQ data ourselves, but I think it would be easy to merge it nondestructively if you generate it.

At translatewiki.net we mark these kind of strings as ignored (not shown at all) or optional (allows customization, but not shown to translators by default).

Is there a way for me to mark strings like that in the MediaWiki-style JSON files (e.g., some special "[SPECIAL:ignored]" token at the beginning of the string)? Or is this marking external? (If so, what's the best format to provide it?)

One other question: we could probably provide ranking data for strings (e.g., which strings are most important to translate) by having Phabricator track when it does a string lookup and then figuring out which strings were looked up most frequently over a period of normal use. This might help translators focus on the most important/frequent strings and not have to worry about all the help/setup text and rare error messages. Would this data be useful? If so, how can we provide it in a format you can read?

The rTRANSLATEWIKI repository is now available. Here is an example of the en.json MediaWiki-style JSON string file we currently generate:

https://secure.phabricator.com/diffusion/TRANSLATEWIKI/browse/master/projects/libphutil/en.json

Here is the QQQ file we currently generate:

https://secure.phabricator.com/diffusion/TRANSLATEWIKI/browse/master/projects/libphutil/qqq.json

I've linked the libphutil versions since they're quicker to load if you want to take a look, but phabricator files are also included.

I've published a mirror to GitHub if that's easier to work with:

https://github.com/phacility/translatewiki

Here's how I'd like to move forward:

  • WMF/TW: Import the en.json MW-style JSON strings to create a new translatable project on Translatewiki. Let us know if you encounter issues.
  • Phacility: We'll fix any bugs with the existing export process so you can get the en.json importing cleanly.
  • WMF/TW: Create a new Git repository you control. Format a libphutil library inside it and publish it somewhere, following the documentation. This is now a Phabricator extension.
  • Phacility: We can help with this if it isn't straightforward. The rTRANSLATEWIKI repository is an example of a repository containing a simple libphutil library (in src/). I think @20after4 is also at least somewhat familiar with this. You could also publish the translations to the existing phabricator-extensions repository, but I'm not sure if it's simpler internally to create a separate repository for this stuff or not (if you want to redistribute the translations more broadly, a separate repository is probably better eventually, at least).
  • WMF/TW: Test translate a couple of strings (say, in Spanish). Export them as es.json and use the bin/translatewiki generate tool in rTRANSLATEWIKI to convert es.json into TranslateWikiSpanishTranslation.php or similar. Let us know if you encounter bugs/issues.
  • Phacility: We'll fix any bugs with the existing generation process so you can convert es.json cleanly.
  • WMF/TW: Publish that file in your new Git repository (use arc liberate to get it added to the class map).
  • WMF/TW: Install the extension library alongside a copy of Phabricator, select "Spanish (Spain)" as your translation, and verify that the strings you translated now appear in the UI. Note that after D16807 you must translate 512 strings or put the install in phabricator.developer-mode for the translation to be selectable in the UI.
  • WMF/TW: Automate that whole mess, then repeat the process with as many strings/languages as you want to translate.

Then, in the future:

  • We can discuss improvements to the export/import process, including:
    • fixing escaping bugs;
    • fixing heredoc extraction bugs;
    • adding priority/frequency information;
    • adding variable type (GENDER/PLURAL) hints;
    • adding support for a "typo fix" string map;
    • adding other automatic QQQ/metadata (but we don't have the resources to manually generate human QQQ labels);
    • annotating ignored/optional strings;
    • fixing bad/ambiguous/untranslatable strings;
    • whatever else comes up.
  • Phacility will implement anything which we mutually agree to be worthwhile/valuable. I imagine some of this stuff is pretty important and some of it isn't really a big deal. For example, we'll fix the heredoc thing eventually, but I suspect none of the heredoc strings are normally visible to non-administrators.
  • We can keep en.json up to date either with a regular export on our side (say, weekly when we promote stable), or you can just run the bin/translatewiki export tool (which is all we'd do). I'm not sure which is easier (see note below about translating your own extensions, which may motivate doing the export yourself).
  • If you produce a redistributable translation we'll list it on Community Resources and likely want to promote it and make it easy to install after the Packages application becomes useful. I have no idea what the timeline on this is, but it would end up as something like arc install translatewiki/translations. (This would be a wrapper for cloning the repository you publish, installing it correctly locally, and configuring Phabricator to load it.)
  • We may eventually mutually want to just upstream everything, but I don't want to commit to that yet. In particular, we don't want to commit to anything coming back upstream yet: translation files, QQQ help text, etc. Exceptions:
    • We'll accept bug fixes to the import/export process or translation engine (or just tell us what the bug is and we'll fix it).
    • We'll accept patches which fix bad strings (or just give us a list of the bad strings you run into and we'll fix them).
    • We'll accept new locale definitions (or just tell us which locales you want and we'll add them, but we need to know the gender/plural rules too since we don't speak all human languages).
  • We don't currently support RTL languages, and support is likely more involved, but we can look at this in the future.
  • I personally only speak English, but if a Pirate/Doge/Emoji translation can be made available on Translatewiki I'd be happy to translate some strings to become more familiar with that side of the process. (I suspect Canadian and British English aren't hugely useful exercises in learning how the platform works? I also can't really faithfully localize into either locale.)

Also note that the rTRANSLATEWIKI repository works on any libphutil library, including third-party extensions. The projects/ directory includes translations for the rTRANSLATEWIKI repository itself, which is probably not too useful, but this means you can translate your own extensions like phabricator-extensions and phabricator-sprint. Obviously, these translations won't come upstream, but you can distribute them with the extensions and everything should work properly.

avivey awarded a token.Nov 7 2016, 5:47 PM

Some of the terms that I use below (e.g. outdated) are explained in https://www.mediawiki.org/wiki/Help:Extension:Translate/Glossary

I don't know what this means. $ in the string contents isn't anything special for translatewiki.net.

Suppose I have this string in Phabricator:

pht('These cupcakes cost $1 each.');

Today, if we export this in the MW JSON format, it will look like this:

"These cupcakes cost $1 each."

However, that isn't distinguishable from this string, which has the same export format, but means something different, since "$1" is "first parameter":

pht('These cupcakes cost %s each.', $variable_dollar_amount);

I do understand that if you do the conversion, there will be conflicts. What I am trying to understand is whether the conversion is necessary in the first place. Right now it looks like we did not specify what is the MW JSON format clearly enough. The format is just JSON file with key-value pairs and a @metadata key will be added at the top when we export. Until recently the values could only be strings (except @metadata), but now we also support arrays (with string keys, numerical indexes are discouraged).

Variables such as $1 are not part of the spec, and it makes no difference to us if you use %s instead. So unless this conversion is required by your system, you can drop it.

Our support for variables means defining one ore more regular expression to match variables from the text. That is used to mark translations as fuzzy if they are not using a variable present in the source text. We can also provide variables as insertables.

Likewise, if a translated string contains the literal substrings $1, {{GENDER:...}}, the | character in a GENDER/PLURAL variant, etc., I'm not sure how to identify that they are literal substrings instead of tokens in the translation grammar.

If you don't use $1 also as variables, you don't need to do anything. We won't treat them specifically from any other text.

I am not sure why you would have literal {{GENDER:...}} in strings, but one option is to pass the content of that as a variable, assuming it does not need translation. If it does, then I don't see the issue of it literally appearing in the string as long as your system knows how to handle it.

Fixing just a spelling mistake in original text throws all translations away for that string.

Yeah, this isn't great. I think this is fairly rare, though, and if it turns out to be a big problem we could mitigate it without switching to string keys, by maintaining a "typo fix" map of the digested keys. I think this could be at least partially automated, and the total size of the map should be fairly small. Typos are also a lot more common in big blocks of text (like help/configuration stuff) which are probably less important to translate: we don't typo the common short phrases that appear most frequently in the UI very often, since someone would tend to notice pretty quickly.

Since you already seem to have a mapping from proto-English to real text, to me it seems you can use this same approach for typos.

As an aside, some small string changes aren't typo fixes: for example, we may correct a material error in a piece of setup help that told the user to run the wrong command, or something like that. In this case, translations should be regenerated. How do you currently deal with this case with string keys? Do you just manually rekey setup-help to setup-help-v2? Or is this rare enough that you just accept that the translations may occasionally be out of date? Or does the tool identify translations with changed English text and highlight them for review?

We track changes and retain history, if there are identifiers that make it possible. Changes in original content will by default cause translations to be marked as outdated. A human will check changes to the source strings, and they can skip the marking for spelling mistake corrections that don't affect translations.

If the content changes in incompatible manner (based on human judgement), we recommend using a different key.

Having strings such as "...$1 task(s)..." means you don't support plurals at all.

Our "string keys" are written in a sort of "proto-english". This language is usually close enough to English that we can show it in the UI without translating it explicitly. For example, here's how we write a string with a gender and plural tasks:

pht("%s closed %s task(s).", $user, $count);

In this case, the "proto-english" key is readable to English users, but not the best localization, so we would translate this string into English like this:

array(
  array(
    '%s closed a task.',
    '%s closed %s tasks.',
  )
)

[...]

So your system is clearly more advanced than I had understood by looking at the files. But this begs the question, wouldn't it be better to translate from this real-English instead of the proto-English strings? That should provide unambiguous mapping as well as the typo-corrected English strings for translators.

D16810 provides a conversion layer between MediaWiki-style JSON {{GENDER:$1|X|Y}} variants and our nested representation.

This makes sense. What/Where is your logic for determining which array index to use for plurals and genders? Perhaps we could simplify things here a bit. If you would use the CLDR plural rule system with array keys such as one, few, many etc, then you could place plural arrays directly in the JSON and our system would support it already with automatic conversion between arrays and inline plural syntax.

For gender we don't currently have similar system, so it could stay on your code or we could do add support for that in our side. I am bit worried though, how will you handle cases such as %s thanked %s where there are two users with genders that should be available for translation?

I don't plan to maintain Translatewiki QQQ data ourselves, but I think it would be easy to merge it nondestructively if you generate it.

If "it" means just the additional explanations, sure. If it means the references to source code locations, I am not so sure. For Gettext we parse these things form the comments in the file, so they do come out of band in a way. This probably needs some more thinking.

At translatewiki.net we mark these kind of strings as ignored (not shown at all) or optional (allows customization, but not shown to translators by default).

Is there a way for me to mark strings like that in the MediaWiki-style JSON files (e.g., some special "[SPECIAL:ignored]" token at the beginning of the string)? Or is this marking external? (If so, what's the best format to provide it?)

We keep this database ourselves. Example.

As a starters, you are feel to provide list of keys you think should be either optional or ignored. From that we can continue as we do for other projects.

One other question: we could probably provide ranking data for strings (e.g., which strings are most important to translate) by having Phabricator track when it does a string lookup and then figuring out which strings were looked up most frequently over a period of normal use. This might help translators focus on the most important/frequent strings and not have to worry about all the help/setup text and rare error messages. Would this data be useful? If so, how can we provide it in a format you can read?

I am glad you asked. We do have a couple of options. For example in MediaWiki core, messages are split into multiple files such as API, Installer (and the rest). Because each file (set of files when including translations) becomes one message group, translators can choose a group they want to work on.

In addition you can also provide a list of the N most important keys and we will create a secondary message group out of those. We have this kind of message group for MediaWiki currently.

popozhu added a subscriber: popozhu.Nov 8 2016, 9:38 AM
popozhu removed a subscriber: popozhu.

Variables such as $1 are not part of the spec, and it makes no difference to us if you use %s instead. So unless this conversion is required by your system, you can drop it.

At least some translators are likely familiar with $1, {{GENDER:$1|X|Y}}, and so on, right? And you have documentation on this syntax to help translators who aren't familiar with it learn it?

My thinking was that it would be easier for translators to work with a familiar system than a new one. The format isn't relevant to us since we need and export/import process anyway, and there's no real cost to us to expressing translations in a syntax that is more likely to be familiar to translators. If there's a better format, I'm happy to export that instead.

Our internal format is spread across multiple different files, some of which are runtime-generated, and I don't want to guarantee that any of the formats are stable, so we should pick some export format. As much as possible, I'd like to optimize for these things, in order from most important to least important:

  1. Ease of translation for translators (e.g., familiar syntax, documentation).
  2. Ease of import for Translatewiki.
  3. Ease of export for us.

I think we need an import/export process on our side no matter what, so I think the net cost for making that process more complicated is very small compared to the cost of giving translators unfamiliar syntax to work with. But we can export/import in whatever format is best given these constraints.

See the next comment for an overview of what data we have now, if it's helpful.

Where is your logic for determining which array index to use for plurals and genders?

It's just positional and hard-coded in the locale definition -- here's the definition for Czech, which defines 0 = one, 1 = few, 2 = many:

https://secure.phabricator.com/diffusion/PHU/browse/master/src/internationalization/locales/PhutilCzechLocale.php

We can add a mapping on the import process between CLDR order and our internal order if they differ, although I suspect they'll tend to be the same in most cases, since the plural forms are usually well-ordered and we can define our locales in terms of CLDR in the future.

(It also looks like CLDR defines four plural forms for Czech. The PhutilCzechLocale rules were written by a native Czech speaker, but we may need to revisit them.)

If "it" means just the additional explanations, sure. If it means the references to source code locations, I am not so sure.

I just mean that if you have a file like this:

{
  "some-key": "This appears on the home page."
}

...and we provide this qqq.json file:

{
  "some-key": "Used in: x.php:123, y.php:345"
}

...it's easy to write a script that merges them like this:

{
  "some-key": "This appears on the home page.\n\n~~~\n\n Used in: x.php:123, y.php:345"
}

e.g., using "~~~" as a delimiter or whatever. Not sure how much of a mess that is.

it seems you can use this same approach for typos

We could, but that would require us to leave the typos in proto-english. I think we can accommodate this without needing to do that.

wouldn't it be better to translate from this real-English instead of the proto-English strings

I assume the proto-English is more friendly for translation, but we could easily export real-English instead. My thinking was just that this is easier for translators to deal with:

$1 added $2 task(s) to his/her queue.

...than this (admittedly, an extreme example):

{{GENDER:$1|{{PLURAL:$2|$1 added a task to his queue.|$1 added $2 tasks to his queue.}}|{{PLURAL:$2|$1 added a task to her queue.|$1 added $2 tasks to her queue.}}|{{PLURAL:$2|$1 added a task to their queue.|$1 added $2 tasks to their queue.}}}}

If the second form is actually better, it's easy to export that instead. Or if the second form is better in terms of data, but MW-style JSON is not the best format to express it in, we can export in some other format -- I just used MW-style JSON since it's the example you gave earlier and I'm assuming it has the most documentation/support/familiarity among Translatewiki translators.

We can't easily export in a minimal inline form like this:

$1 added {{PLURAL:$2|a task|$2 tasks}} to {{GENDER:$1|his|her|their}} queue.

...because we don't have data in an inline format like that, and the nested internal representation means that the different translations may share no text, at least in the general case. We could attempt to automatically produce a minimal inline translation like this by diffing the variants, but that may be a significant amount of work and not produce particularly good results (although perhaps Translatewiki can already do this).

This example is also sort of made up -- I believe we have no English strings which actually have gendered variants today (that is, we never say "his" or "her" or similar anywhere in the product right now, as far as I know).

For example in MediaWiki core, messages are split into multiple files such as API, Installer (and the rest).

Ah, interesting. We can do this to some degree: our groups won't be perfect and we're probably going to end up with a big "everything else" group, but it should be fairly easy to get most of the Maniphest strings into a "Maniphest" group by just looking at file paths at export time. Perhaps more importantly, we should be able to filter a lot of strings that no one cares about (e.g, from unused applications) out into separate groups.

I'll add some code to start collecting frequency data, too.

Here are the current internal data formats:

Proto-English Strings: These appear in the codebase, like this, and are what engineers write when developing new features. They are version-controlled and distributed with the upstream, since they're embedded in the codebase itself.

They are automatically extracted into datafiles by bin/i18n, which uses static analysis to find function calls to pht() and extract their arguments.

return pht('You see %s task(s) in front of you.', $task_count);

Translation Files: These are the final product, used at runtime to translate the UI. They are version-controlled. Some translations (like US English) are distributed with the upstream. Others are provided by third parties. These files can be provided by libraries/extensions, so they don't need to appear in the upstream to translate the UI. A single translation (like "Spanish (Spain)") may draw from multiple translation files.

Example: PhabricatorUSEnglishTranslation.php

They look like this:

<?php

final class PhabricatorUSEnglishTranslation
  extends PhutilTranslation {

  public function getLocaleCode() {
    return 'en_US';
  }

  protected function getTranslations() {
    return array(
      'No daemon(s) with id(s) "%s" exist!' => array(
        'No daemon with id %s exists!',
        'No daemons with ids %s exist!',
      ),

  ...

Locale Definitions: These define plural/gender rules at runtime, and other metadata about a locale. They are used to populate the SettingsTranslations dropdown. They are version-controlled, and currently distributed with the upstream. They can be provided by third parties, but there's value in having one authoritative definition of "Spanish (Spain)" rather than multiple competing definitions, so we expect to maintain them for all major languages in the long run.

These files can also define fallback locales, so, e.g., "English (Canada)" can default to use "English (US)" strings if translations are not available.

Example: PhutilEmojiLocale.php

They look like this:

<?php

/**
 * A picture is worth a thousand words.
 */
final class PhutilEmojiLocale extends PhutilLocale {

  public function getLocaleCode() {
    return 'en_X*';
  }

  public function getLocaleName() {
    return pht('Emoji (Internet)');
  }

  public function getFallbackLocaleCode() {
    return 'en_US';
  }

  public function isSillyLocale() {
    return true;
  }

  public function selectPluralVariant($variant, array $translations) {
    // Emoji have a unique variant for every available value: 0, 1, 2, 3, ...
    if (count($translations) <= $variant) {
      return end($translations);
    }

    return $translations[$variant];
  }

}

Extracted File Cache: This is a cache used by the string extractor to improve extraction performance, so the extraction definition can be updated quickly (in ~3 seconds, vs 1m40s without the cache) after a small number of files change. This is a temporary cache and not version-controlled. It appears in src/.cache/i18n_files.json after bin/i18n extract is run.

Because bin/translatewiki export runs bin/i18n extract internally, this file is available to the export/import process, although it is currently unused. I believe it is not likely to be useful (it is used while compiling i18n_strings.json, and the useful parts are part of that file).

It looks like this:

i18n_files.json
{
  "version": 1,
  "files": {
    "__tests__/PhutilLibraryTestCase.php": "a32681574776eefaa731289d1278c534",
    "aphront/requeststream/AphrontRequestStream.php": "e4dd336304b1ec54452cae1daf9faee8",
    ...
  },
  "strings": {
    "757b6834e1f5598d40338293e611d78a": [
      {
        "string": "An %s already exists. Dispose of the previous guard before creating a new one.",
        "file": "/aphront/writeguard/AphrontWriteGuard.php",
        "line": 68
      },
      {
        "string": "An %s is being created in a context which permits unguarded writes unconditionally. This is not allowed and indicates a serious error.",
        "file": "/aphront/writeguard/AphrontWriteGuard.php",
        "line": 75
      },
      ...,
    ],
    ...
  }
}

Extracted String Data: This is a datafile of strings extracted from the codebase through static analysis of the Proto-English strings in pht(...) calls. It is automatically generated by bin/i18n extract. This is a temporary file (intended to be used by other processes) and not version controlled. It appears in src/.cache/i18n_strings.json after extraction.

This file is available to the import/export process, and we're basically just converting it into a more usable form.

i18n_strings.json
{
  "\"%s\" class \"%s\" has an invalid \"%s\" property. Field constants must be strings and no more than %s bytes in length.": {
    "uses": [
      {
        "file": "/object/Phobject.php",
        "line": 92
      }
    ]
  },
  "\"%s\" class \"%s\" must define a \"%s\" constant.": {
    "uses": [
      {
        "file": "/object/Phobject.php",
        "line": 82
      }
    ]
  },
  "\"%s\" is not an exact quantity.": {
    "uses": [
      {
        "file": "/utils/utils.php",
        "line": 1130
      }
    ]
  },
  "\"Says/remarks\" word edit smoothenss.": {
    "uses": [
      {
        "file": "/utils/__tests__/PhutilProseDiffTestCase.php",
        "line": 58
      }
    ]
  },
  ...

In the short term, I plan to add:

  • Tags/applications/message groups for strings, which will be automatically generated from filenames for now and appear in i18n_strings.json (if it's useful, we can manually classify them later).
  • Frequency information, which I'm going to have to figure out how to collect and store.

In the longer term, I plan to add (if we actually need them):

  • A "typo fix" map, which maps new proto-English keys to older equivalent keys.
  • An "optional"/"untranslatable" map, which adds this metadata for keys.

Variables such as $1 are not part of the spec, and it makes no difference to us if you use %s instead. So unless this conversion is required by your system, you can drop it.

At least some translators are likely familiar with $1, {{GENDER:$1|X|Y}}, and so on, right? And you have documentation on this syntax to help translators who aren't familiar with it learn it?

We already have multiple different projects (MediaWiki is only one of dozens) at translatewiki.net using multiple different kind of variables, so I don't expect translators to care much whether they see %s or $1 when they know the concept of variables.

There would be some gains for unifying this:

  • better translation memory matches
  • better support for right-to-left languages

But in my opinion that is best done at translatewiki.net transparently to the projects. It might not be much effort to you, but all other projects would need to do the same (and most of them won't). Doing it on our side will allow us to be more innovative, such mapping variables on the fly to something else that works better when mixed in between right-to-left text, but only for those languages.

For PLURAL and GENDER, using compatible syntax definitely makes sense and that's why we convert CLDR and Gettext plurals to very similar syntax as MediaWiki's.

Where is your logic for determining which array index to use for plurals and genders?

It's just positional and hard-coded in the locale definition -- here's the definition for Czech, which defines 0 = one, 1 = few, 2 = many:

https://secure.phabricator.com/diffusion/PHU/browse/master/src/internationalization/locales/PhutilCzechLocale.php

We can add a mapping on the import process between CLDR order and our internal order if they differ, although I suspect they'll tend to be the same in most cases, since the plural forms are usually well-ordered and we can define our locales in terms of CLDR in the future.

It would be great if could export to something like this:

"aBmeIEk": {
    "one": "%s star",
    "other": "%s stars"
},

Then our system knows automatically how to handle this.

wouldn't it be better to translate from this real-English instead of the proto-English strings

I assume the proto-English is more friendly for translation, but we could easily export real-English instead. My thinking was just that this is easier for translators to deal with: [...]

We actually have practice of putting {{GENDER}} and {{PLURAL}} in the source text so that translators know they can be used. Otherwise they would think those are not available (which is sometimes the case for some of our products).

We can't easily export in a minimal inline form like this:

$1 added {{PLURAL:$2|a task|$2 tasks}} to {{GENDER:$1|his|her|their}} queue.

...because we don't have data in an inline format like that, and the nested internal representation means that the different translations may share no text, at least in the general case. We could attempt to automatically produce a minimal inline translation like this by diffing the variants, but that may be a significant amount of work and not produce particularly good results (although perhaps Translatewiki can already do this).

We don't have a logic to produce minimal inline forms, but we can consider implementing it and making it available to all products. Like you said, I expect most cases to be simple enough. Translators can use the inline syntax themselves in the translations, which is imho more important, to allow easy proofreading by others.

"aBmeIEk": {
    "one": "%s star",
    "other": "%s stars"
},

When multiple variants exist, should we export a nested structure like this?

"abcdef": {
  "male": {
    "one": "%s added a task to his queue.",
    "other": "%s added %s tasks to his queue."
  },
  "female": {
    "one": "%s added a task to her queue.",
    "other": "%s added %s tasks to her queue."
  },
  "neutral": {
    "one": "%s added a task to their queue.",
    "other": "%s added %s tasks to their queue."
  }
}

(Internally, we don't currently know whether a variant is a gender or plural at extraction time, but we can work toward inferring/annotating this.)

Additionally, many strings are potentially variant in other languages but are not variant in English. This is OK?

"123abc": "%s pressed the button."

Another language might conceivably produce gendered variants for that depending on the gender of the subject (perhaps "%s pressed his/her button." is more colloquial among native speakers than "the button"), but English won't. It's correct for us to export that string as-is, without annotation that the "%s" is a gendered variable and may vary on gender in translation, but does not vary in English?

We actually have practice of putting {{GENDER}} and {{PLURAL}} in the source text so that translators know they can be used.

What's the best way to extract this string, if there's only one variant in English but the parameter is gendered and we know that it's gendered at extraction time? For example:

"%s pressed the button."

Should we export this?

"123abc": "%s {{GENDER}} pressed the button."

...and expect translators or the system to remove that marker? We can alternatively export something like this:

"123abc": {
  "proto": "%s pressed the button labeled '%s' %s time(s).",
  "types": ["gender", null, "plural"],
  "english": { "... <nice English translations in nested format> ..." }
}

...and you can mangle that however you want.

Specifically, we will infer variable types from the arguments at the pht() callsites:

pht(
  '%s pressed the button %s time(s).',
  $user,
  new PhutilNumber($count));

We know that new PhutilNumber(...) and phutil_count() arguments are plurals. These are in wide use already, and should be easy to extract.

We can add a phutil_person() marker to annotate gendered arguments, although annotating the codebase will require some work (we do not currently have any marker that arguments are gendered which is available to the static analyzer), so this data will probably not be available for a while.

When multiple variants exist, should we export a nested structure like this?

[...]

Nope, we only support this for plurals currently, not gender.

Additionally, many strings are potentially variant in other languages but are not variant in English. This is OK?

"123abc": "%s pressed the button."

In MediaWiki that would be "$1 {{GENDER:$1|pressed}} the button." (or "{{GENDER:$1|$1}} pressed the button."). For you I guess you would wrap the whole text inside the GENDER invocation, if you know it can be used; and you also need to drop the first value since %s cannot be repeated as far as I know.

Alternative this can be documented in message documentation (qqq), but translators do not always read it even though they should.

We can alternatively export something like this:

"123abc": {
  "proto": "%s pressed the button labeled '%s' %s time(s).",
  "types": ["gender", null, "plural"],
  "english": { "... <nice English translations in nested format> ..." }
}

We don't have support for this kind of format out of the box, so preferably not.

We know that new PhutilNumber(...) and phutil_count() arguments are plurals. These are in wide use already, and should be easy to extract.

One more clarification, does your system actually support multiple plural variants per message? In other words is your array syntax always messagekey[gender][plural] or could it also be messagekey[plural][plural] depending on the variables?

If the former, then it should be easy for you to automatically generate the strings in the format I proposed with one and other keys.

We can add a phutil_person() marker to annotate gendered arguments, although annotating the codebase will require some work (we do not currently have any marker that arguments are gendered which is available to the static analyzer), so this data will probably not be available for a while.

This would be good and it's okay if it takes time. I think with this information you should be able to generate inline gender syntax such as {{GENDER|%s}} and translators can then use it in the appropriate place in their language, if any.


To summarize what I am proposing:

Case 1: You use GENDER in real English translations:

"abcdef": {
  "one": "{{GENDER|%s added a task to his queue.|%s added a task to her queue.|%s added a task to their queue}}.",
  "other": "{{GENDER|%s added %s tasks to his queue.|%s added %s tasks to her queue.|%s added %s tasks to their queue}}."
  }
}

Case 2: You don't use GENDER in real English translations:

"abcdef": {
  "one": "{{GENDER|%s}} pressed the button once.",
  "other": "{{GENDER|%s}} pressed the button %s times",
  }
}

I am assuming here that you keep track of index of the variable as well as the type. If not, then you need to wrap the whole text again.

If we implement the automatic inline converter for PLURAL, we can also use it for GENDER.

%s cannot be repeated as far as I know.

(Our underlying interpolation engine is sprintf(), so we support sprintf() syntax and %s can be repeated with %1$s. As currently written, the importer should handle this conversion properly, so translators can type $1 $2 $2 $2 $2 and we'll convert to %s %s %2$s %2$s %2$s on import.)

Nope, we only support this for plurals currently, not gender.

How should we express a string where there are multiple variables and one or more non-initial variables are plural?

In other words is your array syntax always messagekey[gender][plural] or could it also be messagekey[plural][plural] depending on the variables?

We support unlimited gender/plural variants in any order. We have a number of strings like this:

'%s edited %s subscriber(s), added %d: %s; removed %d: %s.'

This might, in theory, translate as something like this in a very gender/plural variant language:

array(
  array(
    array(
      array(
        '%s he-edited one-subscriber, added one: %4$s; removed one: %6$s.',
        '%s he-edited one-subscriber, added one: %4$s; removed a few: %6$s.',
        '%s he-edited one-subscriber, added one: %4$s; removed many: %6$s.',
      ),
      array(
        '%s he-edited few-subscriber, added one: %4$s; removed one: %6$s.',
        '%s he-edited few-subscriber, added one: %4$s; removed a few: %6$s.',
        '%s he-edited few-subscriber, added one: %4$s; removed many: %6$s.',
      ),
      array(
        '%s he-edited many-subscriber, added one: %4$s; removed one: %6$s.',
        '%s he-edited many-subscriber, added one: %4$s; removed a few: %6$s.',
        '%s he-edited many-subscriber, added one: %4$s; removed many: %6$s.',
      ),
    ),
    array(
      array(
        '%s he-edited one-subscriber, added a few: %4$s; removed one: %6$s.',
        '%s he-edited one-subscriber, added a few: %4$s; removed a few: %6$s.',
        '%s he-edited one-subscriber, added a few: %4$s; removed many: %6$s.',
      ),
      array(
        '%s he-edited few-subscriber, added a few: %4$s; removed one: %6$s.',
        '%s he-edited few-subscriber, added a few: %4$s; removed a few: %6$s.',
        '%s he-edited few-subscriber, added a few: %4$s; removed many: %6$s.',
      ),
      array(
        '%s he-edited many-subscriber, added a few: %4$s; removed one: %6$s.',
        '%s he-edited many-subscriber, added a few: %4$s; removed a few: %6$s.',
        '%s he-edited many-subscriber, added a few: %4$s; removed many: %6$s.',
      ),
    ),
    ...
"abcdef": {
  "one": "{{GENDER|%s}} pressed the button once.",
  "other": "{{GENDER|%s}} pressed the button %s times",
  }
}

We can do this pretty easily, but I'm not sure what format you want for strings like "gender, plural, plural" or "gender, gender, gender" or "plain string with no gender or plural, gender, plural, plural, gender, plain string, plain string".

That is, our strings may have an unlimited number of gender variables, plural variables, and plain strings, in any order. How should we export them in the general case?

In other words is your array syntax always messagekey[gender][plural] or could it also be messagekey[plural][plural] depending on the variables?

We support unlimited gender/plural variants in any order. We have a number of strings like this:

Okay I think I understood it finally. This is indeed a bit more complicated and does not map nicely to what I proposed. I'll think about this for a while to see if I can come up with ways to handle this nicely.

Sounds good, thanks!

epriestley moved this task from The Queue to Paused on the Prioritized board.Nov 10 2016, 2:39 PM

Pausing this from our perspective since we're waiting on guidance to move forward.

To summarize that last group of changes:

  • We now extract variable type data (for gender/plural) into i18n_strings.json, so it's available for the exporter. We don't do anything useful with it yet.
  • The phutil_person() annotation (for gender) is available, but we need to do a bit more work before we can really start using it, and then actually go add annotations to the codebase over time.
  • The translation engine now supports a callback for frequency sampling (so we can figure out the most commonly-used strings) but we don't have any code to actually store/compile/export that data yet.

Okay, I have been thinking this for a bit. I don't have a perfect solution, but I have the following proposal.

To remove the blocker for going forward, I now agree on your suggestion of using the proto-English version as source text combined with automatically generated documentation of the variables present. Additionally, I would do an experiment (for translatewiki.net) to allow updates by our translators to this source language [1] so that they can add the inline PLURAL/GENDER annotations. This would be the interchange format between Phabricator and translatewiki.net, because it is easier to go from the inline format to your expanded format, than vice versa. I have some PHP code already that can be used as starting point for the inline->expanded conversion, assuming you haven't done that already.

[1] For other projects, updates to the source language are not allowed via the interface.

As a possible extension, we could experiment automatically wrapping the variables (and possible the word immediately after) with PLURAL/GENDER, or even wrapping the whole text multiple times. However, I think this might not be worth it if my proposed crowd-sourcing method is efficient.

I also considered if we could use your expanded format as the source. One thing still unclear to me is whether you currently use/have proto-English or the expanded versions. Do developers add the expanded version when adding new messages? If so, perhaps they could be taught to add it in inline format? If not, then my proposal seems suitable. If you already have lot of these expanded versions, then there would be some duplicate work initially.

The expanded->inline conversion is trivial to do for one variable case with some common affix matching. I tested this with some prototyping code. When there are multiple variables, it gets more complicated. Again, pursuing this way does not seem worth it if my proposal is acceptable and works.

That sounds fine to me. Do you have any preference on what format you'd like the proto-english + variable type annotations in? Maybe something like this, since it sounds like you're doing some level of custom stuff on your side anyway?

[
  ...,
  "abcdef12345": {
    "proto": "The proto-english string with %s variable(s).",
    "types": ["PLURAL"],
    "english": [
      "If available, a nested list of English variants.",
    ]
  },
  ...
]

We can either put usage information (places in the source code where the call appears) in that same structure as JSON, or we can export it as a block of wiki markdown in the "QQQ" format as we currently do.

Do developers add the expanded version when adding new messages?

Very rarely -- only if the proto-english isn't close enough to English. Often (95% of the time?), proto-English is exactly the same as English. In cases where it isn't ("variable(s)", for example) we don't necessarily translate -- for example, if the string is an internal error message and the proto-English is realistically good enough for English-speaking users, it tends to go untranslated. Writing the nested translations by hand is a bit of a pain and we'll presumably have better tools for it some day, so it often doesn't get done if users won't normally see it.

I have some PHP code already that can be used as starting point for the inline->expanded conversion, assuming you haven't done that already.

The inline -> expansion code I have so far is here, if it's useful. We can accept inline strings and expand them, or you can expand them on your side:

https://secure.phabricator.com/source/translatewiki/browse/master/src/management/TranslatewikiManagementGenerateWorkflow.php;1433d08de8f616e7d6d3b8da5f09f41162942325$165-373

I think it handles all the tricky cases but didn't get as far as building a test suite to verify that.

What works out of the box is JSON files with string key-value pairs for source text. For documentation the qqq.json that is "just-another-language" is what we use most often. The strings of that file are parsed as wikitext using MediaWiki syntax and displayed automatically to translators.

The only precedence for more structured information is what we do with Gettext. We have bunch of things we call as translation aidsqqq is one of them. For Gettext we have an aid that parses the Gettext unit comments that say where the string comes from and produce links to online version control system.

As an example, for MediaWiki we (the developers) manually write something like this to qqq.json when they add a new message:

Message shown on Special:Foo.

Variables:
* $1: Number of files, can be used with {{PLURAL}}
* $2: User name for {{GENDER}}
* $3: User name that links to the user page

Adding a support for the format you propose does not look very complicated – so it is not a blocker – but it would be unique to your project with an additional maintenance cost. Hence I'm suggesting to consider using type of format that we already support.


The inline -> expansion code I have so far is here, if it's useful. We can accept inline strings and expand them, or you can expand them on your side:

https://secure.phabricator.com/source/translatewiki/browse/master/src/management/TranslatewikiManagementGenerateWorkflow.php;1433d08de8f616e7d6d3b8da5f09f41162942325$165-373

I think it handles all the tricky cases but didn't get as far as building a test suite to verify that.

If you already have it, good.

Maybe it would be simplest if you just give me an example of exactly what you want us to emit?

We're happy to emit exactly what you want, except that we support arbitrary levels of nesting so we can not emit into a format which is less powerful than that. We currently emit flat JSON in separate "en" and "qqq" files, following the example from T5267#177060.

We can emit all of this data separately, or encode it into one or more JSON files (or other files) however you like:

  1. Proto-english strings.
  2. Nested English translations.
  3. Variable types.
  4. Usage frequency information ("this string has frequency 0.9 out of 1.0", or "this string is part of the 'most common 10% of strings' group").
  5. Application groupings ("this string is part of Maniphest").
  6. Places in the code where the string appears ("this string appears at x.php line 234").
  7. Additional human-readable context information, like "qqq" (very rarely available).

I don't know how to emit (2), (4), (5), or (6) in formats you already support. If you can point me at an example of the best possible format on your side, I'm happy to make our exporter emit whatever is easiest for you.

  1. Proto-english strings.
  2. Nested English translations.
  3. Variable types.
  4. Usage frequency information ("this string has frequency 0.9 out of 1.0", or "this string is part of the 'most common 10% of strings' group").
  5. Application groupings ("this string is part of Maniphest").
  6. Places in the code where the string appears ("this string appears at x.php line 234").
  7. Additional human-readable context information, like "qqq" (very rarely available).
  1. In en.json which would be updated to inline format.
  2. Per my understanding you would use the existing ones until they are produced automatically by converting the inline format en.json to the nested format. These would not be used in translatewiki.net.
  3. As wikitext in qqq.json, or [1]
  4. You could produce separate list of message keys in order of frequency. We can then create group out of top X. Alternatively you could order the messages in the files by frequency, so that most frequent messages appear to translators first. But this creates unnecessary churn in the file, so I don't recommend it.
  5. I suggest that you create separate set of files for each group. For example core/{en,qqq,de,fi,...}.json and maniphest/{en,qqq,de,fi,...}.json. I am assuming here that strings only belong to one application at a time. If this is not the case, you could put those in multiple applications to the core files.
  6. [1]
  7. in qqq.json.

[1] You could extend en.json as you propose, but perhaps better is to create a new file such as metadata.json which is similar to qqq.json but you can define the structure of values as you need. Either way, as this is a completely new thing, there is no existing format to mimic.

Examples (whitespace is not significant):
en.json you can produce:

{
    "AFEAFE": "You have $1 file(s) and $2 image(s)",
    "32AFA2": "Hello"
}

en.json (or any translation) after update from us:

{
    "@metadata" {
        "authors": [
            "Author1",
            "Author2",
        ],
    }
    "AFEAFE": "You have $1 {{PLURAL:$1|file|files}} and $2 {{PLURAL:$2|image|images}}",
    "32AFA2": "Hello"
}

qqq.json if you put variable information here:

{
    "AFEAFE": "Variables:\n* $1 number\n* $2 number",
    "32AFA2": "A friendly greeting"
}

en.expanded.php, or however you call your runtime format, would be produced from en.json file(s). Same for translations. You can run this script, or we could run this script at commit time, but that really depends how we integrate the updates, i.e. does WMF deliver them manually somehow or can we give it to you directly. Just an example below for completeness, I know you format them a bit differently.

$messagesEn = [
    'AFEAFE' =>[
        [ 'You have $1 file and $1 image', 'You have $1 file and $1 images' ],
        [ 'You have $1 files and $1 image', 'You have $1 files and $1 images' ],
    ],
    '32AFA2' => 'Hello'
];

metadata.json could look something like:

{
    "AFEAFE": {
        "variables": { "$1": "number", "$2": "number" },
        "sources": [
            "foo/file.php": "12"
        ]
    },
    "32AFA2": {
        "sources": [
            "foo/bar.php": ["14", "42"]
        ]
}

But I don't consider (6) being vital information at this point. If you put (3) in qqq.json, you can skip creating metadata.json. But if you want to put (3) in a structure format, then of course include (6) in that same format.

we support arbitrary levels of nesting so we can not emit into a format which is less powerful than that

For English this is not a problem in my suggestion, as proto-English does not contain nesting. For existing language translation, it of course is. To import those, we would need to forcefully expand those to something like {{GENDER:$1|{{PLURAL:$2|He kissed me once|He kissed me $2 times}}|{{PLURAL:$2|She kissed me once|She kissed me $2 times}}|{{PLURAL:$2|They kissed me once|They kissed me $2 times}}}} (using singular they here) which is not pretty, but can round-trip and translators can simplify it on our side. If you want, you could also do this initially for the nested English strings you have.

Okay, I believe we already produce a format which is compatible with that definition:

https://secure.phabricator.com/source/translatewiki/browse/master/projects/phabricator/en.json
https://secure.phabricator.com/source/translatewiki/browse/master/projects/phabricator/qqq.json

I'll plan to make these changes:

  • Include variable type information as human readable text in a standard format in "qqq.json", supplementing the additional human-readable source code usage information.
  • Our strings are not unique to a particular application, but I'll implement this rule:
    • If the string is used in zero applications (e.g., only in shared code), or used in more than one application, put it in "core/".
    • If the string is used in exactly one application, put it in "application-name/".
  • In the future, we'll export frequency information in metadata.json or some similar file once we collect it, as an ordered list of every string key.

We now export strings into projects/<project>/<group>/en.json, etc. For example, here are strings which appear ONLY in Maniphest:

https://secure.phabricator.com/source/translatewiki/browse/master/projects/phabricator/maniphest/en.json;00108900457eda804c9c786dd76ff96620a05195

We now include type information at the top of the human-readable "qqq" string. For example, here is a string with five variables, two of which are plurals:

https://secure.phabricator.com/source/translatewiki/browse/master/projects/phabricator/maniphest/qqq.json;00108900457eda804c9c786dd76ff96620a05195$64

We now export a frequency.json, with string keys in a list. Since we haven't actually collected frequency data, the order is currently (roughly) random:

https://secure.phabricator.com/source/translatewiki/browse/master/projects/phabricator/maniphest/frequency.json;00108900457eda804c9c786dd76ff96620a05195

We now export a frequency.json, with string keys in a list. Since we haven't actually collected frequency data, the order is currently (roughly) random:

https://secure.phabricator.com/source/translatewiki/browse/master/projects/phabricator/maniphest/frequency.json;00108900457eda804c9c786dd76ff96620a05195

To clarify, this would be better as one global list across projects. Is that possible?

Sure, not a problem.

urzds added a subscriber: urzds.Nov 28 2016, 12:37 PM
tomheng added a subscriber: tomheng.Feb 1 2017, 3:27 PM

Seeing the long list of merged commits here (thanks Evan and Nikerabbit for elaborating and explaining needs!), are there any further code changes in Phab needed before someone could set up a "Phabricator" project at http://translatewiki.net and let volunteers start contribute translations?

I'm basically wondering what's the status here and what's needed before calling this resolved when it comes to the Phabricator code side of things...

My belief is that we've satisfied all the requirements for Translatewiki to move forward on their side.

There are some additional areas for iteration/improvement on our side, and I imagine there will be ongoing work as we run into issues and surprises, but I believe this isn't blocked by anything on our side.

aklapper closed this task as Resolved.Mar 15 2017, 7:39 PM
aklapper assigned this task to epriestley.

Awesome (thanks for the super fast answer)!

I'm closing this task as resolved. Any potential specific issues coming up in the future should get filed as separate tasks.

Twilight removed a subscriber: Twilight.Jul 16 2017, 2:10 PM