Page MenuHomePhabricator

Phacility Cluster: Ad-Hoc Query Log
Open, NormalPublic

Description

Our use of Phacility cluster data is bound by the Phacility Data Use Policy.

This task logs cases where we have performed anonymous, ad-hoc data collection against cluster instances. The goal of this collection is generally to inform decisions about our product roadmap (for example, by helping us allocate resources to features which are used more frequently by customers).

A typical example of the result of this kind of data collection is something like this.

Across all instances, X% have sent more than 1000 messages in Conpherence.

These results are private, and shared only with staff whose decisions will be informed by the data.

Event Timeline

chad added a comment.Oct 14 2016, 3:06 PM

Such task

epriestley updated the task description. (Show Details)Oct 14 2016, 3:08 PM

Circa September 13, 2016 we ran a query which collected this data from instances:

  • Total number of Conpherence threads.
  • Total number of messages.
  • Percentage of threads with a visibility policy other than "Room Members".

Our goal in collecting this data was to inform decisions about thread join behaviors (T11628) and direct messaging behaviors (T11308) in the product.

epriestley added a comment.EditedOct 14 2016, 3:18 PM

I am running a query to collect this data:

  • Percentage of users who have changed the "Self Actions" email preference.
    • Percentage of these users who have changed it to "Disable".
  • Percentage of installs that have changed the "Self Actions" email preference as a global default.

Our goal in collecting this data is to inform decisions about the default value of this setting (see prior discussion in T9161 and linked tasks).

epriestley updated the task description. (Show Details)Oct 14 2016, 3:20 PM

Per above, here are the specific queries I executed while collecting statistics about the "Self Actions" preference:

User Self Mail
SELECT
  SUM(IF(preferences LIKE '%"self-mail":"0"%', 1, 0)) User_Disabled,
  SUM(IF(preferences LIKE '%"self-mail":"1"%', 1, 0)) User_Enabled,
  SUM(IF(preferences LIKE '%"self-mail"%', 0, 1)) User_NoSetting
  FROM <INSTANCE>_user.user u
  LEFT JOIN <INSTANCE>_user.user_preferences p
  ON u.phid = p.userPHID;
Instance Self Mail
SELECT
  SUM(IF(preferences LIKE '%"self-mail":"0"%', 1, 0)) User_Disabled,
  SUM(IF(preferences LIKE '%"self-mail":"1"%', 1, 0)) User_Enabled,
  SUM(IF(preferences LIKE '%"self-mail"%', 0, 1)) User_NoSetting
  FROM <INSTANCE>_user.user_preferences p
  WHERE p.builtinKey = 'global';

I aggregated this data across all instances before sharing it internally.

As a followup, ran this script on a subset of instances to collect data about mail delivered on a per-action basis versus a per-user basis:

<?php

require_once 'scripts/__init_script__.php';

$sent_count = 0;
$self_count = 0;
$other_count = 0;

$table = new PhabricatorMetaMTAMail();
foreach (new LiskMigrationIterator($table) as $message) {
  $id = $message->getID();

  $status = $message->getStatus();

  if ($status != 'sent' && $status != 'void') {
    // Ignore mail not yet processed.
    echo "IGNORED {$id}: Not yet processed.\n";
    continue;
  }

  $actor_phid = $message->getActorPHID();
  if (!$actor_phid) {
    echo "IGNORED {$id}: No valid actor.\n";
    continue;
  }

  $parameters = $message->getParameters();
  $sent_info = idx($parameters, 'actors.sent');
  if (!$sent_info) {
    echo "IGNORED {$id}: No sent info.\n";
    continue;
  }

  $actor_info = idx($sent_info, $actor_phid);
  if (!$actor_info) {
    echo "IGNORED {$id}: No actor sent info.\n";
    continue;
  }

  $to = idx($parameters, 'to', array());
  $cc = idx($parameters, 'cc', array());
  if (!in_array($actor_phid, $to) && !in_array($actor_phid, $cc)) {
    echo "IGNORED {$id}: No actor real recipient.\n";
    continue;
  }

  $deliverable = idx($actor_info, 'deliverable');
  if ($deliverable === null) {
    echo "IGNORED {$id}: No deliverable info.\n";
    continue;
  }

  if ($deliverable) {
    $sent_count++;
  } else if (in_array('self', idx($actor_info, 'reasons', array()))) {
    $self_count++;
  } else {
    $other_count++;
  }
}

echo ">Sent\t{$sent_count}\tSelf\t{$self_count}\tOther\t{$other_count}\n";

I aggregated the data across all sampled instances before sharing it internally.

epriestley moved this task from Backlog to Near Future on the Phacility board.Nov 14 2016, 7:02 PM

We queried a subset of free instances to gain insight about free instance usage patterns (mostly: are they actually being used?). Specifically, I ran these queries to count repositories and feed stories on a subset of free instances:

'SELECT COUNT(*) FROM <INSTANCE>_repository.repository;'
'SELECT COUNT(*) FROM <INSTANCE>_feed.feed_storydata;'

We queried a subset of instances to gain insight about how frequently repository data is hosted on GitLab, so we're better prepared to respond to any indirect fallout from a recent operational incident on their side:

'SELECT * FROM <INSTANCE>_repository.repository_uri WHERE uri LIKE "%gitlab%";'

We queried a subset of instances for the date of the most recent session activity, to estimate the impact of requiring free instances to log in periodically to remain active:

'SELECT FROM_UNIXTIME(MAX(sessionExpires)) FROM <INSTANCE>_user.phabricator_session;'

Continuing from above, I queried a subset of instances for session counts to help us understand usage patterns for free instances (see T12217):

SELECT COUNT(*) FROM <INSTANCE>_user.phabricator_session

A small subset of instances were queried with this script (below) to assess the impact/priority of T12137. This script examines fulltext search queries, but the only output is a ratio of potentially affected, unique queries first issued in the last 30 days, like this:

12.3% Affected

This script does not disclose the actual content (particularly, the search terms) used by queries, and no query content was directly examined.

fulltext_queries.php
<?php

require_once 'scripts/__init_script__.php';

$table = new PhabricatorSavedQuery();

$window = (PhabricatorTime::getNow() - phutil_units('30 days in seconds'));

$unique = array();
foreach (new LiskMigrationIterator($table) as $query) {
  if ($query->getDateCreated() < $window) {
    // Skip old queries (this may also skip common queries which were first
    // run a while ago, even if they have been run more recently).
    continue;
  }

  $fulltext_engine = 'PhabricatorSearchApplicationSearchEngine';
  if ($query->getEngineClassName() != $fulltext_engine) {
    // Skip queries which aren't fulltext global search queries.
    continue;
  }

  $unique[$query->getParameter('query')] = true;
}

$stopwords = Filesystem::readFile('resources/sql/stopwords.txt');
$stopwords = explode("\n", trim($stopwords));
$stopwords = array_fuse($stopwords);

$unique = array_keys($unique);

$affected = 0;
foreach ($unique as $u_query) {
  $tokens = preg_split('/\s+/', $u_query);
  foreach ($tokens as $token) {
    if (strlen($token) < 3) {
      $affected++;
      continue 2;
    }
    if (isset($stopwords[phutil_utf8_strtolower($token)])) {
      $affected++;
      continue 2;
    }
  }
}

printf("%0.1f%% Affected\n", 100 * ($affected / count($unique)));

I queried the production backup log to find the maximum size of any active instance's backups to contextualize decisions around backup and storage management in D18094.

I searched for .hgsub files containing URI paths matching ssh://- in working copies stored on disk to better assess the impact of the security issue described in T12961. I did not locate any.

We received our first credit card dispute on January 31st.

In responding to this dispute, I am accessing some of the customer's data, including: billing records, support history, and instance activity logs. This access requires disclosure under our data use policy (https://admin.phacility.com/L4).

I'll follow up with more details on what was accessed once we conclude our response.

I wrote our disclosure guideline thinking more about law enforcement requests than customer disputes and this disclosure feels like it might come across as a little intrusive. It's possible that I may update the policy to discuss disputes specifically, maybe with language like this:

Card Disputes

If you file a routine payment dispute with us, we'll review what happened but handle it privately.

If you file a payment dispute (for example, a credit card dispute) we may access your data in a narrow way in order to respond to the dispute. When the access is routine and the scope of our access is tightly limited, we may not require senior personnel to authorize this access and may not publicly disclose this access. When responding to payment disputes, routine data access encompasses:

  • billing history;
  • support history; and
  • anonymous usage information; and
  • other similar information related to service and payments; and
  • other information directly related to the dispute.

If we access information outside this scope while responding to a payment dispute, we will treat this as an exceptional access and apply the normal standards for exceptional access to it, including authorization and disclosure requirements.

If anyone has feedback about this potential change, feel free to post on the Discourse forum or email me directly at epriestley@phacility.com.

Since we've only received one dispute after several years of running a service there's no rush for us to make a policy change here, but if we see more of this in the future I'd lean more toward crafting policy language around this case.

I'll follow up with more details on what was accessed once we conclude our response.

I primarily accessed billing and support history. I roughly assessed instance activity by examining the row counts and creation dates of feed stories in the phabricator_feed.feed_storydata table.

pasik added a subscriber: pasik.May 12 2018, 2:12 PM
epriestley moved this task from Near Future to Eternal on the Phacility board.Aug 10 2018, 6:06 PM

See PHI942 (internal) and PHI892 (internal).

Currently, we include a table of database migrations in the weekly changelog, including the runtime here on secure. Sometimes, this isn't a very good proxy for the actual cost of migrations for large installs and doesn't do enough to provide a warning that there may be a lot of downtime. In particular, a couple of self-hosted installs ran into multi-hour 20180910.audit.03.status.php migration times, even though this migration took 54s on secure.

The runtime for migrations can vary widely based on a lot of factors that I think we can't accurately estimate. I have very low confidence we could ever build a tool like this which was even in the ballpark of correct, although this is probably the way we'd like the tool to work:

$ ./bin/storage upgrade
This migration will take 9h 32m. Continue? [Y/n]

However, we can do a better job than we currently do by supplementing this table with data from the SAAS cluster. In particular, we could publish data like this (for example) by showing both the timing information from secure and the timing information from the deployment to the Phacility SAAS cluster:

MigrationsecureMean SAASMax SAAS
xyz.sql19s7s17h 42m

This implies querying and exposing some SAAS data as part of the deployment process (that is, the timing information itself). I'm generally hesitant about this, but it feels reasonable in this case because it is non-identifying timing data. I believe it could only leak information if we somehow ran an adversarial migration like this:

if ($password === 'hunter2') {
  sleep(17);
} else {
  sleep(3);
}

If the maximum SAAS instance time for this adversarial migration is 17s, it confirms that one instance has the password hunter2.

I can't imagine a way that we could accidentally write this kind of adversarial migration, and if an actual attacker can control a migration they can do far worse damage (migrations can execute arbitrary code).

You can also get some kind of proxy for this data already by looking at the timestamp difference between "Performing an upgrade shortly" and "Upgrade complete" on the @phacilitystatus Twitter.

Generally, disclosing this seems uncontroversial so I currently plan to have the cluster deployment process supplement the changelog with deployment data in some future release. This should give self-hosted installs better guidance about expected downtime for larger instances.