(Translated by https://www.hiragana.jp/)
⚓ T360748 [User Story] Create the MPIC database schema
Page MenuHomePhabricator

[User Story] Create the MPIC database schema
Closed, ResolvedPublic3 Estimated Story Points

Description

Description

Create a database schema for the Metrics Platform Instrument Configurator (MPIC) using the data modeling in the design doc.

Use Case

The MPIC app uses this database as the canonical data store of instrument configurations.

User Story/ies

As an engineer, I'd like to persist data about Metrics Platform instrument configurations in a data store that powers the MPIC API.

Outcome

A database schema is agreed upon wherein the MPIC data model is implemented.

Acceptance Criteria

  • Database schema creates tables outlined in the design doc.

Required

  • Testing Instructions
  • Documentation
  • Submit MPIC DB schema for creating the tables in staging and production

Technical Notes

T331516 provides sample schemas and queries that could be useful for this ticket.

Artifacts & Resources

Event Timeline

cjming set the point value for this task to 3.Mar 22 2024, 8:02 PM
cjming renamed this task from [User Story] Create/deploy the MPIC database to [User Story] Create the MPIC database schema.Apr 19 2024, 2:52 AM
cjming updated the task description. (Show Details)

Proposed database schema for MPIC:

schema.json
[
	{
		"name": "instruments",
		"columns": [
			{
				"name": "id",
				"comment": "The instrument ID for referring to for updating and deleting.",
				"type": "integer",
				"options": { "unsigned": true, "notnull": true, "autoincrement": true }
			},
			{
				"name": "name",
				"comment": "The instrument name.",
				"type": "binary",
				"options": { "notnull": true, "length": 255 }
			},
			{
				"name": "slug",
                                 "comment": "The instrument slug as a parameter for specific instrument endpoints.",
				"type": "binary",
				"options": { "notnull": true, "length": 255 }
			},
			{
				"name": "description",
				"comment": "The instrument description.",
				"type": "text",
				"options": { "notnull": true }
			},
			{
				"name": "creator",
				"comment": "The actor of the analytics instrument creator.",
				"type": "binary",
				"options": { "notnull": true, "length": 255 }
			},
			{
				"name": "purpose",
				"comment": "The purpose of the analytics instrument.",
				"type": "binary",
				"options": { "notnull": true, "length": 255 }
			},
			{
				"name": "created_at",
				"type": "mwtimestamp",
				"options": { "notnull": true }
			},
			{
				"name": "updated_at",
				"type": "mwtimestamp",
				"options": { "notnull": true }
			},
			{
				"name": "start_date",
				"type": "mwtimestamp",
				"options": { "notnull": true }
			},
			{
				"name": "end_date",
				"type": "mwtimestamp",
				"options": { "notnull": false }
			},
			{
				"name": "task",
				"comment": "The Phabricator task id.",
				"type": "text",
				"options": { "notnull": true }
			},
			{
				"name": "compliance_requirements",
				"type": "binary",
				"options": { "notnull": true, "length": 255 }
			},
			{
				"name": "sample_rate",
				"type": "mwtinyint",
				"options": { "notnull": true, "unsigned": true, "default": 0 }
			},
			{
				"name": "sample_unit",
				"type": "binary",
				"options": { "notnull": true, "length": 255 }
			},
			{
				"name": "contextual_attributes",
				"type": "text",
				"options": { "notnull": true }
			},
			{
				"name": "environments",
				"type": "binary",
				"options": { "notnull": true, "length": 255 }
			},
			{
				"name": "security_legal",
				"comment": "A link to documentation including review and approval from Security and Legal.",
				"type": "text",
				"options": { "notnull": true }
			},
		],
		"indexes": [
			{
				"name": "created_at",
				"comment": "Created sort",
				"columns": [ "created_at" ],
				"unique": false
			},
			{
				"name": "creator",
				"comment": "Creator sort",
				"columns": [ "creator", "updated_at" ],
				"unique": false
			},
			{
				"name": "updated_at",
				"comment": "Default sort",
				"columns": [ "updated_at" ],
				"unique": false
			}
		],
		"pk": [ "id" ]
	},
	{
		"name": "instrument_sample_rates",
		"columns": [
			{
				"name": "id",
				"comment": "The instrument sample rate ID for referring to for overriding sample rate.",
				"type": "integer",
				"options": { "unsigned": true, "notnull": true, "autoincrement": true }
			},
			{
				"name": "instrument_id",
				"comment": "The analytics instrument ID for referring to for foreign key constraint.",
				"type": "integer",
				"options": { "unsigned": true, "notnull": true }
			},
			{
				"name": "dblist",
				"comment": "Wiki that the stream applies to for overriding sample rate/unit.",
				"type": "binary",
				"options": { "notnull": true, "length": 255 }
			},
			{
				"name": "sample_rate",
				"type": "mwtinyint",
				"options": { "notnull": true, "unsigned": true, "default": 0 }
			},
		],
		"indexes": [],
		"pk": [ "id" ]
	}
]

Note that the above schema is formatted per https://www.mediawiki.org/wiki/Manual:Schema_changes#Notes when we thought the database would live in MediaWiki. Since this is now a standalone database in either WikiKube or the DSE-K8s cluster, maybe all we need is a sql file to create the tables we need. In any event, for the above format, I wasn't sure how to do enum fields for:

  • compliance_requirements
  • contextual_attributes
  • environments

According to https://www.mediawiki.org/wiki/Manual:Schema_changes#Best_practices_in_choosing_the_data_type, enums are discouraged anyway. If this is the case, then do we need to do separate tables for each with lookup tables?

***I also have an outstanding question about the "task" field and confirming that we need one more column for holding a url for "Security and Legal" (I err'd on adding the column in the schema/script above) <<<. TBD/TK

In case all we do need is a sql file, I am including the sql script we use for local development to create the database and build the tables:

init.sql
CREATE DATABASE IF NOT EXISTS mpic;

GRANT ALL PRIVILEGES ON mpic.* TO 'maria'@'%';

USE mpic;

CREATE TABLE IF NOT EXISTS instruments (
    id INT UNSIGNED NOT NULL AUTO_INCREMENT,
    name VARCHAR(255) NOT NULL,
    slug VARCHAR(255) NOT NULL,
    description TEXT,
    creator VARCHAR(255) NOT NULL,
    owner VARCHAR(255) NOT NULL,
    purpose VARCHAR(255),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
    start_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
    end_date TIMESTAMP,
    task TEXT NOT NULL,
    compliance_requirements SET('legal', 'gdpr') NOT NULL,
    sample_unit VARCHAR(255) NOT NULL,
    sample_rate FLOAT NOT NULL,
    contextual_attributes SET(
        'agent_app_install_id',
        'agent_client_platform',
        'agent_client_platform_family',
        'page_id',
        'page_title',
        'page_namespace',
        'page_namespace_name',
        'page_revision_id',
        'page_wikidata_id',
        'page_wikidata_qid',
        'page_content_language',
        'page_is_redirect',
        'page_user_groups_allowed_to_move',
        'page_user_groups_allowed_to_edit',
        'mediawiki_skin',
        'mediawiki_version',
        'mediawiki_is_production',
        'mediawiki_is_debug_mode',
        'mediawiki_database',
        'mediawiki_site_content_language',
        'mediawiki_site_content_language_variant',
        'performer_is_logged_in',
        'performer_id',
        'performer_name',
        'performer_session_id',
        'performer_pageview_id',
        'performer_groups',
        'performer_is_bot',
        'performer_language',
        'performer_language_variant',
        'performer_can_probably_edit_page',
        'performer_edit_count',
        'performer_edit_count_bucket',
        'performer_registration_dt'
    ) NOT NULL,
    environments SET('development', 'staging', 'production', 'external') NOT NULL,
    security_legal TEXT,
    PRIMARY KEY (id)
);

CREATE TABLE IF NOT EXISTS instrument_sample_rates (
    instrument_id INT UNSIGNED NOT NULL,
    dblist VARCHAR(255) NOT NULL,
    sample_rate FLOAT NOT NULL,
    constraint fk_de_type
    foreign key(instrument_id)
    references instruments (id)
);

@Sfaci does the above seem good enough for now to get the staging/prod dbs going? Once we're happy with it, we can provide the script or schema (not sure what format they need) for T361955.

cc @phuedx -- I pasted the questions I have about the task field and Security and Legal in the design doc if you could clarify - thanks!

Thanks @cjming! All this helps a lot to catch up on the database schema work.
I only have a few comments:

  • Just in case because I would say were are not going to use that schema representation, it seems the owner column is missed in the json schema you have added above.
  • Regarding the SQL script, I would say that task and security_legal don't need to be represented as TEXT (65k characters). I think we could just use VARCHAR with a convenient maximum size.
  • According to the demo meeting we had last week, it seems we should add something to know the current status of an instrument (On, Off, Error). Could we take the opportunity to add some column to the instruments table to store it? I would say we can, at least, determine and save the On/Off status when an user turn the instrument on/off. According to that meeting, we had an action item to try to determine the error status, but as far as I understand it seems that status as a column is a reality and also the turn on/off feature. Please correct me if I'm wrong.

About the ENUM issue, I have taken a look at the ticket where the problem is described and it seems that the three main reasons to avoid them doesn't fit with this project. It seems that they could work for us. And that ticket was created in 2015. Things could be different right now.

I don't know if all this is enough to get the database working in both environments. I'm not sure what SREs expect here from us but a SQL script sounds good. This script would create our database and tables. Maybe we should remove CREATE DATABASE . . . and GRANT PRIVILEGES . . . sentences and we have to tune something else but I think it's something we can start discussing with them although we still have to explore how to deal with upgrades/migrations.

Just one more thing: should we close as invalid the previous ticket about the database schema T331516: Design and get approval for database schema ?

cc @phuedx -- I pasted the questions I have about the task field and Security and Legal in the design doc if you could clarify - thanks!

Done.

Note that the above schema is formatted per https://www.mediawiki.org/wiki/Manual:Schema_changes#Notes when we thought the database would live in MediaWiki. Since this is now a standalone database in either WikiKube or the DSE-K8s cluster, maybe all we need is a sql file to create the tables we need. In any event, for the above format, I wasn't sure how to do enum fields for:

  • compliance_requirements
  • contextual_attributes
  • environments

According to https://www.mediawiki.org/wiki/Manual:Schema_changes#Best_practices_in_choosing_the_data_type, enums are discouraged anyway. If this is the case, then do we need to do separate tables for each with lookup tables?

How about a bitmask? The largest enum is instruments.context_attributes, which has 34 entries (currently). That'll require 34 bits to store. Using a BIGINT (64 bits) gives us flexibility in adding more context attributes as the platform evolves. The other enums are trivially small and could be stored in SMALLINTs (8 bits).

  • Regarding the SQL script, I would say that task and security_legal don't need to be represented as TEXT (65k characters). I think we could just use VARCHAR with a convenient maximum size.

+1

Thanks @Sfaci, @phuedx

Responses inline:

owner column is missed in the json schema

whoops - will add that and include full revised init.sql in a follow up comment after getting some clarification

task and security_legal don't need to be represented as TEXT

how's this? is max size "1000" still too big? It ends up being 150-200 words or 1/2 page. I figure there might be reason to include multiple phab task ids/urls for task and either longer-form text or long/multiple urls (google docs, etc) for security_legal_review

task VARCHAR(1000) NOT NULL,
security_legal_review VARCHAR(1000) NOT NULL,

status as a column is a reality and also the turn on/off feature

ah good point - i will add

status VARCHAR(255) NOT NULL,

ENUM issue

Good point - i'll insert an ENUM in the jsonschema and see if it flies. I'll also try Sam's suggestion re: bitmasking below

not sure what SREs expect here from us but a SQL script sounds good

ok - i'll remove the top lines and just keep the CREATE statements - TK in a follow up comment

should we close as invalid the previous ticket about the database schema T331516

It's already closed/resolved as status -- no need to reference it anymore. We can just use this ticket for settling the schema.

How about a bitmask? The largest enum is instruments.context_attributes, which has 34 entries (currently). That'll require 34 bits to store. Using a BIGINT (64 bits) gives us flexibility in adding more context attributes as the platform evolves. The other enums are trivially small and could be stored in SMALLINTs (8 bits).

Looks up how to do bitmasking...

So like this?


CREATE TABLE IF NOT EXISTS instruments (
    ...
    contextual_attributes BIGINT NOT NULL
    ...
);

CREATE TABLE IF NOT EXISTS contextual_attributes (
    id INT UNSIGNED NOT NULL AUTO_INCREMENT,
    attribute VARCHAR(255) NOT NULL,
    bitmaskvalue INT NOT NULL,
    PRIMARY KEY (id)
);

INSERT INTO contextual_attributes (attribute, bitmaskvalue) VALUES ('agent_app_install_id', 1);
INSERT INTO contextual_attributes (attribute, bitmaskvalue) VALUES ('agent_client_platform', 2);
INSERT INTO contextual_attributes (attribute, bitmaskvalue) VALUES ('agent_client_platform_family', 4);
INSERT INTO contextual_attributes (attribute, bitmaskvalue) VALUES ('page_id', 8);
INSERT INTO contextual_attributes (attribute, bitmaskvalue) VALUES ('page_title', 16);
INSERT INTO contextual_attributes (attribute, bitmaskvalue) VALUES ('page_namespace', 32);
INSERT INTO contextual_attributes (attribute, bitmaskvalue) VALUES ('page_namespace_name', 64);
INSERT INTO contextual_attributes (attribute, bitmaskvalue) VALUES ('page_revision_id', 128);
INSERT INTO contextual_attributes (attribute, bitmaskvalue) VALUES (<another attribute>, <increasing powers of 2>);
...

/* Then inserting into the instrument table with say 3 contextual attributes would look like this? */
INSERT INTO instruments (... contextual_attributes ...) VALUES (... 131  ...);
/* 131 in binary equals 00000000 00000000 10000011 meaning bit 1, 2, and 8 are included ?  i.e. agent_app_install_id, agent_client_platform, and page_revision_id ? */

/* another example */
INSERT INTO instruments (... contextual_attributes ...) VALUES (... 80379  ...);
/* 80379 in binary equals 01010101 00111001 11111011  << idk what this means - does this mean all 8 if we use inclusion vs intersection? */

I have never used bitmasks before in a db context -- after some research, i get that it's useful but it makes my brain hurt - binary is totally inscrutable. It makes more intuitive sense to me to create a lookup table and join them in queries -- if performance is not an issue, is this an acceptable approach?

Anyway, lmk if above is correct.

Btw this is just with 8 bits -- assuming we have minimum 34 bits that may change over time in the contextual attributes context (and is this problematic? wouldn't we have to update all rows if we change something?), I can't get my head around binary numbers that big.

And as I hammer on the forms, the other thought i have is if we have all the contextual attributes in their own table, could we expose these in an endpoint to power the options for its corresponding form field? Rather than hard-code them in the app?

About the ENUM issue, I was wondering if we should focus on reducing the complexity here considering, just in case, what we would have to do in case we add a new contextual attribute some day. We added a new one just a few days ago (maybe I'm wrong and it's not so usual). Is it worth taking that into consideration? It doesn't sound good to have to modify the database schema if we add a new contextual attribute. That's why I also like the classical approach that Clare mentioned about having a separate table with all the values and just join them when needed.It's even better to search them when autocompleting in the ChipInput/Lookup component we have in the registration form. And in the case a contextual attribute is added/changes, we only need to add a new field/modify an existing one.

So like this?
<snip />

More like:

CREATE TABLE IF NOT EXISTS instruments (
    /* ... */
    contextual_attributes BIGINT NOT NULL
    /* ... */
);
const APP_AGENT_INSTALL_ID = 1;
const APP_AGENT_CLIENT_PLATFORM = 2;
const APP_AGENT_CLIENT_PLATFORM_FAMILY = 4;
const PAGE_ID = 8;

const _CONTEXT_ATTRIBUTES = [
        'APP_AGENT_INSTALL_ID',
        'APP_AGENT_CLIENT_PLATFORM',
        'APP_AGENT_CLIENT_PLATFORM_FAMILY',
        'PAGE_ID',
];

class ContextAttributes {
  /**
   * Encoded a list of context attributes.
   *
   * This method is expected to be used when processing raw HTML form input.
   *
   * @param {number[]} raw
   * @return {number} The encoded context attributes ready to insert into the backing store
   */
  encode( raw ) {
    return contextAttributes.reduce(
      ( result, contextAttribute ) => result + contextAttribute,
      0
    );
  }

  /**
   * @param {number} encoded
   * @param {number} contextAttribute
   *
   * @example
   * if ( ContextAttributes.hasContextAttribute( fromDB, APP_AGENT_INSTALL_ID ) ) {
   *   /* ... */
   * }
   */
  hasContextAttribute( encoded, contextAttribute ) {
    return Boolean( encoded & contextAttribute );
  }

  /**
   * Decodes the encoded context attributes to a list of names of attributes.
   *
   * This method is only expected to be used when debugging.
   *
   * @internal
   *
   * @param {number} encoded
   * @return {string[]}
   */
  decode( encoded ) {
    return _CONTEXT_ATTRIBUTES.reduce(
      ( result, contextAttribute, i ) => {
        if ( encoded & ( 2 ** i ) ) {
          result.push( contextAttribute );
        }

        return result;
      };
    );
  }
}

We'd manipulate the bitmasks in the app and only store the resulting number in the DB.

Btw this is just with 8 bits -- assuming we have minimum 34 bits that may change over time in the contextual attributes context (and is this problematic? wouldn't we have to update all rows if we change something?), I can't get my head around binary numbers that big.

You would have to change all rows if you changed the values associated with an existing context attribute, yes. You wouldn't have to change all rows if you added a new one.

An unsigned BIGINT is 64 bits. That's plenty of room for growth if we start adding more context attributes.

And as I hammer on the forms, the other thought i have is if we have all the contextual attributes in their own table, could we expose these in an endpoint to power the options for its corresponding form field? Rather than hard-code them in the app?

Do you need an endpoint or can you query them once during app initialisation?

About the ENUM issue, I was wondering if we should focus on reducing the complexity here ...

@phuedx and I discussed - Sam is fine with however we want to build it. Sounds like we both still lean towards old school lookup tables so we'll go with that

Updated proposed schema - going with straight-up script -- the jsonschema version has peculiarities related to MW so let's just go with the CREATEs and INSERTs:

init.sql

CREATE TABLE IF NOT EXISTS instruments (
    id INT UNSIGNED NOT NULL AUTO_INCREMENT,
    name VARCHAR(255) NOT NULL,
    slug VARCHAR(255) NOT NULL,
    description TEXT,
    creator VARCHAR(255) NOT NULL,
    owner VARCHAR(255) NOT NULL,
    purpose VARCHAR(255),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
    start_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
    end_date TIMESTAMP,
    task VARCHAR(1000) NOT NULL,
    compliance_requirements SET('legal', 'gdpr') NOT NULL,
    sample_unit VARCHAR(255) NOT NULL,
    sample_rate FLOAT NOT NULL,
    environments SET('development', 'staging', 'production', 'external') NOT NULL,
    security_legal_review VARCHAR(1000) NOT NULL,
    status VARCHAR(255) NOT NULL,
    PRIMARY KEY (id)
);

CREATE TABLE IF NOT EXISTS instrument_sample_rates (
    id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
    instrument_id INT UNSIGNED NOT NULL,
    dblist VARCHAR(255) NOT NULL,
    sample_rate FLOAT NOT NULL,
    constraint fk_de_type
    foreign key(instrument_id)
    references instruments (id)
);

CREATE TABLE IF NOT EXISTS contextual_attributes (
    id INT UNSIGNED NOT NULL AUTO_INCREMENT,
    contextual_attribute_name VARCHAR(255) NOT NULL,
    PRIMARY KEY (id)
);

CREATE TABLE IF NOT EXISTS instrument_contextual_attribute_lookup (
    id INT UNSIGNED NOT NULL AUTO_INCREMENT,
    instrument_id INT UNSIGNED NOT NULL,
    contextual_attribute_id INT UNSIGNED NOT NULL,
    PRIMARY KEY (id)
);

INSERT INTO contextual_attributes (contextual_attribute_name) VALUES
 ('agent_app_install_id'),
 ('agent_client_platform'),
 ('agent_client_platform_family'),
 ('page_id'),
 ('page_title'),
 ('page_namespace'),
 ('page_namespace_name'),
 ('page_revision_id'),
 ('page_wikidata_id'),
 ('page_wikidata_qid'),
 ('page_content_language'),
 ('page_is_redirect'),
 ('page_user_groups_allowed_to_move'),
 ('page_user_groups_allowed_to_edit'),
 ('mediawiki_skin'),
 ('mediawiki_version'),
 ('mediawiki_is_production'),
 ('mediawiki_is_debug_mode'),
 ('mediawiki_database'),
 ('mediawiki_site_content_language'),
 ('mediawiki_site_content_language_variant'),
 ('performer_is_logged_in'),
 ('performer_id'),
 ('performer_name'),
 ('performer_session_id'),
 ('performer_pageview_id'),
 ('performer_groups'),
 ('performer_is_bot'),
 ('performer_language'),
 ('performer_language_variant'),
 ('performer_can_probably_edit_page'),
 ('performer_edit_count'),
 ('performer_edit_count_bucket'),
 ('performer_registration_dt');

I'm trying to test the db/init.sql script locally and not quite getting all the tables created. Still troubleshooting but if anything looks obvs wrong, please lmk.
Tested locally - tables are created during docker build and test data is inserted properly.
https://gitlab.wikimedia.org/repos/data-engineering/mpic/-/merge_requests/11

^^ MR provides additional things like creating db/user, using the correct db name, and one more INSERT statement to add test data for instruments - per discussion with @Sfaci, we will not include this INSERT on staging/prod, nor the CREATE db, USE db statements.