diff options
author | Nikolay Perfilov <pnv1@yandex-team.ru> | 2025-05-23 09:35:12 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2025-05-23 09:35:12 +0300 |
commit | 91510c3b3c7655a944d9b1e2aa3fc62de1a7588f (patch) | |
tree | 53b83eeea3321f7e2bd6db321deb1ccfa653bc21 | |
parent | 985adf4d2e990d931958ed19662b61b9987b07e5 (diff) | |
download | ydb-91510c3b3c7655a944d9b1e2aa3fc62de1a7588f.tar.gz |
Add import dataset examples (#16250)
33 files changed, 1904 insertions, 1 deletions
diff --git a/ydb/docs/en/core/olap/_includes/olap-links.md b/ydb/docs/en/core/olap/_includes/olap-links.md new file mode 100644 index 00000000000..ac6b93cf458 --- /dev/null +++ b/ydb/docs/en/core/olap/_includes/olap-links.md @@ -0,0 +1,5 @@ +## Related sections + +* [{#T}](../../faq/analytics.md) +* [{#T}](../../concepts/datamodel/table.md#column-oriented-tables) +* [{#T}](../../yql/reference/builtins/aggregation.md)
\ No newline at end of file diff --git a/ydb/docs/en/core/olap/datasets/_includes/intro.md b/ydb/docs/en/core/olap/datasets/_includes/intro.md new file mode 100644 index 00000000000..ed19a4cb7e6 --- /dev/null +++ b/ydb/docs/en/core/olap/datasets/_includes/intro.md @@ -0,0 +1,5 @@ +{% note info %} + +This page is part of the [Dataset Import](../index.md) section, which includes examples of loading popular datasets into {{ ydb-short-name }}. Before starting, please review the [general information](../index.md#general-info) on requirements and the import process. + +{% endnote %}
\ No newline at end of file diff --git a/ydb/docs/en/core/olap/datasets/animal-crossing.md b/ydb/docs/en/core/olap/datasets/animal-crossing.md new file mode 100644 index 00000000000..b88bbda5d25 --- /dev/null +++ b/ydb/docs/en/core/olap/datasets/animal-crossing.md @@ -0,0 +1,157 @@ +# Animal Crossing New Horizons Catalog + +{% include [intro](_includes/intro.md) %} + +A catalog of items from the popular game Animal Crossing: New Horizons. + +**Source**: [Kaggle - Animal Crossing New Horizons Catalog](https://www.kaggle.com/datasets/jessicali9530/animal-crossing-new-horizons-nookplaza-dataset/) + +**Size**: 51 KB + +## Loading Example + +1. Download and unzip the `accessories.csv` file from Kaggle. + +2. This file includes a BOM (Byte Order Mark). However, the import command does not support files with a BOM. To resolve this, remove the BOM bytes from the beginning of the file by executing the following command: + + ```bash + sed -i '1s/^\xEF\xBB\xBF//' accessories.csv + ``` + +3. The column names in the file contain spaces, which are incompatible with YDB since YDB does not support spaces in column names. Replace spaces in the column names with underscores, for example, by executing the following command: + + ```bash + sed -i '1s/ /_/g' accessories.csv + ``` + +4. Create a table in {{ ydb-short-name }} using one of the following methods: + + {% list tabs %} + + - Embedded UI + + For more information on [Embedded UI](../../reference/embedded-ui/ydb-monitoring). + + ```sql + CREATE TABLE `accessories` ( + `Name` Text NOT NULL, + `Variation` Text NOT NULL, + `DIY` Text NOT NULL, + `Buy` Text NOT NULL, + `Sell` Uint64 NOT NULL, + `Color_1` Text NOT NULL, + `Color_2` Text NOT NULL, + `Size` Text NOT NULL, + `Miles_Price` Text NOT NULL, + `Source` Text NOT NULL, + `Source_Notes` Text NOT NULL, + `Seasonal_Availability` Text NOT NULL, + `Mannequin_Piece` Text NOT NULL, + `Version` Text NOT NULL, + `Style` Text NOT NULL, + `Label_Themes` Text NOT NULL, + `Type` Text NOT NULL, + `Villager_Equippable` Text NOT NULL, + `Catalog` Text NOT NULL, + `Filename` Text NOT NULL, + `Internal_ID` Uint64 NOT NULL, + `Unique_Entry_ID` Text NOT NULL, + PRIMARY KEY (`Unique_Entry_ID`) + ) + WITH ( + STORE = COLUMN + ); + ``` + + - YDB CLI + + ```bash + ydb sql -s \ + 'CREATE TABLE `accessories` ( + `Name` Text NOT NULL, + `Variation` Text NOT NULL, + `DIY` Text NOT NULL, + `Buy` Text NOT NULL, + `Sell` Uint64 NOT NULL, + `Color_1` Text NOT NULL, + `Color_2` Text NOT NULL, + `Size` Text NOT NULL, + `Miles_Price` Text NOT NULL, + `Source` Text NOT NULL, + `Source_Notes` Text NOT NULL, + `Seasonal_Availability` Text NOT NULL, + `Mannequin_Piece` Text NOT NULL, + `Version` Text NOT NULL, + `Style` Text NOT NULL, + `Label_Themes` Text NOT NULL, + `Type` Text NOT NULL, + `Villager_Equippable` Text NOT NULL, + `Catalog` Text NOT NULL, + `Filename` Text NOT NULL, + `Internal_ID` Uint64 NOT NULL, + `Unique_Entry_ID` Text NOT NULL, + PRIMARY KEY (`Unique_Entry_ID`) + ) + WITH ( + STORE = COLUMN + );' + ``` + + {% endlist %} + +5. Execute the import command: + + ```bash + ydb import file csv --header --path accessories accessories.csv + ``` + +## Analytical Query Example + +Identify the top five most popular primary colors of accessories: + +{% list tabs %} + +- Embedded UI + + ```sql + SELECT + Color_1, + COUNT(*) AS color_count + FROM accessories + GROUP BY Color_1 + ORDER BY color_count DESC + LIMIT 5; + ``` + +- YDB CLI + + ```bash + ydb sql -s \ + 'SELECT + Color_1, + COUNT(*) AS color_count + FROM accessories + GROUP BY Color_1 + ORDER BY color_count DESC + LIMIT 5;' + ``` + +{% endlist %} + +Result: + +```raw +┌──────────┬─────────────┐ +│ Color_1 │ color_count │ +├──────────┼─────────────┤ +│ "Black" │ 31 │ +├──────────┼─────────────┤ +│ "Green" │ 27 │ +├──────────┼─────────────┤ +│ "Pink" │ 20 │ +├──────────┼─────────────┤ +│ "Red" │ 20 │ +├──────────┼─────────────┤ +│ "Yellow" │ 19 │ +└──────────┴─────────────┘ +```
\ No newline at end of file diff --git a/ydb/docs/en/core/olap/datasets/chess.md b/ydb/docs/en/core/olap/datasets/chess.md new file mode 100644 index 00000000000..b82d0904c40 --- /dev/null +++ b/ydb/docs/en/core/olap/datasets/chess.md @@ -0,0 +1,105 @@ +# Chess Position Evaluations + +{% include [intro](_includes/intro.md) %} + +The dataset includes 513 million chess position evaluations performed by the Stockfish engine for analysis on the Lichess platform. + +**Source**: [Kaggle - Chess Position Evaluations](https://www.kaggle.com/datasets/lichess/chess-evaluations) + +**Size**: 59.66 GB + +## Loading Example + +1. Download the `evals.csv` file from Kaggle. + +2. Create a table in {{ ydb-short-name }} using one of the following methods: + + {% list tabs %} + + - Embedded UI + + For more information on [Embedded UI](../../reference/embedded-ui/ydb-monitoring). + + ```sql + CREATE TABLE `evals` ( + `fen` Text NOT NULL, + `line` Text NOT NULL, + `depth` Uint64, + `knodes` Uint64, + `cp` Double, + `mate` Double, + PRIMARY KEY (`fen`, `line`) + ) + WITH ( + STORE = COLUMN, + UNIFORM_PARTITIONS = 50 + ); + ``` + + - YDB CLI + + ```bash + ydb sql -s \ + 'CREATE TABLE `evals` ( + `fen` Text NOT NULL, + `line` Text NOT NULL, + `depth` Uint64, + `knodes` Uint64, + `cp` Double, + `mate` Double, + PRIMARY KEY (`fen`, `line`) + ) + WITH ( + STORE = COLUMN, + UNIFORM_PARTITIONS = 50 + );' + ``` + {% endlist %} + +3. Execute the import command: + + ```bash + ydb import file csv --header --null-value "" --path evals evals.csv + ``` + +## Analytical Query Example + +Identify positions with the highest number of moves analyzed by the Stockfish engine: + +{% list tabs %} + +- Embedded UI + + ```sql + SELECT + fen, + MAX(depth) AS max_depth, + SUM(knodes) AS total_knodes + FROM evals + GROUP BY fen + ORDER BY max_depth DESC + LIMIT 10; + ``` + +- YDB CLI + + ```bash + ydb sql -s \ + 'SELECT + fen, + MAX(depth) AS max_depth, + SUM(knodes) AS total_knodes + FROM evals + GROUP BY fen + ORDER BY max_depth DESC + LIMIT 10;' + ``` + +{% endlist %} + +This query performs the following actions: + +* Finds positions (represented in FEN format) with the maximum analysis depth. +* Sums the number of analyzed nodes (knodes) for each position. +* Sorts results by maximum analysis depth in descending order. +* Outputs the top 10 positions with the highest analysis depth.
\ No newline at end of file diff --git a/ydb/docs/en/core/olap/datasets/covid.md b/ydb/docs/en/core/olap/datasets/covid.md new file mode 100644 index 00000000000..4d6bdf2972e --- /dev/null +++ b/ydb/docs/en/core/olap/datasets/covid.md @@ -0,0 +1,159 @@ +# COVID-19 Open Research Dataset + +{% include [intro](_includes/intro.md) %} + +An open dataset of COVID-19 research. + +**Source**: [Kaggle - COVID-19 Open Research Dataset Challenge](https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge?select=metadata.csv) + +**Size**: 1.65 GB (metadata.csv file) + +## Loading Example + +1. Download and unzip the `metadata.csv` file from Kaggle. + +2. The dataset includes completely identical rows. Since YDB requires unique primary key values, add a new column named `row_id` to the file, where the key value will be equal to the row number in the original file. This prevents the removal of duplicate data. This operation can be carried out using the awk command: + + ```bash + awk 'NR==1 {print "row_id," \$0; next} {print NR-1 "," \$0}' metadata.csv > temp.csv && mv temp.csv metadata.csv + ``` + +3. Create a table in {{ ydb-short-name }} using one of the following methods: + + {% list tabs %} + + - Embedded UI + + For more information on [Embedded UI](../../reference/embedded-ui/ydb-monitoring). + + ```sql + CREATE TABLE `covid_research` ( + `row_id` Uint64 NOT NULL, + `cord_uid` Text NOT NULL, + `sha` Text NOT NULL, + `source_x` Text NOT NULL, + `title` Text NOT NULL, + `doi` Text NOT NULL, + `pmcid` Text NOT NULL, + `pubmed_id` Text NOT NULL, + `license` Text NOT NULL, + `abstract` Text NOT NULL, + `publish_time` Text NOT NULL, + `authors` Text NOT NULL, + `journal` Text NOT NULL, + `mag_id` Text, + `who_covidence_id` Text, + `arxiv_id` Text, + `pdf_json_files` Text NOT NULL, + `pmc_json_files` Text NOT NULL, + `url` Text NOT NULL, + `s2_id` Uint64, + PRIMARY KEY (`row_id`) + ) + WITH ( + STORE = COLUMN + ); + ``` + + - YDB CLI + + ```bash + ydb sql -s \ + 'CREATE TABLE `covid_research` ( + `row_id` Uint64 NOT NULL, + `cord_uid` Text NOT NULL, + `sha` Text NOT NULL, + `source_x` Text NOT NULL, + `title` Text NOT NULL, + `doi` Text NOT NULL, + `pmcid` Text NOT NULL, + `pubmed_id` Text NOT NULL, + `license` Text NOT NULL, + `abstract` Text NOT NULL, + `publish_time` Text NOT NULL, + `authors` Text NOT NULL, + `journal` Text NOT NULL, + `mag_id` Text, + `who_covidence_id` Text, + `arxiv_id` Text, + `pdf_json_files` Text NOT NULL, + `pmc_json_files` Text NOT NULL, + `url` Text NOT NULL, + `s2_id` Uint64, + PRIMARY KEY (`row_id`) + ) + WITH ( + STORE = COLUMN + );' + ``` + + {% endlist %} + +4. Execute the import command: + + ```bash + ydb import file csv --header --null-value "" --path covid_research metadata.csv + ``` + +## Analytical Query Example + +Run a query to determine the journals with the highest number of publications: + +{% list tabs %} + +- Embedded UI + + ```sql + SELECT + journal, + COUNT(*) AS publication_count + FROM covid_research + WHERE journal IS NOT NULL AND journal != '' + GROUP BY journal + ORDER BY publication_count DESC + LIMIT 10; + ``` + +- YDB CLI + + ```bash + ydb sql -s \ + 'SELECT + journal, + COUNT(*) AS publication_count + FROM covid_research + WHERE journal IS NOT NULL AND journal != "" + GROUP BY journal + ORDER BY publication_count DESC + LIMIT 10;' + ``` + +{% endlist %} + +Result: + +```raw +┌───────────────────────────────────┬───────────────────┐ +│ journal │ publication_count │ +├───────────────────────────────────┼───────────────────┤ +│ "PLoS One" │ 9953 │ +├───────────────────────────────────┼───────────────────┤ +│ "bioRxiv" │ 8961 │ +├───────────────────────────────────┼───────────────────┤ +│ "Int J Environ Res Public Health" │ 8201 │ +├───────────────────────────────────┼───────────────────┤ +│ "BMJ" │ 6928 │ +├───────────────────────────────────┼───────────────────┤ +│ "Sci Rep" │ 5935 │ +├───────────────────────────────────┼───────────────────┤ +│ "Cureus" │ 4212 │ +├───────────────────────────────────┼───────────────────┤ +│ "Reactions Weekly" │ 3891 │ +├───────────────────────────────────┼───────────────────┤ +│ "Front Psychol" │ 3541 │ +├───────────────────────────────────┼───────────────────┤ +│ "BMJ Open" │ 3515 │ +├───────────────────────────────────┼───────────────────┤ +│ "Front Immunol" │ 3442 │ +└───────────────────────────────────┴───────────────────┘ +```
\ No newline at end of file diff --git a/ydb/docs/en/core/olap/datasets/ecommerce.md b/ydb/docs/en/core/olap/datasets/ecommerce.md new file mode 100644 index 00000000000..455386ad6b9 --- /dev/null +++ b/ydb/docs/en/core/olap/datasets/ecommerce.md @@ -0,0 +1,145 @@ +# E-Commerce Behavior Data + +{% include [intro](_includes/intro.md) %} + +User behavior data from a multi-category online store. + +**Source**: [Kaggle - E-commerce behavior data](https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store/data) + +**Size**: 9 GB + +## Loading Example + +1. Download and unzip the `2019-Nov.csv` file from Kaggle. + +2. The dataset includes completely identical rows. Since YDB requires unique primary key values, add a new column named `row_id` to the file, where the key value will be equal to the row number in the original file. This prevents the removal of duplicate data. This operation can be carried out using the awk command: + + ```bash + awk 'NR==1 {print "row_id," \$0; next} {print NR-1 "," \$0}' 2019-Nov.csv > temp.csv && mv temp.csv 2019-Nov.csv + ``` + +3. Create a table in {{ ydb-short-name }} using one of the following methods: + + {% list tabs %} + + - Embedded UI + + For more information on [Embedded UI](../../reference/embedded-ui/ydb-monitoring). + + ```sql + CREATE TABLE `ecommerce_table` ( + `row_id` Uint64 NOT NULL, + `event_time` Text NOT NULL, + `event_type` Text NOT NULL, + `product_id` Uint64 NOT NULL, + `category_id` Uint64, + `category_code` Text, + `brand` Text, + `price` Double NOT NULL, + `user_id` Uint64 NOT NULL, + `user_session` Text NOT NULL, + PRIMARY KEY (`row_id`) + ) + WITH ( + STORE = COLUMN, + UNIFORM_PARTITIONS = 50 + ); + ``` + + - YDB CLI + + ```bash + ydb sql -s \ + 'CREATE TABLE `ecommerce_table` ( + `row_id` Uint64 NOT NULL, + `event_time` Text NOT NULL, + `event_type` Text NOT NULL, + `product_id` Uint64 NOT NULL, + `category_id` Uint64, + `category_code` Text, + `brand` Text, + `price` Double NOT NULL, + `user_id` Uint64 NOT NULL, + `user_session` Text NOT NULL, + PRIMARY KEY (`row_id`) + ) + WITH ( + STORE = COLUMN, + UNIFORM_PARTITIONS = 50 + );' + ``` + + {% endlist %} + +4. Execute the import command: + + ```bash + ydb import file csv --header --null-value "" --path ecommerce_table 2019-Nov.csv + ``` + +## Analytical Query Example + +Identify the most popular product categories on November 1, 2019: + +{% list tabs %} + +- Embedded UI + + ```sql + SELECT + category_code, + COUNT(*) AS view_count + FROM ecommerce_table + WHERE + SUBSTRING(CAST(event_time AS String), 0, 10) = '2019-11-01' + AND event_type = 'view' + GROUP BY category_code + ORDER BY view_count DESC + LIMIT 10; + ``` + +- YDB CLI + + ```bash + ydb sql -s \ + 'SELECT + category_code, + COUNT(*) AS view_count + FROM ecommerce_table + WHERE + SUBSTRING(CAST(event_time AS String), 0, 10) = "2019-11-01" + AND event_type = "view" + GROUP BY category_code + ORDER BY view_count DESC + LIMIT 10;' + ``` + +{% endlist %} + +Result: + +```raw +┌────────────────────────────────────┬────────────┐ +│ category_code │ view_count │ +├────────────────────────────────────┼────────────┤ +│ null │ 453024 │ +├────────────────────────────────────┼────────────┤ +│ "electronics.smartphone" │ 360650 │ +├────────────────────────────────────┼────────────┤ +│ "electronics.clocks" │ 43581 │ +├────────────────────────────────────┼────────────┤ +│ "computers.notebook" │ 40878 │ +├────────────────────────────────────┼────────────┤ +│ "electronics.video.tv" │ 40383 │ +├────────────────────────────────────┼────────────┤ +│ "electronics.audio.headphone" │ 37489 │ +├────────────────────────────────────┼────────────┤ +│ "apparel.shoes" │ 31013 │ +├────────────────────────────────────┼────────────┤ +│ "appliances.kitchen.washer" │ 28028 │ +├────────────────────────────────────┼────────────┤ +│ "appliances.kitchen.refrigerators" │ 27808 │ +├────────────────────────────────────┼────────────┤ +│ "appliances.environment.vacuum" │ 26477 │ +└────────────────────────────────────┴────────────┘ +```
\ No newline at end of file diff --git a/ydb/docs/en/core/olap/datasets/index.md b/ydb/docs/en/core/olap/datasets/index.md new file mode 100644 index 00000000000..ebbd648cd0b --- /dev/null +++ b/ydb/docs/en/core/olap/datasets/index.md @@ -0,0 +1,54 @@ +# Overview + +These pages describe popular datasets that you can load into {{ ydb-short-name }} to familiarize yourself with the database's functionality and test various use cases. + +## Prerequisites + +To load the datasets, you will need: + +1. Installed [{{ ydb-short-name }} CLI](../../reference/ydb-cli/) +2. [Optional] Configured [connection profile](../../reference/ydb-cli/profile/create.md) to {{ ydb-short-name }} to avoid specifying connection parameters with every command + +## General Information on Data Loading {#general-info} + +{{ ydb-short-name }} supports importing data from CSV files using the [command](../../reference/ydb-cli/export-import/import-file.md) `ydb import file csv`. Example command: + +```bash +ydb import file csv --header --null-value "" --path <table_path> <file>.csv +``` + +Where: + +* `--header` indicates that the first row of the file contains the column names, and the actual data starts from the second row; +* `--null-value ""` specifies that an empty string in the CSV will be interpreted as a null value during data import into the table. + +A table must already exist in {{ ydb-short-name }} for data import. The primary way to create a table is by executing the [`CREATE TABLE` YQL query](../../yql/reference/syntax/create_table/index.md). Instead of writing the query manually, you can try running the import command from a file, as shown in any example in this section, without creating the table first. In this case, the CLI will suggest a `CREATE TABLE` query, which you can use as a base, edit if necessary, and execute. + +To import data into {{ ydb-short-name }}, a table must be pre-created. Typically, a table is created using the [YQL `CREATE TABLE` query](../../yql/reference/syntax/create_table/index.md). However, instead of crafting such a query manually, you can initiate the import command `ydb import file csv` as shown in the import examples in this section. If the table doesn't exist, the CLI will automatically suggest a `CREATE TABLE` query that you can use to create the table. + +{% note info "Selecting a Primary Key" %} + +{{ ydb-short-name }} requires a primary key for the table. It significantly speeds up data loading and processing, and it also allows for deduplication: rows with identical values in the primary key columns replace each other. + +If the imported dataset doesn't have suitable columns for a primary key, we add a new column with row numbers and use it as the primary key, as each row number is unique within the file. + +{% endnote %} + +## Features and Limitations + +When woimporting data to {{ ydb-short-name }}, consider the following points: + +1. **Column Names**: Column names should not contain spaces or special characters. + +2. **Data Types**: + - Date/time strings with timezone (e.g., "2019-11-01 00:00:00 UTC") will be imported as Text type. + - The Bool type is not supported as a column type; use Text or Int64 instead. + +## Available Datasets + +* [Chess Position Evaluations](chess.md) - Stockfish engine chess position evaluations +* [Video Game Sales](video-games.md) - video game sales data +* [E-Commerce Behavior Data](ecommerce.md) - user behavior data from an online store +* [COVID-19 Open Research Dataset](covid.md) - open research dataset on COVID-19 +* [Netflix Movies and TV Shows](netflix.md) - data on Netflix movies and shows +* [Animal Crossing New Horizons Catalog](animal-crossing.md) - item catalog from the game
\ No newline at end of file diff --git a/ydb/docs/en/core/olap/datasets/netflix.md b/ydb/docs/en/core/olap/datasets/netflix.md new file mode 100644 index 00000000000..2759e299105 --- /dev/null +++ b/ydb/docs/en/core/olap/datasets/netflix.md @@ -0,0 +1,127 @@ +# Netflix Movies and TV Shows + +{% include [intro](_includes/intro.md) %} + +Data on movies and TV shows available on Netflix. + +**Source**: [Kaggle - Netflix Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/netflix-shows) + +**Size**: 3.4 MB + +## Loading Example + +1. Download and unzip the `netflix_titles.csv` file from Kaggle. + +2. Create a table in {{ ydb-short-name }} using one of the following methods: + + {% list tabs %} + + - Embedded UI + + For more information on [Embedded UI](../../reference/embedded-ui/ydb-monitoring). + + ```sql + CREATE TABLE `netflix` ( + `show_id` Text NOT NULL, + `type` Text NOT NULL, + `title` Text NOT NULL, + `director` Text NOT NULL, + `cast` Text, + `country` Text NOT NULL, + `date_added` Text NOT NULL, + `release_year` Uint64 NOT NULL, + `rating` Text NOT NULL, + `duration` Text NOT NULL, + `listed_in` Text NOT NULL, + `description` Text NOT NULL, + PRIMARY KEY (`show_id`) + ) + WITH ( + STORE = COLUMN + ); + ``` + + - YDB CLI + + ```bash + ydb sql -s \ + 'CREATE TABLE `netflix` ( + `show_id` Text NOT NULL, + `type` Text NOT NULL, + `title` Text NOT NULL, + `director` Text NOT NULL, + `cast` Text, + `country` Text NOT NULL, + `date_added` Text NOT NULL, + `release_year` Uint64 NOT NULL, + `rating` Text NOT NULL, + `duration` Text NOT NULL, + `listed_in` Text NOT NULL, + `description` Text NOT NULL, + PRIMARY KEY (`show_id`) + ) + WITH ( + STORE = COLUMN + );' + ``` + + {% endlist %} + +3. Execute the import command: + + ```bash + ydb import file csv --header --null-value "" --path netflix netflix_titles.csv + ``` + +## Analytical Query Example + +Identify the top three countries with the most content added to Netflix in 2020: + +{% list tabs %} + +- Embedded UI + + ```sql + SELECT + country, + COUNT(*) AS count + FROM netflix + WHERE + CAST(SUBSTRING(CAST(date_added AS String), 7, 4) AS Int32) = 2020 + AND date_added IS NOT NULL + GROUP BY country + ORDER BY count DESC + LIMIT 3; + ``` + +- YDB CLI + + ```bash + ydb sql -s \ + 'SELECT + country, + COUNT(*) AS count + FROM netflix + WHERE + CAST(SUBSTRING(CAST(date_added AS String), 7, 4) AS Int32) = 2020 + AND date_added IS NOT NULL + GROUP BY country + ORDER BY count DESC + LIMIT 3;' + ``` + +{% endlist %} + +Result: + +```raw +┌─────────────────┬───────┐ +│ country │ count │ +├─────────────────┼───────┤ +│ "United States" │ 22 │ +├─────────────────┼───────┤ +│ "" │ 7 │ +├─────────────────┼───────┤ +│ "Canada" │ 3 │ +└─────────────────┴───────┘ +```
\ No newline at end of file diff --git a/ydb/docs/en/core/olap/datasets/toc_p.yaml b/ydb/docs/en/core/olap/datasets/toc_p.yaml new file mode 100644 index 00000000000..4936e305cd3 --- /dev/null +++ b/ydb/docs/en/core/olap/datasets/toc_p.yaml @@ -0,0 +1,15 @@ +items: + - name: Overview + href: index.md + - name: Chess Position Evaluations + href: chess.md + - name: Video Game Sales + href: video-games.md + - name: E-Commerce Behavior Data + href: ecommerce.md + - name: COVID-19 Open Research Dataset + href: covid.md + - name: Netflix Movies and TV Shows + href: netflix.md + - name: Animal Crossing New Horizons Catalog + href: animal-crossing.md
\ No newline at end of file diff --git a/ydb/docs/en/core/olap/datasets/video-games.md b/ydb/docs/en/core/olap/datasets/video-games.md new file mode 100644 index 00000000000..ea4f19f9644 --- /dev/null +++ b/ydb/docs/en/core/olap/datasets/video-games.md @@ -0,0 +1,117 @@ +# Video Game Sales + +{% include [intro](_includes/intro.md) %} + +Data on video game sales. + +**Source**: [Kaggle - Video Game Sales](https://www.kaggle.com/datasets/gregorut/videogamesales) + +**Size**: 1.36 MB + +## Loading Example + +1. Download and unzip the `vgsales.csv` file from Kaggle. + +2. Create a table in {{ ydb-short-name }} using one of the following methods: + + {% list tabs %} + + - Embedded UI + + For more information on [Embedded UI](../../reference/embedded-ui/ydb-monitoring). + + ```sql + CREATE TABLE `vgsales` ( + `Rank` Uint64 NOT NULL, + `Name` Text NOT NULL, + `Platform` Text NOT NULL, + `Year` Text NOT NULL, + `Genre` Text NOT NULL, + `Publisher` Text NOT NULL, + `NA_Sales` Double NOT NULL, + `EU_Sales` Double NOT NULL, + `JP_Sales` Double NOT NULL, + `Other_Sales` Double NOT NULL, + `Global_Sales` Double NOT NULL, + PRIMARY KEY (`Rank`) + ) + WITH ( + STORE = COLUMN + ); + ``` + + - YDB CLI + + ```bash + ydb sql -s \ + 'CREATE TABLE `vgsales` ( + `Rank` Uint64 NOT NULL, + `Name` Text NOT NULL, + `Platform` Text NOT NULL, + `Year` Text NOT NULL, + `Genre` Text NOT NULL, + `Publisher` Text NOT NULL, + `NA_Sales` Double NOT NULL, + `EU_Sales` Double NOT NULL, + `JP_Sales` Double NOT NULL, + `Other_Sales` Double NOT NULL, + `Global_Sales` Double NOT NULL, + PRIMARY KEY (`Rank`) + ) + WITH ( + STORE = COLUMN + );' + ``` + + {% endlist %} + +3. Execute the import command: + + ```bash + ydb import file csv --header --null-value "" --path vgsales vgsales.csv + ``` + +## Analytical Query Example + +To identify the publisher with the highest average game sales in North America, execute the query: + +{% list tabs %} + +- Embedded UI + + ```sql + SELECT + Publisher, + AVG(NA_Sales) AS average_na_sales + FROM vgsales + GROUP BY Publisher + ORDER BY average_na_sales DESC + LIMIT 1; + ``` + +- YDB CLI + + ```bash + ydb sql -s \ + 'SELECT + Publisher, + AVG(NA_Sales) AS average_na_sales + FROM vgsales + GROUP BY Publisher + ORDER BY average_na_sales DESC + LIMIT 1;' + ``` + +{% endlist %} + +Result: + +```raw +┌───────────┬──────────────────┐ +│ Publisher │ average_na_sales │ +├───────────┼──────────────────┤ +│ "Palcom" │ 3.38 │ +└───────────┴──────────────────┘ +``` + +This query helps find the publisher with the greatest success in North America by average sales.
\ No newline at end of file diff --git a/ydb/docs/en/core/olap/index.md b/ydb/docs/en/core/olap/index.md new file mode 100644 index 00000000000..7dae3da1bc8 --- /dev/null +++ b/ydb/docs/en/core/olap/index.md @@ -0,0 +1,11 @@ +# For Analysts + +This section provides examples and recommendations for handling [analytical (OLAP) scenarios](../faq/analytics.md) in {{ ydb-short-name }}. + +## This section includes the following materials + +* [{#T}](scenarios.md) +* [{#T}](optimizing.md) +* [Dataset import](datasets/index.md) + +{% include [olap-links](_includes/olap-links.md) %}
\ No newline at end of file diff --git a/ydb/docs/en/core/olap/optimizing.md b/ydb/docs/en/core/olap/optimizing.md new file mode 100644 index 00000000000..6397049c783 --- /dev/null +++ b/ydb/docs/en/core/olap/optimizing.md @@ -0,0 +1,17 @@ +# Optimizing analytical queries + +1. **Selecting the right primary key** + * Use columns frequently involved in filtering + * Consider column order to optimize range queries + +2. **Optimizing the data schema** + * Use [column-based storage](../concepts/datamodel/table.md#column-oriented-tables) for analytical tables + * Choose data types carefully to save space + * Implement [partitioning](../concepts/datamodel/table.md#olap-tables-partitioning) for large (larger than 2Gb) analytical tables + +3. **Query optimization** + * Retrieve only necessary columns + * Use appropriate [indexes](../concepts/secondary_indexes.md) + * Formulate filter conditions effectively + +{% include [olap-links](_includes/olap-links.md) %}
\ No newline at end of file diff --git a/ydb/docs/en/core/olap/scenarios.md b/ydb/docs/en/core/olap/scenarios.md new file mode 100644 index 00000000000..11b09326906 --- /dev/null +++ b/ydb/docs/en/core/olap/scenarios.md @@ -0,0 +1,18 @@ +# Typical OLAP scenarios + +1. **Batch analytics** + * Aggregating large volumes of historical data + * Generating reports over extended periods + * Calculating statistics across entire datasets + +2. **Interactive analytics** + * Performing ad-hoc queries for data exploration + * Building dynamic dashboards + * Conducting drill-down analyses at various levels of detail + +3. **Data preprocessing** + * Cleaning and normalizing data + * Enriching data from multiple sources + * Preparing data for machine learning + +{% include [olap-links](_includes/olap-links.md) %}
\ No newline at end of file diff --git a/ydb/docs/en/core/olap/toc_p.yaml b/ydb/docs/en/core/olap/toc_p.yaml new file mode 100644 index 00000000000..eedd4b95791 --- /dev/null +++ b/ydb/docs/en/core/olap/toc_p.yaml @@ -0,0 +1,10 @@ +items: + - name: Typical OLAP scenarios + href: scenarios.md + - name: Optimizing analytical queries + href: optimizing.md + - name: Dataset import + href: datasets/index.md + include: + mode: link + path: datasets/toc_p.yaml
\ No newline at end of file diff --git a/ydb/docs/en/core/recipes/index.md b/ydb/docs/en/core/recipes/index.md index a80ec20c7aa..4afc67eb5a5 100644 --- a/ydb/docs/en/core/recipes/index.md +++ b/ydb/docs/en/core/recipes/index.md @@ -5,3 +5,4 @@ This section of {{ ydb-short-name }} documentation contains ready-to-use recipes * [{#T}](ydb-sdk/index.md) * [{#T}](ydb-cli/index.md) * [{#T}](../yql/reference/recipes/index.md) +* [{#T}](../olap/index.md) diff --git a/ydb/docs/en/core/recipes/toc_p.yaml b/ydb/docs/en/core/recipes/toc_p.yaml index 19bb3ca8a06..ad6cc009e13 100644 --- a/ydb/docs/en/core/recipes/toc_p.yaml +++ b/ydb/docs/en/core/recipes/toc_p.yaml @@ -12,3 +12,8 @@ items: include: mode: link path: ../yql/reference/recipes/toc_i.yaml +- name: For Analysts + href: ../olap/index.md + include: + mode: link + path: ../olap/toc_p.yaml diff --git a/ydb/docs/ru/core/olap/_includes/olap-links.md b/ydb/docs/ru/core/olap/_includes/olap-links.md new file mode 100644 index 00000000000..bdde0875479 --- /dev/null +++ b/ydb/docs/ru/core/olap/_includes/olap-links.md @@ -0,0 +1,5 @@ +## Связанные разделы + +* [{#T}](../../faq/analytics.md) +* [{#T}](../../concepts/datamodel/table.md#column-oriented-tables) +* [{#T}](../../yql/reference/builtins/aggregation.md)
\ No newline at end of file diff --git a/ydb/docs/ru/core/olap/datasets/_includes/intro.md b/ydb/docs/ru/core/olap/datasets/_includes/intro.md new file mode 100644 index 00000000000..74eefd12b88 --- /dev/null +++ b/ydb/docs/ru/core/olap/datasets/_includes/intro.md @@ -0,0 +1,5 @@ +{% note info %} + +Эта страница является частью раздела [Импорт датасетов](../index.md), где описаны примеры загрузки популярных наборов данных в {{ ydb-short-name }}. Перед началом работы ознакомьтесь с [общей информацией](../index.md#general-info) о требованиях и процессе импорта. + +{% endnote %}
\ No newline at end of file diff --git a/ydb/docs/ru/core/olap/datasets/animal-crossing.md b/ydb/docs/ru/core/olap/datasets/animal-crossing.md new file mode 100644 index 00000000000..e1ff02106d2 --- /dev/null +++ b/ydb/docs/ru/core/olap/datasets/animal-crossing.md @@ -0,0 +1,157 @@ +# Animal Crossing New Horizons Catalog + +{% include [intro](_includes/intro.md) %} + +Каталог предметов из популярной игры Animal Crossing: New Horizons. + +**Источник**: [Kaggle - Animal Crossing New Horizons Catalog](https://www.kaggle.com/datasets/jessicali9530/animal-crossing-new-horizons-nookplaza-dataset/) + +**Размер**: 51 KB + +## Пример загрузки + +1. Скачайте и разархивируйте файл `accessories.csv` с Kaggle + +2. Этот файл включает в себя маску BOM (Byte Order Mark). Однако команда импорта не поддерживает файлы с маской BOM. Чтобы устранить проблему, удалите BOM-байты из начала файла, выполнив следующую команду: + +```bash +sed -i '1s/^\xEF\xBB\xBF//' accessories.csv +``` + +3. Имена колонок в файле содержат пробелы, что не совместимо с YDB, поскольку YDB не поддерживает пробелы в именах колонок. Необходимо заменить пробелы в именах колонок, например, на символы подчеркивания. Вы можете сделать это, выполнив следующую команду: + +```bash +sed -i '1s/ /_/g' accessories.csv +``` + +4. Создайте таблицу в {{ ydb-short-name }} одним из следующих способов: + +{% list tabs %} + +- Embedded UI + + Подробнее про [Embedded UI](../../reference/embedded-ui/ydb-monitoring). + + ```sql + CREATE TABLE `accessories` ( + `Name` Text NOT NULL, + `Variation` Text NOT NULL, + `DIY` Text NOT NULL, + `Buy` Text NOT NULL, + `Sell` Uint64 NOT NULL, + `Color_1` Text NOT NULL, + `Color_2` Text NOT NULL, + `Size` Text NOT NULL, + `Miles_Price` Text NOT NULL, + `Source` Text NOT NULL, + `Source_Notes` Text NOT NULL, + `Seasonal_Availability` Text NOT NULL, + `Mannequin_Piece` Text NOT NULL, + `Version` Text NOT NULL, + `Style` Text NOT NULL, + `Label_Themes` Text NOT NULL, + `Type` Text NOT NULL, + `Villager_Equippable` Text NOT NULL, + `Catalog` Text NOT NULL, + `Filename` Text NOT NULL, + `Internal_ID` Uint64 NOT NULL, + `Unique_Entry_ID` Text NOT NULL, + PRIMARY KEY (`Unique_Entry_ID`) + ) + WITH ( + STORE = COLUMN + ); + ``` + +- YDB CLI + + ```bash + ydb sql -s \ + 'CREATE TABLE `accessories` ( + `Name` Text NOT NULL, + `Variation` Text NOT NULL, + `DIY` Text NOT NULL, + `Buy` Text NOT NULL, + `Sell` Uint64 NOT NULL, + `Color_1` Text NOT NULL, + `Color_2` Text NOT NULL, + `Size` Text NOT NULL, + `Miles_Price` Text NOT NULL, + `Source` Text NOT NULL, + `Source_Notes` Text NOT NULL, + `Seasonal_Availability` Text NOT NULL, + `Mannequin_Piece` Text NOT NULL, + `Version` Text NOT NULL, + `Style` Text NOT NULL, + `Label_Themes` Text NOT NULL, + `Type` Text NOT NULL, + `Villager_Equippable` Text NOT NULL, + `Catalog` Text NOT NULL, + `Filename` Text NOT NULL, + `Internal_ID` Uint64 NOT NULL, + `Unique_Entry_ID` Text NOT NULL, + PRIMARY KEY (`Unique_Entry_ID`) + ) + WITH ( + STORE = COLUMN + );' + ``` + +{% endlist %} + +5. Выполните команду импорта: + +```bash +ydb import file csv --header --path accessories accessories.csv +``` + +## Пример аналитического запроса + +Определим пять самых популярных основных цветов аксессуаров: + +{% list tabs %} + +- Embedded UI + + ```sql + SELECT + Color_1, + COUNT(*) AS color_count + FROM accessories + GROUP BY Color_1 + ORDER BY color_count DESC + LIMIT 5; + ``` + +- YDB CLI + + ```bash + ydb sql -s \ + 'SELECT + Color_1, + COUNT(*) AS color_count + FROM accessories + GROUP BY Color_1 + ORDER BY color_count DESC + LIMIT 5;' + ``` + +{% endlist %} + +Результат: + +```raw +┌──────────┬─────────────┐ +│ Color_1 │ color_count │ +├──────────┼─────────────┤ +│ "Black" │ 31 │ +├──────────┼─────────────┤ +│ "Green" │ 27 │ +├──────────┼─────────────┤ +│ "Pink" │ 20 │ +├──────────┼─────────────┤ +│ "Red" │ 20 │ +├──────────┼─────────────┤ +│ "Yellow" │ 19 │ +└──────────┴─────────────┘ +```
\ No newline at end of file diff --git a/ydb/docs/ru/core/olap/datasets/chess.md b/ydb/docs/ru/core/olap/datasets/chess.md new file mode 100644 index 00000000000..ffdbdc647bc --- /dev/null +++ b/ydb/docs/ru/core/olap/datasets/chess.md @@ -0,0 +1,106 @@ +# Chess Position Evaluations + +{% include [intro](_includes/intro.md) %} + +Датасет включает 513 миллионов оценок шахматных позиций, выполненных движком Stockfish для анализа на платформе Lichess. + +**Источник**: [Kaggle - Chess Position Evaluations](https://www.kaggle.com/datasets/lichess/chess-evaluations) + +**Размер**: 59.66 GB + +## Пример загрузки + +1. Скачайте файл `evals.csv` с Kaggle + +2. Создайте таблицу в {{ ydb-short-name }} одним из следующих способов: + +{% list tabs %} + +- Embedded UI + + Подробнее про [Embedded UI](../../reference/embedded-ui/ydb-monitoring). + + ```sql + CREATE TABLE `evals` ( + `fen` Text NOT NULL, + `line` Text NOT NULL, + `depth` Uint64, + `knodes` Uint64, + `cp` Double, + `mate` Double, + PRIMARY KEY (`fen`, `line`) + ) + WITH ( + STORE = COLUMN, + UNIFORM_PARTITIONS = 50 + ); + ``` + +- YDB CLI + + ```bash + ydb sql -s \ + 'CREATE TABLE `evals` ( + `fen` Text NOT NULL, + `line` Text NOT NULL, + `depth` Uint64, + `knodes` Uint64, + `cp` Double, + `mate` Double, + PRIMARY KEY (`fen`, `line`) + ) + WITH ( + STORE = COLUMN, + UNIFORM_PARTITIONS = 50 + );' + ``` + +{% endlist %} + +3. Выполните команду импорта: + +```bash +ydb import file csv --header --null-value "" --path evals evals.csv +``` + +## Пример аналитического запроса + +Определим позиции с наибольшим количеством ходов, проанализированных движком Stockfish: + +{% list tabs %} + +- Embedded UI + + ```sql + SELECT + fen, + MAX(depth) AS max_depth, + SUM(knodes) AS total_knodes + FROM evals + GROUP BY fen + ORDER BY max_depth DESC + LIMIT 10; + ``` + +- YDB CLI + + ```bash + ydb sql -s \ + 'SELECT + fen, + MAX(depth) AS max_depth, + SUM(knodes) AS total_knodes + FROM evals + GROUP BY fen + ORDER BY max_depth DESC + LIMIT 10;' + ``` + +{% endlist %} + +Этот запрос выполняет следующие действия: + +* Находит позиции (представленные в формате FEN) с максимальной глубиной анализа (depth). +* Суммирует количество проанализированных узлов (knodes) для каждой позиции. +* Сортирует результаты по максимальной глубине анализа в порядке убывания. +* Выводит топ-10 позиций с наибольшей глубиной анализа.
\ No newline at end of file diff --git a/ydb/docs/ru/core/olap/datasets/covid.md b/ydb/docs/ru/core/olap/datasets/covid.md new file mode 100644 index 00000000000..7ca55d9a867 --- /dev/null +++ b/ydb/docs/ru/core/olap/datasets/covid.md @@ -0,0 +1,159 @@ +# COVID-19 Open Research Dataset + +{% include [intro](_includes/intro.md) %} + +Открытый набор данных исследований COVID-19. + +**Источник**: [Kaggle - COVID-19 Open Research Dataset Challenge](https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge?select=metadata.csv) + +**Размер**: 1.65 GB (файл metadata.csv) + +## Пример загрузки + +1. Скачайте и разархивируйте файл `metadata.csv` с Kaggle + +2. Датасет включает в себя полностью идентичные строки. Поскольку YDB требует указания уникальных значений первичного ключа, добавим в файл новую колонку под названием `row_id`, где значение ключа будет равно номеру строки в исходном файле. Это позволит предотвратить удаление повторяющихся данных. Эту операцию можно осуществить с помощью команды awk: + +```bash +awk 'NR==1 {print "row_id," $0; next} {print NR-1 "," $0}' metadata.csv > temp.csv && mv temp.csv metadata.csv +``` + +3. Создайте таблицу в {{ ydb-short-name }} одним из следующих способов: + +{% list tabs %} + +- Embedded UI + + Подробнее про [Embedded UI](../../reference/embedded-ui/ydb-monitoring). + + ```sql + CREATE TABLE `covid_research` ( + `row_id` Uint64 NOT NULL, + `cord_uid` Text NOT NULL, + `sha` Text NOT NULL, + `source_x` Text NOT NULL, + `title` Text NOT NULL, + `doi` Text NOT NULL, + `pmcid` Text NOT NULL, + `pubmed_id` Text NOT NULL, + `license` Text NOT NULL, + `abstract` Text NOT NULL, + `publish_time` Text NOT NULL, + `authors` Text NOT NULL, + `journal` Text NOT NULL, + `mag_id` Text, + `who_covidence_id` Text, + `arxiv_id` Text, + `pdf_json_files` Text NOT NULL, + `pmc_json_files` Text NOT NULL, + `url` Text NOT NULL, + `s2_id` Uint64, + PRIMARY KEY (`row_id`) + ) + WITH ( + STORE = COLUMN + ); + ``` + +- YDB CLI + + ```bash + ydb sql -s \ + 'CREATE TABLE `covid_research` ( + `row_id` Uint64 NOT NULL, + `cord_uid` Text NOT NULL, + `sha` Text NOT NULL, + `source_x` Text NOT NULL, + `title` Text NOT NULL, + `doi` Text NOT NULL, + `pmcid` Text NOT NULL, + `pubmed_id` Text NOT NULL, + `license` Text NOT NULL, + `abstract` Text NOT NULL, + `publish_time` Text NOT NULL, + `authors` Text NOT NULL, + `journal` Text NOT NULL, + `mag_id` Text, + `who_covidence_id` Text, + `arxiv_id` Text, + `pdf_json_files` Text NOT NULL, + `pmc_json_files` Text NOT NULL, + `url` Text NOT NULL, + `s2_id` Uint64, + PRIMARY KEY (`row_id`) + ) + WITH ( + STORE = COLUMN + );' + ``` + +{% endlist %} + +4. Выполните команду импорта: + +```bash +ydb import file csv --header --null-value "" --path covid_research metadata.csv +``` + +## Пример аналитического запроса + +Выполните запрос для определения журналов с наибольшим количеством публикаций: + +{% list tabs %} + +- Embedded UI + + ```sql + SELECT + journal, + COUNT(*) AS publication_count + FROM covid_research + WHERE journal IS NOT NULL AND journal != '' + GROUP BY journal + ORDER BY publication_count DESC + LIMIT 10; + ``` + +- YDB CLI + + ```bash + ydb sql -s \ + 'SELECT + journal, + COUNT(*) AS publication_count + FROM covid_research + WHERE journal IS NOT NULL AND journal != "" + GROUP BY journal + ORDER BY publication_count DESC + LIMIT 10;' + ``` + +{% endlist %} + +Результат: + +```raw +┌───────────────────────────────────┬───────────────────┐ +│ journal │ publication_count │ +├───────────────────────────────────┼───────────────────┤ +│ "PLoS One" │ 9953 │ +├───────────────────────────────────┼───────────────────┤ +│ "bioRxiv" │ 8961 │ +├───────────────────────────────────┼───────────────────┤ +│ "Int J Environ Res Public Health" │ 8201 │ +├───────────────────────────────────┼───────────────────┤ +│ "BMJ" │ 6928 │ +├───────────────────────────────────┼───────────────────┤ +│ "Sci Rep" │ 5935 │ +├───────────────────────────────────┼───────────────────┤ +│ "Cureus" │ 4212 │ +├───────────────────────────────────┼───────────────────┤ +│ "Reactions Weekly" │ 3891 │ +├───────────────────────────────────┼───────────────────┤ +│ "Front Psychol" │ 3541 │ +├───────────────────────────────────┼───────────────────┤ +│ "BMJ Open" │ 3515 │ +├───────────────────────────────────┼───────────────────┤ +│ "Front Immunol" │ 3442 │ +└───────────────────────────────────┴───────────────────┘ +```
\ No newline at end of file diff --git a/ydb/docs/ru/core/olap/datasets/ecommerce.md b/ydb/docs/ru/core/olap/datasets/ecommerce.md new file mode 100644 index 00000000000..d70028955b9 --- /dev/null +++ b/ydb/docs/ru/core/olap/datasets/ecommerce.md @@ -0,0 +1,145 @@ +# E-Commerce Behavior Data + +{% include [intro](_includes/intro.md) %} + +Данные о поведении пользователей в мультикатегорийном интернет-магазине. + +**Источник**: [Kaggle - E-commerce behavior data](https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store/data) + +**Размер**: 9 GB + +## Пример загрузки + +1. Скачайте и разархивируйте файл `2019-Nov.csv` с Kaggle + +2. Датасет включает в себя полностью идентичные строки. Поскольку YDB требует указания уникальных значений первичного ключа, добавим в файл новую колонку под названием `row_id`, где значение ключа будет равно номеру строки в исходном файле. Это позволит предотвратить удаление повторяющихся данных. Эту операцию можно осуществить с помощью команды awk: + +```bash +awk 'NR==1 {print "row_id," $0; next} {print NR-1 "," $0}' 2019-Nov.csv > temp.csv && mv temp.csv 2019-Nov.csv +``` + +3. Создайте таблицу в {{ ydb-short-name }} одним из следующих способов: + +{% list tabs %} + +- Embedded UI + + Подробнее про [Embedded UI](../../reference/embedded-ui/ydb-monitoring). + + ```sql + CREATE TABLE `ecommerce_table` ( + `row_id` Uint64 NOT NULL, + `event_time` Text NOT NULL, + `event_type` Text NOT NULL, + `product_id` Uint64 NOT NULL, + `category_id` Uint64, + `category_code` Text, + `brand` Text, + `price` Double NOT NULL, + `user_id` Uint64 NOT NULL, + `user_session` Text NOT NULL, + PRIMARY KEY (`row_id`) + ) + WITH ( + STORE = COLUMN, + UNIFORM_PARTITIONS = 50 + ); + ``` + +- YDB CLI + + ```bash + ydb sql -s \ + 'CREATE TABLE `ecommerce_table` ( + `row_id` Uint64 NOT NULL, + `event_time` Text NOT NULL, + `event_type` Text NOT NULL, + `product_id` Uint64 NOT NULL, + `category_id` Uint64, + `category_code` Text, + `brand` Text, + `price` Double NOT NULL, + `user_id` Uint64 NOT NULL, + `user_session` Text NOT NULL, + PRIMARY KEY (`row_id`) + ) + WITH ( + STORE = COLUMN, + UNIFORM_PARTITIONS = 50 + );' + ``` + +{% endlist %} + +4. Выполните команду импорта: + +```bash +ydb import file csv --header --null-value "" --path ecommerce_table 2019-Nov.csv +``` + +## Пример аналитического запроса + +Определим самые популярные категории продуктов 1 ноября 2019 года: + +{% list tabs %} + +- Embedded UI + + ```sql + SELECT + category_code, + COUNT(*) AS view_count + FROM ecommerce_table + WHERE + SUBSTRING(CAST(event_time AS String), 0, 10) = '2019-11-01' + AND event_type = 'view' + GROUP BY category_code + ORDER BY view_count DESC + LIMIT 10; + ``` + +- YDB CLI + + ```bash + ydb sql -s \ + 'SELECT + category_code, + COUNT(*) AS view_count + FROM ecommerce_table + WHERE + SUBSTRING(CAST(event_time AS String), 0, 10) = "2019-11-01" + AND event_type = "view" + GROUP BY category_code + ORDER BY view_count DESC + LIMIT 10;' + ``` + +{% endlist %} + +Результат: + +```raw +┌────────────────────────────────────┬────────────┐ +│ category_code │ view_count │ +├────────────────────────────────────┼────────────┤ +│ null │ 453024 │ +├────────────────────────────────────┼────────────┤ +│ "electronics.smartphone" │ 360650 │ +├────────────────────────────────────┼────────────┤ +│ "electronics.clocks" │ 43581 │ +├────────────────────────────────────┼────────────┤ +│ "computers.notebook" │ 40878 │ +├────────────────────────────────────┼────────────┤ +│ "electronics.video.tv" │ 40383 │ +├────────────────────────────────────┼────────────┤ +│ "electronics.audio.headphone" │ 37489 │ +├────────────────────────────────────┼────────────┤ +│ "apparel.shoes" │ 31013 │ +├────────────────────────────────────┼────────────┤ +│ "appliances.kitchen.washer" │ 28028 │ +├────────────────────────────────────┼────────────┤ +│ "appliances.kitchen.refrigerators" │ 27808 │ +├────────────────────────────────────┼────────────┤ +│ "appliances.environment.vacuum" │ 26477 │ +└────────────────────────────────────┴────────────┘ +```
\ No newline at end of file diff --git a/ydb/docs/ru/core/olap/datasets/index.md b/ydb/docs/ru/core/olap/datasets/index.md new file mode 100644 index 00000000000..d6fe10289b1 --- /dev/null +++ b/ydb/docs/ru/core/olap/datasets/index.md @@ -0,0 +1,54 @@ +# Обзор + +На этих страницах описаны популярные датасеты, которые вы можете загрузить в {{ ydb-short-name }} для ознакомления с функциональностью базы данных и тестирования различных сценариев использования. + +## Предварительные требования + +Для загрузки датасетов вам потребуется: + +1. Установленный [{{ ydb-short-name }} CLI](../../reference/ydb-cli/) +2. [Опционально] Настроенный [профиль подключения](../../reference/ydb-cli/profile/create.md) к {{ ydb-short-name }}, чтобы не указывать параметры подключения при каждом вызове + +## Общая информация о загрузке данных {#general-info} + +{{ ydb-short-name }} поддерживает импорт данных из CSV-файлов с помощью [команды](../../reference/ydb-cli/export-import/import-file.md) `ydb import file csv`. Пример запуска команды: + +```bash +ydb import file csv --header --null-value "" --path <путь_к_таблице> <файл>.csv +``` + +Где: + +* `--header` означает, что в первой строке файла содержится список имён колонок, а сами данные начинаются со второй строки; +* `--null-value ""` означает, что пустая строка вместо значения в CSV будет интерпретироваться как null-значение при импорте данных из csv-файла в таблицу. + +Для импорта данных нужна заранее созданная таблица в {{ ydb-short-name }}. Основной способ создания таблицы - выполнить [YQL-запрос `CREATE TABLE`](../../yql/reference/syntax/create_table/index.md). Чтобы не составлять его полностью вручную, можно попробовать выполнить команду импорта из файла, как в любом примере ниже, не создавая перед этим таблицу. В таком случае CLI предложит текст `CREATE TABLE`, который можно будет взять за основу, при необходимости отредактировать и выполнить. + +Для импорта данных в {{ ydb-short-name }} таблица должна быть заранее создана. Обычно таблицу создают с помощью [YQL-запроса `CREATE TABLE`](../../yql/reference/syntax/create_table/index.md). Однако, вместо того чтобы писать такой запрос вручную, можно запустить команду импорта `ydb import file csv` так же, как показано в примерах импорта в этом разделе. Если таблица отсутствует, CLI автоматически предложит готовый текст запроса `CREATE TABLE`, который можно будет использовать для создания таблицы. + +{% note info "Выбор первичного ключа" %} + +{{ ydb-short-name }} требует, чтобы у таблицы был первичный ключ. Он значительно ускоряет загрузку и обработку данных, а также позволяет выполнять дедупликацию: строки с одинаковыми значениями в столбцах первичного ключа заменяются друг другом. + +Если в импортируемом наборе данных нет подходящих столбцов для первичного ключа, мы добавляем новый столбец с номерами строк и используем его в качестве первичного ключа, так как номер каждой строки уникален в пределах файла. + +{% endnote %} + +## Особенности и ограничения + +При работе с загрузкой CSV-файлов в {{ ydb-short-name }} следует учитывать следующие моменты: + +1. **Имена колонок**: Названия колонок не должны содержать пробелы или специальные символы. + +2. **Типы данных**: + - Строки в формате даты/времени с указанием временной зоны (например, "2019-11-01 00:00:00 UTC") будут импортированы как тип Text + - Тип Bool не поддерживается в качестве типа колонки, используйте Text или Int64 + +## Доступные датасеты + +* [Chess Position Evaluations](chess.md) - оценки шахматных позиций от движка Stockfish +* [Video Game Sales](video-games.md) - данные о продажах видеоигр +* [E-Commerce Behavior Data](ecommerce.md) - данные о поведении пользователей в интернет-магазине +* [COVID-19 Open Research Dataset](covid.md) - открытый набор данных исследований COVID-19 +* [Netflix Movies and TV Shows](netflix.md) - данные о фильмах и сериалах на Netflix +* [Animal Crossing New Horizons Catalog](animal-crossing.md) - каталог предметов из игры
\ No newline at end of file diff --git a/ydb/docs/ru/core/olap/datasets/netflix.md b/ydb/docs/ru/core/olap/datasets/netflix.md new file mode 100644 index 00000000000..cfb1a1395a8 --- /dev/null +++ b/ydb/docs/ru/core/olap/datasets/netflix.md @@ -0,0 +1,127 @@ +# Netflix Movies and TV Shows + +{% include [intro](_includes/intro.md) %} + +Данные о фильмах и сериалах на платформе Netflix. + +**Источник**: [Kaggle - Netflix Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/netflix-shows) + +**Размер**: 3.4 MB + +## Пример загрузки + +1. Скачайте и разархивируйте файл `netflix_titles.csv` с Kaggle + +2. Создайте таблицу в {{ ydb-short-name }} одним из следующих способов: + +{% list tabs %} + +- Embedded UI + + Подробнее про [Embedded UI](../../reference/embedded-ui/ydb-monitoring). + + ```sql + CREATE TABLE `netflix` ( + `show_id` Text NOT NULL, + `type` Text NOT NULL, + `title` Text NOT NULL, + `director` Text NOT NULL, + `cast` Text, + `country` Text NOT NULL, + `date_added` Text NOT NULL, + `release_year` Uint64 NOT NULL, + `rating` Text NOT NULL, + `duration` Text NOT NULL, + `listed_in` Text NOT NULL, + `description` Text NOT NULL, + PRIMARY KEY (`show_id`) + ) + WITH ( + STORE = COLUMN + ); + ``` + +- YDB CLI + + ```bash + ydb sql -s \ + 'CREATE TABLE `netflix` ( + `show_id` Text NOT NULL, + `type` Text NOT NULL, + `title` Text NOT NULL, + `director` Text NOT NULL, + `cast` Text, + `country` Text NOT NULL, + `date_added` Text NOT NULL, + `release_year` Uint64 NOT NULL, + `rating` Text NOT NULL, + `duration` Text NOT NULL, + `listed_in` Text NOT NULL, + `description` Text NOT NULL, + PRIMARY KEY (`show_id`) + ) + WITH ( + STORE = COLUMN + );' + ``` + +{% endlist %} + +3. Выполните команду импорта: + +```bash +ydb import file csv --header --null-value "" --path netflix netflix_titles.csv +``` + +## Пример аналитического запроса + +Определим три страны, из которых было добавлено больше всего контента на Netflix в 2020 году: + +{% list tabs %} + +- Embedded UI + + ```sql + SELECT + country, + COUNT(*) AS count + FROM netflix + WHERE + CAST(SUBSTRING(CAST(date_added AS String), 7, 4) AS Int32) = 2020 + AND date_added IS NOT NULL + GROUP BY country + ORDER BY count DESC + LIMIT 3; + ``` + +- YDB CLI + + ```bash + ydb sql -s \ + 'SELECT + country, + COUNT(*) AS count + FROM netflix + WHERE + CAST(SUBSTRING(CAST(date_added AS String), 7, 4) AS Int32) = 2020 + AND date_added IS NOT NULL + GROUP BY country + ORDER BY count DESC + LIMIT 3;' + ``` + +{% endlist %} + +Результат: + +```raw +┌─────────────────┬───────┐ +│ country │ count │ +├─────────────────┼───────┤ +│ "United States" │ 22 │ +├─────────────────┼───────┤ +│ "" │ 7 │ +├─────────────────┼───────┤ +│ "Canada" │ 3 │ +└─────────────────┴───────┘ +```
\ No newline at end of file diff --git a/ydb/docs/ru/core/olap/datasets/toc_p.yaml b/ydb/docs/ru/core/olap/datasets/toc_p.yaml new file mode 100644 index 00000000000..afd6aaf4ba4 --- /dev/null +++ b/ydb/docs/ru/core/olap/datasets/toc_p.yaml @@ -0,0 +1,15 @@ +items: + - name: Обзор + href: index.md + - name: Chess Position Evaluations + href: chess.md + - name: Video Game Sales + href: video-games.md + - name: E-Commerce Behavior Data + href: ecommerce.md + - name: COVID-19 Open Research Dataset + href: covid.md + - name: Netflix Movies and TV Shows + href: netflix.md + - name: Animal Crossing New Horizons Catalog + href: animal-crossing.md
\ No newline at end of file diff --git a/ydb/docs/ru/core/olap/datasets/video-games.md b/ydb/docs/ru/core/olap/datasets/video-games.md new file mode 100644 index 00000000000..3e783b72bf0 --- /dev/null +++ b/ydb/docs/ru/core/olap/datasets/video-games.md @@ -0,0 +1,117 @@ +# Video Game Sales + +{% include [intro](_includes/intro.md) %} + +Данные о продажах видеоигр. + +**Источник**: [Kaggle - Video Game Sales](https://www.kaggle.com/datasets/gregorut/videogamesales) + +**Размер**: 1.36 MB + +## Пример загрузки + +1. Скачайте и разархивируйте файл `vgsales.csv` с Kaggle + +2. Создайте таблицу в {{ ydb-short-name }} одним из следующих способов: + +{% list tabs %} + +- Embedded UI + + Подробнее про [Embedded UI](../../reference/embedded-ui/ydb-monitoring). + + ```sql + CREATE TABLE `vgsales` ( + `Rank` Uint64 NOT NULL, + `Name` Text NOT NULL, + `Platform` Text NOT NULL, + `Year` Text NOT NULL, + `Genre` Text NOT NULL, + `Publisher` Text NOT NULL, + `NA_Sales` Double NOT NULL, + `EU_Sales` Double NOT NULL, + `JP_Sales` Double NOT NULL, + `Other_Sales` Double NOT NULL, + `Global_Sales` Double NOT NULL, + PRIMARY KEY (`Rank`) + ) + WITH ( + STORE = COLUMN + ); + ``` + +- YDB CLI + + ```bash + ydb sql -s \ + 'CREATE TABLE `vgsales` ( + `Rank` Uint64 NOT NULL, + `Name` Text NOT NULL, + `Platform` Text NOT NULL, + `Year` Text NOT NULL, + `Genre` Text NOT NULL, + `Publisher` Text NOT NULL, + `NA_Sales` Double NOT NULL, + `EU_Sales` Double NOT NULL, + `JP_Sales` Double NOT NULL, + `Other_Sales` Double NOT NULL, + `Global_Sales` Double NOT NULL, + PRIMARY KEY (`Rank`) + ) + WITH ( + STORE = COLUMN + );' + ``` + +{% endlist %} + +3. Выполните команду импорта: + +```bash +ydb import file csv --header --null-value "" --path vgsales vgsales.csv +``` + +## Пример аналитического запроса + +Чтобы определить издателя, у которого наибольшая средняя продажа игр в Северной Америке, выполните запрос: + +{% list tabs %} + +- Embedded UI + + ```sql + SELECT + Publisher, + AVG(NA_Sales) AS average_na_sales + FROM vgsales + GROUP BY Publisher + ORDER BY average_na_sales DESC + LIMIT 1; + ``` + +- YDB CLI + + ```bash + ydb sql -s \ + 'SELECT + Publisher, + AVG(NA_Sales) AS average_na_sales + FROM vgsales + GROUP BY Publisher + ORDER BY average_na_sales DESC + LIMIT 1;' + ``` + +{% endlist %} + +Результат: + +```bash +┌───────────┬──────────────────┐ +│ Publisher │ average_na_sales │ +├───────────┼──────────────────┤ +│ "Palcom" │ 3.38 │ +└───────────┴──────────────────┘ +``` + +Запрос позволит найти, какой издатель достиг наибольшего успеха в Северной Америке по средней продаже.
\ No newline at end of file diff --git a/ydb/docs/ru/core/olap/index.md b/ydb/docs/ru/core/olap/index.md new file mode 100644 index 00000000000..71b4d86b15c --- /dev/null +++ b/ydb/docs/ru/core/olap/index.md @@ -0,0 +1,11 @@ +# Для аналитиков + +В этом разделе собраны примеры и рекомендации по работе с [аналитическими (OLAP) сценариями](../faq/analytics.md) в {{ ydb-short-name }}. + +## Данный раздел содержит следующие материалы + +* [{#T}](scenarios.md) +* [{#T}](optimizing.md) +* [Импорт датасетов](datasets/index.md) + +{% include [olap-links](_includes/olap-links.md) %}
\ No newline at end of file diff --git a/ydb/docs/ru/core/olap/optimizing.md b/ydb/docs/ru/core/olap/optimizing.md new file mode 100644 index 00000000000..fcf0ba4e835 --- /dev/null +++ b/ydb/docs/ru/core/olap/optimizing.md @@ -0,0 +1,17 @@ +# Оптимизация аналитических запросов + +1. **Правильно выбирать первичный ключ** + * Использовать колонки, по которым часто выполняется фильтрация + * Учитывать порядок колонок для оптимизации range-запросов + +2. **Оптимизировать схему данных** + * Использовать [колоночное хранение](../concepts/datamodel/table.md#column-oriented-tables) для аналитических таблиц + * Правильно выбирать типы данных для экономии места + * Применять [партиционирование](../concepts/datamodel/table.md#olap-tables-partitioning) для больших (более 2Гб) аналитических таблиц + +3. **Оптимизировать запросы** + * Запрашивать только необходимые колонки + * Использовать подходящие [индексы](../concepts/secondary_indexes.md) + * Правильно формулировать условия фильтрации + +{% include [olap-links](_includes/olap-links.md) %}
\ No newline at end of file diff --git a/ydb/docs/ru/core/olap/scenarios.md b/ydb/docs/ru/core/olap/scenarios.md new file mode 100644 index 00000000000..2941892de9b --- /dev/null +++ b/ydb/docs/ru/core/olap/scenarios.md @@ -0,0 +1,18 @@ +# Типичные OLAP сценарии + +1. **Пакетная аналитика** + * Агрегация больших объемов исторических данных + * Построение отчетов за длительные периоды + * Вычисление статистик по всему набору данных + +2. **Интерактивная аналитика** + * Ad-hoc запросы для исследования данных + * Построение динамических дашбордов + * Drill-down анализ с разными уровнями детализации + +3. **Предварительная обработка данных** + * Очистка и нормализация данных + * Обогащение данных из разных источников + * Подготовка данных для машинного обучения + +{% include [olap-links](_includes/olap-links.md) %}
\ No newline at end of file diff --git a/ydb/docs/ru/core/olap/toc_p.yaml b/ydb/docs/ru/core/olap/toc_p.yaml new file mode 100644 index 00000000000..4999a820fc1 --- /dev/null +++ b/ydb/docs/ru/core/olap/toc_p.yaml @@ -0,0 +1,10 @@ +items: + - name: Типичные OLAP-сценарии + href: scenarios.md + - name: Оптимизация аналитических запросов + href: optimizing.md + - name: Импорт датасетов + href: datasets/index.md + include: + mode: link + path: datasets/toc_p.yaml
\ No newline at end of file diff --git a/ydb/docs/ru/core/recipes/index.md b/ydb/docs/ru/core/recipes/index.md index 6dda2e081b8..76efb7787d8 100644 --- a/ydb/docs/ru/core/recipes/index.md +++ b/ydb/docs/ru/core/recipes/index.md @@ -5,3 +5,4 @@ * [{#T}](ydb-sdk/index.md) * [{#T}](ydb-cli/index.md) * [{#T}](../yql/reference/recipes/index.md) +* [{#T}](../olap/index.md)
\ No newline at end of file diff --git a/ydb/docs/ru/core/recipes/toc_p.yaml b/ydb/docs/ru/core/recipes/toc_p.yaml index 19bb3ca8a06..66b7f823924 100644 --- a/ydb/docs/ru/core/recipes/toc_p.yaml +++ b/ydb/docs/ru/core/recipes/toc_p.yaml @@ -12,3 +12,8 @@ items: include: mode: link path: ../yql/reference/recipes/toc_i.yaml +- name: Для аналитиков + href: ../olap/index.md + include: + mode: link + path: ../olap/toc_p.yaml diff --git a/ydb/docs/ru/core/recipes/ydb-cli/toc_p.yaml b/ydb/docs/ru/core/recipes/ydb-cli/toc_p.yaml index 4d243c0aefa..54d0f932c9e 100644 --- a/ydb/docs/ru/core/recipes/ydb-cli/toc_p.yaml +++ b/ydb/docs/ru/core/recipes/ydb-cli/toc_p.yaml @@ -1,5 +1,5 @@ items: -- name: Конвертация типов таблиц +- name: Конвертация типов таблиц href: convert-table-type.md - name: Нагрузочное тестирование href: benchmarks.md |