Welcome to the project Climate Data Modelling! Our project aims to solve the problem of extracting the data from public data sources and storing it inside a bigquery data warehouse. We first download the data into the data lake using bash scripts and then we will use the python scripts to read the data from the data lake and load them into the data warehouse. This data can then be used for solving the climatology problems using deep learning approaches.
python3 -m venv venv
source venv/bin/activate
pip install -r requirements/requirements.txt
- Go to GCP Console
- Click on the hamburger menu on the left
- Go to compute engine -> VM Instances
- Click on create instance at the top
- Select the configuration of the machine that you want. GPUs are not available in free trial. So, if you are on free trial please don't choose GPUs.
- Click on Create at the bottom of the page
- Wait for a few minutes so that the instance will be created
- Go to GCP Console
- Click on the hamburger menu on the left
- Go to compute engine -> VM Instances
- You will see the VM instance that you have created in the previous step
- Select that VM and click on start/resume at the top
- Wait for the machine to boot up
- Once the machine is booted up, under connect there will be SSH option. Click on that SSH option
- A new window will open. Wait for a few seconds and you will see the linux shell for the Virtual machine
- Now you can start doing your work on the VM
- Go to GCP Console
- Click on the hamburger menu on the left
- Go to Cloud Storage -> Buckets
- Click on create at the top
- Name your bucket as data-lake-bucket-climate-modelling, choose us(multi regions) as location, location type multi region, storage class standard. Leave everything else to default and click create.
- You will see your bucket in cloud storage -> buckets
- Click on the bucket
- Click create folder on the top
- Give the folder name as daylength and click on create
- Follow the steps 8 and 9 to create other folders called: maximumtemperature, minimumtemperature, shortwaveradiation, and snowwaterequivalent
- Your Bucket and the folders inside these buckets are now created.
- The important thing to remember is that these bucketnames will be used in our code. So, if you want to change these names, then change the names in the code as well.
- To load the files for a specific value like tmin, tmax etc... run the corresponding files located in
load_nc_files_to_data_lake
- To Download all the files run
download_all_files.sh
located inload_nc_files_to_data_lake
- To create empty tables run the corresponding files in
create_empty_tables_with_schemas
- If you want to change the schema of any of the tables please refer to this code and modify it here.
- To see if tables were created successfully go to cloud console -> Bigquery. Then at the side panel find your datset and the tables that were created when you ran these scripts.
- Run the corresponding files located at
insert_rows_with_only_dates_in_tables/
to insert the date values into the bigquery table. - You can check that the values are inserted using the bigquery UI available on google cloud console.
- Please wait for 30 to minutes because once you insert rows into the table bigquery doesn't support updates or deletes to that rows for sometime. So, please wait for a few minutes and start running the scripts located at
insert_x_y_climate_values_into_tables
. - Run the script
- Check that the values are being updated in the bigquery UI console.
### Create a new empty table with the schema of the old table. Testing_table_3 is the new table and will have the same schema as testing_table_2
CREATE TABLE `climate-data-modeling.python_creating_dataset.testing_table_3`
AS
SELECT *
FROM `climate-data-modeling.python_creating_dataset.testing_table_2`
WHERE 1 = 0;
## Delete a table
DROP TABLE `climate-data-modeling.python_creating_dataset.testing_table_3_16`
## See the schema of a table
SELECT
column_name,
data_type
FROM
`climate-data-modeling.python_creating_dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE
table_name = 'testing_table_3'
ORDER BY
ordinal_position;
## See the values of a table
SELECT * FROM `climate-data-modeling.climate_data.daylength` LIMIT 1000
## See the length of the nested and repeated fields. Here dayls is a Struct of repeated values
SELECT ARRAY_LENGTH(dayls) AS num_values
FROM `climate-data-modeling.climate_data.daylength_data`
## Insert the values into the table manually.
UPDATE `climate-data-modeling.python_creating_dataset.testing_table_3`
SET dayls = ARRAY_CONCAT(dayls, [STRUCT(-100.0 AS x, -100.0 AS y, 300.00 AS dayl)])
WHERE date = '1997-01-01T12:00:00'
## These queries can be modified as per the requirement when you are testing.