You are now running a Glue Studio notebook; before you can start using your notebook you must start an interactive session.
Magic | Type | Description |
---|---|---|
%%configure | Dictionary | A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
%profile | String | Specify a profile in your aws configuration to use as the credentials provider. |
%iam_role | String | Specify an IAM role to execute your session with. |
%region | String | Specify the AWS region in which to initialize a session. |
%session_id | String | Returns the session ID for the running session. |
%connections | List | Specify a comma separated list of connections to use in the session. |
%additional_python_modules | List | Comma separated list of pip packages, s3 paths or private pip arguments. |
%extra_py_files | List | Comma separated list of additional Python files from S3. |
%extra_jars | List | Comma separated list of additional Jars to include in the cluster. |
%number_of_workers | Integer | The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too. |
%glue_version | String | The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0 (eg: %glue_version 2.0). |
%security_config | String | Define a security configuration to be used with this session. |
%sql | String | Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code. |
%streaming | String | Changes the session type to Glue Streaming. |
%etl | String | Changes the session type to Glue ETL. |
%status | Returns the status of the current Glue session including its duration, configuration and executing user / role. | |
%stop_session | Stops the current session. | |
%list_sessions | Lists all currently running sessions by name and ID. | |
%worker_type | String | Standard, G.1X, or G.2X. number_of_workers must be set too. Default is G.1X. |
%spark_conf | String | Specify custom spark configurations for your session. E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer. |
import datetime
from pyspark.sql.types import *
import pyspark.sql.functions as F
from pyspark.sql.functions import *
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
json_fpath = "s3://acc-lakeformation-spec/db_dummy_json/"
df_json = glueContext.create_dynamic_frame_from_options("s3",
{'paths': [json_fpath],
'recurse': True,
'groupFiles': 'inPartition',
'groupSize': '1048576'},
format='json')
df = df_json.toDF().toPandas()
df.head()
df2 = df.drop_duplicates(['serial'], keep='last')
df2
df2['anomesdia'] = df2['start'].apply(lambda x: datetime.datetime.fromisoformat(x).strftime('anomesdia=%Y-%m-%d'))
df2.head()
df_json.printSchema()
df_df = df_json.toDF()
df_df.schema
df3 = spark.createDataFrame(df2, df_json.toDF().schema)
df3.show()
df3.count()
df3['start'].apply(lambda x: x)
df4 = df3.select(
df3['serial'].cast(StringType()).alias('serial'),
df3['status'].cast(StringType()).alias('status'),
df3['start'].cast(StringType()).alias('start'),
df3['start'].cast(DateType()).alias('anomesdia')
)
df4.show()
path = 's3://acc-lakeformation-spec/db_dummy_parquet/'
df4.repartition(1).write.mode('append').format('parquet').partitionBy('anomesdia').save(path)