You are now running a Glue Studio notebook; before you can start using your notebook you must start an interactive session.
Magic | Type | Description |
---|---|---|
%%configure | Dictionary | A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
%profile | String | Specify a profile in your aws configuration to use as the credentials provider. |
%iam_role | String | Specify an IAM role to execute your session with. |
%region | String | Specify the AWS region in which to initialize a session |
%session_id | String | Returns the session ID for the running session. |
%connections | List | Specify a comma separated list of connections to use in the session. |
%additional_python_modules | List | Comma separated list of pip packages, s3 paths or private pip arguments. |
%extra_py_files | List | Comma separated list of additional Python files from S3. |
%extra_jars | List | Comma separated list of additional Jars to include in the cluster. |
%number_of_workers | Integer | The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too. |
%worker_type | String | Standard, G.1X, or G.2X. number_of_workers must be set too. Default is G.1X |
%glue_version | String | The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0 (eg: %glue_version 2.0) |
%security_config | String | Define a security configuration to be used with this session. |
%sql | String | Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code. |
%streaming | String | Changes the session type to Glue Streaming. |
%etl | String | Changes the session type to Glue ETL. |
%status | Returns the status of the current Glue session including its duration, configuration and executing user / role. | |
%stop_session | Stops the current session. | |
%list_sessions | Lists all currently running sessions by name and ID. | |
%spark_conf | String | Specify custom spark configurations for your session. E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer |
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
from pyspark import cloudpickle
def fn(*args, **kwargs):
import socket
import time
#time.sleep(5)
print(socket.gethostname())
return f"{args}___{kwargs}___{socket.gethostname()}"
rdd = sc.parallelize(range(10), 10)\
.map(lambda _: cloudpickle.dumps(fn(_)))
idx = 1
inst = []
for c in rdd.collect():
print(idx, c)
inst.append(c)
idx += 1
print(len(set(inst)))
rdd = sc.parallelize(range(20), 20)\
.map(lambda _: cloudpickle.dumps(fn()))
idx = 1
inst = []
for c in rdd.collect():
print(idx, c)
inst.append(c)
idx += 1
print(len(set(inst)))
len(set(inst))
%stop_session
%status
%number_of_workers 20