Glue Studio Notebook¶

You are now running a Glue Studio notebook; before you can start using your notebook you must start an interactive session.

Available Magics¶

Magic	Type	Description
%%configure	Dictionary	A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics.
%profile	String	Specify a profile in your aws configuration to use as the credentials provider.
%iam_role	String	Specify an IAM role to execute your session with.
%region	String	Specify the AWS region in which to initialize a session
%session_id	String	Returns the session ID for the running session.
%connections	List	Specify a comma separated list of connections to use in the session.
%additional_python_modules	List	Comma separated list of pip packages, s3 paths or private pip arguments.
%extra_py_files	List	Comma separated list of additional Python files from S3.
%extra_jars	List	Comma separated list of additional Jars to include in the cluster.
%number_of_workers	Integer	The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too.
%worker_type	String	Standard, G.1X, or G.2X. number_of_workers must be set too. Default is G.1X
%glue_version	String	The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0 (eg: %glue_version 2.0)
%security_config	String	Define a security configuration to be used with this session.
%sql	String	Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code.
%streaming	String	Changes the session type to Glue Streaming.
%etl	String	Changes the session type to Glue ETL.
%status		Returns the status of the current Glue session including its duration, configuration and executing user / role.
%stop_session		Stops the current session.
%list_sessions		Lists all currently running sessions by name and ID.
%spark_conf	String	Specify custom spark configurations for your session. E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer

In [ ]:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::691262992979:role/glue-notebook
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 10
Session ID: a7e1c5b2-d83b-48e9-8326-3b43d984511a
Applying the following default arguments:
--glue_kernel_version 0.35
--enable-glue-datacatalog true
Waiting for session a7e1c5b2-d83b-48e9-8326-3b43d984511a to get into ready status...
Session a7e1c5b2-d83b-48e9-8326-3b43d984511a has been created

In [1]:

from pyspark import cloudpickle

In [13]:

def fn(*args, **kwargs):
    import socket
    import time
    #time.sleep(5)
    print(socket.gethostname())
    return f"{args}___{kwargs}___{socket.gethostname()}"

In [17]:

rdd = sc.parallelize(range(10), 10)\
.map(lambda _: cloudpickle.dumps(fn(_)))
idx = 1
inst = []
for c in rdd.collect():
    print(idx, c)
    inst.append(c)
    idx += 1
print(len(set(inst)))

1 b'\x80\x02X8\x00\x00\x00(0,)___{}___ip-172-36-116-251.sa-east-1.compute.internalq\x00.'
2 b'\x80\x02X7\x00\x00\x00(1,)___{}___ip-172-36-165-57.sa-east-1.compute.internalq\x00.'
3 b'\x80\x02X7\x00\x00\x00(2,)___{}___ip-172-35-186-73.sa-east-1.compute.internalq\x00.'
4 b'\x80\x02X8\x00\x00\x00(3,)___{}___ip-172-35-126-221.sa-east-1.compute.internalq\x00.'
5 b'\x80\x02X7\x00\x00\x00(4,)___{}___ip-172-36-178-71.sa-east-1.compute.internalq\x00.'
6 b'\x80\x02X8\x00\x00\x00(5,)___{}___ip-172-35-188-114.sa-east-1.compute.internalq\x00.'
7 b'\x80\x02X7\x00\x00\x00(6,)___{}___ip-172-34-30-138.sa-east-1.compute.internalq\x00.'
8 b'\x80\x02X8\x00\x00\x00(7,)___{}___ip-172-35-217-112.sa-east-1.compute.internalq\x00.'
9 b'\x80\x02X7\x00\x00\x00(8,)___{}___ip-172-34-22-171.sa-east-1.compute.internalq\x00.'
10 b'\x80\x02X7\x00\x00\x00(9,)___{}___ip-172-35-198-79.sa-east-1.compute.internalq\x00.'
10

In [18]:

rdd = sc.parallelize(range(20), 20)\
.map(lambda _: cloudpickle.dumps(fn()))
idx = 1
inst = []
for c in rdd.collect():
    print(idx, c)
    inst.append(c)
    idx += 1
print(len(set(inst)))

1 b'\x80\x02X6\x00\x00\x00()___{}___ip-172-36-116-251.sa-east-1.compute.internalq\x00.'
2 b'\x80\x02X6\x00\x00\x00()___{}___ip-172-35-120-216.sa-east-1.compute.internalq\x00.'
3 b'\x80\x02X3\x00\x00\x00()___{}___ip-172-34-5-64.sa-east-1.compute.internalq\x00.'
4 b'\x80\x02X5\x00\x00\x00()___{}___ip-172-36-108-53.sa-east-1.compute.internalq\x00.'
5 b'\x80\x02X5\x00\x00\x00()___{}___ip-172-34-30-138.sa-east-1.compute.internalq\x00.'
6 b'\x80\x02X6\x00\x00\x00()___{}___ip-172-35-126-221.sa-east-1.compute.internalq\x00.'
7 b'\x80\x02X6\x00\x00\x00()___{}___ip-172-35-123-215.sa-east-1.compute.internalq\x00.'
8 b'\x80\x02X6\x00\x00\x00()___{}___ip-172-36-141-138.sa-east-1.compute.internalq\x00.'
9 b'\x80\x02X5\x00\x00\x00()___{}___ip-172-36-178-71.sa-east-1.compute.internalq\x00.'
10 b'\x80\x02X5\x00\x00\x00()___{}___ip-172-36-165-57.sa-east-1.compute.internalq\x00.'
11 b'\x80\x02X5\x00\x00\x00()___{}___ip-172-34-22-171.sa-east-1.compute.internalq\x00.'
12 b'\x80\x02X5\x00\x00\x00()___{}___ip-172-35-198-79.sa-east-1.compute.internalq\x00.'
13 b'\x80\x02X6\x00\x00\x00()___{}___ip-172-36-113-241.sa-east-1.compute.internalq\x00.'
14 b'\x80\x02X6\x00\x00\x00()___{}___ip-172-35-188-114.sa-east-1.compute.internalq\x00.'
15 b'\x80\x02X5\x00\x00\x00()___{}___ip-172-35-19-130.sa-east-1.compute.internalq\x00.'
16 b'\x80\x02X5\x00\x00\x00()___{}___ip-172-35-186-73.sa-east-1.compute.internalq\x00.'
17 b'\x80\x02X6\x00\x00\x00()___{}___ip-172-35-217-112.sa-east-1.compute.internalq\x00.'
18 b'\x80\x02X6\x00\x00\x00()___{}___ip-172-36-233-184.sa-east-1.compute.internalq\x00.'
19 b'\x80\x02X6\x00\x00\x00()___{}___ip-172-36-130-141.sa-east-1.compute.internalq\x00.'
20 b'\x80\x02X6\x00\x00\x00()___{}___ip-172-36-116-251.sa-east-1.compute.internalq\x00.'
19

In [12]:

len(set(inst))

In [10]:

%stop_session

Stopping session: a7e1c5b2-d83b-48e9-8326-3b43d984511a
Stopped session.

In [4]:

%status

Session ID: 258245cb-5fef-4796-a272-2398ea65ddf7
Status: READY
Duration: 547.931748 seconds
Role: arn:aws:iam::691262992979:role/glue-notebook
CreatedOn: 2022-11-24 05:12:21.298000+00:00
GlueVersion: 2.0
Worker Type: G.1X
Number of Workers: 10
Region: sa-east-1
Applying the following default arguments:
--glue_kernel_version 0.35
--enable-glue-datacatalog true
Arguments Passed: ['--glue_kernel_version: 0.35', '--enable-glue-datacatalog: true']

In [ ]:

In [8]:

%number_of_workers 20

You are already connected to session a7e1c5b2-d83b-48e9-8326-3b43d984511a. Your change will not reflect in the current session, but it will affect future new sessions. 

Previous number of workers: 10
Setting new number of workers to: 20

In [ ]: