Run this notebook: Open in Colab Open in Kaggle

Data types¶

Numbers¶

Python supports several numeric types out of the box, including integers (int), floating-point numbers (float), and complex numbers. Arithmetic operations follow standard mathematical precedence rules (parentheses, exponents, multiplication/division, addition/subtraction). Understanding numeric types is foundational because NumPy extends these into powerful n-dimensional arrays, and every machine learning model ultimately operates on numbers.

1 + 1

1 * 3

1 / 2

** is raised to that power - thus this is 2 to the power of 4

2 ** 4

4 % 2

5 % 2

(2 + 3) * (5 + 5)

Variable Assignment¶

Variables in Python act as labels that point to objects stored in memory. Unlike statically typed languages, Python infers the type at runtime, so you can reassign a variable to a completely different type. Variable naming conventions matter: names must start with a letter or underscore, cannot begin with a number, and should be descriptive. In data science workflows, clear variable names like learning_rate or batch_size make code far more readable than single letters.

# Can not start with number or special characters
name_of_var = 2

x = 2
y = 3

z = x + y

Strings¶

Strings are sequences of characters, created with single quotes ('...') or double quotes ("..."). Python treats both identically, but using double quotes is handy when your string contains apostrophes. Strings are immutable – once created, individual characters cannot be changed in place. This immutability concept reappears in NumPy when dealing with array views versus copies, so it is worth internalizing early.

'single quotes'

"double quotes"

" wrap lot's of other quotes"

Printing¶

The print() function sends output to the console. In Jupyter notebooks, the last expression in a cell is automatically displayed, but print() gives you explicit control over what gets shown and when. This distinction matters when debugging data pipelines – you often need to inspect intermediate values mid-cell rather than relying on the automatic display of the final result.

x = 'hello'

If you don’t pass to the print() function - shows this way with quotes to show it’s a string

print(x)

num = 12
name = 'Sam'

f-string Formatting (Modern Python)¶

Python 3.6 introduced f-strings (formatted string literals), which are the recommended way to embed expressions inside strings. Prefix the string with f and place variables or expressions inside curly braces {}. f-strings are faster and more readable than the older .format() method shown above. In data science, you will use f-strings constantly for logging metrics, printing model accuracy, and generating dynamic file paths.

print('My number is: {one}, and my name is: {two}'.format(one=num,two=name))

print('My number is: {}, and my name is: {}'.format(num,name))

Updated way to run - tested 6/14/2023

print(f"My number is: {num}, and my name is {name}")

String Indexing¶

Strings support bracket-based indexing to access individual characters by position. Python uses zero-based indexing, meaning the first character is at index 0. This same indexing convention carries directly into NumPy arrays and Pandas DataFrames, so mastering it here pays dividends throughout your data science journey.

s = 'abcdefghijkl'

s[2]

String Slicing¶

Slicing extracts a substring using the syntax s[start:stop], where start is inclusive and stop is exclusive. Omitting start defaults to the beginning; omitting stop defaults to the end. This start:stop:step pattern is identical to how NumPy array slicing works, making it one of the most important Python patterns to internalize for scientific computing.

s[:3]

slice the other way around - from index and beyond

s[1:]

so to grab a section - from index to end index

s[2:5]

Lists¶

Lists are Python’s most versatile ordered collection. They can hold mixed types (integers, strings, even other lists), are mutable (you can change elements in place), and support methods like append(), pop(), and sort(). In data science, raw data often arrives as Python lists before being converted to NumPy arrays with np.array(). Understanding list behavior – especially indexing, slicing, and mutability – provides the mental model you need for working with arrays and tensors.

[1,2,3]

['hi',1,[1,2]]

my_list = ['a','b','c']

my_list.append('d')

my_list

Note indexing is same as for string explained above

my_list[0]

my_list[1]

my_list[1:]

my_list[:1]

reassign elements in list using index of element

my_list[0] = 'NEW'

my_list

Nested Lists¶

Lists can contain other lists, creating nested structures that represent multi-dimensional data. Accessing elements requires chaining index operations – nest[3][2][0] drills from the outermost list inward. This nesting concept maps directly to multi-dimensional NumPy arrays, where arr[i][j][k] or arr[i, j, k] accesses elements in 3D arrays used for image data (height, width, channels) and video tensors.

nest = [1,2,3,[4,5,['target']]]

accesses nested list since that is what at index 3

nest[3]

to access elements within that nested list

nest[3][2]

and further into nested list

nest[3][2][0]

Dictionaries¶

Dictionaries store data as key-value pairs, providing O(1) average lookup time by key. Unlike lists which are accessed by integer index, dictionaries use descriptive keys like strings. In machine learning, dictionaries are everywhere: model hyperparameters ({'learning_rate': 0.01, 'epochs': 100}), JSON API responses, feature mappings, and configuration files. Values can be any Python object, including lists and other dictionaries, enabling deeply nested data structures.

d = {'key1':'item1','key2':'item2'}

access values using keys not index

d['key1']

can even store lists as values

d = {'k1':[1,2,3]}

to access values, use key

d['k1']

then can access list using indexing with key noted first

d['k1'][1]

or can assign list to variable

my_list = d['k1']

my_list

my_list[1]

nested dictionaries

d = {'k1':{'innerkey':[1,2,3]}}

d['k1']

d['k1']['innerkey']

d['k1']['innerkey'][1]

Booleans¶

Booleans represent truth values: True or False. They are the result of comparison operations and form the basis of conditional logic. In NumPy, boolean arrays become a powerful tool for boolean masking – selecting elements from an array that satisfy a condition, such as arr[arr > 0] to filter out negative values. Understanding booleans here prepares you for data filtering, which is one of the most common operations in data analysis.

True

False

Tuples¶

Tuples are ordered, immutable sequences created with parentheses (). Once created, their elements cannot be changed, added, or removed. This immutability makes tuples useful for fixed collections like coordinates, RGB color values, or array shapes. In NumPy, the .shape attribute returns a tuple (e.g., (3, 4) for a 3x4 matrix), and many functions expect tuples for dimension specifications.

t = (1,2,3)

t[0]

tuples are immutable and do not support item reassignment - cannot reassign like you would a list

meaning you want to use tuple when you don’t want a user to be able to change what’s inside

t[0] = 'NEW'

Sets¶

Sets are unordered collections of unique elements. Adding a duplicate to a set silently does nothing – the set enforces uniqueness automatically. Use the set() function to deduplicate a list, or create a set directly with curly braces {1, 2, 3}. Sets also support mathematical set operations like union, intersection, and difference. In data preprocessing, sets are useful for finding unique categories, checking membership efficiently, and comparing feature sets between datasets.

sets = collections of unique elements

{1,2,3}

So will reduce duplicates down to unique elements

{1,2,3,1,2,1,2,3,3,3,3,2,2,2,1,1,2}

can use set as function to grab unique elements for you

array = [1,1,1,2,2,3,3,3,3,4,4,5,5,5,5,5,6]
set(array)

to add, use add() instead of append

s = {1,2,3}

s.add(5)

if you try to add a duplicate you won’t get an error but your set will remain unchanged

s.add(1)

Comparison Operators¶

Comparison operators (>, <, >=, <=, ==, !=) evaluate two values and return a boolean result. The equality operator == checks if two values are the same, while = is the assignment operator – confusing these is a common beginner mistake. In NumPy, comparison operators are vectorized: applying arr > 5 to an array returns a boolean array of the same shape, which is the foundation of data filtering and conditional selection in scientific computing.

Comparison operators return boolean values

1 > 2

1 < 2

1 >= 1

1 <= 4

to check for equality, use == using only = will make python think it’s variable assignment

1 == 1

'hi' == 'bye'

to check for inequality

1 != 3

Logic Operators¶

The logical operators and, or, and not combine boolean expressions. and requires both conditions to be True; or requires at least one; not inverts the result. In NumPy, the Python keywords and/or do not work element-wise on arrays – instead you use & (and), | (or), and ~ (not) with parentheses around each condition, for example (arr > 3) & (arr < 7). This distinction trips up many beginners when transitioning from Python to NumPy.

(1 > 2) and (2 < 3)

(1 > 2) or (2 < 3)

(1 == 2) or (2 == 3) or (4 == 4)

if, elif, else Statements¶

Conditional statements control the flow of execution based on boolean conditions. Python uses indentation (not braces) to define code blocks, which enforces readable code structure. The if block runs when its condition is True; elif (else-if) checks additional conditions in order; else catches everything that did not match. In ML pipelines, conditionals are used for branching logic like early stopping, selecting different model architectures, or handling edge cases in data preprocessing.

if 1 < 2:
    print('Yep!')

if 1 < 2:
    print('yep!')

if 1 < 2:
    print('first')
else:
    print('last')

if 1 > 2:
    print('first')
else:
    print('last')

if 1 == 2:
    print('first')
elif 3 == 3:
    print('middle')
else:
    print('Last')

keep in mind if using multiple elif statements, will stop at first one true, so:

if 1==2:
    print('first')
elif 4==4:
    print('second')
elif 3==3:
    print('middle')
else:
    print('last')

for Loops¶

A for loop iterates over each element in a sequence (list, string, range, etc.), executing the indented block once per element. The loop variable (e.g., item) takes on the value of each element in turn. While for loops are intuitive, they are slow for large numerical datasets because Python executes one iteration at a time. This is exactly why NumPy exists – it replaces explicit loops with vectorized operations that run in optimized C code, often 10-100x faster.

seq = [1,2,3,4,5]

Use for loop to perform action for each element in sequence

for item in seq:
    print(item)

for item in seq:
    print('Yep')

for jelly in seq:
    print(jelly+jelly)

while Loops¶

A while loop repeats its body as long as a condition remains True. Unlike for loops which iterate a known number of times, while loops are useful when the termination condition depends on runtime state. Be careful to update the condition variable inside the loop, or you will create an infinite loop. In ML training, the conceptual equivalent is the training loop that runs until convergence, a maximum number of epochs, or an early stopping criterion.

Use while loop to perform action until a certain condition is met

also note that format has been replaced by f”” notation

i = 1
while i < 5:
    print('i is: {}'.format(i))
    i = i+1

range()¶

The range() function generates a sequence of integers lazily (as a generator), meaning it does not store all values in memory at once. range(n) produces values from 0 to n-1; range(start, stop, step) provides full control. To see the actual values, wrap it in list(). This lazy evaluation pattern is important in data science – when dealing with millions of rows, generating data on-demand rather than all at once prevents memory exhaustion.

range(5)

for i in range(5):
    print(i)

range is generator - if you want it as a list, you’ll need to assign like so

list(range(5))

List Comprehension¶

List comprehensions provide a concise, Pythonic way to create new lists by applying an expression to each element of an existing iterable, all in a single line. The syntax [expression for item in iterable] replaces the multi-line pattern of initializing an empty list, looping, and appending. List comprehensions are not just syntactic sugar – they are generally faster than equivalent for loops because Python optimizes them internally. This compact style of expressing transformations is a stepping stone toward understanding NumPy’s vectorized operations.

x = [1,2,3,4]

list comprehension is just an easier way to perform some action with an already existing list, to produce another list as output

Here is using for loop to iterate through list and perform action

out = []
for item in x:
    out.append(item**2)
print(out)

to instead use list comprehension - put inside brackets - don’t have to instantiate empty list first

[item**2 for item in x]

to do it all in one motion:

out = [item**2 for item in x]

out

Functions¶

Functions encapsulate reusable blocks of logic, defined with the def keyword. They accept parameters (inputs), execute a body of code, and optionally return a value. Default parameter values (e.g., param1='default') make arguments optional. Docstrings – triple-quoted strings immediately after the def line – document what the function does and can be viewed in Jupyter with Shift+Tab. Writing well-structured functions is critical in ML projects for keeping preprocessing, training, and evaluation logic modular and testable.

using = in parameters sets default value for parameter

“”” notation is how to add in string of multiple lines as a documentation string

NOTE if you want to know what function docstring is - in jupyter - type function name without (), then shift + tab for description from your docstring

You will get signature call (what function expects to receive) then your docstring

This can be used with pre-existing functions as well for a quick reference to the function usage

def my_func(param1='default'):
    """
    Docstring goes here.
    """
    print(param1)

if you don’t use () - you will not call function - you will just ask python what is this object

my_func

to call/execute function

my_func()

my_func('new param')

my_func(param1='new param')

using return to assign a value

def square(x):
    return x**2

out = square(2)

print(out)

Lambda Expressions¶

Lambda expressions create small, anonymous (unnamed) functions in a single line using the syntax lambda arguments: expression. They are most useful when you need a quick throwaway function as an argument to higher-order functions like map(), filter(), or sorted(). While regular def functions are preferred for anything complex or reusable, lambdas keep code compact for simple transformations – like applying a normalization step or extracting a sort key in data processing pipelines.

def times2(var):
    return var*2

times2(2)

lambda var: var*2

map and filter¶

map(function, iterable) applies a function to every element of an iterable and returns a lazy map object. filter(function, iterable) returns only the elements for which the function returns True. Both return lazy iterators, so wrap them in list() to see results. These functional programming tools pair naturally with lambda expressions for concise data transformations. In practice, Pandas .apply() and NumPy vectorization largely replace map/filter for tabular and array data, but understanding these primitives builds your functional programming intuition.

seq = [1,2,3,4,5]

use map to perform a function for each element in a list

map(function, seq)

if you don’t create new list, will just return object

map(times2,seq)

to assign results to a list so they’re stored

list(map(times2,seq))

generally using map because you don’t want to write out a whole function - you want to perform a simple expression

so instead you use lambda

use lambda “”“what you want to pass in> : <what you want to return out”“” , seq

list(map(lambda var: var*2,seq))

use filter in a similar way

here, lambda expression takes condition and returns True or False for element in iterable object (so a list) and returns when True

filter(lambda item: item%2 == 0,seq)

cast to list to see actual output

For example to see even numbers in list

list(filter(lambda item: item%2 == 0,seq))

Methods¶

Methods are functions that belong to a specific object type. You call them with dot notation: object.method(). Each Python type has its own set of methods – strings have .lower(), .upper(), .split(); lists have .append(), .pop(), .sort(); dictionaries have .keys(), .values(), .items(). In Jupyter, typing an object name followed by a dot and pressing Tab reveals all available methods. This autocompletion habit will serve you well when exploring NumPy array methods and Pandas DataFrame methods.

st = 'hello my name is Sam'

so if you type “st” then hit tab in jupyter, you will get a list of all methods you can call on string object

st.lower()

st.upper()

splits on whitespace by default

st.split()

tweet = 'Go Sports! #Sports'

tweet.split('#')

tweet.split('#')[1]

d.keys()

d.items()

lst = [1,2,3]

pop will “pop” the last item out of the list and return it - so remove from list and return

lst.pop()

lst

can also use pop with index

first = lst.pop(0)

lst

first

'x' in [1,2,3]

'x' in ['x','y','z']

Tuple Unpacking¶

Tuple unpacking lets you assign multiple variables from a tuple (or any iterable) in a single statement. When iterating over a list of tuples with for (a, b) in list_of_tuples, each tuple is automatically unpacked into the named variables. This pattern appears frequently in Python’s enumerate() (which yields index-value pairs) and in zip() (which pairs elements from multiple iterables). In data science, tuple unpacking is used when iterating over dictionary .items(), unpacking train/test splits, and destructuring return values from functions.

x = [(1,2),(3,4),(5,6)]

x[0]

normal iterating through items

for item in x:
    print(item)

to instead unpack within for loop

for (a,b) in x:
    print(a)