Data typesยถ
Numbersยถ
Python supports several numeric types out of the box, including integers (int), floating-point numbers (float), and complex numbers. Arithmetic operations follow standard mathematical precedence rules (parentheses, exponents, multiplication/division, addition/subtraction). Understanding numeric types is foundational because NumPy extends these into powerful n-dimensional arrays, and every machine learning model ultimately operates on numbers.
1 + 1
1 * 3
1 / 2
** is raised to that power - thus this is 2 to the power of 4
2 ** 4
4 % 2
5 % 2
(2 + 3) * (5 + 5)
Variable Assignmentยถ
Variables in Python act as labels that point to objects stored in memory. Unlike statically typed languages, Python infers the type at runtime, so you can reassign a variable to a completely different type. Variable naming conventions matter: names must start with a letter or underscore, cannot begin with a number, and should be descriptive. In data science workflows, clear variable names like learning_rate or batch_size make code far more readable than single letters.
# Can not start with number or special characters
name_of_var = 2
x = 2
y = 3
z = x + y
z
Stringsยถ
Strings are sequences of characters, created with single quotes ('...') or double quotes ("..."). Python treats both identically, but using double quotes is handy when your string contains apostrophes. Strings are immutable โ once created, individual characters cannot be changed in place. This immutability concept reappears in NumPy when dealing with array views versus copies, so it is worth internalizing early.
'single quotes'
"double quotes"
" wrap lot's of other quotes"
Printingยถ
The print() function sends output to the console. In Jupyter notebooks, the last expression in a cell is automatically displayed, but print() gives you explicit control over what gets shown and when. This distinction matters when debugging data pipelines โ you often need to inspect intermediate values mid-cell rather than relying on the automatic display of the final result.
x = 'hello'
If you donโt pass to the print() function - shows this way with quotes to show itโs a string
x
print(x)
num = 12
name = 'Sam'
f-string Formatting (Modern Python)ยถ
Python 3.6 introduced f-strings (formatted string literals), which are the recommended way to embed expressions inside strings. Prefix the string with f and place variables or expressions inside curly braces {}. f-strings are faster and more readable than the older .format() method shown above. In data science, you will use f-strings constantly for logging metrics, printing model accuracy, and generating dynamic file paths.
print('My number is: {one}, and my name is: {two}'.format(one=num,two=name))
print('My number is: {}, and my name is: {}'.format(num,name))
Updated way to run - tested 6/14/2023
print(f"My number is: {num}, and my name is {name}")
String Indexingยถ
Strings support bracket-based indexing to access individual characters by position. Python uses zero-based indexing, meaning the first character is at index 0. This same indexing convention carries directly into NumPy arrays and Pandas DataFrames, so mastering it here pays dividends throughout your data science journey.
s = 'abcdefghijkl'
s[2]
String Slicingยถ
Slicing extracts a substring using the syntax s[start:stop], where start is inclusive and stop is exclusive. Omitting start defaults to the beginning; omitting stop defaults to the end. This start:stop:step pattern is identical to how NumPy array slicing works, making it one of the most important Python patterns to internalize for scientific computing.
s[:3]
slice the other way around - from index and beyond
s[1:]
so to grab a section - from index to end index
s[2:5]
Listsยถ
Lists are Pythonโs most versatile ordered collection. They can hold mixed types (integers, strings, even other lists), are mutable (you can change elements in place), and support methods like append(), pop(), and sort(). In data science, raw data often arrives as Python lists before being converted to NumPy arrays with np.array(). Understanding list behavior โ especially indexing, slicing, and mutability โ provides the mental model you need for working with arrays and tensors.
[1,2,3]
['hi',1,[1,2]]
my_list = ['a','b','c']
my_list.append('d')
my_list
Note indexing is same as for string explained above
my_list[0]
my_list[1]
my_list[1:]
my_list[:1]
reassign elements in list using index of element
my_list[0] = 'NEW'
my_list
Nested Listsยถ
Lists can contain other lists, creating nested structures that represent multi-dimensional data. Accessing elements requires chaining index operations โ nest[3][2][0] drills from the outermost list inward. This nesting concept maps directly to multi-dimensional NumPy arrays, where arr[i][j][k] or arr[i, j, k] accesses elements in 3D arrays used for image data (height, width, channels) and video tensors.
nest = [1,2,3,[4,5,['target']]]
accesses nested list since that is what at index 3
nest[3]
to access elements within that nested list
nest[3][2]
and further into nested list
nest[3][2][0]
Dictionariesยถ
Dictionaries store data as key-value pairs, providing O(1) average lookup time by key. Unlike lists which are accessed by integer index, dictionaries use descriptive keys like strings. In machine learning, dictionaries are everywhere: model hyperparameters ({'learning_rate': 0.01, 'epochs': 100}), JSON API responses, feature mappings, and configuration files. Values can be any Python object, including lists and other dictionaries, enabling deeply nested data structures.
d = {'key1':'item1','key2':'item2'}
d
access values using keys not index
d['key1']
can even store lists as values
d = {'k1':[1,2,3]}
to access values, use key
d['k1']
then can access list using indexing with key noted first
d['k1'][1]
or can assign list to variable
my_list = d['k1']
my_list
my_list[1]
nested dictionaries
d = {'k1':{'innerkey':[1,2,3]}}
d['k1']
d['k1']['innerkey']
d['k1']['innerkey'][1]
Booleansยถ
Booleans represent truth values: True or False. They are the result of comparison operations and form the basis of conditional logic. In NumPy, boolean arrays become a powerful tool for boolean masking โ selecting elements from an array that satisfy a condition, such as arr[arr > 0] to filter out negative values. Understanding booleans here prepares you for data filtering, which is one of the most common operations in data analysis.
True
False
Tuplesยถ
Tuples are ordered, immutable sequences created with parentheses (). Once created, their elements cannot be changed, added, or removed. This immutability makes tuples useful for fixed collections like coordinates, RGB color values, or array shapes. In NumPy, the .shape attribute returns a tuple (e.g., (3, 4) for a 3x4 matrix), and many functions expect tuples for dimension specifications.
t = (1,2,3)
t[0]
tuples are immutable and do not support item reassignment - cannot reassign like you would a list
meaning you want to use tuple when you donโt want a user to be able to change whatโs inside
t[0] = 'NEW'
Setsยถ
Sets are unordered collections of unique elements. Adding a duplicate to a set silently does nothing โ the set enforces uniqueness automatically. Use the set() function to deduplicate a list, or create a set directly with curly braces {1, 2, 3}. Sets also support mathematical set operations like union, intersection, and difference. In data preprocessing, sets are useful for finding unique categories, checking membership efficiently, and comparing feature sets between datasets.
sets = collections of unique elements
{1,2,3}
So will reduce duplicates down to unique elements
{1,2,3,1,2,1,2,3,3,3,3,2,2,2,1,1,2}
can use set as function to grab unique elements for you
array = [1,1,1,2,2,3,3,3,3,4,4,5,5,5,5,5,6]
set(array)
to add, use add() instead of append
s = {1,2,3}
s.add(5)
s
if you try to add a duplicate you wonโt get an error but your set will remain unchanged
s.add(1)
s
Comparison Operatorsยถ
Comparison operators (>, <, >=, <=, ==, !=) evaluate two values and return a boolean result. The equality operator == checks if two values are the same, while = is the assignment operator โ confusing these is a common beginner mistake. In NumPy, comparison operators are vectorized: applying arr > 5 to an array returns a boolean array of the same shape, which is the foundation of data filtering and conditional selection in scientific computing.
Comparison operators return boolean values
1 > 2
1 < 2
1 >= 1
1 <= 4
to check for equality, use == using only = will make python think itโs variable assignment
1 == 1
'hi' == 'bye'
to check for inequality
1 != 3
Logic Operatorsยถ
The logical operators and, or, and not combine boolean expressions. and requires both conditions to be True; or requires at least one; not inverts the result. In NumPy, the Python keywords and/or do not work element-wise on arrays โ instead you use & (and), | (or), and ~ (not) with parentheses around each condition, for example (arr > 3) & (arr < 7). This distinction trips up many beginners when transitioning from Python to NumPy.
(1 > 2) and (2 < 3)
(1 > 2) or (2 < 3)
(1 == 2) or (2 == 3) or (4 == 4)
if, elif, else Statementsยถ
Conditional statements control the flow of execution based on boolean conditions. Python uses indentation (not braces) to define code blocks, which enforces readable code structure. The if block runs when its condition is True; elif (else-if) checks additional conditions in order; else catches everything that did not match. In ML pipelines, conditionals are used for branching logic like early stopping, selecting different model architectures, or handling edge cases in data preprocessing.
if 1 < 2:
print('Yep!')
if 1 < 2:
print('yep!')
if 1 < 2:
print('first')
else:
print('last')
if 1 > 2:
print('first')
else:
print('last')
if 1 == 2:
print('first')
elif 3 == 3:
print('middle')
else:
print('Last')
keep in mind if using multiple elif statements, will stop at first one true, so:
if 1==2:
print('first')
elif 4==4:
print('second')
elif 3==3:
print('middle')
else:
print('last')
for Loopsยถ
A for loop iterates over each element in a sequence (list, string, range, etc.), executing the indented block once per element. The loop variable (e.g., item) takes on the value of each element in turn. While for loops are intuitive, they are slow for large numerical datasets because Python executes one iteration at a time. This is exactly why NumPy exists โ it replaces explicit loops with vectorized operations that run in optimized C code, often 10-100x faster.
seq = [1,2,3,4,5]
Use for loop to perform action for each element in sequence
for item in seq:
print(item)
for item in seq:
print('Yep')
for jelly in seq:
print(jelly+jelly)
while Loopsยถ
A while loop repeats its body as long as a condition remains True. Unlike for loops which iterate a known number of times, while loops are useful when the termination condition depends on runtime state. Be careful to update the condition variable inside the loop, or you will create an infinite loop. In ML training, the conceptual equivalent is the training loop that runs until convergence, a maximum number of epochs, or an early stopping criterion.
Use while loop to perform action until a certain condition is met
also note that format has been replaced by fโโ notation
i = 1
while i < 5:
print('i is: {}'.format(i))
i = i+1
range()ยถ
The range() function generates a sequence of integers lazily (as a generator), meaning it does not store all values in memory at once. range(n) produces values from 0 to n-1; range(start, stop, step) provides full control. To see the actual values, wrap it in list(). This lazy evaluation pattern is important in data science โ when dealing with millions of rows, generating data on-demand rather than all at once prevents memory exhaustion.
range(5)
for i in range(5):
print(i)
range is generator - if you want it as a list, youโll need to assign like so
list(range(5))
List Comprehensionยถ
List comprehensions provide a concise, Pythonic way to create new lists by applying an expression to each element of an existing iterable, all in a single line. The syntax [expression for item in iterable] replaces the multi-line pattern of initializing an empty list, looping, and appending. List comprehensions are not just syntactic sugar โ they are generally faster than equivalent for loops because Python optimizes them internally. This compact style of expressing transformations is a stepping stone toward understanding NumPyโs vectorized operations.
x = [1,2,3,4]
list comprehension is just an easier way to perform some action with an already existing list, to produce another list as output
Here is using for loop to iterate through list and perform action
out = []
for item in x:
out.append(item**2)
print(out)
to instead use list comprehension - put inside brackets - donโt have to instantiate empty list first
[item**2 for item in x]
to do it all in one motion:
out = [item**2 for item in x]
out
Functionsยถ
Functions encapsulate reusable blocks of logic, defined with the def keyword. They accept parameters (inputs), execute a body of code, and optionally return a value. Default parameter values (e.g., param1='default') make arguments optional. Docstrings โ triple-quoted strings immediately after the def line โ document what the function does and can be viewed in Jupyter with Shift+Tab. Writing well-structured functions is critical in ML projects for keeping preprocessing, training, and evaluation logic modular and testable.
using = in parameters sets default value for parameter
โโโ notation is how to add in string of multiple lines as a documentation string
NOTE if you want to know what function docstring is - in jupyter - type function name without (), then shift + tab for description from your docstring
You will get signature call (what function expects to receive) then your docstring
This can be used with pre-existing functions as well for a quick reference to the function usage
def my_func(param1='default'):
"""
Docstring goes here.
"""
print(param1)
if you donโt use () - you will not call function - you will just ask python what is this object
my_func
to call/execute function
my_func()
my_func('new param')
my_func(param1='new param')
using return to assign a value
def square(x):
return x**2
out = square(2)
print(out)
Lambda Expressionsยถ
Lambda expressions create small, anonymous (unnamed) functions in a single line using the syntax lambda arguments: expression. They are most useful when you need a quick throwaway function as an argument to higher-order functions like map(), filter(), or sorted(). While regular def functions are preferred for anything complex or reusable, lambdas keep code compact for simple transformations โ like applying a normalization step or extracting a sort key in data processing pipelines.
def times2(var):
return var*2
times2(2)
lambda var: var*2
map and filterยถ
map(function, iterable) applies a function to every element of an iterable and returns a lazy map object. filter(function, iterable) returns only the elements for which the function returns True. Both return lazy iterators, so wrap them in list() to see results. These functional programming tools pair naturally with lambda expressions for concise data transformations. In practice, Pandas .apply() and NumPy vectorization largely replace map/filter for tabular and array data, but understanding these primitives builds your functional programming intuition.
seq = [1,2,3,4,5]
use map to perform a function for each element in a list
map(function, seq)
if you donโt create new list, will just return object
map(times2,seq)
to assign results to a list so theyโre stored
list(map(times2,seq))
generally using map because you donโt want to write out a whole function - you want to perform a simple expression
so instead you use lambda
use lambda โโโwhat you want to pass in> : <what you want to return outโโโ , seq
list(map(lambda var: var*2,seq))
use filter in a similar way
here, lambda expression takes condition and returns True or False for element in iterable object (so a list) and returns when True
filter(lambda item: item%2 == 0,seq)
cast to list to see actual output
For example to see even numbers in list
list(filter(lambda item: item%2 == 0,seq))
Methodsยถ
Methods are functions that belong to a specific object type. You call them with dot notation: object.method(). Each Python type has its own set of methods โ strings have .lower(), .upper(), .split(); lists have .append(), .pop(), .sort(); dictionaries have .keys(), .values(), .items(). In Jupyter, typing an object name followed by a dot and pressing Tab reveals all available methods. This autocompletion habit will serve you well when exploring NumPy array methods and Pandas DataFrame methods.
st = 'hello my name is Sam'
so if you type โstโ then hit tab in jupyter, you will get a list of all methods you can call on string object
st.lower()
st.upper()
splits on whitespace by default
st.split()
tweet = 'Go Sports! #Sports'
tweet.split('#')
tweet.split('#')[1]
d
d.keys()
d.items()
lst = [1,2,3]
pop will โpopโ the last item out of the list and return it - so remove from list and return
lst.pop()
lst
can also use pop with index
first = lst.pop(0)
lst
first
'x' in [1,2,3]
'x' in ['x','y','z']
Tuple Unpackingยถ
Tuple unpacking lets you assign multiple variables from a tuple (or any iterable) in a single statement. When iterating over a list of tuples with for (a, b) in list_of_tuples, each tuple is automatically unpacked into the named variables. This pattern appears frequently in Pythonโs enumerate() (which yields index-value pairs) and in zip() (which pairs elements from multiple iterables). In data science, tuple unpacking is used when iterating over dictionary .items(), unpacking train/test splits, and destructuring return values from functions.
x = [(1,2),(3,4),(5,6)]
x[0]
normal iterating through items
for item in x:
print(item)
to instead unpack within for loop
for (a,b) in x:
print(a)