Intermediate Python for Data Science -

01 Sep

Matplotlib (Check this chapter, Rodrigo)

Data Visualization is a key skill for aspiring data scientists. Matplotlib makes it easy to create meaningful and insightful plots. In this chapter, you will learn to build various types of plots and to customize them to make them more visually appealing and interpretable.

Dictionaries and Pandas

Learn about the dictionary, an alternative to the Python list, and the Pandas DataFrame, the de facto standard to work with tabular data in Python. You will get hands-on practice with creating, manipulating and accessing the information you need from these data structures.

Dictionaries. These objects are used when we want to associate objects within the same set. Creating multiple lists for this is ineffective and slow. We use dictionaries instead of lists when we want to index by unique keys and lookup table with unique keys.
1. Create dictionaries. The general way to build them is my_dict = {key1:value1, key2:value2,... }
2. Access dictionaries. My_dict[key] returns the value associated to key. Dictionaries have the method .keys() to see all the keys on it.
3. Select, update and remove. You can do this with my_dict[key], my_dict[key]=new_value, del(my_dict[key]) respectively
4. Observation:
  1. Keys have to be immutable objects (its representation is fixed), for example integers, booleans, strings but lists no.
  2. You can check whether or not a key is in your dictionary with "key in my_dict" .
  3. Dictionaries can contain key:value pairs where the values are again dictionaries. It's perfectly possible to chain square brackets to select elements. Ex. europe['spain']['population']
Pandas package. When we are working with tons of data, we prefer it to come in a tabular way, where we can have different variable types (that's why arrays are not an option sometimes). For this we have the Pandas package: High level data manipulation tool, built on Numpy; in Pandas we store the tabular data in a so-called DataFrame.We center our attention in Data Frame.
1. Create one.There are different ways to create this Panda's data structure.
  1. From Dictionaries. Each dictionary key is a column label and each value is a list which contains the column elements. "In [3]: import pandas as pd", "In [4]: my_df = pd.DataFrame(my_dict)"
    1. Observations: we can name the rows in the data frame using my_df.index=['row1', 'row2',...]
  2. From CSV file. Most of times we need to import data in a .csv file, for this we use the pd.read_csv() function. We write my_df = pd.read_csv("path/to/my_data.csv").
    1. Observations: If the first column of the .csv file represents the rows' names we add the index_col=0 argument to pd.read_csv()
2. Index and Select Data. There are mainly two ways for doing this, with square brackets and loc and iloc methods.
  1. Square brackets [ ].
    1. Column Access. my_df['col1'] returns a 1D labelled array containing the col1 column while my_df[['col1']] returns a DataFrame containing the col1; we can access several columns at the same time with my_df[['col1','col2',...]] which returns a df with the selected columns.
    2. Row Access. Using brackets we can only access rows with slicing. My_df[a:b] returns rows a to (excluding) b.
    3. Observation: Square brackets have limited functionality.
  2. loc and iloc. These are methods for a Data Frame to index and select data.
    1. loc. It is a label-based method.
      1. Row Access. My_df.loc['row's_name'] returnsthe row as Panda Series while my_df.loc[['row's_name']] returns it as a Data Frame. We can access several rows with my_df.loc[['row1','row2',...]].
      2. Row & Column loc. We can intersect rows and columns to get only those, with my_df.loc[['row1','row2',...],['col1','col2',...]].
        Observation. We can also use slicing using ' : ' instead of ['row1','row2',...], so we access columns with my_df.loc[:,['col1','col2',...]].
    2. iloc. It is an integer-position-based method. It is exactly the same as the loc method but instead of using the row/column names we use their index 0,1,2,...
    3. Observation: you can also combine label-based selection the loc way and index-based selection the ilocway. It's done with ix.

ch2_Dictionaries&Pandas.pdf

Logic, Control Flow and Filtering

Boolean logic is the foundation of decision-making in your Python programs. Learn about different comparison operators, how you can combine them with boolean operators and how to use the boolean outcomes in control structures. You'll also learn to filter data from Pandas DataFrames using logic.

Comparison operators: How Python values relate. The usuar ones are <, <= , >, >=, ==, !=. Generally we should compare two objects of the same type (int and float are an exception). Remember how with arrays we could compare x>23 with x an array, this is because Python considered 23 as an array with all 23's and compare it element-wisely with x.
Boolean Operators: As we know, these boolean operators are and, or, not. When working with arrays, we should use logical_and(), logical_or(), logical_not() so that it performs element-wise.
If, elif, else: These expressions are for Control Flow. The way to use these is as follows, where after the colon you should follow on the next row with indentation.
1. if condition : expression; elif condition : expression; else : expression.
Filtering Pandas DataFrame: We ideally get a Pandas Series, not a DataFrame when we want to filter a DataFrame.
1. Ex.:If we want to select the observations that meet a condition on a certain variable we need 3 steps: Select the column associated to that variable, do the comparison on the column, use the result to select the obsevations. brics.loc[:,"area"]; brics["area"] > 8; brics[brics["area"] > 8].
2. Ex. with boolean operators: As Pandas comes from NumPy, we can use NumPy's boolean operators and write brics[np.logical_and(brics["area"] > 8, brics["area"] < 10)]

ch3_Logic_ControlFLow_Filtering.pdf

Loops

There are several techniques to repeatedly execute Python code. While loops are like repeated if statements; the for loop is there to iterate over all kinds of data structures.

While. 'while condition : expression'
For. 'for var in seq: expression' where seq may be any iterable object such as a list or string (or range(100) if we want to iterate it a hundred times for ex.); this way we only obtain the elements on seq, if we want to get the index of the element on each iteration we could use enumerate() as 'for index, var in enumerate(seq)' where index and var store the corresponding index and element in seq.
Looping Data Structures
1. Dictionary. As we hace two important parts in a dictionary, namely the key and value, you cannot loop it in the usual way, you need to use the .items() method: for key, value in my_dict.items(): expression where key and value store exactly that.
2. NumPy Arrays. For 1D arrays we use the usual way for val in my_1Darray: expression. However, if we have a 2D array like meas = np.array([np_height, np_weight]) and use a for in the usual way, it iterates over each row, not each element; in order to iterate over each element we use the np.nditer() function like this for val in np.nditer(my_2Darray): expression.
3. Pandas DataFrame. As we have rows and columns, to iterate this type of object we need the .iterrow() method on data frames. for lab, row in my_df.iterrows(): expression where lab is the lab on the iterated row and row is the Panda Series form of the row.
  1. Ex. to print out the BRIC countries' capitals we write for lab, row in brics.iterrows(): print(lab + ": " + row["capital"])
  2. Ex. (Inefficient way to add a new column to a df): for lab, row in brics.iterrows() : brics.loc[lab, "name_length"] = len( row["country"])
  3. Ex. (Efficient way to add a new column to a df): brics["name_length"] = brics["country"].apply(len). This is because the .apply() method is element-wise on each element of the "coutry" column. If instead of len() we were to use a method (for ex. upper() method for strings) it should be like this cars["COUNTRY"]=cars['country'].apply(str.upper)

ch4_Loops.pdf

Case Study: Hacker Statistics

This chapter blends together everything you've learned up to now. You will use hacker statistics to calculate your chances of winning a bet. Use random number generators, loops and matplotlib to get the competitive edge!

Random generators. We use the subpackage random of NumPy, np.random.seed(x) is used to set the seed equal to x (it ensures reproducibility), np.random.rand() returns a pseudo random number between 0 and 1, and np.random.randint(a,b) generates a random int between a and (exclusive) b.
Distribution. To obtain the distribution of your data through multiple simulation, we save the final result of each simulation and then plot it on a histogram.

Comments