• ANALITICAS
  • Posts
  • Import data from an Excel file (.xlsx) with Python and pandas

Import data from an Excel file (.xlsx) with Python and pandas

Table of Contents

Python and the pandas library to analyse data has one basic requirement: the data! Where do we get it from? Where do we import it? What do we do with it?

The data we will import into our script be stored in a dataframe (often abbreviated as df).

Importing the file

The .xlsx extension indicates files in the standard Microsoft Excel format. Pandas has a suitable command to import the data from .

df = pd.read_excel('file.xlsx')

If you use a pandas version lower than 1.2, you'll need to specify the engine argument to read an .xlsx file.

The openpyxl package needs to be downloaded and installed separately. Alternatively, you can update pandas to version 1.2 or higher.

Here's how to read an .xlsx file with a pandas version lower than 1.2:

df = pd.read_excel('file.xlsx', engine='openpyxl')

Specify the path to the file

In the example above the file to import is in the location as our Python script. If this is not the case, we need to specify the file path:

df = pd.read_excel(r'C:\User\path\file.xlsx')

Specify which sheet to import

The read_excel() command always reads the data from the first sheet of the Excel file. If we want to choose another sheet, you need to leverage the sheet_name parameter.

Additionally, it's possible to select the sheet either by its name or by its position within the Excel file.

# select the sheet based on its name

df = pd.read_excel(r'C:\User\path\file.xlsx', sheet_name='sheet1')

df = pd.read_excel(r'C:\User\path\file.xlsx', sheet_name='sheet2')


# select the sheet based on its position

df = pd.read_excel(r'C:\User\path\file.xlsx', sheet_name=0)

df = pd.read_excel(r'C:\User\path\file.xlsx', sheet_name=1)

In Python, counting starts from zero. Therefore, when we assign the value 0 to the sheet_name argument, we are referring to the first sheet.

When the sheet_name argument is not specified, its default value corresponds to zero.

This means that the command will read the first sheet of the Excel file, whatever it may be.

Specify the header row

Sometimes you may want to specify which row of the Excel sheet is the header of our dataset.

In this case, we just need to include the header argument:

df = pd.read_excel(r'C:\User\path\file.xlsx', header=1)

In this case, we are indicating that the second row of the Excel sheet (header = 1) is the header.

Specify the index column

Pandas automatically adds an index column to our dataframe.

If we want to assign the index to a specific column, we need to use the index_col argument.

# assign the index based on the column position

df = pd.read_excel(r'C:\User\path\file.xlsx', index_col=0)



# assign the index based on the column name

df = pd.read_excel(r'C:\User\path\file.xlsx', index_col="colonna-1")

The default value of the index_col argument is "None". This is why pandas creates an additional one.

Rename the columns

While importing data from an Excel file, you can also rename the columns. Just specify the names argument.

df = pd.read_excel(r'\User\path\file.xlsx', names=['ID', 'city', 'email'])

Exclude the rows at the beginning of the file

If your .xlsx file contains rows that are not part of the dataset. If these are just above the data (i.e. some notes), you can leverage the skiprows argument:

df = pd.read_excel(r'\User\path\file.xlsx', skiprows=4)

In this case, we will import the data from our Excel file excluding the first 4 rows of the spreadsheet.

The default value of the skiprows argument is equal to zero. This means that, unless we specify another value, no row will be excluded.

Exclude rows at the end of the file

Similar to the previous example, if we want to exclude from the import some rows at the end of the data, we can use the skipfooter argument:

df = pd.read_excel(r'\User\path\file.xlsx', skipfooter=2)

In this example, we imported our data excluding the last two rows of the dataset.

The default value of the skipfooter argument is equal to zero, therefore, unless we change it, no row will be excluded from the import.

Convert integers to decimals

The convert_float argument is a bit more niche compared to those seen so far. It is used to convert integers to decimals while importing data from an Excel file.

This can happen because all numbers, regardless of whether or not they have values after the decimal point, are classified as float by Excel (float is the data type corresponding to numbers with a decimal point). So the number 2 actually corresponds to 2.0.

When we import data through the pandas library, Python automatically converts all numbers without values after the decimal point to integers.

To prevent this from happening, and to preserve the float type, we need to change the default value of the convert_float argument:

dati = pd.read_excel(r'\User\path\file.xlsx', sheet_name=0, convert_float=False)

What if you need to import data from a .csv file? We’ve got you covered. Here’s a quick lesson about that.

Don’t miss out the next issue of the ANALITICAS newsletter! Subscribe now: