Create DataFrame from string table
Table of Contents
You can find the code for the read_table function mentioned in this note in
subdirectory 20230318.
The really handy Pandas function read_fwf creates a DataFrame from a
fixed-width text table, for example:
import io
import pandas as pd
table = """
FirstName LastName FirstAppearance
Donald Duck 1934
Mickey Mouse 1928
Goofy 1932
"""
# read_fwf expects a filename, either as a string, a path object, or as the
# next line shows, a stream/file-like object
df = pd.read_fwf(io.StringIO(table))
print(df)
FirstName LastName FirstAppearance
0 Donald Duck 1934
1 Mickey Mouse 1928
2 Goofy NaN 1932
I mostly use it in unit tests, when I need to set up DataFrame instances.
They’re easier to create using read_fwf and the resulting code is more
readable, more communicative.
However, read_fwf doesn’t handle column names that contain spaces:
table = """
FirstName LastName First appearance
Donald Duck 1934
Mickey Mouse 1928
Goofy 1932
"""
df = pd.read_fwf(io.StringIO(table))
print(df)
FirstName LastName First appearance
0 Donald Duck 1934 NaN
1 Mickey Mouse 1928 NaN
2 Goofy NaN 1932 NaN
Column “First appearance” is interpreted as two separate columns, which is not
what you want. Enclosing the column name in single quotes, as a hint to
read_fwf, doesn’t help.
This lead me to write function read_table that wraps read_fwf to allow you
to specify multi-word column names:
table = """
FirstName LastName First appearance
Donald Duck 1934
Mickey Mouse 1928
Goofy 1932
"""
df = read_table(table, columns=["FirstName", "LastName", "First appearance"])
print(df)
FirstName LastName First appearance
0 Donald Duck 1934
1 Mickey Mouse 1928
2 Goofy NaN 1932
read_table only accepts a string whereas read_fwf accepts a filename, path
object or stream/file-like object. As mentioned, my primary use case for
read_table is to setup a DataFrame from hard-coded data.
read_fwf doesn’t like it when you have a column whose values have spaces:
table = """
Name FirstAppearance
Donald Duck 1934
Mickey Mouse 1928
Goofy 1932
"""
stream = io.StringIO(table)
df = pd.read_fwf(stream)
print(df)
Name Unnamed: 1 FirstAppearance
0 Donald Duck 1934
1 Mickey Mouse 1928
2 Goofy NaN 1932
read_table has your back here too if you specify the column names:
table = """
Name FirstAppearance
Donald Duck 1934
Mickey Mouse 1928
Goofy 1932
"""
df = read_table(table, columns=["Name", "FirstAppearance"])
print(df)
Name FirstAppearance
0 Donald Duck 1934
1 Mickey Mouse 1928
2 Goofy 1932
read_table wraps read_fwf and if you pass in a keyword argument that
read_table doesn’t have in its signature, it passes it on to read_fwf. This
allows read_table to support the use of a column to specify an index:
table = """
FirstName LastName FirstAppearance
0 Donald Duck 1934
2 Mickey Mouse 1928
3 Goofy 1932
"""
df = read_table(table, index_col=0)
print(df)
print(df.index)
FirstName LastName FirstAppearance
0 Donald Duck 1934
2 Mickey Mouse 1928
3 Goofy NaN 1932
Int64Index([0, 2, 3], dtype='int64')
Here, read_table passes index_col=0 to read_fwf. Because of that,
read_fwf lets the first column provide the index.
Closing thoughts
I realized that having to specify all columns names is not ideal. In one of the
examples above, only column name “First Appearance” has a space but read_table
requires you to specify the other column names too. This is something to address
in a newer version.
Furthermore, in the example with multi-word column values, you had to specify
the column names to import the table correctly. There’s no real need for that as
read_table can determine the column names automatically.