Reading tables is a pretty common thing to do and there are a number of ways to read tables besides writing a read function yourself. That’s not to say these are magic bullets. Every table is different and can have its own eccentricities. If you find yourself reading the same type of quirky file over and over again it could be worth your effort to write your own reader that does things just the way you like. That said, here are some other options.

numpy.loadtxt

numpy.loadtxt is a very simple reader. There are ways to make it do some advanced things like handle missing data or read non-numeric columns but they are all a bit tedious so this function is best used with well behaved tables.

numpy.loadtxt has a couple of useful keywords. Use the skiprows keyword to skip header lines. By default numpy.loadtxt will split columns on white space but you can specify other separators using the delimiter keyword. If you want select only certain columns from the table use the usecols keyword. That can be useful if you want to skip a text column.

Normally the data is returned as one large 2D array but setting unpack=True will return the the columns as individual arrays.

Examples

Reading a well formatted, white space delimited table into a single array:

 >>> np.loadtxt('data_table.txt', skiprows=1)
 array([[ 0.2536, 0.1008, 0.3857],
 [ 0.4839, 0.4536, 0.3561],
 [ 0.1292, 0.6875, 0.5929],
 [ 0.1781, 0.3049, 0.8928],
 [ 0.6253, 0.3486, 0.8791]])

Reading a well formatted, white space delimited table into a three arrays:

 >>> a,b,c = np.loadtxt('data_table.txt', skiprows=1, unpack=True)
 >>> a
 array([ 0.2536, 0.4839, 0.1292, 0.1781, 0.6253])
 >>> b
 array([ 0.1008, 0.4536, 0.6875, 0.3049, 0.3486])
 >>> c
 array([ 0.3857, 0.3561, 0.5929, 0.8928, 0.8791])

Reading a table with NAN values:

 >>> np.loadtxt('data_table2.txt', skiprows=1)
 array([[ 0.4839, 0.4536, 0.3561],
 [ 0.1292, 0.6875, nan],
 [ 0.1781, 0.3049, 0.8928],
 [ nan, 0.5801, 0.2038],
 [ 0.5993, 0.4357, 0.741 ]])

Reading a nicely formatted CSV file. Skip the first column since it contains strings:

 >>> np.loadtxt('exoplanetData_clean.csv', skiprows=1, delimiter=',', usecols=(1,2,3))
 array([[ 0.2 , 0.33 , 29.329 ],
 [ 9.1 , 1.29 , 60.3251],
 [ 17. , 0.96 , 143.213 ],
 [ 6.8 , 0.38 , 20.8613],
 [ 4.7 , 0.38 , 42.6318]])

numpy.genfromtxt

numpy.genfromtxt is a more robust version of loadtxt that can better handle missing data. It supports most of the same keywords as loadtxt except it has skip_header instead of skiprows.

Examples

numpy.genfromtxt will automatically convert character strings in numeric columns to nan:

 >>> np.genfromtxt('data_table3.txt', skip_header=1)
 array([[ 0.4839, 0.4536, 0.3561],
 [ 0.1292, 0.6875, nan],
 [ 0.1781, 0.3049, 0.8928],
 [ nan, 0.5801, 0.2038],
 [ 0.5993, 0.4357, 0.741 ]])

But you can also specify the missing value string and substitution values:

 >>> np.genfromtxt('data_table3.txt', skip_header=1, missing_values=('MISSING','MISSING','MISSING'), filling_values=(-999,-999,-999))
 array([[ 4.83900000e-01, 4.53600000e-01, 3.56100000e-01],
 [ 1.29200000e-01, 6.87500000e-01, -9.99000000e+02],
 [ 1.78100000e-01, 3.04900000e-01, 8.92800000e-01],
 [ -9.99000000e+02, 5.80100000e-01, 2.03800000e-01],
 [ 5.99300000e-01, 4.35700000e-01, 7.41000000e-01]])

When you set dtype=None numpy.genfromtxt will attempt to determine the type of each column, even detecting string columns. If you set names=True numpy.genfromtxt will take the column names from the header line:

 >>> np.genfromtxt('exoplanetData_clean.csv', names=True, delimiter=',', dtype=None)
 array([('Kepler-30 b', 0.2, 0.33, 29.329),
 ('Kepler-30 c', 9.1, 1.29, 60.3251),
 ('Kepler-30 d', 17.0, 0.96, 143.213),
 ('Kepler-31 b', 6.8, 0.38, 20.8613),
 ('Kepler-31 c', 4.7, 0.38, 42.6318)],
 dtype=[('Planet_Name', '|S11'), ('Pl_Mass', '<f8'), ('Pl_Radius', '<f8'), ('Pl_Period', '<f8')])

>>> a = np.genfromtxt('exoplanetData_clean.csv', names=True, delimiter=',', dtype=None)
 >>> a['Pl_Mass']
 array([ 0.2, 9.1, 17. , 6.8, 4.7])

Setting usemask=True will return a masked array where missing data is masked:

 >>> np.genfromtxt('exoplanetData.csv', delimiter=',', skip_header=1, dtype=None, usemask=True)
 masked_array(data = [('Kepler-32 b', 4.1, 0.37, 5.90124) ('Kepler-32 c', 0.5, 0.33, 8.7522)
 ('Kepler-33 b', --, 0.16, 5.66793) ('Kepler-33 c', --, 0.29, 13.17562)],
 mask = [(False, False, False, False) (False, False, False, False)
 (False, True, False, False) (False, True, False, False)],
 fill_value = ('N/A', 1e+20, 1e+20, 1e+20),
 dtype = [('f0', '|S11'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8')])

asciitable

asciitable is a third-party Python tool for reading text files. Installation instructions are on the asciitable website. asciitable can read and write text tables in a variety of formats.

One nice thing about asciitable is that it will try to guess the format of your table so you can type less when reading in most tables. For non-standard tables you can define specialized reader classes that will be used to interpret the table.

Examples

Reading files with no missing data:

 >>> asciitable.read('data_table.txt')
 rec.array([(0.2536, 0.1008, 0.3857), (0.4839, 0.4536, 0.3561),
 (0.1292, 0.6875, 0.5929), (0.1781, 0.3049, 0.8928),
 (0.6253, 0.3486, 0.8791)],
 dtype=[('Value1', '<f8'), ('Value2', '<f8'), ('Value3', '<f8')])

>>> asciitable.read('exoplanetData_clean.csv')
 rec.array([('Kepler-30 b', 0.2, 0.33, 29.329),
 ('Kepler-30 c', 9.1, 1.29, 60.3251),
 ('Kepler-30 d', 17.0, 0.96, 143.213),
 ('Kepler-31 b', 6.8, 0.38, 20.8613),
 ('Kepler-31 c', 4.7, 0.38, 42.6318)],
 dtype=[('Planet Name', '|S11'), ('Pl. Mass', '<f8'), ('Pl. Radius', '<f8'), ('Pl. Period', '<f8')])

Notice that it identified the string column without having to specify any special keywords.

When you specify the value of missing data using the fill_values keyword asciitable will by default return a masked array:

 >>> asciitable.read('data_table3.txt', fill_values=('MISSING',-999))
 masked_rec.array(data = [(0.4839, 0.4536, 0.3561) (0.1292, 0.6875, --) (0.1781, 0.3049, 0.8928)
 (--, 0.5801, 0.2038) (0.5993, 0.4357, 0.741)],
 mask = [(False, False, False) (False, False, True) (False, False, False)
 (True, False, False) (False, False, False)],
 fill_value = (1e+20, 1e+20, 1e+20),
 dtype = [('Value1', '<f8'), ('Value2', '<f8'), ('Value3', '<f8')])

>>> asciitable.read('exoplanetData.csv', fill_values=('', 'nan'))
 masked_rec.array(data = [('Kepler-32 b', 4.1, 0.37, 5.90124) ('Kepler-32 c', 0.5, 0.33, 8.7522)
 ('Kepler-33 b', --, 0.16, 5.66793) ('Kepler-33 c', --, 0.29, 13.17562)],
 mask = [(False, False, False, False) (False, False, False, False)
 (False, True, False, False) (False, True, False, False)],
 fill_value = ('N/A', 1e+20, 1e+20, 1e+20),
 dtype = [('Planet Name', '|S11'), ('Pl. Mass', '<f8'), ('Pl. Radius', '<f8'), ('Pl. Period', '<f8')])

ATpy

If you want to get really fancy there’s atpy. atpy can read a huge number of table formats assuming you have the correct reader installed. For example, reading fits tables requires pyfits and reading basic text tables requires asciitable. What atpy gives you is a powerful Table class for working with your data.

Read more about atpy and asciitable at this AstroBetter blog post: http://www.astrobetter.com/atpy-and-asciitable/. atpy and asciitable can also be used to write table data files.

Examples

 >>> t = atpy.Table('data_table.txt', type='ascii')
 >>> t.data
 array([(0.2536, 0.1008, 0.3857), (0.4839, 0.4536, 0.3561),
 (0.1292, 0.6875, 0.5929), (0.1781, 0.3049, 0.8928),
 (0.6253, 0.3486, 0.8791)],
 dtype=[('Value1', '<f8'), ('Value2', '<f8'), ('Value3', '<f8')])
 >>> t.columns
 {
 Value1 : type=float64,
 Value2 : type=float64,
 Value3 : type=float64
 }
 >>> t.Value1
 array([ 0.2536, 0.4839, 0.1292, 0.1781, 0.6253])
 >>> t.row(0)
 (0.2536, 0.1008, 0.3857)

The Table class even has a cool where() method:

 >>> t = atpy.Table('exoplanetData_clean.csv', type='ascii')
 >>> t.data
 array([('Kepler-30 b', 0.2, 0.33, 29.329),
 ('Kepler-30 c', 9.1, 1.29, 60.3251),
 ('Kepler-30 d', 17.0, 0.96, 143.213),
 ('Kepler-31 b', 6.8, 0.38, 20.8613),
 ('Kepler-31 c', 4.7, 0.38, 42.6318)],
 dtype=[('Planet Name', '|S11'), ('Pl. Mass', '<f8'), ('Pl. Radius', '<f8'), ('Pl. Period', '<f8')])
 >>> t_new = t.where((t['Pl. Mass'] > 5) & (t['Pl. Period'] < 100))
 >>> t_new.data
 array([('Kepler-30 c', 9.1, 1.29, 60.3251),
 ('Kepler-31 b', 6.8, 0.38, 20.8613)],
 dtype=[('Planet Name', '|S11'), ('Pl. Mass', '<f8'), ('Pl. Radius', '<f8'), ('Pl. Period', '<f8')])

When reading text tables with atpy you can pass the same arguments you would have passed to asciitable, as these are all just passed onto asciitable.read() under the hood:

 >>> t = atpy.Table('exoplanetData.csv', type='ascii', fill_values=('', 'nan'))
 >>> t.data
 array([('Kepler-32 b', 4.1, 0.37, 5.90124),
 ('Kepler-32 c', 0.5, 0.33, 8.7522),
 ('Kepler-33 b', nan, 0.16, 5.66793),
 ('Kepler-33 c', nan, 0.29, 13.17562)],
 dtype=[('Planet Name', '|S11'), ('Pl. Mass', '<f8'), ('Pl. Radius', '<f8'), ('Pl. Period', '<f8')])

Notice that the data are not a masked array. Be aware that comparisons to nan always return False so if you do a where on a column containing nan values those rows will never be part of your result.

Data Files

Here are the data files used in the examples above.

data_table.txt:

Value1  Value2  Value3
0.2536  0.1008  0.3857
0.4839  0.4536  0.3561
0.1292  0.6875  0.5929
0.1781  0.3049  0.8928
0.6253  0.3486  0.8791

data_table2.txt:

Value1  Value2  Value3
0.4839  0.4536  0.3561
0.1292  0.6875  NAN
0.1781  0.3049  0.8928
NAN     0.5801  0.2038
0.5993  0.4357  0.7410

data_table3.txt:

Value1  Value2  Value3
0.4839  0.4536  0.3561
0.1292  0.6875  MISSING
0.1781  0.3049  0.8928
MISSING 0.5801  0.2038
0.5993  0.4357  0.7410

exoplanetData_clean.csv:

Planet Name,Pl. Mass,Pl. Radius,Pl. Period
Kepler-30 b,0.2,0.33,29.329
Kepler-30 c,9.1,1.29,60.3251
Kepler-30 d,17,0.96,143.213
Kepler-31 b,6.8,0.38,20.8613
Kepler-31 c,4.7,0.38,42.6318

exoplanetData.csv (contains missing data):

Planet Name,Pl. Mass,Pl. Radius,Pl. Period
Kepler-32 b,4.1,0.37,5.90124
Kepler-32 c,0.5,0.33,8.7522
Kepler-33 b,,0.16,5.66793
Kepler-33 c,,0.29,13.17562

12 thoughts on “Reading Text Tables with Python”

Phil Rosenfield says:

So, which is the fastest for large data? Making my own table classes has been faster than loadtxt and genfromtxt.

March 9, 2012 at 8:49 pm Reply
1. jiffyclub says:
  
  I did some quick tests on a 100000 line CSV file. genfromtxt was faster than loadtxt.
  
  asciitable could be faster than genfromtxt but only if you specify things like column names and the delimiter so that it doesn’t do any guessing. Fiddling with more advanced options for asciitable, like fully defining the reader class, might speed it up further.
  
  When you really need to optimize it’s probably best to a) make sure your table is well behaved and b) write your own reader.
  
  March 10, 2012 at 1:21 am Reply
  1. jiffyclub says:
    
    Actually if you *really* need to optimize text tables are probably not the right tool for you. You’d want to use HDF5 with pytables or something like that.
    
    March 17, 2012 at 10:30 am
Guillaume says:

‘Pandas’ (http://pandas.pydata.org/) is also a nice python library to handle tables, with great I/O functions.

March 9, 2012 at 11:18 pm Reply
IDL’s GAUSSFIT in Python | Pen and Pants says:

[…] Post navigation ← Previous […]

March 16, 2012 at 11:32 pm Reply
How to read data from text file » SaiProg says:

[…] Other way to open text data is to use the Numpy function genfromtxt which handles better missing data and can read at the same time numeric and string columns. Read more about reading text file in https://penandpants.com/2012/03/09/reading-text-tables-with-python/. […]

July 28, 2012 at 5:59 pm Reply
Como ler dados de arquivo de texto » SaiProg says:

[…] Leia mais sobre como ler arquivos de texto em https://penandpants.com/2012/03/09/reading-text-tables-with-python/. […]

July 29, 2012 at 12:15 am Reply
Marcel Haas says:

I tried reading in huge files (several GB, I know they shouldn’t be ascii files, but I didn’t make them, but do need them…) with loadtxt and genfromtxt. I found indeed that genfromtxt is indeed a factor ~1.5 faster, but it does use double the memory (note that I am sing a small subset of columns of the total table in both cases, using the usecols keywords). Is that behavior expected? If memory becomes an issue, would loadtxt be better?

January 8, 2013 at 4:50 pm Reply
1. jiffyclub says:
  
  I haven’t played around with them enough to know much about their memory usage, but I can say that writing your own custom readers would most likely give you the fastest, most efficient results.
  
  January 10, 2013 at 8:43 pm Reply
Eli Bressert (@astrobiased) says:

If you use Pandas read_tables with chunking enabled you can load ASCII data fast while saving it to another format (e.g., HDF5). This helps you get around the memory issue when dealing with GBs of ASCII files especially. I did this for a 2 TB ASCII, while converting it to an SQL database and it worked really well.

January 20, 2013 at 3:56 am Reply
平静 (@0Heisei) says:

This was hugely useful! I was trying to get 3 separate arrays from as textfile and I did not know how to do it.

March 25, 2014 at 8:24 am Reply
[Python]csv読み込み備忘録 | Momentum says:

[…] Reading Text Tables with Python | Pen and Pants https://penandpants.com/2012/03/09/reading-text-tables-with-python/ […]

November 23, 2014 at 11:02 pm Reply

Pen and Pants

Don't Leave Home Without Them

Reading Text Tables with Python

numpy.loadtxt

Examples

numpy.genfromtxt

Examples

asciitable

Examples

ATpy

Examples

Data Files

12 thoughts on “Reading Text Tables with Python”

Leave a comment Cancel reply

numpy.loadtxt

Examples

numpy.genfromtxt

Examples

asciitable

Examples

ATpy

Examples

Data Files

Share this:

Related

12 thoughts on “Reading Text Tables with Python”

Leave a comment Cancel reply