用NumPy genfromtxt导入数据
用NumPy genfromtxt導(dǎo)入數(shù)據(jù)
NumPy provides several functions to create arrays from tabular data. We focus here on the genfromtxt function.
In a nutshell, genfromtxt runs two main loops. The first loop converts each line of the file in a sequence of strings. The second loop converts each string to the appropriate data type. This mechanism is slower than a single loop, but gives more flexibility. In particular, genfromtxt is able to take missing data into account, when other faster and simpler functions like loadtxt cannot.
NumPy提供了幾種從表格數(shù)據(jù)創(chuàng)建數(shù)組的功能。這里專注genfromtxt功能。
genfromtxt運(yùn)行兩個(gè)主循環(huán)。第一個(gè)循環(huán)以字符串序列轉(zhuǎn)換文件的每一行。第二個(gè)循環(huán)將每個(gè)字符串轉(zhuǎn)換為適當(dāng)?shù)臄?shù)據(jù)類型。這種機(jī)制比單循環(huán)慢,但具有更大的靈活性。特別是,當(dāng)其他更快,更簡(jiǎn)單的功能(如loadtxt不能)無(wú)法處理時(shí), genfromtxt能夠考慮丟失的數(shù)據(jù)。
Note
When giving examples, we will use the following conventions: 在給出示例時(shí),將使用以下約定:
import numpy as np
from io import StringIO
Defining the input
The only mandatory argument of genfromtxt is the source of the data. It can be a string, a list of strings, a generator or an open file-like object with a read method, for example, a file or io.StringIO object. If a single string is provided, it is assumed to be the name of a local or remote file. If a list of strings or a generator returning strings is provided, each string is treated as one line in a file. When the URL of a remote file is passed, the file is automatically downloaded to the current directory and opened.
Recognized file types are text files and archives. Currently, the function recognizes gzip and bz2 (bzip2) archives. The type of the archive is determined from the extension of the file: if the filename ends with ‘.gz’, a gzip archive is expected; if it ends with ‘bz2’, a bzip2 archive is assumed.
唯一強(qiáng)制性參數(shù)genfromtxt是數(shù)據(jù)源。可以是字符串,字符串列表,生成器或帶有read方法的打開(kāi)的類似文件的對(duì)象,例如文件或 io.StringIO對(duì)象。如果提供單個(gè)字符串,假定是本地文件或遠(yuǎn)程文件的名稱。如果提供了字符串列表或返回字符串的生成器,將每個(gè)字符串視為文件中的一行。傳遞遠(yuǎn)程文件的URL后,該文件將自動(dòng)下載到當(dāng)前目錄并打開(kāi)。
公認(rèn)的文件類型是文本文件和存檔。當(dāng)前,該功能可識(shí)別gzip和bz2(bzip2)存檔。存檔的類型由文件的擴(kuò)展名決定:如果文件名以’.gz’結(jié)尾,則應(yīng)使用gzip存檔;否則,將使用默認(rèn)的存檔。如果結(jié)尾為 ‘bz2’,bzip2則假定為存檔。
Splitting the lines into columns
將行拆分為列
The delimiter argument
delimiter參數(shù)
Once the file is defined and open for reading, genfromtxt splits each non-empty line into a sequence of strings. Empty or commented lines are just skipped. The delimiter keyword is used to define how the splitting should take place.
Quite often, a single character marks the separation between columns. For example, comma-separated files (CSV) use a comma (,) or a semicolon (😉 as delimiter:
定義文件并打開(kāi)以供讀取后,genfromtxt 將每條非空行拆分為一系列字符串。空行或注釋行被跳過(guò)。關(guān)鍵字delimiter用來(lái)定義分割應(yīng)該如何發(fā)生。
通常,單個(gè)字符標(biāo)記列之間的分隔。例如,逗號(hào)分隔文件(CSV)使用逗號(hào)(,)或分號(hào)(;)作為分隔符:data = u"1, 2, 3\n4, 5, 6"
np.genfromtxt(StringIO(data), delimiter=",")
array([[ 1., 2., 3.],
[ 4., 5., 6.]])
Another common separator is “\t”, the tabulation character. However, we are not limited to a single character, any string will do. By default, genfromtxt assumes delimiter=None, meaning that the line is split along white spaces (including tabs) and that consecutive white spaces are considered as a single white space.
Alternatively, we may be dealing with a fixed-width file, where columns are defined as a given number of characters. In that case, we need to set delimiter to a single integer (if all the columns have the same size) or to a sequence of integers (if columns can have different sizes):
另一個(gè)常見(jiàn)的分隔符是"\t"制表符。不僅限于單個(gè)字符,任何字符串都可以。默認(rèn)情況下, genfromtxt假設(shè)delimiter=None,則表示該行沿空白(包括制表符)分隔,連續(xù)的空白被視為單個(gè)空白。
可能要處理一個(gè)固定寬度的文件,將列定義為給定數(shù)量的字符。在這種情況下,需要設(shè)置 delimiter為單個(gè)整數(shù)(如果所有列的大小都相同)或整數(shù)序列(如果列的大小可以不同):data = u" 1 2 3\n 4 5 67\n890123 4"
np.genfromtxt(StringIO(data), delimiter=3)
array([[ 1., 2., 3.],
[ 4., 5., 67.],
[ 890., 123., 4.]])data = u"123456789\n 4 7 9\n 4567 9"
np.genfromtxt(StringIO(data), delimiter=(4, 3, 2))
array([[ 1234., 567., 89.],
[ 4., 7., 9.],
[ 4., 567., 9.]])
The autostrip argument
By default, when a line is decomposed into a series of strings, the individual entries are not stripped of leading nor trailing white spaces. This behavior can be overwritten by setting the optional argument autostrip to a value of True:
autostrip參數(shù)
默認(rèn)情況下,當(dāng)將一行分解為一系列字符串時(shí),不會(huì)刪除各個(gè)條目的前導(dǎo)或尾隨空格。通過(guò)將可選參數(shù)autostrip設(shè)置為值True,可以覆蓋此行為 :data = u"1, abc , 2\n 3, xxx, 4"
Without autostrip
np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5")
array([[‘1’, ’ abc ‘, ’ 2’],
[‘3’, ’ xxx’, ’ 4’]], dtype=’<U5’)With autostrip
np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5", autostrip=True)
array([[‘1’, ‘a(chǎn)bc’, ‘2’],
[‘3’, ‘xxx’, ‘4’]], dtype=’<U5’)
The comments argument
The optional argument comments is used to define a character string that marks the beginning of a comment. By default, genfromtxt assumes comments=’#’. The comment marker may occur anywhere on the line. Any character present after the comment marker(s) is simply ignored:
comments參數(shù)
可選參數(shù)comments用于定義標(biāo)記注釋開(kāi)始的字符串。默認(rèn)情況下, genfromtxt假設(shè)為comments=’#’。注釋標(biāo)記可以出現(xiàn)在行中的任何位置。注釋標(biāo)記后面的任何字符都將被忽略:data = u"""#
… # Skip me !
… # Skip me too !
… 1, 2
… 3, 4
… 5, 6 #This is the third line of the data
… 7, 8
… # And here comes the last line
… 9, 0
… “”"np.genfromtxt(StringIO(data), comments="#", delimiter=",")
array([[1., 2.],
[3., 4.],
[5., 6.],
[7., 8.],
[9., 0.]])
New in version 1.7.0: When comments is set to None, no lines are treated as comments. 在1.7.0版本的新功能:當(dāng)comments設(shè)置為None,沒(méi)有行被視為注釋。
Note
There is one notable exception to this behavior: if the optional argument names=True, the first commented line will be examined for names. 此行為有一個(gè)明顯的例外:如果可選參數(shù) names=True,則將檢查第一條注釋行的名稱。
Skipping lines and choosing columns
The skip_header and skip_footer arguments
The presence of a header in the file can hinder data processing. In that case, we need to use the skip_header optional argument. The values of this argument must be an integer which corresponds to the number of lines to skip at the beginning of the file, before any other action is performed. Similarly, we can skip the last n lines of the file by using the skip_footer attribute and giving it a value of n:
跳過(guò)行并選擇列
skip_header和skip_footer參數(shù)
文件中標(biāo)頭的存在會(huì)阻礙數(shù)據(jù)處理。在這種情況下,需要使用skip_header可選參數(shù)。此參數(shù)的值必須是整數(shù),該整數(shù)與執(zhí)行任何其它操作之前,在文件開(kāi)頭要跳過(guò)的行數(shù)相對(duì)應(yīng)。可以n通過(guò)使用skip_footer屬性,將其值設(shè)置為n,跳過(guò)文件的最后幾行n:data = u"\n".join(str(i) for i in range(10))
np.genfromtxt(StringIO(data),)
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])np.genfromtxt(StringIO(data),
… skip_header=3, skip_footer=5)
array([ 3., 4.])
By default, skip_header=0 and skip_footer=0, meaning that no lines are skipped.
The usecols argument
In some cases, we are not interested in all the columns of the data but only a few of them. We can select which columns to import with the usecols argument. This argument accepts a single integer or a sequence of integers corresponding to the indices of the columns to import. Remember that by convention, the first column has an index of 0. Negative integers behave the same as regular Python negative indexes.
For example, if we want to import only the first and the last columns, we can use usecols=(0, -1):
默認(rèn)情況下,skip_header=0和skip_footer=0,表示不跳過(guò)任何行。
usecols參數(shù)
在某些情況下,對(duì)數(shù)據(jù)的所有列都不感興趣,但僅對(duì)其中的一些感興趣。可以選擇使用usecols參數(shù)導(dǎo)入的列 。此參數(shù)接受與要導(dǎo)入的列的索引相對(duì)應(yīng)的單個(gè)整數(shù)或整數(shù)序列。按照慣例,第一列的索引為0。負(fù)整數(shù)的行為與常規(guī)Python負(fù)索引相同。
例如,如果只想導(dǎo)入第一列和最后一列,則可以使用:usecols=(0, -1)data = u"1 2 3\n4 5 6"
np.genfromtxt(StringIO(data), usecols=(0, -1))
array([[ 1., 3.],
[ 4., 6.]])
If the columns have names, we can also select which columns to import by giving their name to the usecols argument, either as a sequence of strings or a comma-separated string: 如果這些列具有名稱,可以通過(guò)將其名稱作為usecols參數(shù),以字符串序列或逗號(hào)分隔的字符串作為參數(shù)來(lái)選擇要導(dǎo)入的列:data = u"1 2 3\n4 5 6"
np.genfromtxt(StringIO(data),
… names=“a, b, c”, usecols=(“a”, “c”))
array([(1.0, 3.0), (4.0, 6.0)],
dtype=[(‘a(chǎn)’, ‘<f8’), (‘c’, ‘<f8’)])np.genfromtxt(StringIO(data),
… names=“a, b, c”, usecols=(“a, c”))
array([(1.0, 3.0), (4.0, 6.0)],
dtype=[(‘a(chǎn)’, ‘<f8’), (‘c’, ‘<f8’)])
Choosing the data type
The main way to control how the sequences of strings we have read from the file are converted to other types is to set the dtype argument. Acceptable values for this argument are:
? a single type, such as dtype=float. The output will be 2D with the given dtype, unless a name has been associated with each column with the use of the names argument (see below). Note that dtype=float is the default for genfromtxt.
? a sequence of types, such as dtype=(int, float, float).
? a comma-separated string, such as dtype=“i4,f8,|U3”.
? a dictionary with two keys ‘names’ and ‘formats’.
? a sequence of tuples (name, type), such as dtype=[(‘A’, int), (‘B’, float)].
? an existing numpy.dtype object.
? the special value None. In that case, the type of the columns will be determined from the data itself (see below).
In all the cases but the first one, the output will be a 1D array with a structured dtype. This dtype has as many fields as items in the sequence. The field names are defined with the names keyword.
When dtype=None, the type of each column is determined iteratively from its data. We start by checking whether a string can be converted to a boolean (that is, if the string matches true or false in lower cases); then whether it can be converted to an integer, then to a float, then to a complex and eventually to a string. This behavior may be changed by modifying the default mapper of the StringConverter class.
The option dtype=None is provided for convenience. However, it is significantly slower than setting the dtype explicitly.
選擇數(shù)據(jù)類型
控制從文件中讀取的字符串序列,如何轉(zhuǎn)換為其它類型的主要方法是,設(shè)置dtype參數(shù)。此參數(shù)可接受的值為:
? 單一類型,例如dtype=float。輸出將是具有給定dtype的2D,除非使用names參數(shù),將名稱與每個(gè)列相關(guān)聯(lián)(請(qǐng)參見(jiàn)下文)。dtype=float是genfromtxt的默認(rèn)設(shè)置 。
? 類型序列,例如。dtype=(int, float, float)
? 逗號(hào)分隔的字符串,例如dtype=“i4,f8,|U3”。
? 有兩個(gè)鍵’names’和’formats’的字典。
? 一組元組,如 ,(name, type)dtype=[(‘A’, int), (‘B’, float)]
? 現(xiàn)有numpy.dtype對(duì)象。
? 特殊值None。列的類型將由數(shù)據(jù)本身確定(請(qǐng)參見(jiàn)下文)。
在除第一種情況以外的所有情況下,輸出都是具有結(jié)構(gòu)化dtype的一維數(shù)組。此dtype具有與序列中的條目一樣多的字段。字段名稱是用names關(guān)鍵字定義的。
如果dtype=None,根據(jù)數(shù)據(jù)迭代確定每個(gè)列的類型。首先檢查一個(gè)字符串是否可以轉(zhuǎn)換為布爾值(即,如果字符串匹配true或false小寫);否則,轉(zhuǎn)換為布爾值。是否可以將其轉(zhuǎn)換為整數(shù),轉(zhuǎn)換為浮點(diǎn)數(shù),轉(zhuǎn)換為復(fù)數(shù),最后轉(zhuǎn)換為字符串。通過(guò)修改類的默認(rèn)映射,可以更改此行為 StringConverter。
dtype=None提供此選項(xiàng)是為了方便。但是,比dtype顯式設(shè)置慢得多。
設(shè)置名稱
Setting the names
The names argument
A natural approach when dealing with tabular data is to allocate a name to each column. A first possibility is to use an explicit structured dtype, as mentioned previously:
names參數(shù)
處理表格數(shù)據(jù)時(shí),一種自然的方法是為每個(gè)列分配一個(gè)名稱。如前所述,第一種可能性是使用顯式結(jié)構(gòu)化dtype:data = StringIO(“1 2 3\n 4 5 6”)
np.genfromtxt(data, dtype=[(_, int) for _ in “abc”])
array([(1, 2, 3), (4, 5, 6)],
dtype=[(‘a(chǎn)’, ‘<i8’), (‘b’, ‘<i8’), (‘c’, ‘<i8’)])
Another simpler possibility is to use the names keyword with a sequence of strings or a comma-separated string: 另一個(gè)更簡(jiǎn)單的可能性是,將names關(guān)鍵字與字符串序列,或逗號(hào)分隔的字符串,一起使用:data = StringIO(“1 2 3\n 4 5 6”)
np.genfromtxt(data, names=“A, B, C”)
array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)],
dtype=[(‘A’, ‘<f8’), (‘B’, ‘<f8’), (‘C’, ‘<f8’)])
In the example above, we used the fact that by default, dtype=float. By giving a sequence of names, we are forcing the output to a structured dtype.
We may sometimes need to define the column names from the data itself. In that case, we must use the names keyword with a value of True. The names will then be read from the first line (after the skip_header ones), even if the line is commented out:
在上面的示例中,使用了默認(rèn)情況下的事實(shí)dtype=float。通過(guò)提供一系列名稱,將輸出強(qiáng)制為結(jié)構(gòu)化dtype。
有時(shí)可能需要根據(jù)數(shù)據(jù)本身定義列名稱。在這種情況下,必須使用names值為 True的關(guān)鍵字。即使從第一行中刪除了注釋,也將從第一行中讀取名稱 skip_header:data = StringIO(“So it goes\n#a b c\n1 2 3\n 4 5 6”)
np.genfromtxt(data, skip_header=1, names=True)
array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)],
dtype=[(‘a(chǎn)’, ‘<f8’), (‘b’, ‘<f8’), (‘c’, ‘<f8’)])
The default value of names is None. If we give any other value to the keyword, the new names will overwrite the field names we may have defined with the dtype: 默認(rèn)值names是None。如果給關(guān)鍵字賦予其它任何值,則新名稱將覆蓋可能已經(jīng)用dtype定義的字段名稱:data = StringIO(“1 2 3\n 4 5 6”)
ndtype=[(‘a(chǎn)’,int), (‘b’, float), (‘c’, int)]
names = [“A”, “B”, “C”]
np.genfromtxt(data, names=names, dtype=ndtype)
array([(1, 2.0, 3), (4, 5.0, 6)],
dtype=[(‘A’, ‘<i8’), (‘B’, ‘<f8’), (‘C’, ‘<i8’)])
The defaultfmt argument
If names=None but a structured dtype is expected, names are defined with the standard NumPy default of “f%i”, yielding names like f0, f1 and so forth:
如果names=None,只希望使用結(jié)構(gòu)化dtype,則使用標(biāo)準(zhǔn)NumPy默認(rèn)值定義"f%i",產(chǎn)生f0,f1類似的名稱, 依此類推:data = StringIO(“1 2 3\n 4 5 6”)
np.genfromtxt(data, dtype=(int, float, int))
array([(1, 2.0, 3), (4, 5.0, 6)],
dtype=[(‘f0’, ‘<i8’), (‘f1’, ‘<f8’), (‘f2’, ‘<i8’)])
In the same way, if we don’t give enough names to match the length of the dtype, the missing names will be defined with this default template: 同樣,如果沒(méi)有提供足夠的名稱來(lái)匹配dtype的長(zhǎng)度,則會(huì)使用此默認(rèn)模板來(lái)定義缺少的名稱:data = StringIO(“1 2 3\n 4 5 6”)
np.genfromtxt(data, dtype=(int, float, int), names=“a”)
array([(1, 2.0, 3), (4, 5.0, 6)],
dtype=[(‘a(chǎn)’, ‘<i8’), (‘f0’, ‘<f8’), (‘f1’, ‘<i8’)])
We can overwrite this default with the defaultfmt argument, that takes any format string: 可以使用defaultfmt任何格式字符串的參數(shù),覆蓋此默認(rèn)值:data = StringIO(“1 2 3\n 4 5 6”)
np.genfromtxt(data, dtype=(int, float, int), defaultfmt=“var_%02i”)
array([(1, 2.0, 3), (4, 5.0, 6)],
dtype=[(‘var_00’, ‘<i8’), (‘var_01’, ‘<f8’), (‘var_02’, ‘<i8’)])
Note
We need to keep in mind that defaultfmt is used only if some names are expected but not defined. 需要記住,defaultfmt僅當(dāng)需要某些名稱,但未定義某些名稱時(shí),才使用。
Validating names
NumPy arrays with a structured dtype can also be viewed as recarray, where a field can be accessed as if it were an attribute. For that reason, we may need to make sure that the field name doesn’t contain any space or invalid character, or that it does not correspond to the name of a standard attribute (like size or shape), which would confuse the interpreter. genfromtxt accepts three optional arguments that provide a finer control on the names:
驗(yàn)證名稱
具有結(jié)構(gòu)化dtype的NumPy數(shù)組,可以視為recarray,在其中可以像對(duì)待 字段一樣,訪問(wèn)字段。可能需要確保字段名稱,不包含任何空格或無(wú)效字符,或者不與標(biāo)準(zhǔn)屬性(如size或 shape)的名稱相對(duì)應(yīng),這會(huì)使解釋器感到困惑。 genfromtxt 接受三個(gè)可選參數(shù),對(duì)名稱提供了更好的控制:
deletechars
Gives a string combining all the characters that must be deleted from the name. By default, invalid characters are !@#$%^&*()-=+|]}[{’;: /?.>,<.
excludelist
Gives a list of the names to exclude, such as return, file, print… If one of the input name is part of this list, an underscore character (’_’) will be appended to it.
case_sensitive
Whether the names should be case-sensitive (case_sensitive=True), converted to upper case (case_sensitive=False or case_sensitive=‘upper’) or to lower case (case_sensitive=‘lower’).
Tweaking the conversion
The converters argument
Usually, defining a dtype is sufficient to define how the sequence of strings must be converted. However, some additional control may sometimes be required. For example, we may want to make sure that a date in a format YYYY/MM/DD is converted to a datetime object, or that a string like xx% is properly converted to a float between 0 and 1. In such cases, we should define conversion functions with the converters arguments.
The value of this argument is typically a dictionary with column indices or column names as keys and a conversion functions as values. These conversion functions can either be actual functions or lambda functions. In any case, they should accept only a string as input and output only a single element of the wanted type.
In the following example, the second column is converted from as string representing a percentage to a float between 0 and 1:
converters參數(shù)
通常,dtype足以定義必須如何轉(zhuǎn)換字符串序列。有時(shí)可能需要一些其它控制。例如,可能要確保將格式中的日期 YYYY/MM/DD轉(zhuǎn)換為datetime對(duì)象,或者將類似的字符串xx%正確轉(zhuǎn)換為0到1之間的浮點(diǎn)數(shù)。在這種情況下,應(yīng)該使用converters 參數(shù)定義轉(zhuǎn)換函數(shù)。
該參數(shù)的值通常是一個(gè)字典,其中以列索引或列名作為鍵,而轉(zhuǎn)換函數(shù)作為值。這些轉(zhuǎn)換函數(shù)可以是實(shí)際函數(shù),也可以是lambda函數(shù)。在任何情況下,都應(yīng)僅接受字符串作為輸入,并僅輸出所需類型的單個(gè)元素。
在下面的示例中,第二列從表示百分比的字符串轉(zhuǎn)換為0到1之間的浮點(diǎn)數(shù)。
convertfunc = lambda x: float(x.strip(b"%"))/100.
data = u"1, 2.3%, 45.\n6, 78.9%, 0"
names = (“i”, “p”, “n”)General case …
np.genfromtxt(StringIO(data), delimiter=",", names=names)
array([(1., nan, 45.), (6., nan, 0.)],
dtype=[(‘i’, ‘<f8’), (‘p’, ‘<f8’), (‘n’, ‘<f8’)])
We need to keep in mind that by default, dtype=float. A float is therefore expected for the second column. However, the strings ’ 2.3%’ and ’ 78.9%’ cannot be converted to float and we end up having np.nan instead. Let’s now use a converter:Converted case …
np.genfromtxt(StringIO(data), delimiter=",", names=names,
… converters={1: convertfunc})
array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)],
dtype=[(‘i’, ‘<f8’), (‘p’, ‘<f8’), (‘n’, ‘<f8’)])
The same results can be obtained by using the name of the second column (“p”) as key instead of its index (1):Using a name for the converter …
np.genfromtxt(StringIO(data), delimiter=",", names=names,
… converters={“p”: convertfunc})
array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)],
dtype=[(‘i’, ‘<f8’), (‘p’, ‘<f8’), (‘n’, ‘<f8’)])
Converters can also be used to provide a default for missing entries. In the following example, the converter convert transforms a stripped string into the corresponding float or into -999 if the string is empty. We need to explicitly strip the string from white spaces as it is not done by default:data = u"1, , 3\n 4, 5, 6"
convert = lambda x: float(x.strip() or -999)
np.genfromtxt(StringIO(data), delimiter=",",
… converters={1: convert})
array([[ 1., -999., 3.],
[ 4., 5., 6.]])
Using missing and filling values
Some entries may be missing in the dataset we are trying to import. In a previous example, we used a converter to transform an empty string into a float. However, user-defined converters may rapidly become cumbersome to manage.
The genfromtxt function provides two other complementary mechanisms: the missing_values argument is used to recognize missing data and a second argument, filling_values, is used to process these missing data.
missing_values
By default, any empty string is marked as missing. We can also consider more complex strings, such as “N/A” or “???” to represent missing or invalid data. The missing_values argument accepts three kind of values:
a string or a comma-separated string
This string will be used as the marker for missing data for all the columns
a sequence of strings
In that case, each item is associated to a column, in order.
a dictionary
Values of the dictionary are strings or sequence of strings. The corresponding keys can be column indices (integers) or column names (strings). In addition, the special key None can be used to define a default applicable to all columns.
filling_values
We know how to recognize missing data, but we still need to provide a value for these missing entries. By default, this value is determined from the expected dtype according to this table:
Expected type Default
bool False
int -1
float np.nan
complex np.nan+0j
string ‘???’
We can get a finer control on the conversion of missing values with the filling_values optional argument. Like missing_values, this argument accepts different kind of values:
a single value
This will be the default for all columns
a sequence of values
Each entry will be the default for the corresponding column
a dictionary
Each key can be a column index or a column name, and the corresponding value should be a single object. We can use the special key None to define a default for all columns.
In the following example, we suppose that the missing values are flagged with “N/A” in the first column and by “???” in the third column. We wish to transform these missing values to 0 if they occur in the first and second column, and to -999 if they occur in the last column:data = u"N/A, 2, 3\n4, ,???"
kwargs = dict(delimiter=",",
… dtype=int,
… names=“a,b,c”,
… missing_values={0:“N/A”, ‘b’:" “, 2:”???"},
… filling_values={0:0, ‘b’:0, 2:-999})np.genfromtxt(StringIO(data), **kwargs)
array([(0, 2, 3), (4, 0, -999)],
dtype=[(‘a(chǎn)’, ‘<i8’), (‘b’, ‘<i8’), (‘c’, ‘<i8’)])
usemask
We may also want to keep track of the occurrence of missing data by constructing a boolean mask, with True entries where data was missing and False otherwise. To do that, we just have to set the optional argument usemask to True (the default is False). The output array will then be a MaskedArray.
Shortcut functions
In addition to genfromtxt, the numpy.lib.npyio module provides several convenience functions derived from genfromtxt. These functions work the same way as the original, but they have different default values.
recfromtxt
Returns a standard numpy.recarray (if usemask=False) or a MaskedRecords array (if usemaske=True). The default dtype is dtype=None, meaning that the types of each column will be automatically determined.
recfromcsv
Like recfromtxt, but with a default delimiter=",".
總結(jié)
以上是生活随笔為你收集整理的用NumPy genfromtxt导入数据的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: AI基础架构Pass Infrastru
- 下一篇: Xilinx FPGA全局介绍