如何解析 XML 并获取特定节点属性的实例？

M

Mateen Ulhaq

我建议ElementTree。同一个 API 还有其他兼容的实现，例如 Python 标准库本身中的 lxml 和 cElementTree；但是，在这种情况下，他们主要添加的是更快的速度——编程部分的易用性取决于 ElementTree 定义的 API。

首先从 XML 构建一个 Element 实例 root，例如使用 XML 函数，或者通过解析具有以下内容的文件：

import xml.etree.ElementTree as ET
root = ET.parse('thefile.xml').getroot()

或 ElementTree 中显示的许多其他方式中的任何一种。然后执行以下操作：

for type_tag in root.findall('bar/type'):
    value = type_tag.get('foobar')
    print(value)

输出：

1
2

您似乎忽略了 Python 附带的 xml.etree.cElementTree，在某些方面比 lxml 更快（“lxml 的 iterparse() 比 cET 中的稍慢”——来自 lxml 作者的电子邮件）。

ElementTree 工作并包含在 Python 中。虽然 XPath 支持有限，并且您不能遍历元素的父元素，这会减慢开发速度（尤其是如果您不知道这一点）。有关详细信息，请参阅 python xml query get parent。

lxml 增加的不仅仅是速度。它提供了对诸如父节点、XML 源中的行号等信息的轻松访问，这些信息在多种情况下都非常有用。

似乎 ElementTree 有一些漏洞问题，这是来自文档的引用：

Warning The xml.etree.ElementTree module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities.

@Cristik 这似乎是大多数 xml 解析器的情况，请参阅 XML vulnerabilities page。

M

Mateen Ulhaq

minidom 是最快且非常直接的。

XML:

<data>
    <items>
        <item name="item1"></item>
        <item name="item2"></item>
        <item name="item3"></item>
        <item name="item4"></item>
    </items>
</data>

Python：

from xml.dom import minidom

dom = minidom.parse('items.xml')
elements = dom.getElementsByTagName('item')

print(f"There are {len(elements)} items:")

for element in elements:
    print(element.attributes['name'].value)

输出：

There are 4 items:
item1
item2
item3
item4

你如何获得“item1”的价值？例如：Value1

minidom 的文档在哪里？我只发现了这个，但没有发现：docs.python.org/2/library/xml.dom.minidom.html

我也很困惑为什么它直接从文档的顶层找到item？如果您为其提供路径 (data->items)，它会不会更干净？因为，如果您还有 data->secondSetOfItems，它也有名为 item 的节点，并且您只想列出两组 item 中的一组，该怎么办？

请参阅stackoverflow.com/questions/21124018/…

语法在这里不起作用，您需要删除括号 for s in itemlist: print(s.attributes['name'].value)

t

the Tin Man

您可以使用 BeautifulSoup：

from bs4 import BeautifulSoup

x="""<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>"""

y=BeautifulSoup(x)
>>> y.foo.bar.type["foobar"]
u'1'

>>> y.foo.bar.findAll("type")
[<type foobar="1"></type>, <type foobar="2"></type>]

>>> y.foo.bar.findAll("type")[0]["foobar"]
u'1'
>>> y.foo.bar.findAll("type")[1]["foobar"]
u'2'

三年后使用 bs4 这是一个很好的解决方案，非常灵活，特别是如果源代码格式不正确

@YOU BeautifulStoneSoup 已弃用。只需使用 BeautifulSoup(source_xml, features="xml")

又过了 3 年，我只是尝试使用 ElementTree 加载 XML，不幸的是它无法解析，除非我在某些地方调整源但 BeautifulSoup 立即工作而没有任何更改！

@andi您的意思是“已弃用”。 “折旧”是指价值下降，通常是由于老化或正常使用造成的磨损。

再过3年，现在BS4还不够快。需要年龄。寻找任何更快的解决方案

S

Stevoisiak

那里有很多选择。如果速度和内存使用是一个问题，cElementTree 看起来很棒。与使用 readlines 简单地读入文件相比，它的开销很小。

相关指标可在下表中找到，从 cElementTree 网站复制：

library                         time    space
xml.dom.minidom (Python 2.1)    6.3 s   80000K
gnosis.objectify                2.0 s   22000k
xml.dom.minidom (Python 2.4)    1.4 s   53000k
ElementTree 1.2                 1.6 s   14500k  
ElementTree 1.2.4/1.3           1.1 s   14500k  
cDomlette (C extension)         0.540 s 20500k
PyRXPU (C extension)            0.175 s 10850k
libxml2 (C extension)           0.098 s 16000k
readlines (read as utf-8)       0.093 s 8850k
cElementTree (C extension)  --> 0.047 s 4900K <--
readlines (read as ascii)       0.032 s 5050k

正如 @jfs 所指出的，cElementTree 与 Python 捆绑在一起：

Python 2：从 xml.etree 导入 cElementTree 作为 ElementTree。

Python 3：从 xml.etree 导入 ElementTree（自动使用加速的 C 版本）。

使用 cElementTree 有什么缺点吗？这似乎是不费吹灰之力。

显然他们不想在 OS X 上使用该库，因为我花了超过 15 分钟试图找出从哪里下载它并且没有链接有效。缺乏文档会阻碍好的项目蓬勃发展，希望更多的人能够意识到这一点。

@Stunner：它在标准库中，即您不需要下载任何东西。在 Python 2 上：from xml.etree import cElementTree as ElementTree。在 Python 3 上：from xml.etree import ElementTree（自动使用加速的 C 版本）

@mayhewsw 弄清楚如何有效地将 ElementTree 用于特定任务需要付出更多的努力。对于适合内存的文档，使用 minidom 要容易得多，并且它适用于较小的 XML 文档。

t

the Tin Man

为简单起见，我建议使用 xmltodict。

它将您的 XML 解析为 OrderedDict；

>>> e = '<foo>
             <bar>
                 <type foobar="1"/>
                 <type foobar="2"/>
             </bar>
        </foo> '

>>> import xmltodict
>>> result = xmltodict.parse(e)
>>> result

OrderedDict([(u'foo', OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))]))])

>>> result['foo']

OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))])

>>> result['foo']['bar']

OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])])

同意。如果您不需要 XPath 或任何复杂的东西，那么使用起来会简单得多（尤其是在解释器中）；对于发布 XML 而不是 JSON 的 REST API 很方便

请记住 OrderedDict 不支持重复键。大多数 XML 都充满了相同类型的多个同级（例如，一个部分中的所有段落，或者您的栏中的所有类型）。所以这只适用于非常有限的特殊情况。

@TextGeek 在这种情况下， result["foo"]["bar"]["type"] 是所有 <type> 元素的列表，因此它仍在工作（即使结构可能有点出乎意料）。

自 2019 年以来没有更新

我刚刚意识到自 2019 年以来没有更新。我们需要找到一个活跃的分叉。

s

sandy

lxml.objectify 非常简单。

获取示例文本：

from lxml import objectify
from collections import defaultdict

count = defaultdict(int)

root = objectify.fromstring(text)

for item in root.bar.type:
    count[item.attrib.get("foobar")] += 1

print dict(count)

输出：

{'1': 1, '2': 1}

count 使用默认键将每个项目的计数存储在字典中，因此您不必检查成员资格。您也可以尝试查看 collections.Counter。

t

the Tin Man

Python 有一个到 expat XML 解析器的接口。

xml.parsers.expat

它是一个非验证解析器，因此不会捕获错误的 XML。但是，如果您知道您的文件是正确的，那么这非常好，您可能会得到您想要的确切信息，并且您可以随时丢弃其余信息。

stringofxml = """<foo>
    <bar>
        <type arg="value" />
        <type arg="value" />
        <type arg="value" />
    </bar>
    <bar>
        <type arg="value" />
    </bar>
</foo>"""
count = 0
def start(name, attr):
    global count
    if name == 'type':
        count += 1

p = expat.ParserCreate()
p.StartElementHandler = start
p.Parse(stringofxml)

print count # prints 4

t

the Tin Man

只是为了增加另一种可能性，您可以使用 untangle，因为它是一个简单的 xml-to-python-object 库。这里有一个例子：

安装：

pip install untangle

用法：

您的 XML 文件（稍有更改）：

<foo>
   <bar name="bar_name">
      <type foobar="1"/>
   </bar>
</foo>

使用 untangle 访问属性：

import untangle

obj = untangle.parse('/path_to_xml_file/file.xml')

print obj.foo.bar['name']
print obj.foo.bar.type['foobar']

输出将是：

bar_name
1

有关解开的更多信息，请参见“untangle”。

此外，如果您好奇，可以在“Python and XML”中找到用于处理 XML 和 Python 的工具列表。您还将看到之前的答案中提到了最常见的那些。

是什么让 untangle 与 minidom 不同？

我无法告诉你这两者之间的区别，因为我没有使用过 minidom。

g

gatkin

我可能会建议declxml。

完全披露：我编写这个库是因为我正在寻找一种在 XML 和 Python 数据结构之间进行转换的方法，而无需使用 ElementTree 编写数十行命令式解析/序列化代码。

使用 declxml，您可以使用处理器以声明方式定义 XML 文档的结构以及如何在 XML 和 Python 数据结构之间进行映射。处理器用于序列化和解析以及基本级别的验证。

解析成 Python 数据结构很简单：

import declxml as xml

xml_string = """
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>
"""

processor = xml.dictionary('foo', [
    xml.dictionary('bar', [
        xml.array(xml.integer('type', attribute='foobar'))
    ])
])

xml.parse_from_string(processor, xml_string)

产生输出：

{'bar': {'foobar': [1, 2]}}

您还可以使用相同的处理器将数据序列化为 XML

data = {'bar': {
    'foobar': [7, 3, 21, 16, 11]
}}

xml.serialize_to_string(processor, data, indent='    ')

产生以下输出

<?xml version="1.0" ?>
<foo>
    <bar>
        <type foobar="7"/>
        <type foobar="3"/>
        <type foobar="21"/>
        <type foobar="16"/>
        <type foobar="11"/>
    </bar>
</foo>

如果您想使用对象而不是字典，您可以定义处理器来将数据转换为对象以及从对象转换数据。

import declxml as xml

class Bar:

    def __init__(self):
        self.foobars = []

    def __repr__(self):
        return 'Bar(foobars={})'.format(self.foobars)


xml_string = """
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>
"""

processor = xml.dictionary('foo', [
    xml.user_object('bar', Bar, [
        xml.array(xml.integer('type', attribute='foobar'), alias='foobars')
    ])
])

xml.parse_from_string(processor, xml_string)

产生以下输出

{'bar': Bar(foobars=[1, 2])}

t

the Tin Man

这是一个使用 cElementTree 的非常简单但有效的代码。

try:
    import cElementTree as ET
except ImportError:
  try:
    # Python 2.5 need to import a different module
    import xml.etree.cElementTree as ET
  except ImportError:
    exit_err("Failed to import cElementTree from any known place")      

def find_in_tree(tree, node):
    found = tree.find(node)
    if found == None:
        print "No %s in file" % node
        found = []
    return found  

# Parse a xml file (specify the path)
def_file = "xml_file_name.xml"
try:
    dom = ET.parse(open(def_file, "r"))
    root = dom.getroot()
except:
    exit_err("Unable to open and parse input definition file: " + def_file)

# Parse to find the child nodes list of node 'myNode'
fwdefs = find_in_tree(root,"myNode")

这是来自“python xml parse”。

t

the Tin Man

XML:

<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>

Python代码：

import xml.etree.cElementTree as ET

tree = ET.parse("foo.xml")
root = tree.getroot() 
root_tag = root.tag
print(root_tag) 

for form in root.findall("./bar/type"):
    x=(form.attrib)
    z=list(x)
    for i in z:
        print(x[i])

输出：

foo
1
2

M

Martijn Pieters

如果您使用 python-benedict，则无需使用特定于库的 API。只需从您的 XML 初始化一个新实例并轻松管理它，因为它是 dict 子类。

安装简单：pip install python-benedict

from benedict import benedict as bdict

# data-source can be an url, a filepath or data-string (as in this example)
data_source = """
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>"""

data = bdict.from_xml(data_source)
t_list = data['foo.bar'] # yes, keypath supported
for t in t_list:
   print(t['@foobar'])

它支持并规范化多种格式的 I/O 操作：Base64、CSV、JSON、TOML、XML、YAML 和 query-string。

它在 GitHub 上经过良好测试和开源。披露：我是作者。

G

G M

xml.etree.ElementTree 与 lxml

这些是两个最常用的库的一些优点，在它们之间进行选择之前我会有所了解。

xml.etree.ElementTree：

来自标准库：无需安装任何模块

lxml

轻松编写 XML 声明：例如您需要添加standalone="no" 吗？漂亮的打印：你可以有一个很好的缩进 XML 没有额外的代码。对象化功能：它允许您像处理普通 Python 对象 hierarchy.node 一样使用 XML。 sourceline 允许轻松获取您正在使用的 XML 元素的行。您还可以使用内置的 XSD 架构检查器。

t

the Tin Man

import xml.etree.ElementTree as ET
data = '''<foo>
           <bar>
               <type foobar="1"/>
               <type foobar="2"/>
          </bar>
       </foo>'''
tree = ET.fromstring(data)
lst = tree.findall('bar/type')
for item in lst:
    print item.get('foobar')

这将打印 foobar 属性的值。

s

smci

simplified_scrapy：一个新的lib，用过之后就爱上了。我推荐给你。

from simplified_scrapy import SimplifiedDoc
xml = '''
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>
'''

doc = SimplifiedDoc(xml)
types = doc.selects('bar>type')
print (len(types)) # 2
print (types.foobar) # ['1', '2']
print (doc.selects('bar>type>foobar()')) # ['1', '2']

Here 是更多示例。这个库很容易使用。

S

Siraj

#If the xml is in the form of a string as shown below then
from lxml  import etree, objectify
'''sample xml as a string with a name space {http://xmlns.abc.com}'''
message =b'<?xml version="1.0" encoding="UTF-8"?>\r\n<pa:Process xmlns:pa="http://xmlns.abc.com">\r\n\t<pa:firsttag>SAMPLE</pa:firsttag></pa:Process>\r\n'  # this is a sample xml which is a string


print('************message coversion and parsing starts*************')

message=message.decode('utf-8') 
message=message.replace('<?xml version="1.0" encoding="UTF-8"?>\r\n','') #replace is used to remove unwanted strings from the 'message'
message=message.replace('pa:Process>\r\n','pa:Process>')
print (message)

print ('******Parsing starts*************')
parser = etree.XMLParser(remove_blank_text=True) #the name space is removed here
root = etree.fromstring(message, parser) #parsing of xml happens here
print ('******Parsing completed************')


dict={}
for child in root: # parsed xml is iterated using a for loop and values are stored in a dictionary
    print(child.tag,child.text)
    print('****Derving from xml tree*****')
    if child.tag =="{http://xmlns.abc.com}firsttag":
        dict["FIRST_TAG"]=child.text
        print(dict)


### output
'''************message coversion and parsing starts*************
<pa:Process xmlns:pa="http://xmlns.abc.com">

    <pa:firsttag>SAMPLE</pa:firsttag></pa:Process>
******Parsing starts*************
******Parsing completed************
{http://xmlns.abc.com}firsttag SAMPLE
****Derving from xml tree*****
{'FIRST_TAG': 'SAMPLE'}'''

还请包括一些上下文来解释您的答案如何解决问题。不鼓励仅使用代码的答案。

L

Liju

如果您不想使用任何外部库或 3rd 方工具，请尝试以下代码。

这会将xml解析为python字典

这也将解析 xml 属性

这也将解析像这样的空标签和只有像这样的属性的标签

代码

import re

def getdict(content):
    res=re.findall("<(?P<var>\S*)(?P<attr>[^/>]*)(?:(?:>(?P<val>.*?)</(?P=var)>)|(?:/>))",content)
    if len(res)>=1:
        attreg="(?P<avr>\S+?)(?:(?:=(?P<quote>['\"])(?P<avl>.*?)(?P=quote))|(?:=(?P<avl1>.*?)(?:\s|$))|(?P<avl2>[\s]+)|$)"
        if len(res)>1:
            return [{i[0]:[{"@attributes":[{j[0]:(j[2] or j[3] or j[4])} for j in re.findall(attreg,i[1].strip())]},{"$values":getdict(i[2])}]} for i in res]
        else:
            return {res[0]:[{"@attributes":[{j[0]:(j[2] or j[3] or j[4])} for j in re.findall(attreg,res[1].strip())]},{"$values":getdict(res[2])}]}
    else:
        return content

with open("test.xml","r") as f:
    print(getdict(f.read().replace('\n','')))

样本输入

<details class="4b" count=1 boy>
    <name type="firstname">John</name>
    <age>13</age>
    <hobby>Coin collection</hobby>
    <hobby>Stamp collection</hobby>
    <address>
        <country>USA</country>
        <state>CA</state>
    </address>
</details>
<details empty="True"/>
<details/>
<details class="4a" count=2 girl>
    <name type="firstname">Samantha</name>
    <age>13</age>
    <hobby>Fishing</hobby>
    <hobby>Chess</hobby>
    <address current="no">
        <country>Australia</country>
        <state>NSW</state>
    </address>
</details>

输出（美化）

[
  {
    "details": [
      {
        "@attributes": [
          {
            "class": "4b"
          },
          {
            "count": "1"
          },
          {
            "boy": ""
          }
        ]
      },
      {
        "$values": [
          {
            "name": [
              {
                "@attributes": [
                  {
                    "type": "firstname"
                  }
                ]
              },
              {
                "$values": "John"
              }
            ]
          },
          {
            "age": [
              {
                "@attributes": []
              },
              {
                "$values": "13"
              }
            ]
          },
          {
            "hobby": [
              {
                "@attributes": []
              },
              {
                "$values": "Coin collection"
              }
            ]
          },
          {
            "hobby": [
              {
                "@attributes": []
              },
              {
                "$values": "Stamp collection"
              }
            ]
          },
          {
            "address": [
              {
                "@attributes": []
              },
              {
                "$values": [
                  {
                    "country": [
                      {
                        "@attributes": []
                      },
                      {
                        "$values": "USA"
                      }
                    ]
                  },
                  {
                    "state": [
                      {
                        "@attributes": []
                      },
                      {
                        "$values": "CA"
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "details": [
      {
        "@attributes": [
          {
            "empty": "True"
          }
        ]
      },
      {
        "$values": ""
      }
    ]
  },
  {
    "details": [
      {
        "@attributes": []
      },
      {
        "$values": ""
      }
    ]
  },
  {
    "details": [
      {
        "@attributes": [
          {
            "class": "4a"
          },
          {
            "count": "2"
          },
          {
            "girl": ""
          }
        ]
      },
      {
        "$values": [
          {
            "name": [
              {
                "@attributes": [
                  {
                    "type": "firstname"
                  }
                ]
              },
              {
                "$values": "Samantha"
              }
            ]
          },
          {
            "age": [
              {
                "@attributes": []
              },
              {
                "$values": "13"
              }
            ]
          },
          {
            "hobby": [
              {
                "@attributes": []
              },
              {
                "$values": "Fishing"
              }
            ]
          },
          {
            "hobby": [
              {
                "@attributes": []
              },
              {
                "$values": "Chess"
              }
            ]
          },
          {
            "address": [
              {
                "@attributes": [
                  {
                    "current": "no"
                  }
                ]
              },
              {
                "$values": [
                  {
                    "country": [
                      {
                        "@attributes": []
                      },
                      {
                        "$values": "Australia"
                      }
                    ]
                  },
                  {
                    "state": [
                      {
                        "@attributes": []
                      },
                      {
                        "$values": "NSW"
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
]

方法不错，但是返回的结果不方便使用。

S

Siraj

如果源是一个 xml 文件，就像这个示例一样

<pa:Process xmlns:pa="http://sssss">
        <pa:firsttag>SAMPLE</pa:firsttag>
    </pa:Process>

你可以试试下面的代码

from lxml import etree, objectify
metadata = 'C:\\Users\\PROCS.xml' # this is sample xml file the contents are shown above
parser = etree.XMLParser(remove_blank_text=True) # this line removes the  name space from the xml in this sample the name space is --> http://sssss
tree = etree.parse(metadata, parser) # this line parses the xml file which is PROCS.xml
root = tree.getroot() # we get the root of xml which is process and iterate using a for loop
for elem in root.getiterator():
    if not hasattr(elem.tag, 'find'): continue  # (1)
    i = elem.tag.find('}')
    if i >= 0:
        elem.tag = elem.tag[i+1:]

dict={}  # a python dictionary is declared
for elem in tree.iter(): #iterating through the xml tree using a for loop
    if elem.tag =="firsttag": # if the tag name matches the name that is equated then the text in the tag is stored into the dictionary
        dict["FIRST_TAG"]=str(elem.text)
        print(dict)

输出将是

{'FIRST_TAG': 'SAMPLE'}

如何解析 XML 并获取特定节点属性的实例？

关注公众号

想领先一步获取最新的外包任务吗？

相似问题

平台

支持

友情链接

联系我们