March 23, 2020 · 3 min read
Imagine the following situation:
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<GetHotelsResponse>
<GetHotelsResult>
<Hotels>
<Hotel>
<Id>1</Id>
<Name>Hotel A</Name>
<Latitude>41.390205</Latitud>
<Longitude>2.6466666667</Longitud>
</Hotel>
<!-- Hundreds of thousands ... -->
<Hotel>
<Id>351987</Id>
<Name>Hotel Z</Name>
<Latitude>40.416775</Latitud>
<Longitude>2.154007</Longitud>
</Hotel>
</Hoteles>
</GetHotelsResult>
</GetHotelsResponse>
</soap:Body>
</soap:Envelope>
(Yes, this still happens in the real world :( )
You would probably use the suds library to parse SOAP APIs and the first attempt to import the hotels would be something like this:
from suds.client import Client
wsdl = 'https:://api.soap.com/v1?wsdl'
client = Client(wsdl)
hotels = client.service.GetHotels()
for hotel in hotels:
import_hotel(hotel)
This snippet would work most of the time, but it has an issue in our scenario: suds parses the response and builds all the Python objects in memory at once. Given that the list of hotels is so huge, this snippet would consume all your available memory, or at least, more than the 32GB we have on our server.
Unfortunately, suds does not appear to have a way to parse SOAP responses in an iterative way, but it does have a way to get the raw XML response without parsing it into Python objects. Taking advantage of the lxml library we can build an iterative version:
import cStringIO
from lxml import etree
from suds.client import Client
wsdl = 'https:://api.soap.com/v1?wsdl'
client = Client(wsdl)
# Get the raw XML response without parsing SOAP elements to Python objects
client.set_options(retxml=True)
xml = cStringIO.StringIO(client.service.GetHotels())
# Iterate the XML hotel per hotel
context = etree.iterparse(xml, events=('end',), tag='Hotel')
for _event, xml_hotel in context:
import_hotel(xml_hotel)
Now we are parsing and importing hotels one by one without consuming all the memory, right? Well, not really. If you run this code you'll see that it begins by importing the hotels very quickly but soon after, it slows down to a crawl. Why? Although etree.iterparse does not consume the entire XML at first, it does not free up the references to nodes from each iteration. We need to manually free up two types of references:
import cStringIO
from lxml import etree
from suds.client import Client
wsdl = 'https:://api.soap.com/v1?wsdl'
client = Client(wsdl)
# Get the raw XML response without parsing SOAP elements to Python objects
client.set_options(retxml=True)
xml = cStringIO.StringIO(client.service.GetHotels())
# Iterate the XML hotel per hotel
context = etree.iterparse(xml, events=('end',), tag='Hotel')
for _event, xml_hotel in context:
import_hotel(xml_hotel)
# Free memory
xml_hotel.clear() # free children
while xml_hotel.getprevious() is not None: # free preceding siblings
del xml_hotel.getparent()[0]
Finally, we can import the entire list of hotels without running out of memory. :)
You can find more details about iterative parsing of XMLs here: https://www.ibm.com/developerworks/xml/library/x-hiperfparse