python

All posts tagged python

A friend was looking for a way to list the space usage on a windows server that only had FTP access. I had written something similar for a project long ago, and polished up to do the job.

This python script will walk an FTP directory in a top-down, depth-first pattern. It uses the ftplib library, which I believe is built-in to most or all python distributions. Configure the FTP_* variables near the top to set the server, port, user, password, and the delay between each FTP operation (to avoid hammering the server). The script recursively processes directories, creating a dirStruct tuple that contains the following items:

(pwd, subdirList, fileList, sizeInFilesHere, sizeTotal)
    pwd is a string like "/debian/dists/experimental"
    subdirList is a list of tuples just like this one
    fileList is a list of (filename, sizeInBytes) tuples
    sizeInFilesHere is a sum of all the files in this directory
    sizeTotal is a sum of all the files in this directory and all subdirectories

It also writes data to two CSV files:

  • dirStruct_only_folders.csv
    • Contains entries for just the directories.
    • Local size is the total size of files in that folder (does not count subdirs).
    • Total size is the sum of local size and total size of all subdirs.
  • dirStruct_complete.csv
    • Contains entries for both files and folders.
    • Files do not have a total size, only a local size.
#!/usr/bin/env python
#
# A script to recursively walk an FTP server directory structure, recording information
# about the file and directory sizes as it traverses the folders.
#
# Stores output in two CSV files:
#  dirStruct_only_folders.csv
#     Contains entries for just the directories.
#     Local size is the total size of files in that folder (does not count subdirs).
#     Total size is the sum of local size and total size of all subdirs.
#  dirStruct_complete.csv
#     Contains entries for both files and folders.
#     Files do not have a total size, only a local size.
#
# Customize the FTP_* variables below.
#
# Basically does a depth-first search.
#
# Written by Matthew L Beckler, matthew at mbeckler dot org.
# Released into the public domain, do whatever you like with this.
# Email me if you like the script or have suggestions to improve it.

from ftplib import FTP
from time import sleep


FTP_SERVER = "ftp.debian.org"
FTP_PORT = "21" # 21 is the default
FTP_USER = "" # leave empty for anon FTP server
FTP_PASS = ""
FTP_DELAY = 1 # how long to wait between calls to the ftp server

def parseListLine(line):
   # Files look like          "-rw-r--r--    1 1176     1176       176158 Mar 30 01:52 README.mirrors.html"
   # Directories look like    "drwxr-sr-x   15 1176     1176         4096 Feb 15 09:22 dists"
   # Returns (name, isDir, sizeBytes)
   items = line.split()
   return (items[8], items[0][0] == "d", int(items[4]))

# Since the silly ftp library makes us use a callback to handle each line of text from the server,
# we have a global lines buffer. Clear the buffer variable before doing each call.
lines = []
def appendLine(line):
   global lines
   lines.append(line)
def getListingParsed(ftp):
   """ This is a sensible interface to the silly line getting system. Returns a copy of the directory listing, parsed. """
   global lines
   lines = []
   ftp.dir(appendLine)
   myLines = lines[:]
   parsedLines = map(parseListLine, myLines)
   return parsedLines
   
def descendDirectories(ftp):
   # Will return a tuple for the current ftp directory, like this:
   # (pwd, subdirList, fileList, sizeInFilesHere, sizeTotal)
   #     pwd is a string like "/debian/dists/experimental"
   #     subdirList is a list of tuples just like this one
   #     fileList is a list of (filename, sizeInBytes) tuples
   #     sizeInFilesHere is a sum of all the files in this directory
   #     sizeTotal is a sum of all the files in this directory and all subdirectories

   sleep(FTP_DELAY) # be a nice client

   # make our directory structure to return
   pwd = ftp.pwd()
   subdirList = []
   fileList = []
   sizeInFilesHere = 0
   sizeTotal = 0

   print pwd + "/"
   items = getListingParsed(ftp)
   for name, isDir, sizeBytes in items:
      if not isDir:
         fileList.append( (name, sizeBytes) )
         sizeInFilesHere += sizeBytes
      else:
         # is a directory, so recurse
         ftp.cwd(name)
         struct = descendDirectories(ftp)
         ftp.cwd("..")
         subdirList.append(struct)
         sizeTotal += struct[4]

   # add in the size of all files here to sizeTotal
   sizeTotal += sizeInFilesHere
   return (pwd, subdirList, fileList, sizeInFilesHere, sizeTotal)

def pprintBytes(b):
   """ Pretty prints a number of bytes with a proper suffix, like K, M, G, T. """
   suffixes = ["", "K", "M", "G", "T", "?"]
   ix = 0
   while (b > 1024):
      b /= 1024.0
      ix += 1
   s = suffixes[min(len(suffixes) - 1, ix)]
   if int(b) == b:
      return "%d%s" % (b, s)
   else:
      return "%.1f%s" % (b, s)

def pprintDirStruct(dirStruct):
   """ Pretty print the directory structure. RECURSIVE FUNCTION! """
   print "{}/ ({} in {} files here, {} total)".format(dirStruct[0], pprintBytes(dirStruct[3]), len(dirStruct[2]), pprintBytes(dirStruct[4]))
   for ds in dirStruct[1]:
      pprintDirStruct(ds)

def saveDirStructToCSV(dirStruct, fid, includeFiles):
   """ Save the directory structure to a CSV file. RECURSIVE FUNCTION! """
   # Info about this directory itself
   fid.write("\"{}/\",{},{}\n".format(dirStruct[0], dirStruct[3], dirStruct[4]))
   pwd = dirStruct[0]

   # Info about files here
   if includeFiles:
      for name, size in dirStruct[2]:
         fid.write("\"{}\",{},\n".format(pwd + "/" + name, size))

   # Info about dirs here, recurse
   for ds in dirStruct[1]:
      saveDirStructToCSV(ds, fid, includeFiles)

print "Connecting to FTP server '%s' port %s..." % (FTP_SERVER, FTP_PORT)
ftp = FTP()
ftp.connect(FTP_SERVER, FTP_PORT)
if FTP_USER == "":
   ftp.login()
else:
   ftp.login(FTP_USER, FTP_PASS)

print "Walking directory structure..."
dirStruct = descendDirectories(ftp)

print ""
print "Finished descending directories, here is the info:"
pprintDirStruct(dirStruct)
print ""

FILENAME = "dirStruct_complete.csv"
print "Saving complete directory info (files and folders) to a CSV file: '%s'" % FILENAME
with open(FILENAME, "w") as fid:
   fid.write("\"Path\",\"Local size\",\"Total size\"\n")
   saveDirStructToCSV(dirStruct, fid, includeFiles=True)

FILENAME = "dirStruct_only_folders.csv"
print "Saving directory info (only folders) to a CSV file: '%s'" % FILENAME
with open(FILENAME, "w") as fid:
   fid.write("\"Path\",\"Local size\",\"Total size\"\n")
   saveDirStructToCSV(dirStruct, fid, includeFiles=False)

Sample CSV output:

"Path","Local size","Total size"
"/plugins/",5426535,7594527
"/plugins/foo-1.1.jar",7774,
"/plugins/CHANGELOG.txt",45169,

Local size is just the size of the file itself, or the size of all files in a directory. Total size is the total size of the files in a directory plus the total sizes of all subdirectories. Files do not have a total size entry.

I’m writing a fun little webapp using Flask and Python and Sqlalchemy, running on Heroku using a PostgreSQL database. I use a sqlite3 database file for local testing, and PostgreSQL when I deploy, so naturally there are some minor snags to be run into when switching between database engines.

Tonight I ran into a tricky issue after adding a ton more foreign-key / relationships to my database-backed models. I was getting an error like this when I tried to issue my db.drop_all() command in my python script that initializes my database tables:

sqlalchemy.exc.InternalError: (InternalError) cannot drop table pages because other objects depend on it
DETAIL:  constraint pagesections_parent_page_id_fkey on table pagesections depends on table pages
HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 '\nDROP TABLE pages' {}

A bunch of searching for solutions indicated that maybe it would work if you run db.reflect() immediately before the db.drop_all(), but apparently the reflect function is broken for the current flask/sqlalchemy combination. Further searching revealed a mystical “DropEverything” function, and I finally found a copy here. I had to do a few small modifications to get it to work in the context of Flask’s use of Sqlalchemy.

def db_DropEverything(db):
    # From http://www.sqlalchemy.org/trac/wiki/UsageRecipes/DropEverything

    conn=db.engine.connect()

    # the transaction only applies if the DB supports
    # transactional DDL, i.e. Postgresql, MS SQL Server
    trans = conn.begin()

    inspector = reflection.Inspector.from_engine(db.engine)

    # gather all data first before dropping anything.
    # some DBs lock after things have been dropped in 
    # a transaction.
    metadata = MetaData()

    tbs = []
    all_fks = []

    for table_name in inspector.get_table_names():
        fks = []
        for fk in inspector.get_foreign_keys(table_name):
            if not fk['name']:
                continue
            fks.append(
                ForeignKeyConstraint((),(),name=fk['name'])
                )
        t = Table(table_name,metadata,*fks)
        tbs.append(t)
        all_fks.extend(fks)

    for fkc in all_fks:
        conn.execute(DropConstraint(fkc))

    for table in tbs:
        conn.execute(DropTable(table))

    trans.commit()

I had to change the uses of engine to db.engine since Flask’s SQLalchemy takes care of that for you. You get the db object from the app, like this “from myapp import db”, and this is how I defined db in myapp:

from flask.ext.sqlalchemy import SQLAlchemy

app = Flask(__name__, etc)

# DATABASE_URL is set if we are running on Heroku
if 'DATABASE_URL' in os.environ:
    app.config['HEROKU'] = True
    app.config['SQLALCHEMY_DATABASE_URI'] = os.environ['DATABASE_URL']
else:
    app.config['HEROKU'] = False
    app.config['SQLALCHEMY_DATABASE_URI'] = "sqlite:///" + os.path.join(PROJECT_ROOT, "../app.db")

db = SQLAlchemy(app)

And then this is the important parts of my db_create.py script:

from sqlalchemy.engine import reflection
from sqlalchemy.schema import (
        MetaData,
        Table,
        DropTable,
        ForeignKeyConstraint,
        DropConstraint,
        )

from cyosa import app, db

if not app.config['HEROKU'] and os.path.exists("app.db"):
    os.remove("app.db")

def db_DropEverything(db):
    # listed above

db_DropEverything(db)
db.create_all()

# add your instances of models here, be sure to db.session.commit()

I’ve found this little function to be useful when dividing work across a fixed number of workers. It does integer division, but rounds up (unlike normal integer division which truncates). Written in python, but it would be easy to write in another language.

def divide_round_up(n, d):
    return (n + (d - 1))/d

For example, if we were distributing N jobs to 4 CPU threads, this function will tell us how many total iterations are required to process N threads:

>>> divide_round_up(0,4)
0
>>> divide_round_up(3,4)
1
>>> divide_round_up(4,4)
1
>>> divide_round_up(5,4)
2
>>> divide_round_up(7,4)
2
>>> divide_round_up(8,4)
2

Update: My friend Joe wrote in to say “That has overflow issues if n+d>int_max. How about (n/d)+(bool(n%d))”. For my needs, I’m not going to be anywhere close to int_max, but it’s a valid point.

Notes for myself so I can do this again someday… (UPDATE: I had to refresh the links at the bottom of this article to the latest version of Django. Good luck.)

Let’s say we are working on a Django project called applesauce, for lack of a better name. We start by creating a directory for this project:
cd ~
mkdir applesauce
cd applesauce

We use pip to install python packages, so we must install it first:
sudo apt-get install python-pip python-setuptools

We use virtualenv to manage our python environment to make sure we don’t go too crazy. We can use pip to install it:
sudo pip install virtualenv

virtualenv has this idea of using a directory to store its settings, so let’s make one for it, and use it as a new virtual environment:
mkdir applesauce-virtualenv
virtualenv applesauce-virtualenv

The –no-site-packages is the default now, so we don’t include it here (you could, but it’s deprecated). This helps keep things sane.

virtualenv creates a shell script for you to source that sets up the necessary environment variables. You will need to source this file every time you want to work on this project:

source applesauce-virtualenv/bin/activate

You should see that your shell prompt has changed, and now has “(applesauce-virtualenv)” at the front of the prompt. This lets you know that you have already “logged in” to your virtualenv.

Now, let’s install some of the packages required to work with Django. We start with the non-python system-wide packages and libraries:
sudo apt-get install libxml2-dev libxslt1-dev python2.7-dev libpng12-dev libfreetype6-dev build-essential python-dev

Then, make sure that you are in your virtual environment and install these python things using pip. Since we’re installing these packages into the virtualenv we don’t need to use sudo.

pip install django django-debug-toolbar south httplib2 lxml

If you have any modules/packages that you want to be able to use in your Django project, you can copy them into applesauce-virtualenv/lib/python2.7/site-packages/

To verify that Django can be seen by Python, type python from your shell. Then at the Python prompt, try to import Django:

>>> import django
>>> print django.get_version()
1.4.1

Now, you can start working with Django. You might want to start with the official tutorial, Writing your first Django app.

(Some information taken from the Django Quick install guide.)